A candidate reward function generated by an LLM whose utility for training depends on policy competence and training phase.