LLM-generated rewards aren't equally useful throughout training—their reliability depends on policy competence and training phase, so verification and deployment timing matter as much as reward generation itself.
This paper addresses when and how to use LLM-generated rewards during reinforcement learning training. The authors propose RHyVE, a method that verifies reward quality based on the current policy's skill level and training phase, rather than treating all rewards equally throughout training.