When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin|April 28, 2026arXiv

Key Takeaway

Imperfect reward signals used in RLHF can sometimes help rather than hurt model training, and evaluating reward quality requires understanding how errors interact with the learning algorithm, not just counting ranking mistakes.

Summary

This paper shows that not all reward errors are equally harmful when training language models with reinforcement learning. By analyzing how policy gradient optimization works, the authors categorize reward mistakes into harmful, benign, and even beneficial types—where some errors can actually help prevent the model from getting stuck on mediocre outputs.

alignment evaluation

Key Terms

reinforcement-learning-from-human-feedback proxy-reward policy-gradient reward-model