Imperfect reward signals used in RLHF can sometimes help rather than hurt model training, and evaluating reward quality requires understanding how errors interact with the learning algorithm, not just counting ranking mistakes.
This paper shows that not all reward errors are equally harmful when training language models with reinforcement learning. By analyzing how policy gradient optimization works, the authors categorize reward mistakes into harmful, benign, and even beneficial types—where some errors can actually help prevent the model from getting stuck on mediocre outputs.