Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

Mehul Damani, Isha Puri, Idan Shenfeld, Jacob Andreas|July 1, 2026arXiv

Key Takeaway

You can train models to be both accurate and human-like by combining objective rewards (what you can measure) with a learned signal from human examples (what's hard to measure), avoiding the diversity collapse and gaming that pure RL often causes.

Summary

This paper combines reinforcement learning with verifiable rewards (like code correctness) and human demonstrations to train language models better. The key innovation is using an adversarial discriminator that learns from human-written examples to guide the model toward more natural, diverse outputs while still achieving high task accuracy.

training alignment reasoning

Key Terms

reinforcement-learning-from-verifiable-rewards reward-hacking adversarial-generator-discriminator diversity-collapse