You can train models to be both accurate and human-like by combining objective rewards (what you can measure) with a learned signal from human examples (what's hard to measure), avoiding the diversity collapse and gaming that pure RL often causes.
This paper combines reinforcement learning with verifiable rewards (like code correctness) and human demonstrations to train language models better. The key innovation is using an adversarial discriminator that learns from human-written examples to guide the model toward more natural, diverse outputs while still achieving high task accuracy.