Task-based reward signals in RL training genuinely improve model capabilities beyond just amplifying existing patterns—sharpening alone is mathematically unstable and produces limited gains.
This paper compares two approaches to improving AI models with reinforcement learning: distribution sharpening (making existing capabilities more extreme) versus task-reward learning (teaching new skills). Using math tasks, the authors show that sharpening alone produces weak, unstable results, while task rewards enable robust performance gains and stable training.