You can extract free step-level evaluation signals from standard RL post-training using progress advantage, eliminating the need to build expensive process reward models for agent systems.
This paper shows that RL-trained language models already contain step-level scoring signals without needing separate reward models. The authors derive 'progress advantage'—a metric based on policy log-probability ratios—that automatically captures how good each step is, and demonstrate it works for scaling, uncertainty, and debugging across multiple benchmarks.