Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

Changdae Oh, Wendi Li, Seongheon Park, Samuel Yeh, Tanwi Mallick et al.|June 24, 2026arXiv

Key Takeaway

You can extract free step-level evaluation signals from standard RL post-training using progress advantage, eliminating the need to build expensive process reward models for agent systems.

Summary

This paper shows that RL-trained language models already contain step-level scoring signals without needing separate reward models. The authors derive 'progress advantage'—a metric based on policy log-probability ratios—that automatically captures how good each step is, and demonstrate it works for scaling, uncertainty, and debugging across multiple benchmarks.

reasoning training evaluation

Key Terms

process-reward-model advantage-function policy-gradient markov-decision-process test-time-scaling