BRRL provides the first principled theoretical foundation for PPO-style clipped objectives, proving monotonic improvement and connecting trust region methods to the Cross-Entropy Method—offering both better understanding and a path to improved algorithms.
This paper fixes a theoretical gap in PPO by introducing BRRL, a framework that derives the mathematically optimal policy update with guaranteed improvement. The authors develop BPO, a practical algorithm that approximates this optimal solution, and extend it to GBPO for LLM fine-tuning. Experiments show BPO matches or beats PPO across robotics, games, and language model tasks.