Rethinking the Divergence Regularization in LLM RL

Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo et al.|June 8, 2026arXiv

Key Takeaway

When training LLMs with RL, use smooth regularization on policy shifts instead of hard cutoffs—it gives better training stability without throwing away useful learning signals.

Summary

This paper improves how language models learn from reinforcement learning by fixing how we measure when a model's behavior has changed too much during training. Instead of abruptly cutting off gradient updates (like existing methods do), the authors propose DRPO, which smoothly reduces their impact. This keeps training more stable and efficient across different model sizes.

training alignment efficiency

Key Terms

trust-region policy-gradient divergence-regularization off-policy