When training LLMs with RL, use smooth regularization on policy shifts instead of hard cutoffs—it gives better training stability without throwing away useful learning signals.
This paper improves how language models learn from reinforcement learning by fixing how we measure when a model's behavior has changed too much during training. Instead of abruptly cutting off gradient updates (like existing methods do), the authors propose DRPO, which smoothly reduces their impact. This keeps training more stable and efficient across different model sizes.