General Preference Reinforcement Learning

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt et al.|May 18, 2026arXiv

Key Takeaway

GPRL solves reward hacking in LLM training by treating quality as multi-dimensional rather than scalar, allowing online RL to work on open-ended tasks without collapsing onto exploitable reward axes.

Summary

This paper addresses a gap in LLM training by proposing General Preference Reinforcement Learning (GPRL), which handles open-ended tasks like traditional preference optimization while maintaining the continuous exploration benefits of online RL.

training alignment reasoning

Key Terms

preference-optimization reward-hacking skew-symmetric group-relative-policy-optimization trust-region