UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart|June 17, 2026arXiv

Key Takeaway

By jointly reasoning over uncertainty in rewards, dynamics, and values during planning, preference-based RL can achieve sample efficiency comparable to model-based methods while avoiding explicit reward design.

Summary

This paper presents UBP2, a method that learns reward models from human preference comparisons while actively exploring the environment. Unlike passive approaches, UBP2 uses ensemble models to balance learning about rewards, environment dynamics, and value functions, enabling efficient sample use during early training stages.

training efficiency reasoning

Key Terms

preference-based-reinforcement-learning epistemic-uncertainty ensemble-methods regret-bounds