By jointly reasoning over uncertainty in rewards, dynamics, and values during planning, preference-based RL can achieve sample efficiency comparable to model-based methods while avoiding explicit reward design.
This paper presents UBP2, a method that learns reward models from human preference comparisons while actively exploring the environment. Unlike passive approaches, UBP2 uses ensemble models to balance learning about rewards, environment dynamics, and value functions, enabling efficient sample use during early training stages.