Diversity in RL emerges naturally when you acknowledge reward uncertainty—agents rationally explore multiple behaviors when they're unsure what the true reward function is, eliminating the need for hacky diversity bonuses.
This paper proposes a new way to train RL agents that naturally produces diverse behaviors by treating the reward function as uncertain rather than fixed. Instead of maximizing a single reward, the method optimizes over distributions of possible rewards, causing agents to explore multiple strategies without sacrificing performance.