Treating the reward function as a distribution rather than a fixed scalar, reflecting ambiguity in what behavior is actually desired.