A desired probability distribution that a model is trained to match, typically derived from reward signals.