Forcing a model's output distribution to match a target distribution, here used to normalize reward structures across different tasks.