Target Policy Optimization

Jean Kaddour|April 7, 2026arXiv

Key Takeaway

TPO decouples what to optimize for (target distribution) from how to optimize (parameter updates), using a simple cross-entropy loss that naturally prevents overshooting—this is particularly valuable for sparse-reward RL problems.

Summary

Target Policy Optimization (TPO) is a new reinforcement learning method that separates the decision of which completions should be rewarded from how to update model parameters. Instead of doing both simultaneously like standard methods, TPO creates a target probability distribution from scored completions and trains the model to match it using cross-entropy loss.

training

Key Terms

policy-gradient target-distribution cross-entropy sparse-reward