TPO decouples what to optimize for (target distribution) from how to optimize (parameter updates), using a simple cross-entropy loss that naturally prevents overshooting—this is particularly valuable for sparse-reward RL problems.
Target Policy Optimization (TPO) is a new reinforcement learning method that separates the decision of which completions should be rewarded from how to update model parameters. Instead of doing both simultaneously like standard methods, TPO creates a target probability distribution from scored completions and trains the model to match it using cross-entropy loss.