Training an agent to match a target policy by minimizing divergence (e.g., KL divergence) between predicted and target actions.