Target Policy Optimization — ThinkLLM