Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients — ThinkLLM