Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan et al.|June 16, 2026arXiv

Key Takeaway

Teaching small models through prompt-based learning (showing them correct vs incorrect answers to discriminate) works better than traditional distillation or standard RL, especially for models under 1B parameters.

Summary

This paper introduces ZPPO, a training method that improves small AI models by learning from larger teacher models without copying their exact outputs. Instead of forcing students to imitate teacher predictions, ZPPO keeps the teacher in the prompt—creating special question formats that help students learn to discriminate correct from incorrect answers and identify their own failure patterns.

training efficiency alignment

Key Terms

knowledge-distillation on-policy-learning prompt-engineering reinforcement-learning grpo