Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao et al.|April 14, 2026arXiv

Key Takeaway

On-policy distillation succeeds only when the teacher model offers genuinely new capabilities beyond the student's training data and both models share compatible reasoning patterns—not just higher scores.

Summary

This paper investigates why on-policy distillation (a technique for training smaller AI models from larger ones) sometimes works and sometimes fails. The researchers found that success requires compatible thinking patterns between student and teacher models, plus genuinely new capabilities from the teacher.

training efficiency reasoning

Key Terms

knowledge-distillation on-policy token-level-reward thinking-patterns reverse-distillation