Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Yecheng Wu, Song Han, Hai Cai|April 14, 2026arXiv

Key Takeaway

You can make on-policy distillation 4x faster by precomputing teacher outputs once and enforcing 'teacher consistency' (using the same teacher throughout), eliminating the need for a live teacher server during training.

Summary

This paper proposes Lightning OPD, a more efficient way to train large language models by distilling knowledge from a teacher model without needing to run the teacher continuously during training.

training efficiency reasoning

Key Terms

on-policy-distillation teacher-consistency supervised-fine-tuning gradient-bias