You can make on-policy distillation 4x faster by precomputing teacher outputs once and enforcing 'teacher consistency' (using the same teacher throughout), eliminating the need for a live teacher server during training.
This paper proposes Lightning OPD, a more efficient way to train large language models by distilling knowledge from a teacher model without needing to run the teacher continuously during training.