Strong Teacher Not Needed? On Distillation in LLM Pretraining

Taiming Lu, Zhuang Liu|May 22, 2026arXiv

Key Takeaway

You don't need a powerful teacher to improve a larger language model through distillation—smaller teachers work fine, and over-training the teacher can actually hurt performance.

Summary

This paper challenges the assumption that knowledge distillation in language model training requires a strong teacher model. By systematically testing different teacher-student size combinations, the researchers found that even small, undertrained teachers can improve larger students when losses are properly balanced, and that stronger teachers don't always produce better results.

training efficiency

Key Terms

knowledge-distillation teacher-student-divergence loss-mixing generalization