You don't need a powerful teacher to improve a larger language model through distillation—smaller teachers work fine, and over-training the teacher can actually hurt performance.
This paper challenges the assumption that knowledge distillation in language model training requires a strong teacher model. By systematically testing different teacher-student size combinations, the researchers found that even small, undertrained teachers can improve larger students when losses are properly balanced, and that stronger teachers don't always produce better results.