When training student models on their own generated data, length inflation and repetition can destabilize training—use divergence constraints and mixture distillation to keep outputs stable and prevent performance collapse.
This paper identifies a critical failure mode in on-policy distillation (OPD) where student models generate increasingly long, repetitive outputs during training, causing data truncation and training instability.