Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han et al.|April 9, 2026arXiv

Key Takeaway

When training student models on their own generated data, length inflation and repetition can destabilize training—use divergence constraints and mixture distillation to keep outputs stable and prevent performance collapse.

Summary

This paper identifies a critical failure mode in on-policy distillation (OPD) where student models generate increasingly long, repetitive outputs during training, causing data truncation and training instability.

training efficiency reasoning

Key Terms

on-policy-rl knowledge-distillation divergence-constraint rollout-mixture-distillation truncation-collapse