On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

Andrei Liviu Nicolicioiu, Mohammad Pezeshki, Aaron Courville|June 24, 2026arXiv

Key Takeaway

Self-distillation trades diversity for accuracy: models become overconfident in their preferred solutions, hurting performance on out-of-distribution tasks that need varied strategies.

Summary

This paper reveals a hidden cost of on-policy self-distillation: while it achieves high average accuracy, it reduces output diversity by amplifying the model's existing biases. The authors show theoretically and empirically that self-distillation concentrates probability mass on dominant modes, causing pass@k curves to flatten—generating more rollouts doesn't improve accuracy like it should.

training reasoning evaluation

Key Terms

self-distillation on-policy-learning pass-at-k rollout-diversity mutual-information