DOPD: Dual On-policy Distillation

Xinlei Yu, Gen Li, Qingyi Si, Guibin Zhang, Yuqi Xu et al.|June 29, 2026arXiv

Key Takeaway

When distilling from privileged teachers or students, routing supervision based on advantage gaps prevents students from learning to exploit information asymmetry instead of real capabilities—improving both LLM and vision-language model performance.

Summary

This paper addresses a key problem in on-policy distillation where adding privileged information (extra inputs) to teachers or students creates a 'privilege illusion'—students learn to mimic information asymmetry rather than transferable skills.

training efficiency

Key Terms

on-policy-distillation privileged-information advantage-gap token-level-supervision