When distilling from privileged teachers or students, routing supervision based on advantage gaps prevents students from learning to exploit information asymmetry instead of real capabilities—improving both LLM and vision-language model performance.
This paper addresses a key problem in on-policy distillation where adding privileged information (extra inputs) to teachers or students creates a 'privilege illusion'—students learn to mimic information asymmetry rather than transferable skills.