TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang et al.|April 15, 2026arXiv

Key Takeaway

Token importance in distillation has two sources: high-uncertainty tokens and low-uncertainty tokens where the student disagrees with the teacher. Selecting tokens based on this two-axis view enables training on <20% of tokens while matching full-token performance.

Summary

This paper identifies which tokens matter most when training a smaller AI model to learn from a larger one using the model's own predictions. The key insight: focus on tokens where the student is uncertain OR confidently wrong. Using this approach, you can train on just 50% of tokens while matching full training, cutting memory use by nearly half.

training efficiency reasoning

Key Terms

on-policy-distillation token-importance student-entropy teacher-student-divergence token-selection