Token importance in distillation has two sources: high-uncertainty tokens and low-uncertainty tokens where the student disagrees with the teacher. Selecting tokens based on this two-axis view enables training on <20% of tokens while matching full-token performance.
This paper identifies which tokens matter most when training a smaller AI model to learn from a larger one using the model's own predictions. The key insight: focus on tokens where the student is uncertain OR confidently wrong. Using this approach, you can train on just 50% of tokens while matching full training, cutting memory use by nearly half.