Cross-architecture distillation for diffusion models is now practical: you can compress large diffusion LLMs into tiny ones (13x smaller) while maintaining performance, even when teacher and student have completely different designs.
This paper introduces TIDE, a framework for distilling knowledge from large diffusion language models into much smaller ones across different architectures. Unlike previous distillation methods that work within a single model type, TIDE handles cases where teacher and student models have different designs, attention mechanisms, and tokenizers.