Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan|April 29, 2026arXiv

Key Takeaway

Cross-architecture distillation for diffusion models is now practical: you can compress large diffusion LLMs into tiny ones (13x smaller) while maintaining performance, even when teacher and student have completely different designs.

Summary

This paper introduces TIDE, a framework for distilling knowledge from large diffusion language models into much smaller ones across different architectures. Unlike previous distillation methods that work within a single model type, TIDE handles cases where teacher and student models have different designs, attention mechanisms, and tokenizers.

training efficiency architecture

Key Terms

diffusion-process knowledge-distillation cross-architecture-transfer mixture-of-experts tokenizer