Fast Byte Latent Transformer

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer et al.|May 8, 2026arXiv

Key Takeaway

Byte-level models can now generate 50% faster by predicting multiple bytes in parallel instead of one at a time, making them practical for real-world use without sacrificing quality.

Summary

Byte-level language models match token-based models but generate slowly because they produce one byte at a time. This paper introduces three faster variants: BLT-D uses diffusion to generate multiple bytes per step, BLT-S uses local drafting with verification, and BLT-DV combines both. All reduce memory bandwidth costs by over 50% during generation.

efficiency architecture

Key Terms

byte-level-tokenization speculative-decoding block-diffusion-language-model inference-speedup