Byte-level models can now generate 50% faster by predicting multiple bytes in parallel instead of one at a time, making them practical for real-world use without sacrificing quality.
Byte-level language models match token-based models but generate slowly because they produce one byte at a time. This paper introduces three faster variants: BLT-D uses diffusion to generate multiple bytes per step, BLT-S uses local drafting with verification, and BLT-DV combines both. All reduce memory bandwidth costs by over 50% during generation.