Reusing sparse attention routing decisions across layers dramatically reduces the computational cost of long-context inference without sacrificing quality—a practical win for deploying reasoning-heavy models with extended context windows.
This paper proposes cross-layer sparse attention (CLSA), a technique that speeds up long-context inference in large language models by computing which tokens to attend to once and reusing that decision across all decoder layers. By sharing both the key-value cache and routing decisions, it achieves significant speedups (up to 7.6x) while maintaining accuracy on long-context tasks.