You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei|June 4, 2026arXiv

Key Takeaway

Reusing sparse attention routing decisions across layers dramatically reduces the computational cost of long-context inference without sacrificing quality—a practical win for deploying reasoning-heavy models with extended context windows.

Summary

This paper proposes cross-layer sparse attention (CLSA), a technique that speeds up long-context inference in large language models by computing which tokens to attend to once and reusing that decision across all decoder layers. By sharing both the key-value cache and routing decisions, it achieves significant speedups (up to 7.6x) while maintaining accuracy on long-context tasks.

efficiency architecture

Key Terms

sparse-attention kv-cache token-routing long-context-inference decoding-efficiency