DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han et al.|May 18, 2026arXiv

Key Takeaway

DashAttention enables efficient long-context processing by combining adaptive sparse selection with differentiable training, outperforming fixed-sparsity methods while maintaining gradient flow through both attention stages.

Summary

DashAttention improves how language models handle long documents by using a smarter two-stage attention mechanism. Instead of always selecting the same number of relevant tokens, it adaptively picks different amounts based on what each query needs, while keeping the entire process trainable. This achieves full-attention quality with 75% fewer computations.

efficiency architecture

Key Terms

sparse-attention hierarchical-attention key-value-caches softmax-attention differentiable