AdaSplash-2: Faster Differentiable Sparse Attention

Nuno Gonçalves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li et al.|April 16, 2026arXiv

Key Takeaway

Sparse attention can now match or beat FlashAttention-2's speed when processing long contexts, making it practical for building models that handle extended input sequences without the quadratic memory cost.

Summary

AdaSplash-2 makes sparse attention faster by using a histogram-based trick to quickly compute the normalizer needed for differentiable sparse attention. This lets transformers handle long contexts efficiently—matching softmax speed at short lengths while being significantly faster for long sequences.

efficiency architecture training

Key Terms

sparse-attention quadratic-complexity differentiable-sparse-attention long-context-handling