Sparse attention can now match or beat FlashAttention-2's speed when processing long contexts, making it practical for building models that handle extended input sequences without the quadratic memory cost.
AdaSplash-2 makes sparse attention faster by using a histogram-based trick to quickly compute the normalizer needed for differentiable sparse attention. This lets transformers handle long contexts efficiently—matching softmax speed at short lengths while being significantly faster for long sequences.