Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

Zihan Zhao, Baotong Lu, Shengjie Lin, Yizou Chen, Jing Liu et al.|April 29, 2026arXiv

Key Takeaway

Sparse attention algorithms only work well in practice when paired with careful system design—SPIN shows that unifying different sparsity methods and optimizing GPU-CPU memory transfers can turn algorithmic gains into real performance improvements for long-context LLM serving.

Summary

SPIN is a system for serving large language models with long contexts by combining sparse attention (which only reads relevant parts of memory) with smart memory management across GPU and CPU. It unifies different sparse attention methods into a common framework and optimizes how data moves between fast GPU memory and slower CPU memory, achieving 1.66-5.66x faster throughput than existing systems.

efficiency

Key Terms

sparse-attention kv-cache hierarchical-memory inference-time-compute