Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Yiming Bian, Joshua M. Akey|April 22, 2026arXiv

Key Takeaway

You can now run exact attention on billion-token sequences on a single GPU by streaming chunks through memory—no approximation needed, just smarter scheduling of the computation.

Summary

This paper solves the memory problem that prevents long-context language models from running on single GPUs. Instead of approximating attention (which loses accuracy), it mathematically decomposes attention into smaller independent chunks that can be processed one at a time, streaming results without keeping everything in memory at once.

efficiency architecture

Key Terms

quadratic-complexity self-attention streaming-inference memory-adaptive-scheduling