You can now run exact attention on billion-token sequences on a single GPU by streaming chunks through memory—no approximation needed, just smarter scheduling of the computation.
This paper solves the memory problem that prevents long-context language models from running on single GPUs. Instead of approximating attention (which loses accuracy), it mathematically decomposes attention into smaller independent chunks that can be processed one at a time, streaming results without keeping everything in memory at once.