KV Cache Offloading for Context-Intensive Tasks

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov|April 9, 2026arXiv

Key Takeaway

KV-cache offloading works well for simple long-context tasks but breaks down when you need to extract lots of information from the input—a critical gap for real-world applications like document analysis.

Summary

This paper reveals that KV-cache offloading—a technique to reduce memory usage during long-context LLM inference—fails on tasks requiring heavy information extraction from prompts. The authors create a Text2JSON benchmark and show that existing offloading methods degrade accuracy significantly, then propose a simpler alternative that works better across multiple models.

efficiency evaluation

Key Terms

kv-cache kv-cache-offloading context-intensive-tasks low-rank-projection