KV-cache offloading works well for simple long-context tasks but breaks down when you need to extract lots of information from the input—a critical gap for real-world applications like document analysis.
This paper reveals that KV-cache offloading—a technique to reduce memory usage during long-context LLM inference—fails on tasks requiring heavy information extraction from prompts. The authors create a Text2JSON benchmark and show that existing offloading methods degrade accuracy significantly, then propose a simpler alternative that works better across multiple models.