Make Your LVLM KV Cache More Lightweight

Xihao Chen, Yangyang Guo, Roger Zimmermann|May 1, 2026arXiv

Key Takeaway

You can cut vision-language model KV cache memory in half by intelligently compressing vision tokens based on what the text prompt actually needs, rather than keeping all visual information.

Summary

LightKV reduces GPU memory overhead in vision-language models by compressing the Key-Value cache during inference. It uses text prompts to guide which vision tokens are most important, keeping only 55% of tokens while maintaining performance and cutting memory use in half.

efficiency multimodal

Key Terms

kv-cache vision-tokens prefill-stage cross-modality-message-passing token-compression