You can cut vision-language model KV cache memory in half by intelligently compressing vision tokens based on what the text prompt actually needs, rather than keeping all visual information.
LightKV reduces GPU memory overhead in vision-language models by compressing the Key-Value cache during inference. It uses text prompts to guide which vision tokens are most important, keeping only 55% of tokens while maintaining performance and cutting memory use in half.