VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos|March 24, 2026arXiv

Key Takeaway

You can make vision-language models faster without losing visual detail by being selective about which attention layers process images—use efficient cross-attention for context and add self-attention layers only when the task complexity demands it.

Summary

VISOR improves vision-language model efficiency by selectively attending to visual information rather than compressing images. Instead of reducing visual tokens, it uses sparse cross-attention and dynamically chosen self-attention layers to process high-resolution details only when needed, reducing computation while maintaining performance on complex visual reasoning tasks.

efficiency multimodal architecture

Key Terms

sparse-attention cross-attention vision-language-model token-reduction dynamic-routing