You can make vision-language models faster without losing visual detail by being selective about which attention layers process images—use efficient cross-attention for context and add self-attention layers only when the task complexity demands it.
VISOR improves vision-language model efficiency by selectively attending to visual information rather than compressing images. Instead of reducing visual tokens, it uses sparse cross-attention and dynamically chosen self-attention layers to process high-resolution details only when needed, reducing computation while maintaining performance on complex visual reasoning tasks.