Gaze Heads: How VLMs Look at What They Describe

Rohit Gandikota, David Bau|June 12, 2026arXiv

Key Takeaway

VLMs have interpretable internal mechanisms (gaze heads) that can be surgically edited at inference time to control what the model describes, offering a practical way to steer multimodal outputs without model retraining.

Summary

This paper discovers that vision-language models develop specialized attention heads called 'gaze heads' that track which image regions they're describing. By redirecting these heads' attention during inference, researchers can steer the model to describe any chosen image region without retraining—achieving 83% accuracy on comic panels and extending to natural images.

multimodal

Key Terms

attention-head vision-language-model inference-time-steering attention-mask-intervention mechanistic-interpretability