VLMs have interpretable internal mechanisms (gaze heads) that can be surgically edited at inference time to control what the model describes, offering a practical way to steer multimodal outputs without model retraining.
This paper discovers that vision-language models develop specialized attention heads called 'gaze heads' that track which image regions they're describing. By redirecting these heads' attention during inference, researchers can steer the model to describe any chosen image region without retraining—achieving 83% accuracy on comic panels and extending to natural images.