Vision-language models have a sparse, identifiable causal circuit that controls whether they trust visual input or stored knowledge—removing just a few attention heads flips the model from knowledge-based to vision-based answers in most cases.
This paper reveals how vision-language models choose between visual evidence and memorized knowledge when they conflict. Using activation analysis, researchers identified a small set of attention heads (2.5-4.8% of heads) that act as a causal switch: removing them makes models trust their eyes instead of what they've learned.