You can make vision-language models robust to text-in-image attacks by identifying and surgically adjusting specific attention heads—no retraining needed.
This paper identifies why CLIP vision models fail when images contain irrelevant text (typographic attacks), using mechanistic interpretability to pinpoint which attention heads over-focus on text. The authors propose a training-free fix by selectively adjusting these identified components, improving robustness without retraining.