Towards Robustness against Typographic Attack with Training-free Concept Localization

Bohan Liu, Wenqian Ye, Guangzhi Xiong, Zhenghao He, Sanchit Sinha et al.|July 2, 2026arXiv

Key Takeaway

You can make vision-language models robust to text-in-image attacks by identifying and surgically adjusting specific attention heads—no retraining needed.

Summary

This paper identifies why CLIP vision models fail when images contain irrelevant text (typographic attacks), using mechanistic interpretability to pinpoint which attention heads over-focus on text. The authors propose a training-free fix by selectively adjusting these identified components, improving robustness without retraining.

safety

Key Terms

typographic-attack mechanistic-interpretability circuit-mining attention-intervention