Multimodal AI judges can be fooled into trusting text over images—training them on perceptually grounded examples significantly improves their ability to make consistent, verifiable evaluations.
This paper identifies and fixes a critical flaw in multimodal AI judges: they often trust plausible-sounding text over what they actually see in images. The authors create a dataset of carefully modified images and responses to train judges to rely on visual evidence, resulting in more reliable automated evaluation systems.