The tendency of multimodal judges to favor plausible text narratives over what they actually perceive in visual content.