Vision-language models struggle to genuinely reason about visual information—they primarily reason in text space, and adding images often degrades performance compared to text alone.
This paper reveals that vision-language models often rely on text reasoning rather than truly understanding images. Researchers created CrossMath, a benchmark with identical problems in text-only, image-only, and image+text formats, and found that adding images actually hurts performance. They show VLMs can be improved through targeted fine-tuning on multimodal reasoning tasks.