Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin et al.|April 17, 2026arXiv

Key Takeaway

Vision-language models struggle to genuinely reason about visual information—they primarily reason in text space, and adding images often degrades performance compared to text alone.

Summary

This paper reveals that vision-language models often rely on text reasoning rather than truly understanding images. Researchers created CrossMath, a benchmark with identical problems in text-only, image-only, and image+text formats, and found that adding images actually hurts performance. They show VLMs can be improved through targeted fine-tuning on multimodal reasoning tasks.

evaluation multimodal reasoning

Key Terms

vision-language-model modality-gap multimodal-reasoning cross-modal-matching