Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Eghbal A. Hosseini, Brian Cheung, Evelina Fedorenko, Alex H. Williams|April 23, 2026arXiv

Key Takeaway

Single images with high agreement among vision models show dramatically stronger alignment with language models, suggesting that representational convergence across modalities is driven by how unambiguously the environment constrains perception.

Summary

This paper reveals that how consistently different vision models represent individual images (intra-modal agreement) strongly predicts whether vision and language models will represent those same images similarly (cross-modal alignment).

multimodal evaluation reasoning

Key Terms

representational-convergence intra-modal-dispersion cross-modal-convergence generalized-procrustes-algorithm