Single images with high agreement among vision models show dramatically stronger alignment with language models, suggesting that representational convergence across modalities is driven by how unambiguously the environment constrains perception.
This paper reveals that how consistently different vision models represent individual images (intra-modal agreement) strongly predicts whether vision and language models will represent those same images similarly (cross-modal alignment).