Cross-modal alignment between vision and language models is much weaker than previously claimed—it only appears in small-scale experiments and reflects broad semantic overlap, not deep structural convergence.
This paper challenges the popular "Platonic Representation Hypothesis"—the idea that AI models trained on different types of data (like text and images) learn the same underlying representation of reality.