Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros|April 20, 2026arXiv

Key Takeaway

Cross-modal alignment between vision and language models is much weaker than previously claimed—it only appears in small-scale experiments and reflects broad semantic overlap, not deep structural convergence.

Summary

This paper challenges the popular "Platonic Representation Hypothesis"—the idea that AI models trained on different types of data (like text and images) learn the same underlying representation of reality.

multimodal evaluation scaling

Key Terms

cross-modal-alignment mutual-nearest-neighbors platonic-representation-hypothesis semantic-overlap