Multimodal training doesn't automatically make language models more human-like; visual pretraining helps selectively for visually-rich text, but language-internal representations remain the foundation for modeling human reading.
This paper compares language models trained only on text (LLMs) with models trained on both text and images (VLMs) to see if visual training makes AI better at matching how humans read. Using brain scans and eye-tracking data from real readers, the researchers found that VLMs don't universally outperform LLMs—language-only training remains crucial.