Language models can encode spatial information in their hidden layers but fail to properly bind viewpoint positions to observations—a critical gap for spatial reasoning that can be partially fixed by fine-tuning specific attention heads.
This paper investigates how language models understand viewpoint rotation using only text, without visual information. Researchers created a task where models must track how a viewpoint changes through multiple rotations and predict what would be seen.