How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Zhen Yang, Ping Jian, Zhongbin Guo, Zuming Zhang, Chengzhi Li et al.|April 16, 2026arXiv

Key Takeaway

Language models can encode spatial information in their hidden layers but fail to properly bind viewpoint positions to observations—a critical gap for spatial reasoning that can be partially fixed by fine-tuning specific attention heads.

Summary

This paper investigates how language models understand viewpoint rotation using only text, without visual information. Researchers created a task where models must track how a viewpoint changes through multiple rotations and predict what would be seen.

reasoning

Key Terms

layer-wise-probing head-wise-causal-intervention spatial-intelligence viewpoint-rotation-understanding