Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Mahtab Bigverdi, Lindsey Li, Weikai Huang, Yiming Liu, Jaemin Cho et al.|June 2, 2026arXiv

Key Takeaway

Training vision-language models to generate intermediate visual representations of unseen spatial configurations works better than text-based reasoning for spatial tasks, and these representations remain interpretable without needing to generate actual images at inference time.

Summary

This paper introduces Imaginative Perception Tokens (IPT), a training method that helps vision-language models reason about spaces they can't directly see. Instead of forcing spatial reasoning through text, IPT teaches models to generate intermediate visual representations of what they would perceive from different viewpoints or through occluded spaces.

multimodal reasoning training

Key Terms

vision-language-model spatial-reasoning chain-of-thought intermediate-representations supervision-signal