Training vision-language models to generate intermediate visual representations of unseen spatial configurations works better than text-based reasoning for spatial tasks, and these representations remain interpretable without needing to generate actual images at inference time.
This paper introduces Imaginative Perception Tokens (IPT), a training method that helps vision-language models reason about spaces they can't directly see. Instead of forcing spatial reasoning through text, IPT teaches models to generate intermediate visual representations of what they would perceive from different viewpoints or through occluded spaces.