ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Yining Hong, Jiageng Liu, Han Yin, Manling Li, Leonidas Guibas et al.|May 18, 2026arXiv

Key Takeaway

AI agents fail at embodied spatial reasoning primarily because they make poor action choices, not because they can't see—and they confidently stick to wrong answers even when evidence contradicts them, unlike humans who actively seek disconfirming evidence.

Summary

ESI-Bench is a benchmark for testing how well AI agents actively explore physical environments to understand spatial relationships. Rather than passively looking at images, agents must decide when to move, manipulate objects, and gather observations to solve tasks.

multimodal reasoning

Key Terms

embodied-ai perception-action-loop spatial-intelligence action-blindness metacognitive-gap