AI agents fail at embodied spatial reasoning primarily because they make poor action choices, not because they can't see—and they confidently stick to wrong answers even when evidence contradicts them, unlike humans who actively seek disconfirming evidence.
ESI-Bench is a benchmark for testing how well AI agents actively explore physical environments to understand spatial relationships. Rather than passively looking at images, agents must decide when to move, manipulate objects, and gather observations to solve tasks.