Using executable code as an action interface lets vision-language models flexibly compose spatial reasoning operations and adapt to intermediate results, significantly improving performance on 3D/4D reasoning tasks without model-specific tuning.
SpatialClaw is a framework that helps AI agents reason about 3D space and object relationships by letting them write and execute Python code step-by-step. Instead of committing to a full analysis upfront or using rigid tool menus, the agent can see intermediate results and adapt its approach, achieving 59.9% accuracy across spatial reasoning tasks—11.2 points better than prior methods.