SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee et al.|June 11, 2026arXiv

Key Takeaway

Using executable code as an action interface lets vision-language models flexibly compose spatial reasoning operations and adapt to intermediate results, significantly improving performance on 3D/4D reasoning tasks without model-specific tuning.

Summary

SpatialClaw is a framework that helps AI agents reason about 3D space and object relationships by letting them write and execute Python code step-by-step. Instead of committing to a full analysis upfront or using rigid tool menus, the agent can see intermediate results and adapt its approach, achieving 59.9% accuracy across spatial reasoning tasks—11.2 points better than prior methods.

reasoning agents multimodal

Key Terms

spatial-reasoning vision-language-model stateful-workspace tool-augmented-generation action-interface