GROW$^2$: Grounding Which and Where for Robot Tool Use

Yuhong Deng, Yuyao Liu, David Hsu|June 29, 2026arXiv

Key Takeaway

By decomposing tool affordance grounding into semantic (which object/part) and geometric (where) levels, robots can generalize to novel objects and creative tool use without expensive end-to-end training.

Summary

GROW² enables robots to creatively use any available object as a tool by breaking down the problem into two steps: using vision-language models to identify which object and which part to use, then using vision models to locate the exact 3D position. This lets robots solve tasks like cutting cake with a plate when no knife is available, without needing large labeled datasets.

agents multimodal reasoning

Key Terms

affordance-prediction vision-language-model zero-shot-generalization rgb-d-image semantic-grounding