By decomposing tool affordance grounding into semantic (which object/part) and geometric (where) levels, robots can generalize to novel objects and creative tool use without expensive end-to-end training.
GROW² enables robots to creatively use any available object as a tool by breaking down the problem into two steps: using vision-language models to identify which object and which part to use, then using vision models to locate the exact 3D position. This lets robots solve tasks like cutting cake with a plate when no knife is available, without needing large labeled datasets.