Decoupling high-level reasoning from low-level control in robotic systems preserves the planning abilities of large vision-language models while improving execution accuracy on physical manipulation tasks.
HiVLA splits robot manipulation into two parts: a vision-language model that plans tasks and identifies objects, and a specialized action model that executes precise movements. This separation lets robots reason about complex tasks while staying accurate at fine-grained control, outperforming end-to-end approaches on real robot tasks.