HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu et al.|April 15, 2026arXiv

Key Takeaway

Decoupling high-level reasoning from low-level control in robotic systems preserves the planning abilities of large vision-language models while improving execution accuracy on physical manipulation tasks.

Summary

HiVLA splits robot manipulation into two parts: a vision-language model that plans tasks and identifies objects, and a specialized action model that executes precise movements. This separation lets robots reason about complex tasks while staying accurate at fine-grained control, outperforming end-to-end approaches on real robot tasks.

agents multimodal architecture

Key Terms

vision-language-action-model task-decomposition visual-grounding diffusion-transformer cascaded-cross-attention