Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Junhao Shi, Siyin Wang, Xiaopeng Yu, Li Ji, Jingjing Gong et al.|July 2, 2026arXiv

Key Takeaway

Separating motor skill learning from language grounding dramatically reduces the labeled data needed for robot learning—TAP matches models trained on 1M+ expert trajectories while using far less labeled data and shows better robustness to real-world perturbations.

Summary

This paper proposes Task-Agnostic Pretraining (TAP), a two-stage approach for training Vision-Language-Action robots that separates learning how to move (from unlabeled robot interactions) from learning what to do (from minimal labeled data).

training efficiency multimodal

Key Terms

vision-language-action-model inverse-dynamics behavior-cloning embodied-ai self-supervised-pretraining