Separating motor skill learning from language grounding dramatically reduces the labeled data needed for robot learning—TAP matches models trained on 1M+ expert trajectories while using far less labeled data and shows better robustness to real-world perturbations.
This paper proposes Task-Agnostic Pretraining (TAP), a two-stage approach for training Vision-Language-Action robots that separates learning how to move (from unlabeled robot interactions) from learning what to do (from minimal labeled data).