Pretraining action modules on motion structure before vision-language alignment significantly improves robot learning efficiency and cross-embodiment generalization, particularly in data-scarce real-world settings.
This paper proposes a two-stage training approach for robot manipulation models that first learns motion patterns from action trajectories alone, then transfers this knowledge to vision-language-action models.