Progress signals and semantic subtask grounding are critical for long-horizon bimanual manipulation—the model predicts both actions and continuous progress to automatically transition between assembly steps and reduce compounding errors.
FurnitureVLA tackles real-scale bimanual robot furniture assembly using vision-language-action models. The system combines a VR teleoperation interface for data collection, a simulation pipeline for training, and a progress-aware model that predicts both actions and assembly progress to handle long-horizon tasks (up to 1550 steps).