FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

Chenyang Ma, Yue Yang, Radu Corcodel, Siddarth Jain, Andrew Wu et al.|July 1, 2026arXiv

Key Takeaway

Progress signals and semantic subtask grounding are critical for long-horizon bimanual manipulation—the model predicts both actions and continuous progress to automatically transition between assembly steps and reduce compounding errors.

Summary

FurnitureVLA tackles real-scale bimanual robot furniture assembly using vision-language-action models. The system combines a VR teleoperation interface for data collection, a simulation pipeline for training, and a progress-aware model that predicts both actions and assembly progress to handle long-horizon tasks (up to 1550 steps).

agents reasoning multimodal

Key Terms

vision-language-action-model bimanual-manipulation long-horizon-tasks progress-signal sim-to-real-transfer