Synthetic data from reconstructed 3D scenes can effectively train perception-based humanoid robots for real-world loco-manipulation, eliminating the need for expensive human-annotated robot trajectories.
This paper solves a key bottleneck in training humanoid robots: the lack of paired data combining egocentric camera views, language instructions, and robot motion. The authors generate 48,000 synthetic training examples by reconstructing real indoor scenes with 3D Gaussian Splatting, simulating robot trajectories, and rendering first-person views.