By representing actions as embodiment-agnostic physical intents grounded in visual outcomes, UniT enables humanoid robots to learn directly from human video data, dramatically improving data efficiency and enabling zero-shot task transfer without robot-specific training.
UniT solves a major bottleneck in training humanoid robots: the lack of robot data. Instead of collecting expensive robot videos, it learns from abundant human videos by finding a shared "physical language"—a unified way to represent actions that works across different body types.