UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge et al.|April 21, 2026arXiv

Key Takeaway

By representing actions as embodiment-agnostic physical intents grounded in visual outcomes, UniT enables humanoid robots to learn directly from human video data, dramatically improving data efficiency and enabling zero-shot task transfer without robot-specific training.

Summary

UniT solves a major bottleneck in training humanoid robots: the lack of robot data. Instead of collecting expensive robot videos, it learns from abundant human videos by finding a shared "physical language"—a unified way to represent actions that works across different body types.

multimodal agents

Key Terms

cross-embodiment-transfer embodiment-agnostic visual-anchoring discrete-latent-space zero-shot-task-transfer