Vega: Learning to Drive with Natural Language Instructions

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou et al.|March 26, 2026arXiv

Key Takeaway

Language instructions can guide autonomous driving decisions in real-time, enabling personalized driving behaviors beyond fixed rules—this opens the door to more flexible, user-responsive autonomous systems.

Summary

Vega is a vision-language-action model that learns to drive by following natural language instructions. The system combines visual perception, language understanding, and world modeling to generate safe driving trajectories. Researchers created a 100,000-scene dataset with diverse driving instructions and trajectories to train the model.

multimodal agents reasoning

Key Terms

vision-language-action-model instruction-following world-modeling trajectory-generation diffusion-paradigm