InSight: Self-Guided Skill Acquisition via Steerable VLAs

Maggie Wang, Lars Osterberg, Stephen Tian, Ola Shorinwa, Jiajun Wu et al.|June 23, 2026arXiv

Key Takeaway

By making VLA models steerable at the primitive-action level, you can create a self-improving loop where robots identify skill gaps, practice autonomously, and expand their capabilities continuously.

Summary

InSight enables vision-language-action models to autonomously learn new manipulation skills by breaking down demonstrations into reusable primitive actions (like "move gripper to bowl").

agents training multimodal

Key Terms

vision-language-action-model skill-primitive end-effector-pose autonomous-skill-acquisition data-flywheel