LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li et al.|June 11, 2026arXiv

Key Takeaway

Vision-language-action models can now control lab robots by combining action token pretraining with flow matching, but success requires both lab-specific training data and support for multiple robot embodiments.

Summary

This paper introduces LabVLA, a vision-language-action model designed to control robots in scientific laboratories. The key innovation is a two-stage training approach: first pretraining the model to understand action tokens, then fine-tuning it with flow matching.

multimodal agents training

Key Terms

vision-language-action-model flow-matching action-token-pretraining embodiment knowledge-insulation