Vision-language-action models can now control lab robots by combining action token pretraining with flow matching, but success requires both lab-specific training data and support for multiple robot embodiments.
This paper introduces LabVLA, a vision-language-action model designed to control robots in scientific laboratories. The key innovation is a two-stage training approach: first pretraining the model to understand action tokens, then fine-tuning it with flow matching.