VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah et al.|April 21, 2026arXiv

Key Takeaway

For roboticists and ML engineers: VLA Foundry eliminates pipeline incompatibility issues by providing a unified training stack for building embodied AI models, with released weights and open-source code making it practical to train and deploy robotic policies.

Summary

VLA Foundry is an open-source framework that unifies training of language models, vision-language models, and vision-language-action models in one codebase. Instead of stitching together separate pipelines, it provides end-to-end control from language pretraining through action fine-tuning, enabling researchers to train robotic manipulation policies from scratch or using pretrained backbones.

architecture training applications

Key Terms

vision-language-action-model vision-language-model end-to-end-learning backbone-architecture closed-loop-policy