VLA models trained for robotics lose significant commonsense and world knowledge compared to their base vision-language models, particularly on complex semantic tasks—a critical finding for building reliable embodied AI systems.
This paper introduces Act2Answer, a benchmark that measures whether vision-language-action (VLA) models—AI systems trained to understand images and perform robot actions—retain commonsense and factual knowledge after fine-tuning on robotics data.