Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva et al.|June 17, 2026arXiv

Key Takeaway

VLA models trained for robotics lose significant commonsense and world knowledge compared to their base vision-language models, particularly on complex semantic tasks—a critical finding for building reliable embodied AI systems.

Summary

This paper introduces Act2Answer, a benchmark that measures whether vision-language-action (VLA) models—AI systems trained to understand images and perform robot actions—retain commonsense and factual knowledge after fine-tuning on robotics data.

evaluation multimodal agents

Key Terms

vision-language-action-model embodied-agent layerwise-probing knowledge-retention