The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Takuya Shiba|April 3, 2026arXiv

Key Takeaway

For robot learning systems, discrete action tokenization creates a hard ceiling on performance gains from better vision models—you need to increase action representation capacity, not just encoder quality, to see improvements.

Summary

This paper explains why upgrading vision encoders in robot learning models doesn't always improve performance. The key issue is the 'Compression Gap': when robot actions are represented as discrete tokens (like a limited vocabulary), the token codebook becomes an information bottleneck that prevents improvements from better vision encoders from helping.

architecture scaling efficiency

Key Terms

discrete-tokens codebook information-bottleneck vision-language-action-model visuomotor-pipeline