For robot learning systems, discrete action tokenization creates a hard ceiling on performance gains from better vision models—you need to increase action representation capacity, not just encoder quality, to see improvements.
This paper explains why upgrading vision encoders in robot learning models doesn't always improve performance. The key issue is the 'Compression Gap': when robot actions are represented as discrete tokens (like a limited vocabulary), the token codebook becomes an information bottleneck that prevents improvements from better vision encoders from helping.