Vision Transformers don't learn by making tokens independent; instead, they increase representational complexity through richer transformations while preserving strong token interactions, which challenges common assumptions about how these models develop.
This paper analyzes how Vision Transformers' internal representations change during training using geometric analysis tools.