Vision Transformers struggle with binding (knowing which features go together), and this limitation explains why they fail at tasks involving feature-sharing or occluded objects—making binding a measurable and critical component of visual understanding.
This paper formalizes the 'binding problem'—how AI models know which visual features belong to the same object—using information theory. The authors develop a probing method to measure binding information in Vision Transformers and show that binding is crucial for understanding scenes with multiple objects, especially when objects share features or overlap.