You can build efficient vision transformers by routing all patch interactions through a small set of learned core tokens instead of using all-to-all attention, achieving linear complexity without sacrificing performance.
This paper proposes VECA, a vision transformer that replaces quadratic all-to-all attention with linear-time attention using learned "core" tokens as communication hubs. Instead of every patch attending to every other patch, all patches only interact through a small set of learned cores, reducing computation from O(N²) to O(N) while maintaining competitive accuracy on vision tasks.