Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao et al.|May 12, 2026arXiv

Key Takeaway

You can build efficient vision transformers by routing all patch interactions through a small set of learned core tokens instead of using all-to-all attention, achieving linear complexity without sacrificing performance.

Summary

This paper proposes VECA, a vision transformer that replaces quadratic all-to-all attention with linear-time attention using learned "core" tokens as communication hubs. Instead of every patch attending to every other patch, all patches only interact through a small set of learned cores, reducing computation from O(N²) to O(N) while maintaining competitive accuracy on vision tasks.

architecture efficiency scaling

Key Terms

vision-transformer self-attention quadratic-complexity linear-complexity core-periphery-attention