Variable-Width Transformers

Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy et al.|June 16, 2026arXiv

Key Takeaway

Not all transformer layers need the same width—narrowing middle layers while keeping early and late layers wide improves efficiency and performance, suggesting different layers have different computational roles.

Summary

This paper proposes Variable-Width Transformers, which use wider layers at the beginning and end of the network while narrowing middle layers. This non-uniform design outperforms standard transformers of the same size on language modeling, while reducing computation by 22% and memory usage by 15%.

architecture efficiency scaling

Key Terms

transformer-architecture residual-stream kv-cache model-width nonuniform-capacity-allocation