Not all transformer layers need the same width—narrowing middle layers while keeping early and late layers wide improves efficiency and performance, suggesting different layers have different computational roles.
This paper proposes Variable-Width Transformers, which use wider layers at the beginning and end of the network while narrowing middle layers. This non-uniform design outperforms standard transformers of the same size on language modeling, while reducing computation by 22% and memory usage by 15%.