Tapered Language Models

Reza Bayat, Ali Behrouz, Aaron Courville|June 22, 2026arXiv

Key Takeaway

You can improve language model efficiency by tapering MLP width across depth—allocating more capacity to early layers and less to later ones—a free performance gain that works across different architectures.

Summary

This paper shows that language models waste parameters by allocating them uniformly across layers. The authors propose Tapered Language Models, which gradually reduce the width of MLPs (the largest parameter-consuming components) from early to later layers. Across multiple architectures and scales, this simple change improves performance without extra cost.

architecture efficiency scaling

Key Terms

residual-stream mlp depth-aware-capacity-allocation parameter-budget