You can improve language model efficiency by tapering MLP width across depth—allocating more capacity to early layers and less to later ones—a free performance gain that works across different architectures.
This paper shows that language models waste parameters by allocating them uniformly across layers. The authors propose Tapered Language Models, which gradually reduce the width of MLPs (the largest parameter-consuming components) from early to later layers. Across multiple architectures and scales, this simple change improves performance without extra cost.