Aligning router weights with the principal singular directions of experts improves MoE routing efficiency—a simple mathematical principle that scales from 1B to 11B parameter models.
This paper improves Mixture-of-Experts (MoE) models by redesigning how routers select which experts to use. The authors propose aligning each router with the most important direction of its expert using a mathematical technique called Manifold Power Iteration, which helps routers better match tokens to appropriate experts.