Routers in sparse mixture-of-experts models work best when they maintain geometric alignment with their experts—understanding this coupling can improve routing stability and reduce the need for complex auxiliary losses.
This paper reveals that routers in Sparse Mixture-of-Experts models learn a geometric relationship with their experts: router weights and expert weights receive gradients along the same directions, causing them to specialize together.