You can now tune hyperparameters on a single dense model and transfer them directly to MoE models of any size or configuration, eliminating the need for expensive hyperparameter search when scaling with MoE.
Complete-muE is a framework that solves the problem of transferring hyperparameters (like learning rate and weight decay) from dense neural networks to Mixture-of-Experts (MoE) models without expensive retuning.