MobileMoE: Scaling On-Device Mixture of Experts

Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai et al.|May 26, 2026arXiv

Key Takeaway

MoE isn't just for giant models—on mobile devices, moderate sparsity with shared experts is both memory and compute-optimal, letting you get better performance with fewer active parameters than dense models.

Summary

MobileMoE brings Mixture-of-Experts (MoE) architecture to phones and edge devices by optimizing it for memory and compute constraints. The models use 0.3-0.9B active parameters but achieve better performance than larger dense models, running 2-4× faster on real smartphones while using less memory.

efficiency architecture scaling

Key Terms

mixture-of-experts sparse-activation quantization-aware-training on-device-inference scaling-law