EMO: Pretraining Mixture of Experts for Emergent Modularity

Ryan Wang, Akshita Bhagia, Sewon Min|May 7, 2026arXiv

Key Takeaway

By constraining tokens within the same document to share expert pools during pretraining, EMO creates naturally modular experts that specialize in semantic domains (math, code, etc.), enabling practical memory-efficient deployment without sacrificing performance.

Summary

EMO is a Mixture-of-Experts language model designed to work efficiently when you only need a subset of its capabilities. Instead of forcing all experts to activate for every input, EMO groups experts by document domain during training, so code-heavy documents use code experts, math documents use math experts, and so on.

architecture efficiency training

Key Terms

mixture-of-experts sparse-model expert-specialization modularity