By constraining tokens within the same document to share expert pools during pretraining, EMO creates naturally modular experts that specialize in semantic domains (math, code, etc.), enabling practical memory-efficient deployment without sacrificing performance.
EMO is a Mixture-of-Experts language model designed to work efficiently when you only need a subset of its capabilities. Instead of forcing all experts to activate for every input, EMO groups experts by document domain during training, so code-heavy documents use code experts, math documents use math experts, and so on.