The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Jeremy Herbst, Jae Hee Lee, Stefan Wermter|April 2, 2026arXiv

Key Takeaway

MoE language models are naturally more interpretable than dense models because their sparse architecture pushes experts toward single, well-defined tasks like 'closing LaTeX brackets' rather than handling multiple unrelated functions.

Summary

This paper investigates how Mixture-of-Experts language models work by analyzing individual experts instead of neurons. The researchers find that MoE experts are less ambiguous (monosemantic) than dense networks, and that experts specialize in specific linguistic tasks rather than broad domains—making MoE models easier to understand and interpret at scale.

architecture efficiency

Key Terms

mixture-of-experts monosemanticity polysemanticity k-sparse-probing mechanistic-interpretability