Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang et al.|April 9, 2026arXiv

Key Takeaway

Multimodal MoE models suffer from 'routing distraction'—visual inputs cause the routing mechanism to activate the wrong experts for reasoning. A simple intervention that guides expert selection toward domain experts significantly improves visual reasoning performance.

Summary

This paper identifies a problem in multimodal mixture-of-experts models where they can see images correctly but fail at reasoning tasks that they solve easily with text.

architecture multimodal reasoning

Key Terms

mixture-of-experts routing-mechanism domain-expert cross-modal-semantic-sharing vision-language-task