Multimodal MoE models suffer from 'routing distraction'—visual inputs cause the routing mechanism to activate the wrong experts for reasoning. A simple intervention that guides expert selection toward domain experts significantly improves visual reasoning performance.
This paper identifies a problem in multimodal mixture-of-experts models where they can see images correctly but fail at reasoning tasks that they solve easily with text.