Multimodal MoE models suffer from 'routing distraction'—visual inputs misdirect the expert selection mechanism away from reasoning experts—which can be fixed by guiding the routing to activate task-relevant domain experts.
This paper identifies a critical flaw in multimodal mixture-of-experts models: they can perceive images correctly but fail at reasoning tasks that they solve easily with text input. The researchers discover that visual inputs cause the routing mechanism to activate the wrong experts, and propose a fix that improves performance by up to 3.17% on complex visual reasoning tasks.