Information Router for Mitigating Modality Dominance in Vision-Language Models

Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib|April 17, 2026arXiv

Key Takeaway

Fixing modality dominance requires enriching missing information, not just redirecting attention—MoIR routes complementary information between modalities to create more balanced, information-dense representations before the language model processes them.

Summary

Vision-language models often rely too heavily on one modality (vision or text), ignoring useful information from the other. This paper proposes MoIR, a method that identifies weak or ambiguous tokens in one modality and enriches them with information from the stronger modality before processing.

multimodal architecture evaluation

Key Terms

modality-collapse modality-imbalance information-density cross-modal-fusion token-routing