Fixing modality dominance requires enriching missing information, not just redirecting attention—MoIR routes complementary information between modalities to create more balanced, information-dense representations before the language model processes them.
Vision-language models often rely too heavily on one modality (vision or text), ignoring useful information from the other. This paper proposes MoIR, a method that identifies weak or ambiguous tokens in one modality and enriches them with information from the stronger modality before processing.