Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

Yichen Gao, Yiqun Zhang, Zijing Wang, Yujia Li, Heng Guo et al.|June 3, 2026arXiv

Key Takeaway

Audio-language models encode audio evidence correctly but lose arbitration to text—this can be fixed at inference time by reweighting scores from audio-only and audio-text branches, improving accuracy by 17.8 points without retraining.

Summary

Audio-language models often prefer text over audio even when audio is clearly correct. This paper shows that the audio information is actually encoded in the model but gets overridden during decision-making. By removing conflicting text and measuring how the model's preference changes, researchers found that 64% of conflicts can be reversed.

multimodal evaluation reasoning

Key Terms

activation-patching counterfactual-evaluation arbitration logit-distillation