Audio-language models encode audio evidence correctly but lose arbitration to text—this can be fixed at inference time by reweighting scores from audio-only and audio-text branches, improving accuracy by 17.8 points without retraining.
Audio-language models often prefer text over audio even when audio is clearly correct. This paper shows that the audio information is actually encoded in the model but gets overridden during decision-making. By removing conflicting text and measuring how the model's preference changes, researchers found that 64% of conflicts can be reversed.