Current audio AI models fail dramatically on genuine audio understanding tasks—they likely exploit dataset biases and metadata rather than actually listening to and reasoning about sound.
AUDITA is a new benchmark dataset with real-world audio and human-authored trivia questions designed to test whether AI models can truly understand audio content rather than relying on shortcuts. Humans answer correctly 32% of the time, but state-of-the-art models score below 9%, revealing a significant gap in audio reasoning capabilities.