Current large audio-language models fail to properly control or interpret paralinguistic cues (emotion, tone, style) in speech, with these failures accounting for 43% of errors in conversational tasks—a critical gap for building natural-sounding voice assistants.
SpeechParaling-Bench is a benchmark for testing how well AI speech models handle paralinguistic features—things like emotion, tone, and speaking style. It includes over 100 fine-grained features tested across 1,000+ English-Chinese speech samples, and uses an AI judge to compare outputs fairly. Tests show current models struggle significantly with controlling these subtle speech qualities.