SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

Ruohan Liu, Shukang Yin, Tao Wang, Dong Zhang, Weiji Zhuang et al.|April 22, 2026arXiv

Key Takeaway

Current large audio-language models fail to properly control or interpret paralinguistic cues (emotion, tone, style) in speech, with these failures accounting for 43% of errors in conversational tasks—a critical gap for building natural-sounding voice assistants.

Summary

SpeechParaling-Bench is a benchmark for testing how well AI speech models handle paralinguistic features—things like emotion, tone, and speaking style. It includes over 100 fine-grained features tested across 1,000+ English-Chinese speech samples, and uses an AI judge to compare outputs fairly. Tests show current models struggle significantly with controlling these subtle speech qualities.

evaluation multimodal applications

Key Terms

paralinguistic-cues large-audio-language-model pairwise-comparison intra-utterance-variation