Real-time voice AI systems can hear emotional cues but don't use them in decision-making; they need explicit prompting to consider tone, and even then improve only partially—making them risky for emotionally sensitive interactions.
This paper evaluates four leading real-time voice AI systems (GPT-4 Realtime, Gemini Live, Qwen Omni) and finds they ignore emotional tone and vocal delivery when making decisions, even though they can perceive these cues when asked directly.