Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish|June 11, 2026arXiv

Key Takeaway

Role-based conditioning significantly improves when voice agents decide to speak in group conversations—a critical capability for real-time multi-party voice applications like meeting assistants and collaborative AI systems.

Summary

This paper presents ModeratorLM, a voice agent that improves turn-taking in multi-party conversations by conditioning behavior on an explicitly assigned role (e.g., moderator, participant). The system uses a speech language model with streaming processing and optional chain-of-thought reasoning.

agents reasoning multimodal

Key Terms

turn-taking-detection role-playing-benchmark streaming-inference chain-of-thought-reasoning synthetic-conversation-dataset