Efficient ASR Training with Conversations that Never Happened

Máté Gedeon, Péter Mihajlik|June 2, 2026arXiv

Key Takeaway

LLM-generated synthetic conversations paired with TTS can effectively replace scarce real conversational data for training speech recognition systems, especially when real multi-speaker dialogue is expensive to collect.

Summary

This paper shows how to train better speech recognition systems for low-resource languages by generating fake conversations using LLMs and text-to-speech. Instead of collecting expensive real conversations, the authors create synthetic multi-speaker dialogues with realistic speaker metadata, then use TTS to generate audio.

data training applications

Key Terms

automatic-speech-recognition text-to-speech synthetic-data data-augmentation