LLM-generated synthetic conversations paired with TTS can effectively replace scarce real conversational data for training speech recognition systems, especially when real multi-speaker dialogue is expensive to collect.
This paper shows how to train better speech recognition systems for low-resource languages by generating fake conversations using LLMs and text-to-speech. Instead of collecting expensive real conversations, the authors create synthetic multi-speaker dialogues with realistic speaker metadata, then use TTS to generate audio.