Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Yanis Labrak, David Grünert, Séverin Baroudi, Jiyun Chun, Pawel Cyrta et al.|April 7, 2026arXiv

Key Takeaway

Long-form audio reasoning needs better training data and evaluation benchmarks; synthetic generation with realistic audio characteristics can provide both, and traditional cascaded pipelines (speech-to-text then summarization) still beat end-to-end models on this task.

Summary

Researchers created a synthetic dataset of 8,800 doctor-patient conversations (1.3k hours of audio) to train and evaluate AI systems on long-form audio understanding. The pipeline generates realistic dialogues, synthesizes multi-speaker audio with background noise, and produces medical summaries (SOAP notes) as reference outputs—all using open-source models.

data evaluation applications

Key Terms

long-context-handling synthetic-data end-to-end-learning cascaded-pipeline soap-note