Long-form audio reasoning needs better training data and evaluation benchmarks; synthetic generation with realistic audio characteristics can provide both, and traditional cascaded pipelines (speech-to-text then summarization) still beat end-to-end models on this task.
Researchers created a synthetic dataset of 8,800 doctor-patient conversations (1.3k hours of audio) to train and evaluate AI systems on long-form audio understanding. The pipeline generates realistic dialogues, synthesizes multi-speaker audio with background noise, and produces medical summaries (SOAP notes) as reference outputs—all using open-source models.