Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu, Jiacheng Shao et al.|July 2, 2026arXiv

Key Takeaway

Reasoning models can improve speaker identification in video by combining multiple modalities and contextual evidence, outperforming traditional audio-only approaches on challenging cases.

Summary

This paper tackles speaker recognition in long-form TV dramas by introducing DramaSR-532K, a large benchmark with 532K annotated dialogue lines, and DramaSR-LRM, a reasoning-based approach that combines audio, text, and visual information to accurately identify which character is speaking. The method works especially well on short utterances where voice alone isn't reliable.

multimodal reasoning applications

Key Terms

speaker-recognition multimodal-input tool-use reasoning-model