Reasoning models can improve speaker identification in video by combining multiple modalities and contextual evidence, outperforming traditional audio-only approaches on challenging cases.
This paper tackles speaker recognition in long-form TV dramas by introducing DramaSR-532K, a large benchmark with 532K annotated dialogue lines, and DramaSR-LRM, a reasoning-based approach that combines audio, text, and visual information to accurately identify which character is speaking. The method works especially well on short utterances where voice alone isn't reliable.