LLMs outperform traditional word-error metrics for evaluating speech recognition by understanding semantic meaning rather than just counting mistakes, opening the door to more human-aligned ASR evaluation.
This paper shows that large language models can evaluate speech recognition quality much better than traditional metrics like Word Error Rate. Instead of just counting wrong words, LLMs can understand meaning and classify errors in ways that match how humans judge speech quality—achieving 92-94% agreement with human raters.