Current text-to-audio-video models excel at aesthetic quality but fail at semantic control—they can't reliably render text, maintain speech coherence, or control musical pitch, showing that evaluation needs to go beyond visual appeal.
AVGen-Bench is a benchmark for evaluating text-to-audio-video generation systems across 11 real-world tasks. It uses specialist models and multimodal AI to assess both perceptual quality and semantic accuracy, revealing that current systems produce visually appealing content but struggle with text rendering, speech coherence, and musical pitch control.