Current text-to-audio-video generators look good but fail at semantic tasks like rendering text, maintaining speech coherence, and controlling musical pitch—evaluation needs to go beyond visual aesthetics to catch these failures.
AVGen-Bench is a benchmark for evaluating text-to-audio-video generation systems across 11 real-world tasks. It uses specialist models and multimodal AI to assess both perceptual quality and semantic accuracy, revealing that current systems struggle with text rendering, speech coherence, physical reasoning, and musical pitch control despite producing visually appealing outputs.