AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing et al.|April 9, 2026arXiv

Key Takeaway

Current text-to-audio-video generators look good but fail at semantic tasks like rendering text, maintaining speech coherence, and controlling musical pitch—evaluation needs to go beyond visual aesthetics to catch these failures.

Summary

AVGen-Bench is a benchmark for evaluating text-to-audio-video generation systems across 11 real-world tasks. It uses specialist models and multimodal AI to assess both perceptual quality and semantic accuracy, revealing that current systems struggle with text rendering, speech coherence, physical reasoning, and musical pitch control despite producing visually appealing outputs.

evaluation multimodal applications

Key Terms

multimodal-evaluation semantic-controllability text-to-audio-video-generation specialist-models