Many AI benchmarks contain hidden flaws that distort model rankings and performance scores; automated auditing can catch these issues at scale and improve the reliability of capability assessments.
This paper introduces Auto Benchmark Audit (ABA), an AI agent that automatically checks benchmark tasks for hidden problems like incomplete specifications, environment conflicts, and broken evaluation logic. Testing 168 benchmarks across nine domains, ABA found critical issues in over 25% of tasks—problems that human reviewers missed.