Existing test generation benchmarks don't verify if tests actually run or match code changes; this benchmark solves that by grounding evaluation in real executable environments and commit history, revealing that state-of-the-art agents still struggle on recent tasks.
TestEvo-Bench is a benchmark for evaluating AI agents on test and code co-evolution tasks—writing new tests for code changes and updating failing tests. Unlike static benchmarks, it uses real commits from Java projects with executable environments to measure pass rates, coverage, and mutation scores.