TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

Jiale Amber Wang, Kaiyuan Wang, Pengyu Nie|July 2, 2026arXiv

Key Takeaway

Existing test generation benchmarks don't verify if tests actually run or match code changes; this benchmark solves that by grounding evaluation in real executable environments and commit history, revealing that state-of-the-art agents still struggle on recent tasks.

Summary

TestEvo-Bench is a benchmark for evaluating AI agents on test and code co-evolution tasks—writing new tests for code changes and updating failing tests. Unlike static benchmarks, it uses real commits from Java projects with executable environments to measure pass rates, coverage, and mutation scores.

evaluation agents applications

Key Terms

test-generation code-reasoning mutation-score execution-grounded-metrics data-contamination