Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Zhi Chen, Zhensu Sun, Yuling Shi, David Lo, Lingxiao Jiang|July 1, 2026arXiv

Key Takeaway

Performance-optimization benchmarks for coding agents have significant reliability issues: reference patches are unstable across machines, scoring rules heavily influence rankings, and most tasks are already solved by existing submissions, making leaderboard positions unreliable indicators of tru...

Summary

This paper audits three major benchmarks for evaluating coding agents on performance optimization tasks.

evaluation agents applications

Key Terms

benchmark coding-agent leaderboard runtime-instability