Performance-optimization benchmarks for coding agents have significant reliability issues: reference patches are unstable across machines, scoring rules heavily influence rankings, and most tasks are already solved by existing submissions, making leaderboard positions unreliable indicators of tru...
This paper audits three major benchmarks for evaluating coding agents on performance optimization tasks.