HumanEval+

codingScore: 0-100 (% pass@1)1 model scored

About

An augmented version of HumanEval with 80× more test cases per problem to catch false positives from weak test suites

Methodology

Uses the same 164 HumanEval problems but adds 13,000+ additional test cases generated via type-aware mutation and LLM-based input synthesis. Catches solutions that pass original tests but have subtle bugs. Average 80× more tests per problem.

Paper Dataset

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	GPT-4o	86.6%