An augmented version of HumanEval with 80× more test cases per problem to catch false positives from weak test suites
Uses the same 164 HumanEval problems but adds 13,000+ additional test cases generated via type-aware mutation and LLM-based input synthesis. Catches solutions that pass original tests but have subtle bugs. Average 80× more tests per problem.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | GPT-4o | 86.6% |