Continuously updated coding benchmark using new competitive programming problems from LeetCode, AtCoder, and Codeforces to prevent contamination
Collects new competitive programming problems published after training cutoff dates of evaluated models. Problems include code generation, self-repair, code execution prediction, and test output prediction. Automatically refreshed to avoid benchmark contamination.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Qwen3 235B A22B | 70.7% |
| 2 | Gemini 2.5 Pro | 70.4% |
| 3 | DeepSeek R1 | 65.9% |