A harder variant of MMLU with 10 answer choices instead of 4 and more reasoning-intensive questions, reducing noise from random guessing
12,032 questions across 14 domains, expanded from MMLU's 4-choice to 10-choice format. Questions are filtered for difficulty and augmented with reasoning-heavy problems from STEM sources. Significantly reduces guessing advantage (10% vs 25% random baseline).
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Qwen2.5 32B | 53.4% |
| 2 | Qwen2.5 32B Instruct | 51.9% |
| 3 | Qwen2.5 14B Instruct | 43.4% |
| 4 | Phi 3 medium 128k instruct | 41.2% |
| 5 | gemma 2 27b it | 38.3% |
| 6 |
| Qwen2.5 7B |
| 37.4% |
| 7 | Qwen2.5 7B Instruct | 36.5% |
| 8 | gemma 2 9b it | 31.9% |
| 9 | Llama 3.1 8B Instruct | 31.1% |
| 10 | Qwen2.5 Coder 7B Instruct | 26.1% |
| 11 | Qwen2.5 3B Instruct | 25.1% |
| 12 | Meta Llama 3 8B | 24.6% |
| 13 | Llama 3.2 3B Instruct | 24.4% |
| 14 | Qwen2.5 1.5B Instruct | 20.0% |
| 15 | Mistral 7B Instruct v0.2 | 19.1% |
| 16 | phi 2 | 18.1% |
| 17 | gemma 2 2b it | 17.2% |
| 18 | Qwen2 1.5B Instruct | 16.7% |
| 19 | Llama 3.2 3B | 16.5% |
| 20 | Qwen2.5 0.5B Instruct | 7.7% |
| 21 | Llama 3.2 1B Instruct | 7.6% |
| 22 | gpt j 6b | 2.7% |
| 23 | Llama 3.2 1B | 2.3% |
| 24 | distilgpt2 | 2.1% |
| 25 | gpt2 | 1.8% |
| 26 | falcon 7b instruct | 1.7% |
| 27 | gpt2 large | 1.6% |
| 28 | pythia 160m | 1.3% |
| 29 | TinyLlama 1.1B Chat v1.0 | 1.1% |