MMLU-Pro

generalScore: 0-100 (% accuracy)29 models scored

About

A harder variant of MMLU with 10 answer choices instead of 4 and more reasoning-intensive questions, reducing noise from random guessing

Methodology

12,032 questions across 14 domains, expanded from MMLU's 4-choice to 10-choice format. Questions are filtered for difficulty and augmented with reasoning-heavy problems from STEM sources. Significantly reduces guessing advantage (10% vs 25% random baseline).

Paper Dataset

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	Qwen2.5 32B	53.4%
2	Qwen2.5 32B Instruct	51.9%
3	Qwen2.5 14B Instruct	43.4%
4	Phi 3 medium 128k instruct	41.2%
5	gemma 2 27b it	38.3%
6