MMLU

Massive Multitask Language Understanding

generalScore: 0-100 (% accuracy)2 models scored

About

Tests broad academic knowledge across 57 subjects.

Methodology

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and more. Models are typically evaluated in 5-shot format. Covers difficulty levels from elementary to professional.

Paper Dataset Website

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	Gemini 2.0 Flash	89.0%
2	GPT-4o	88.7%