Massive Multitask Language Understanding
Tests broad academic knowledge across 57 subjects.
57-subject multiple-choice exam spanning STEM, humanities, social sciences, and more. Models are typically evaluated in 5-shot format. Covers difficulty levels from elementary to professional.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Gemini 2.0 Flash | 89.0% |
| 2 | GPT-4o | 88.7% |