MATH Benchmark
Competition mathematics problems across seven subjects and five difficulty levels, testing advanced mathematical reasoning
12,500 competition mathematics problems from AMC 10/12, AIME, and other competitions across 7 difficulty levels and 7 subject areas. Models must produce step-by-step solutions with final answers evaluated for exact correctness.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | GPT-4o | 76.6% |
| 2 | Qwen2.5 32B Instruct | 62.5% |
| 3 | Qwen2.5 14B Instruct | 54.8% |
| 4 | Qwen2.5 7B Instruct | 50.0% |
| 5 | Qwen2.5 Coder 7B Instruct | 37.2% |
| 6 |
| Qwen2.5 3B Instruct |
| 36.8% |
| 7 | Qwen2.5 32B | 35.6% |
| 8 | Qwen2.5 7B | 25.1% |
| 9 | gemma 2 27b it | 23.9% |
| 10 | Qwen2.5 1.5B Instruct | 22.1% |
| 11 | gemma 2 9b it | 19.5% |
| 12 | Phi 3 medium 128k instruct | 19.2% |
| 13 | Llama 3.2 3B Instruct | 17.7% |
| 14 | Llama 3.1 8B Instruct | 15.6% |
| 15 | Qwen2 1.5B Instruct | 7.2% |
| 16 | Llama 3.2 1B Instruct | 7.0% |
| 17 | Meta Llama 3 8B | 4.5% |
| 18 | Mistral 7B Instruct v0.2 | 3.0% |
| 19 | phi 2 | 2.9% |
| 20 | Llama 3.2 3B | 1.9% |
| 21 | TinyLlama 1.1B Chat v1.0 | 1.5% |
| 22 | gpt j 6b | 1.4% |
| 23 | gpt2 large | 1.2% |
| 24 | Llama 3.2 1B | 1.2% |
| 25 | falcon 7b instruct | 1.2% |
| 26 | pythia 160m | 0.9% |
| 27 | distilgpt2 | 0.6% |
| 28 | gpt2 | 0.2% |
| 29 | gemma 2 2b it | 0.1% |
| 30 | Qwen2.5 0.5B Instruct | 0.0% |