MATH

MATH Benchmark

mathScore: 0-100 (% correct)30 models scored

About

Competition mathematics problems across seven subjects and five difficulty levels, testing advanced mathematical reasoning

Methodology

12,500 competition mathematics problems from AMC 10/12, AIME, and other competitions across 7 difficulty levels and 7 subject areas. Models must produce step-by-step solutions with final answers evaluated for exact correctness.

Paper Dataset

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	GPT-4o	76.6%
2	Qwen2.5 32B Instruct	62.5%
3	Qwen2.5 14B Instruct	54.8%
4	Qwen2.5 7B Instruct	50.0%
5	Qwen2.5 Coder 7B Instruct	37.2%
6