GPQA Diamond

Graduate-Level Google-Proof Q&A — Diamond Subset

reasoningScore: 0-100 (% accuracy)30 models scored

About

Expert-level multiple-choice questions in biology, chemistry, and physics. The Diamond subset contains the hardest questions verified by multiple domain experts

Methodology

448 expert-crafted multiple-choice questions across biology, chemistry, and physics. Each question was validated by at least two domain experts to ensure questions cannot be answered through web search alone. Models are evaluated on 0-shot or few-shot accuracy.

Paper Dataset Website

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	GPT-4o	53.6%
2	Qwen2.5 32B	21.6%
3	gemma 2 27b it	16.7%
4	gemma 2 9b it	14.8%
5	Qwen2.5 32B Instruct