Graduate-Level Google-Proof Q&A — Diamond Subset
Expert-level multiple-choice questions in biology, chemistry, and physics. The Diamond subset contains the hardest questions verified by multiple domain experts
448 expert-crafted multiple-choice questions across biology, chemistry, and physics. Each question was validated by at least two domain experts to ensure questions cannot be answered through web search alone. Models are evaluated on 0-shot or few-shot accuracy.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | GPT-4o | 53.6% |
| 2 | Qwen2.5 32B | 21.6% |
| 3 | gemma 2 27b it | 16.7% |
| 4 | gemma 2 9b it | 14.8% |
| 5 | Qwen2.5 32B Instruct |
| 11.7% |
| 6 | Phi 3 medium 128k instruct | 11.5% |
| 7 | Qwen2.5 7B | 10.0% |
| 8 | Qwen2.5 14B Instruct | 9.6% |
| 9 | Llama 3.1 8B Instruct | 8.7% |
| 10 | Meta Llama 3 8B | 7.4% |
| 11 | Qwen2.5 Coder 7B Instruct | 5.6% |
| 12 | Qwen2.5 7B Instruct | 5.5% |
| 13 | Llama 3.2 3B Instruct | 3.8% |
| 14 | Mistral 7B Instruct v0.2 | 3.5% |
| 15 | Llama 3.2 1B Instruct | 3.4% |
| 16 | gemma 2 2b it | 3.2% |
| 17 | Qwen2.5 3B Instruct | 3.0% |
| 18 | phi 2 | 2.9% |
| 19 | Llama 3.2 3B | 2.3% |
| 20 | Qwen2 1.5B Instruct | 1.6% |
| 21 | gpt2 large | 1.2% |
| 22 | distilgpt2 | 1.2% |
| 23 | pythia 160m | 1.1% |
| 24 | gpt2 | 1.1% |
| 25 | Qwen2.5 0.5B Instruct | 1.0% |
| 26 | Qwen2.5 1.5B Instruct | 0.8% |
| 27 | gpt j 6b | 0.0% |
| 28 | TinyLlama 1.1B Chat v1.0 | 0.0% |
| 29 | Llama 3.2 1B | 0.0% |
| 30 | falcon 7b instruct | 0.0% |