Tests models on research-level scientific programming problems drawn from real scientific papers across physics, chemistry, biology, and mathematics
Problems require implementing algorithms described in scientific literature, then verifying correctness against test cases. Covers numerical methods, simulations, and domain-specific computations across STEM fields.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | o3 Mini | 10.8% |
| 2 | DeepSeek R1 | 4.6% |
| 3 | Claude 3.5 Sonnet | 4.6% |
| 4 | DeepSeek V3 | 3.1% |
| 5 | Llama 3.1 405B Instruct |
| 1.5% |
| 6 | Claude 3 Sonnet | 1.5% |
| 7 | Qwen2 72B Instruct | 1.5% |
| 8 | GPT-4 Turbo | 1.5% |
| 9 | GPT-4o | 1.5% |
| 10 | Claude 3 Opus | 1.5% |
| 11 | o1-mini | 1.5% |
| 12 | Llama 3.1 70B Instruct | 0.0% |