An extremely difficult exam crowdsourced from subject-matter experts across 100+ disciplines, designed to be the hardest test for AI systems
3,000 questions contributed by domain experts across mathematics, sciences, humanities, and specialized fields. Each question is vetted to ensure it tests deep understanding rather than memorization. Frontier models typically score below 20%, making it a ceiling benchmark.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | GPT-4o | 9.4% |