Humanity's Last Exam

generalScore: 0-100 (% accuracy)1 model scored

About

An extremely difficult exam crowdsourced from subject-matter experts across 100+ disciplines, designed to be the hardest test for AI systems

Methodology

3,000 questions contributed by domain experts across mathematics, sciences, humanities, and specialized fields. Each question is vetted to ensure it tests deep understanding rather than memorization. Frontier models typically score below 20%, making it a ceiling benchmark.

Paper Dataset Website

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	GPT-4o	9.4%