Standardized tests used to measure AI model capabilities. Each benchmark evaluates different skills — from reasoning and coding to math and conversation.
A harder variant of MMLU with 10 answer choices instead of 4 and more reasoning-intensive questions, reducing noise from random guessing
Tests broad academic knowledge across 57 subjects.
An extremely difficult exam crowdsourced from subject-matter experts across 100+ disciplines, designed to be the hardest test for AI systems
Expert-level multiple-choice questions in biology, chemistry, and physics. The Diamond subset contains the hardest questions verified by multiple domain experts
A challenging subset of BIG-Bench tasks where models previously failed to match average human performance
Tests multi-step reasoning over natural language narratives including murder mysteries, team allocation puzzles, and object placement tracking
164 Python programming problems requiring code generation, evaluated by running tests
An augmented version of HumanEval with 80× more test cases per problem to catch false positives from weak test suites
Tests AI coding assistants on real-world programming tasks across multiple languages using the Aider coding tool. Measures ability to edit existing codebases to pass tests.
Continuously updated coding benchmark using new competitive programming problems from LeetCode, AtCoder, and Codeforces to prevent contamination
Tests models on research-level scientific programming problems drawn from real scientific papers across physics, chemistry, biology, and mathematics
Evaluates models on completing real-world terminal and shell tasks, including file manipulation, system commands, and scripting
Competition mathematics problems across seven subjects and five difficulty levels, testing advanced mathematical reasoning
15 challenging math competition problems from AIME 2024, used as a difficult math reasoning benchmark for frontier models
15 challenging math competition problems from AIME 2025, used as a difficult math reasoning benchmark for frontier models
Tests whether models can follow explicit, verifiable instructions such as 'write exactly 3 paragraphs' or 'include the word ocean at least twice'
Evaluates instruction-following ability using diverse, complex instructions that test a model's ability to precisely adhere to specified constraints
Live crowd-sourced evaluation where users chat with two anonymous models side-by-side and vote for the better response. Produces Elo ratings
80 multi-turn conversation questions scored by GPT-4 on writing, roleplay, reasoning, math, coding, and STEM