Benchmarks

Standardized tests used to measure AI model capabilities. Each benchmark evaluates different skills — from reasoning and coding to math and conversation.

General Knowledge

MMLU-Pro

A harder variant of MMLU with 10 answer choices instead of 4 and more reasoning-intensive questions, reducing noise from random guessing

General Knowledge29 models

MMLU

Tests broad academic knowledge across 57 subjects.

General Knowledge2 models

Humanity's Last Exam

An extremely difficult exam crowdsourced from subject-matter experts across 100+ disciplines, designed to be the hardest test for AI systems

General Knowledge1 model

Reasoning

GPQA Diamond

Expert-level multiple-choice questions in biology, chemistry, and physics. The Diamond subset contains the hardest questions verified by multiple domain experts

Reasoning30 models

BBH

A challenging subset of BIG-Bench tasks where models previously failed to match average human performance

Reasoning29 models

MuSR

Tests multi-step reasoning over natural language narratives including murder mysteries, team allocation puzzles, and object placement tracking

Reasoning29 models

Coding

Aider Polyglot

Tests AI coding assistants on real-world programming tasks across multiple languages using the Aider coding tool. Measures ability to edit existing codebases to pass tests.

Coding16 models

SciCode

Tests models on research-level scientific programming problems drawn from real scientific papers across physics, chemistry, biology, and mathematics

Coding12 models

LiveCodeBench

Continuously updated coding benchmark using new competitive programming problems from LeetCode, AtCoder, and Codeforces to prevent contamination

Coding3 models

TerminalBench

Evaluates models on completing real-world terminal and shell tasks, including file manipulation, system commands, and scripting

Coding3 models

HumanEval

164 Python programming problems requiring code generation, evaluated by running tests

Coding2 models

HumanEval+

An augmented version of HumanEval with 80× more test cases per problem to catch false positives from weak test suites

Coding1 model

Math

MATH

Competition mathematics problems across seven subjects and five difficulty levels, testing advanced mathematical reasoning

Math30 models

AIME 2025

15 challenging math competition problems from AIME 2025, used as a difficult math reasoning benchmark for frontier models

Math13 models

AIME 2024

15 challenging math competition problems from AIME 2024, used as a difficult math reasoning benchmark for frontier models

Math12 models

Instruction Following

IFEval

Tests whether models can follow explicit, verifiable instructions such as 'write exactly 3 paragraphs' or 'include the word ocean at least twice'

Instruction Following29 models

IFBench

Evaluates instruction-following ability using diverse, complex instructions that test a model's ability to precisely adhere to specified constraints

Instruction Following6 models

Conversation

Chatbot Arena

Elo

Live crowd-sourced evaluation where users chat with two anonymous models side-by-side and vote for the better response. Produces Elo ratings

Conversation1 model

MT-Bench

/10

80 multi-turn conversation questions scored by GPT-4 on writing, roleplay, reasoning, math, coding, and STEM

Conversation1 model

Agent

TAU2

Tests models on autonomous tool use and agentic task completion in realistic web and computer interaction scenarios

Agent8 models

Long Context

LCR

Tests models on retrieving specific information from very long documents, measuring long-context comprehension and retrieval accuracy

Long Context7 models