ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Benchmarks

Standardized tests used to measure AI model capabilities. Each benchmark evaluates different skills — from reasoning and coding to math and conversation.

General Knowledge

MMLU-Pro

%

A harder variant of MMLU with 10 answer choices instead of 4 and more reasoning-intensive questions, reducing noise from random guessing

General Knowledge29 models

MMLU

%

Tests broad academic knowledge across 57 subjects.

General Knowledge2 models

Humanity's Last Exam

%

An extremely difficult exam crowdsourced from subject-matter experts across 100+ disciplines, designed to be the hardest test for AI systems

General Knowledge1 model

Reasoning

GPQA Diamond

%

Expert-level multiple-choice questions in biology, chemistry, and physics. The Diamond subset contains the hardest questions verified by multiple domain experts

Reasoning30 models

BBH

%

A challenging subset of BIG-Bench tasks where models previously failed to match average human performance

Reasoning29 models

MuSR

%

Tests multi-step reasoning over natural language narratives including murder mysteries, team allocation puzzles, and object placement tracking

Reasoning29 models

Coding

HumanEval

%

164 Python programming problems requiring code generation, evaluated by running tests

Coding2 models

HumanEval+

%

An augmented version of HumanEval with 80× more test cases per problem to catch false positives from weak test suites

Coding1 model

Aider Polyglot

%

Tests AI coding assistants on real-world programming tasks across multiple languages using the Aider coding tool. Measures ability to edit existing codebases to pass tests.

Coding

LiveCodeBench

%

Continuously updated coding benchmark using new competitive programming problems from LeetCode, AtCoder, and Codeforces to prevent contamination

Coding

SciCode

%

Tests models on research-level scientific programming problems drawn from real scientific papers across physics, chemistry, biology, and mathematics

Coding

TerminalBench

%

Evaluates models on completing real-world terminal and shell tasks, including file manipulation, system commands, and scripting

Coding

Math

MATH

%

Competition mathematics problems across seven subjects and five difficulty levels, testing advanced mathematical reasoning

Math30 models

AIME 2024

%

15 challenging math competition problems from AIME 2024, used as a difficult math reasoning benchmark for frontier models

Math

AIME 2025

%

15 challenging math competition problems from AIME 2025, used as a difficult math reasoning benchmark for frontier models

Math

Instruction Following

IFEval

%

Tests whether models can follow explicit, verifiable instructions such as 'write exactly 3 paragraphs' or 'include the word ocean at least twice'

Instruction Following29 models

IFBench

%

Evaluates instruction-following ability using diverse, complex instructions that test a model's ability to precisely adhere to specified constraints

Instruction Following

Conversation

Chatbot Arena

Elo

Live crowd-sourced evaluation where users chat with two anonymous models side-by-side and vote for the better response. Produces Elo ratings

Conversation1 model

MT-Bench

/10

80 multi-turn conversation questions scored by GPT-4 on writing, roleplay, reasoning, math, coding, and STEM

Conversation1 model

Long Context

LCR

%

Tests models on retrieving specific information from very long documents, measuring long-context comprehension and retrieval accuracy

Long Context

Agent

TAU2

%

Tests models on autonomous tool use and agentic task completion in realistic web and computer interaction scenarios

Agent