IFBench

instruction followingScore: 0-100 (% accuracy)6 models scored

About

Evaluates instruction-following ability using diverse, complex instructions that test a model's ability to precisely adhere to specified constraints

Methodology

Tests models on following complex, multi-constraint instructions across diverse task types. Uses automatic evaluation with programmatic and LLM-based verification. More challenging than IFEval due to more complex and varied constraints.

Paper Dataset Website

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	Grok 4.3	81.3%
2	o3	69.3%
3	Claude Opus 4.5	58.0%
4	Gemini 2.5 Pro	52.3%
5	Claude Sonnet 4	42.3