IFEval

Instruction-Following Eval

instruction followingScore: 0-100 (% accuracy)29 models scored

About

Tests whether models can follow explicit, verifiable instructions such as 'write exactly 3 paragraphs' or 'include the word ocean at least twice'

Methodology

541 prompts with 25 types of verifiable instructions (length constraints, keyword inclusion, formatting requirements, etc). Evaluated by programmatic verification — no human or LLM judge needed. Two metrics: strict (all constraints met) and loose (partial credit).

Paper Dataset

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	Qwen2.5 32B Instruct	83.5%
2	Qwen2.5 14B Instruct	81.6%
3	gemma 2 27b it	79.8%
4	Qwen2.5 7B Instruct	75.9%
5	gemma 2 9b it