Tests whether models can follow explicit, verifiable instructions such as 'write exactly 3 paragraphs' or 'include the word ocean at least twice'
Methodology
541 prompts with 25 types of verifiable instructions (length constraints, keyword inclusion, formatting requirements, etc). Evaluated by programmatic verification — no human or LLM judge needed. Two metrics: strict (all constraints met) and loose (partial credit).
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.