Instruction-Following Eval
Tests whether models can follow explicit, verifiable instructions such as 'write exactly 3 paragraphs' or 'include the word ocean at least twice'
541 prompts with 25 types of verifiable instructions (length constraints, keyword inclusion, formatting requirements, etc). Evaluated by programmatic verification — no human or LLM judge needed. Two metrics: strict (all constraints met) and loose (partial credit).
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Qwen2.5 32B Instruct | 83.5% |
| 2 | Qwen2.5 14B Instruct | 81.6% |
| 3 | gemma 2 27b it | 79.8% |
| 4 | Qwen2.5 7B Instruct | 75.9% |
| 5 | gemma 2 9b it |
| 74.4% |
| 6 | Llama 3.2 3B Instruct | 73.9% |
| 7 | Qwen2.5 3B Instruct | 64.7% |
| 8 | Qwen2.5 Coder 7B Instruct | 61.0% |
| 9 | Phi 3 medium 128k instruct | 60.4% |
| 10 | Llama 3.2 1B Instruct | 57.0% |
| 11 | gemma 2 2b it | 56.7% |
| 12 | Mistral 7B Instruct v0.2 | 55.0% |
| 13 | Llama 3.1 8B Instruct | 49.2% |
| 14 | Qwen2.5 1.5B Instruct | 44.8% |
| 15 | Qwen2.5 32B | 40.8% |
| 16 | Qwen2.5 7B | 33.7% |
| 17 | Qwen2 1.5B Instruct | 33.7% |
| 18 | Qwen2.5 0.5B Instruct | 30.7% |
| 19 | phi 2 | 27.4% |
| 20 | gpt j 6b | 25.2% |
| 21 | gpt2 large | 20.5% |
| 22 | falcon 7b instruct | 19.7% |
| 23 | pythia 160m | 18.2% |
| 24 | gpt2 | 17.9% |
| 25 | Llama 3.2 1B | 14.8% |
| 26 | Meta Llama 3 8B | 14.6% |
| 27 | Llama 3.2 3B | 13.4% |
| 28 | distilgpt2 | 6.1% |
| 29 | TinyLlama 1.1B Chat v1.0 | 6.0% |