Evaluates instruction-following ability using diverse, complex instructions that test a model's ability to precisely adhere to specified constraints
Tests models on following complex, multi-constraint instructions across diverse task types. Uses automatic evaluation with programmatic and LLM-based verification. More challenging than IFEval due to more complex and varied constraints.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Grok 4.3 | 81.3% |
| 2 | o3 | 69.3% |
| 3 | Claude Opus 4.5 | 58.0% |
| 4 | Gemini 2.5 Pro | 52.3% |
| 5 | Claude Sonnet 4 | 42.3 |
| 6 | DeepSeek R1 | 38.0% |