80 multi-turn conversation questions scored by GPT-4 on writing, roleplay, reasoning, math, coding, and STEM
80 multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge (STEM), and knowledge (humanities). GPT-4 serves as the judge, scoring responses on a 1-10 scale. Designed to test conversational ability and instruction following.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | GPT-4o | 9.3/10 |