MT-Bench

conversationScore: 1-10 (GPT-4 judge score)1 model scored

About

80 multi-turn conversation questions scored by GPT-4 on writing, roleplay, reasoning, math, coding, and STEM

Methodology

80 multi-turn questions across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge (STEM), and knowledge (humanities). GPT-4 serves as the judge, scoring responses on a 1-10 scale. Designed to test conversational ability and instruction following.

Paper Dataset Website

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	GPT-4o	9.3/10