TerminalBench

codingScore: 0-100 (% tasks completed)3 models scored

About

Evaluates models on completing real-world terminal and shell tasks, including file manipulation, system commands, and scripting

Methodology

Models are given access to a terminal and asked to complete practical shell tasks. Covers bash scripting, file operations, process management, and system administration tasks. Success is judged by whether the desired system state is achieved.

Website

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	Claude Opus 4	39.2%
2	Claude Sonnet 4	35.5%
3	Claude 3.7 Sonnet	35.2%