Evaluates models on completing real-world terminal and shell tasks, including file manipulation, system commands, and scripting
Models are given access to a terminal and asked to complete practical shell tasks. Covers bash scripting, file operations, process management, and system administration tasks. Success is judged by whether the desired system state is achieved.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Claude Opus 4 | 39.2% |
| 2 | Claude Sonnet 4 | 35.5% |
| 3 | Claude 3.7 Sonnet | 35.2% |