Tool and Agent Use Benchmark 2
Tests models on autonomous tool use and agentic task completion in realistic web and computer interaction scenarios
Models must complete multi-step tasks using tools (web search, code execution, API calls) in realistic scenarios. Evaluates planning, tool selection, error recovery, and goal completion across diverse domains.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Claude Sonnet 4 | 60.0% |
| 2 | Claude Opus 4 | 59.6% |
| 3 | Claude 3.7 Sonnet | 58.4% |
| 4 | o1 | 50.0% |
| 5 | o4-mini | 49.2 |
| 6 | Claude 3.5 Sonnet | 46.0% |
| 7 | GPT-4o | 42.8% |
| 8 | o3 Mini | 32.4% |