TAU2

Tool and Agent Use Benchmark 2

agentScore: 0-100 (% tasks completed)8 models scored

About

Tests models on autonomous tool use and agentic task completion in realistic web and computer interaction scenarios

Methodology

Models must complete multi-step tasks using tools (web search, code execution, API calls) in realistic scenarios. Evaluates planning, tool selection, error recovery, and goal completion across diverse domains.

Paper Dataset Website

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	Claude Sonnet 4	60.0%
2	Claude Opus 4	59.6%
3	Claude 3.7 Sonnet	58.4%
4	o1	50.0%
5	o4-mini	49.2