Tool and Agent Use Benchmark 2
Tests models on autonomous tool use and agentic task completion in realistic web and computer interaction scenarios
Models must complete multi-step tasks using tools (web search, code execution, API calls) in realistic scenarios. Evaluates planning, tool selection, error recovery, and goal completion across diverse domains.
No model scores recorded yet
Scores will appear here as the pipeline processes model data