Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao et al.|April 7, 2026arXiv

Key Takeaway

Current agent benchmarks miss critical safety violations and robustness failures by only checking final results; trajectory-aware evaluation that tracks every action reveals that most frontier models are less reliable than they appear, especially on video tasks.

Summary

Claw-Eval is a comprehensive evaluation suite for autonomous AI agents that goes beyond checking final outputs to examine every action taken during task execution. It evaluates 300 real-world tasks across multiple modalities and interaction types, using execution traces, logs, and environment snapshots to catch safety issues and robustness problems that simpler evaluation methods miss.

evaluation agents safety

Key Terms

agentic-evaluation trajectory-aware-grading execution-trace robustness-evaluation multimodal-agent