Current agent benchmarks miss critical safety violations and robustness failures by only checking final results; trajectory-aware evaluation that tracks every action reveals that most frontier models are less reliable than they appear, especially on video tasks.
Claw-Eval is a comprehensive evaluation suite for autonomous AI agents that goes beyond checking final outputs to examine every action taken during task execution. It evaluates 300 real-world tasks across multiple modalities and interaction types, using execution traces, logs, and environment snapshots to catch safety issues and robustness problems that simpler evaluation methods miss.