Current AI agents struggle with real-world web tasks that require document understanding, multi-step navigation, and detailed form-filling—even frontier models succeed on less than 40% of everyday online activities.
ClawBench is a benchmark with 153 real-world online tasks across 144 live websites—like booking appointments, filling forms, and submitting applications—to test whether AI agents can handle everyday work.