Enterprise agents need evaluation frameworks that measure artifact quality, cost, runtime, and skill transfer—not just task completion—because real workplace tasks are complex, heterogeneous, and require reproducible, auditable results.
This paper introduces EnterpriseClawBench, a benchmark for evaluating AI agents in real workplace environments. Built from actual enterprise sessions, it contains 852 reproducible tasks involving file handling, tool use, and business artifact creation.