EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Jincheng Zhong, Weizhi Wang, Che Jiang, Kai Tian, Zhenzhao Yuan et al.|June 22, 2026arXiv

Key Takeaway

Enterprise agents need evaluation frameworks that measure artifact quality, cost, runtime, and skill transfer—not just task completion—because real workplace tasks are complex, heterogeneous, and require reproducible, auditable results.

Summary

This paper introduces EnterpriseClawBench, a benchmark for evaluating AI agents in real workplace environments. Built from actual enterprise sessions, it contains 852 reproducible tasks involving file handling, tool use, and business artifact creation.

evaluation agents applications

Key Terms

agentic-tasks artifact-delivery tool-use semantic-rubrics skill-transfer-behavior