From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

Shangding Gu|May 25, 2026arXiv

Key Takeaway

Agent performance depends equally on system design (memory, routing, verification) as on model capability; evaluating agents requires measuring trajectory quality and system hygiene, not just final outcomes.

Summary

This paper argues that building better AI agents requires focusing on the system architecture around language models, not just making the models bigger. It introduces the concept of 'scaling the harness'—designing the memory, tool-use, verification, and orchestration layers that turn a model into a working agent—and proposes benchmarks to measure agent quality beyond just task success.

agents architecture evaluation

Key Terms

agentic-systems agent-harness skill-routing context-governance trustworthy-memory