Agent performance depends equally on system design (memory, routing, verification) as on model capability; evaluating agents requires measuring trajectory quality and system hygiene, not just final outcomes.
This paper argues that building better AI agents requires focusing on the system architecture around language models, not just making the models bigger. It introduces the concept of 'scaling the harness'—designing the memory, tool-use, verification, and orchestration layers that turn a model into a working agent—and proposes benchmarks to measure agent quality beyond just task success.