AI agents are expensive and unpredictable: token costs vary wildly (up to 30x difference on the same task), models differ dramatically in efficiency, and even frontier models can't accurately predict their own token usage before running.
This paper analyzes how much AI agents spend on tokens when solving coding tasks. Researchers studied eight frontier LLMs on real-world coding benchmarks and found that agentic tasks consume 1000x more tokens than simpler coding tasks, with huge variability between runs. Surprisingly, spending more tokens doesn't guarantee better results—accuracy often peaks at intermediate costs then plateaus.