More reasoning isn't always better for function-calling agents—brief, structured reasoning (8-32 tokens) optimally helps the model select the right function, while longer reasoning causes it to hallucinate invalid functions.
This paper studies how much reasoning time language agents should spend before calling functions. Testing on 200 tasks, the authors find a surprising non-monotonic pattern: brief 32-token reasoning improves accuracy by 45%, but longer reasoning actually hurts performance.