Retrieval isn't the main problem for frontier models on deep research tasks; instead, they fail primarily at deriving answers from evidence and calibrating confidence correctly, suggesting future improvements should focus on reasoning and verification rather than search.
DeepWeb-Bench is a challenging benchmark for evaluating AI agents that research questions by searching the web, collecting evidence, and reasoning through answers. Unlike existing benchmarks, it focuses on tasks requiring massive evidence gathering, cross-source verification, and complex multi-step reasoning—areas where current frontier models still struggle significantly.