Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Delip Rao, Eric Wong, Chris Callison-Burch|April 3, 2026arXiv

Key Takeaway

LLM citations are unreliable at scale, but the problem is measurable and fixable: models equipped with URL-checking tools can reduce hallucinated citations from 5-18% to under 1% through self-correction.

Summary

This paper reveals that 3-13% of citation URLs provided by LLMs and research agents are completely fabricated (hallucinated), while another 5-18% don't work. The authors measure this across 10+ models and 200k+ URLs, then release urlhealth—a tool that checks if URLs are real using the Wayback Machine and helps models self-correct, reducing broken citations by up to 79x.

evaluation safety agents

Key Terms

hallucination citation-tracking agentic-self-correction tool-use deep-research-agent