BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Delip Rao, Chris Callison-Burch|April 3, 2026arXiv

Key Takeaway

Even with web search enabled, LLMs rely heavily on parametric memory for citations and fail on recent papers; a two-stage pipeline separating retrieval from revision reduces errors more effectively than improving the base model alone.

Summary

Large language models with web search still make frequent errors in BibTeX citations for scientific papers, especially for recent or obscure papers. This paper benchmarks three frontier models on 931 papers, identifies two types of citation errors, and shows that a two-stage retrieval-then-revision approach using deterministic tools improves accuracy from 51% to 78% of fully correct entries.

evaluation applications safety

Key Terms

parametric-memory hallucination web-search-integration rag-pipeline error-taxonomy