Even with web search enabled, LLMs rely heavily on parametric memory for citations and fail on recent papers; a two-stage pipeline separating retrieval from revision reduces errors more effectively than improving the base model alone.
Large language models with web search still make frequent errors in BibTeX citations for scientific papers, especially for recent or obscure papers. This paper benchmarks three frontier models on 931 papers, identifies two types of citation errors, and shows that a two-stage retrieval-then-revision approach using deterministic tools improves accuracy from 51% to 78% of fully correct entries.