GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, yet their tendency to fabricate citations (ghost citations'') poses a systemic threat to citation validity. To quantify this threat and inform mitigation, we develop CiteVerifier, an open-source framework for large-scale citation verification, and conduct the first comprehensive study of citation validity in the LLM era through three experiments built on it. We benchmark 13 state-of-the-art LLMs on citation generation across 40 research domains, finding that all models hallucinate citations at rates from 14.23\% to 94.93\%, with significant variation across research domains. Moreover, we analyze 2.2 million citations from 56,381 papers published at top-tier AI/ML and Security venues (2020--2025), confirming that 1.07\% of papers contain invalid or fabricated citations (604 papers), with an 80.9\% increase in 2025 alone. Furthermore, we survey 97 researchers and analyze 94 valid responses after removing 3 conflicting samples, revealing a critical verification gap’’: 41.5% of researchers copy-paste BibTeX without checking and 44.4% choose no-action responses when encountering suspicious references; meanwhile, 76.7% of reviewers do not thoroughly check references and 80.0% never suspect fake citations. Our findings reveal an accelerating crisis where unreliable AI tools, combined with inadequate human verification by researchers and insufficient peer review scrutiny, enable fabricated citations to contaminate the scientific record. We propose interventions for researchers, venues, and tool developers to protect citation integrity.


💡 Research Summary

The paper “GhostCite: A Large‑Scale Analysis of Citation Validity in the Age of Large Language Models” investigates a newly emerging form of scholarly misconduct: the generation of fabricated references, termed “ghost citations,” by large language models (LLMs) used for academic writing. The authors develop an open‑source framework called CiteVerifier to automatically parse, retrieve, and validate citations at scale, and they conduct three complementary experiments to answer three research questions: (Q1) How prevalent are ghost citations in LLM outputs? (Q2) Have they already contaminated the published scientific record? (Q3) Why do existing verification mechanisms fail?

CiteVerifier Architecture
CiteVerifier combines a hybrid parsing pipeline (GROBID plus an LLM‑assisted fallback) with a cascaded retrieval strategy: a local cache of bibliographic records, queries to academic databases (DBLP, Google Scholar, Crossref), and finally a web search fallback. Retrieved records are compared to the parsed citation using a similarity‑based classifier that outputs “Valid,” “Suspicious,” or “Invalid.” The system also caches results and uses LLM‑assisted re‑parsing to handle non‑standard formats.

Experiment I – LLM Benchmark (Q1)
Thirteen state‑of‑the‑art LLMs (including GPT‑5, Claude‑4, DeepSeek, Hunyuan, etc.) were prompted to generate reference lists across 40 computer‑science sub‑domains aligned with arXiv categories. In total 375,440 citations were produced from 22,800 API calls (≈ $800 cost). All models hallucinated citations, with rates ranging from 14.23 % (DeepSeek) to 94.93 % (Hunyuan). The hallucination rate varied by domain by up to 51.39 percentage points, indicating strong domain sensitivity. The authors also experimented with prompting strategies (search‑augmented vs. chain‑of‑thought) and found that even the best‑performing configurations still produced substantial false references.

Experiment II – Archival Analysis (Q2)
The authors collected 56,381 papers from eight top AI/ML and security venues (NeurIPS, ICML, IJCAI, AAAI, IEEE S&P, USENIX Security, CCS, NDSS) published between 2020 and 2025, totaling 2,199,409 citations. CiteVerifier automatically flagged 2,530 citations as potentially invalid. A manual verification team of 16 researchers spent roughly one month reviewing these flags, confirming 739 citations in 604 distinct papers (1.07 % of the corpus) as definitively invalid—either severely malformed or untraceable. The proportion of papers with invalid citations rose sharply in 2025, showing an 80.9 % increase over the 2020‑2024 average, suggesting rapid acceleration. Moreover, some fabricated citations appeared in up to 16 different papers, evidencing error propagation across the literature.

Experiment III – User Study (Q3)
A survey was distributed to 300 randomly selected authors and program committee members from the same eight venues; 97 responses were received, of which 94 were deemed valid after removing three conflicting entries. Among respondents who use AI tools (n = 86), 87.2 % reported using them for research tasks and 86.7 % claimed they “always verify” AI‑generated citations. However, self‑reported behavior diverged from practice: 41.5 % admitted to copy‑pasting BibTeX entries without checking, and 44.4 % would take no action when encountering a suspicious reference. Of the 30 reviewer respondents, 76.7 % said they do not thoroughly check references, and 80.0 % never suspect fabricated citations. Overall, 74.5 % view peer review as ineffective at catching metadata errors, while 70.2 % strongly support automated citation checks.

Key Insights

  1. LLM‑Generated Ghost Citations Are Widespread – Even the most advanced models produce a non‑trivial fraction of fabricated references, and the problem is highly domain‑dependent.
  2. The Published Record Is Already Contaminated – Over one percent of recent AI/ML and security papers contain invalid citations, with a steep upward trend and evidence of repeated propagation.
  3. Human Verification Is Systematically Weak – Researchers and reviewers rely on trust‑by‑default norms; many copy‑paste citations without verification, and reviewers rarely scrutinize reference lists. This creates a “verification gap” that allows ghost citations to persist.
  4. Scalable Automated Verification Is Feasible – CiteVerifier demonstrates that a combination of robust parsing, multi‑source retrieval, caching, and similarity‑based classification can process millions of citations with acceptable precision, though coverage gaps in bibliographic databases remain a limitation.

Proposed Interventions

  • Adopt Automated Verification Pipelines: Integrate tools like CiteVerifier into conference submission systems and journal editorial workflows to flag suspicious references before peer review.
  • Policy Enhancements: Require explicit disclosure of AI assistance in manuscript preparation and mandate citation verification as part of author responsibilities.
  • Community Education: Train researchers and reviewers on the risks of ghost citations and promote a culture of systematic reference checking.
  • Infrastructure Improvements: Expand and harmonize bibliographic databases, develop cross‑publisher APIs, and explore citation‑graph anomaly detection to complement text‑based verification.

Limitations and Future Work
The authors acknowledge that CiteVerifier’s recall is bounded by the completeness of external databases; some legitimate but obscure works may be misclassified. LLM‑assisted re‑parsing can introduce its own errors, and the manual verification step, while thorough, is labor‑intensive and may not scale without further automation. Future research directions include (a) multimodal verification that incorporates PDF content and metadata, (b) graph‑based detection of anomalous citation patterns, (c) longitudinal studies of policy impact, and (d) broader cross‑disciplinary analyses beyond computer science.

Conclusion
“GhostCite” provides the first large‑scale, empirical quantification of ghost citations in the LLM era, revealing a rapidly growing threat to scholarly integrity. By combining a scalable verification framework with extensive empirical analysis and a behavioral survey, the paper demonstrates that both technological and cultural interventions are required to safeguard the trust infrastructure of scientific communication.


Comments & Academic Discussion

Loading comments...

Leave a Comment