Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ($\approx$ 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy $\approx$ 0.60) despite cleanly rejecting the random baseline ($\approx$ 0.89–0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches $\approx$ 0.83, and GNNs with embedding node features achieve 93% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude $\approx 0.77$ and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.

💡 Research Summary

This paper investigates whether reference lists generated by large language models (LLMs) can be distinguished from those curated by humans, focusing on both the structural properties of the induced citation graphs and the semantic content of the cited works. Using the SciSciNet dataset, the authors select 10,000 “focal” papers (published in Q1 journals between 1999 and 2021, each with 3–54 references) amounting to roughly 275 k citations. For each focal paper they prompt GPT‑4o to suggest the same number of references that the original paper cites, using only the paper’s title, authors, venue, year, and abstract. The suggested references are matched against the SciSciNet database via fuzzy title/author matching to ensure they exist. An identical pipeline is repeated with Claude Sonnet 4.5 to test robustness.

Two families of graphs are constructed: (i) Ground‑truth graphs, containing the focal paper and its actual human‑cited references; (ii) Generated graphs, containing the focal paper and the LLM‑suggested references. Edges represent citation relationships retrieved from SciSciNet. To create a meaningful baseline, the authors generate field‑matched random graphs by permuting references within the same top‑field (or sub‑field) while preserving each focal paper’s out‑degree (reference count) and field‑level distributions of citation frequencies and publication years. A temporally constrained variant further ensures that resampled references precede the focal paper’s year. This baseline deliberately destroys latent citation structure while keeping observable statistics.

Structural analysis: The authors extract five normalized node‑level metrics—degree centrality, closeness centrality, eigenvector centrality, clustering coefficient, and edge count. For each graph they compute four aggregate statistics (mean, median, inter‑quartile range, and max‑to‑mean ratio), yielding a compact set of interpretable descriptors that scale across graphs of varying size. Visualizations (box‑plots, scatter‑plots) show that GPT‑generated graphs closely mirror ground‑truth graphs in median values and dispersion for all metrics, including the presence of high‑degree hub nodes and moderate clustering. Random graphs, by contrast, cluster at low degree and near‑zero clustering, reflecting a tree‑like sparsity.

A Random Forest (RF) classifier trained on these structural descriptors achieves ≈ 60 % accuracy when distinguishing GPT‑generated from ground‑truth graphs—only marginally better than chance—yet it cleanly separates both from the random baseline with ≈ 0.90–0.92 accuracy. This demonstrates that LLMs can reproduce the global topology of human citation networks.

Semantic analysis: Each node (paper) is also represented by a high‑dimensional embedding derived from its title and abstract. The authors use two embedding families: (a) OpenAI’s 3072‑dimensional model, and (b) SPECTER, a citation‑aware transformer. Graph‑level semantic features are obtained by aggregating node embeddings (mean, std, max). An RF classifier on these aggregated embeddings raises the human‑vs‑LLM discrimination to ≈ 0.83 accuracy. More importantly, Graph Neural Networks (GNNs) that ingest node embeddings as input features (e.g., GraphSAGE, GAT) achieve ≈ 93 % test accuracy on the GPT‑vs‑ground‑truth task. The same pipeline applied to Claude‑generated graphs yields RF accuracy ≈ 0.77, confirming that the phenomenon is not model‑specific.

Robustness checks: The authors repeat the entire pipeline with the sub‑field random baseline, the temporally constrained baseline, and with both embedding families. Across all variants, random graphs remain easily separable, while LLM‑generated graphs stay close to ground‑truth structurally but diverge semantically. The consistency of results across GPT‑4o, Claude 4.5, OpenAI embeddings, and SPECTER underscores the reliability of the findings.

Key insights:

Structural mimicry – LLMs, when generating references purely from parametric knowledge, reproduce the salient graph‑level statistics of human citation networks (degree distribution, clustering, hub prevalence). Simple topological metrics are insufficient for detection.
Semantic fingerprint – Title/abstract embeddings retain enough signal to differentiate LLM‑generated reference sets from human ones. The semantic space reflects the model’s internal knowledge distribution rather than the nuanced relevance judgments humans make.
Detection strategy – Effective detection and debiasing of LLM‑generated bibliographies should focus on content‑level signals (semantic embeddings, GNN‑based joint structure‑semantic models) rather than coarse network topology.
Potential bias amplification – The study notes that LLMs tend to reinforce known scientometric biases (Matthew effect, recency bias, shorter titles, fewer authors) while reducing self‑citations. If such patterns propagate unchecked, they could subtly reshape citation ecosystems.
Practical tools – The authors release their code, data, and trained GNN models, enabling downstream systems (literature‑review assistants, citation recommendation engines) to flag potentially synthetic reference lists for human verification.

In conclusion, the paper provides a comprehensive empirical framework for evaluating LLM‑generated bibliographies. It shows that while LLMs can convincingly emulate the structure of scholarly citation graphs, they leave a detectable semantic imprint. Consequently, future scholarly‑AI pipelines should incorporate semantic‑aware monitoring to safeguard the integrity of academic citation practices.

Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment