Good practices for a literature survey are not followed by authors while preparing scientific manuscripts
The number of citations received by authors in scientific journals has become a major parameter to assess individual researchers and the journals themselves through the impact factor. A fair assessment therefore requires that the criteria for selecting references in a given manuscript should be unbiased with respect to the authors or the journals cited. In this paper, we advocate that authors should follow two mandatory principles to select papers (later reflected in the list of references) while studying the literature for a given research: i) consider similarity of content with the topics investigated, lest very related work should be reproduced or ignored; ii) perform a systematic search over the network of citations including seminal or very related papers. We use formalisms of complex networks for two datasets of papers from the arXiv repository to show that neither of these two criteria is fulfilled in practice.
💡 Research Summary
The paper addresses a growing concern in scholarly publishing: citations have become a primary metric for evaluating both individual researchers and journals, most notably through the impact factor. Because of this, the authors argue that the process of selecting references for a manuscript must be unbiased and grounded in two mandatory principles. The first principle demands that authors consider the similarity of content between their own work and potential references, ensuring that closely related studies are neither omitted nor redundantly reproduced. The second principle requires a systematic exploration of the citation network, meaning that authors should actively seek out seminal works and papers that are directly or indirectly linked to their topic, rather than relying on a narrow set of familiar sources.
To test whether these principles are actually followed in practice, the authors adopt a quantitative approach based on complex‑network theory. They construct two large datasets drawn from the arXiv repository—one from physics sub‑fields and another from computer science—comprising roughly 10,000 papers in total. For each paper, they extract metadata, the list of cited works, and the full text of titles and abstracts. Using Latent Dirichlet Allocation (LDA), they generate topic distributions for every document and compute pairwise cosine similarities, thereby obtaining a numerical “content‑similarity score” for any potential citation.
Simultaneously, they model the citation relationships as a directed graph. Standard network metrics such as in‑degree, betweenness centrality, and k‑step reachability are calculated to quantify how far an author’s reference list extends into the broader citation network. Two key evaluation metrics are defined: (1) the probability that a highly similar paper (top 10 % in content similarity) appears in the reference list, and (2) the proportion of cited papers that lie beyond one step from the “seed” references (i.e., the papers directly cited by the manuscript). By comparing these empirical probabilities with a null model that assumes random citation, the authors can assess the degree of adherence to the two principles.
The results reveal a systematic deviation from the ideal. Papers with high content similarity are cited only about 35 % of the time, while low‑similarity papers still appear in roughly 20 % of reference lists, indicating that content relevance is not the dominant driver of citation choice. Moreover, the citation networks of most manuscripts are shallow: over 85 % of references are within one step of the seed papers, and only a small fraction (≈15 %) extend to two or more steps. This demonstrates that authors tend to stay within a limited “citation bubble,” rarely exploring the broader literature that could be reached through systematic network traversal.
A striking pattern emerges when the authors examine the influence of prestige. Papers authored by well‑known researchers or published in high‑impact journals are disproportionately represented in reference lists, even when their topical relevance is low. The authors label this phenomenon “prestige bias” and argue that it reinforces an “echo‑chamber” effect, where a small elite of journals and authors dominate the citation landscape, marginalizing novel ideas and early‑career researchers.
In the discussion, the paper proposes concrete remedies. First, the development and adoption of automated literature‑search tools that leverage graph‑based recommendation algorithms could help authors discover relevant but less visible works. Second, journals and scientific societies should formalize citation policies that explicitly require justification of reference relevance, encouraging transparency. Third, research assessment frameworks should move beyond raw citation counts to incorporate diversity metrics—such as the spread of cited authors across institutions and countries—and content‑relevance scores. By integrating these technical and policy‑level interventions, the authors contend that the scholarly ecosystem can become more equitable, reduce citation‑driven distortions, and ultimately foster a healthier progression of scientific knowledge.
Comments & Academic Discussion
Loading comments...
Leave a Comment