Exploiting citation networks for large-scale author name disambiguation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a novel algorithm and validation method for disambiguating author names in very large bibliographic data sets and apply it to the full Web of Science (WoS) citation index. Our algorithm relies only upon the author and citation graphs available for the whole period covered by the WoS. A pair-wise publication similarity metric, which is based on common co-authors, self-citations, shared references and citations, is established to perform a two-step agglomerative clustering that first connects individual papers and then merges similar clusters. This parameterized model is optimized using an h-index based recall measure, favoring the correct assignment of well-cited publications, and a name-initials-based precision using WoS metadata and cross-referenced Google Scholar profiles. Despite the use of limited metadata, we reach a recall of 87% and a precision of 88% with a preference for researchers with high h-index values. 47 million articles of WoS can be disambiguated on a single machine in less than a day. We develop an h-index distribution model, confirming that the prediction is in excellent agreement with the empirical data, and yielding insight into the utility of the h-index in real academic ranking scenarios.

💡 Research Summary

The paper presents a scalable algorithm for author name disambiguation that relies exclusively on the author and citation graphs available in the Web of Science (WoS) database. Recognizing that many bibliographic records, especially older ones, lack rich metadata such as full first names, affiliations, or email addresses, the authors design a method that uses only three universally present pieces of information for each paper: the list of co‑authors (A), the set of references (R), and the set of citing papers (C).

A pairwise similarity score sᵢⱼ between any two papers i and j is defined as a weighted sum of four components: (1) the normalized overlap of co‑author lists, (2) the count of potential self‑citations (i.e., when one paper cites the other), (3) the overlap of reference lists, and (4) the overlap of citing papers. The normalization uses an “overlap coefficient” to mitigate bias toward papers with long author or citation lists. Four tunable weights (α) control the relative importance of each component.

Using this similarity matrix, the algorithm proceeds in two agglomerative clustering stages. In the first stage, any paper pair with sᵢⱼ exceeding a threshold β₁ is linked, and the resulting connected components become provisional clusters. In the second stage, a cluster‑to‑cluster similarity S_γ,κ is computed by summing the sᵢⱼ values (above a second threshold β₂) across all cross‑cluster paper pairs and normalizing by the product of the cluster sizes. If S_γ,κ exceeds a third threshold β₃, the two clusters are merged. Remaining singleton papers are attached to existing clusters if their similarity to any member exceeds a fourth threshold β₄.

Parameter optimization is guided not by generic precision/recall but by a task‑specific metric: the h‑index recall R_h. The authors collect Google Scholar profiles (GSP) for 3,000 surnames, cross‑reference each profile’s publications with WoS entries, and compute the h‑index of the ground‑truth profile. For each candidate clustering, they identify the cluster that captures the most of a researcher’s papers and calculate R_h as the ratio of the h‑index of that cluster to the true h‑index. This focuses the algorithm on correctly assigning the most‑cited papers, which dominate the h‑index, rather than on perfect recall of every low‑cited work. Precision is estimated using a name‑initial proxy: after clustering all papers sharing a surname (ignoring initials), the most frequent initial in each cluster is assumed correct, and the proportion of papers with that initial gives an upper bound on lumping errors.

A massive random search over 10,000 parameter configurations on the 3,000 surnames identifies a region near the origin of the precision‑vs‑h‑index‑error plane, indicating the best trade‑off. Further fine‑tuning on four disjoint samples (A–D) confirms that using all four similarity components (authors, self‑citations, references, citations) yields the lowest errors.

Implementation handles 47 million papers, 141 million co‑author entries, and 526 million citation links from 1900–2011. The system runs on a single high‑performance machine, completing the full disambiguation in under 24 hours. Evaluation on the entire WoS shows an average recall of 87 % and precision of 88 %. Importantly, for researchers with high h‑indices (top 10 % of the population), h‑index recall exceeds 95 %, demonstrating the algorithm’s bias toward correctly clustering highly cited work.

With the disambiguated author clusters, the authors compute the empirical distribution P(h) of h‑indices across the whole scientific corpus. Contrary to earlier assumptions that P(h) follows a Pareto (Lotka) law, the observed distribution deviates significantly. The paper proposes a simple stochastic model (based on log‑normal assumptions about citation accumulation) that matches the empirical P(h) closely, suggesting that the h‑index reflects both productivity and the underlying citation network dynamics.

In summary, the study delivers a practical, metadata‑light disambiguation pipeline that scales to the entire Web of Science, achieves high h‑index‑focused accuracy, and provides new insights into the statistical properties of the h‑index. The methodology can be adapted for other large bibliographic collections and for alternative evaluation goals by redefining the optimization objective.

Exploiting citation networks for large-scale author name disambiguation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment