Distortive Effects of Initial-Based Name Disambiguation on Measurements of Large-Scale Coauthorship Networks

Distortive Effects of Initial-Based Name Disambiguation on Measurements   of Large-Scale Coauthorship Networks

Scholars have often relied on name initials to resolve name ambiguities in large-scale coauthorship network research. This approach bears the risk of incorrectly merging or splitting author identities. The use of initial-based disambiguation has been justified by the assumption that such errors would not affect research findings too much. This paper tests this assumption by analyzing coauthorship networks from five academic fields - biology, computer science, nanoscience, neuroscience, and physics - and an interdisciplinary journal, PNAS. Name instances in datasets of this study were disambiguated based on heuristics gained from previous algorithmic disambiguation solutions. We use disambiguated data as a proxy of ground-truth to test the performance of three types of initial-based disambiguation. Our results show that initial-based disambiguation can misrepresent statistical properties of coauthorship networks: it deflates the number of unique authors, number of component, average shortest paths, clustering coefficient, and assortativity, while it inflates average productivity, density, average coauthor number per author, and largest component size. Also, on average, more than half of top 10 productive or collaborative authors drop off the lists. Asian names were found to account for the majority of misidentification by initial-based disambiguation due to their common surname and given name initials.


💡 Research Summary

This study rigorously evaluates how the widely used practice of initial‑based name disambiguation (IBD) distorts the structural properties of large‑scale co‑authorship networks. The authors assembled bibliographic records from five distinct scientific domains—biology, computer science, nanoscience, neuroscience, and physics—as well as the interdisciplinary journal PNAS, covering publications from 2000 to 2020. To obtain a proxy ground‑truth, they combined existing author identifiers (ORCID, ResearcherID) with a multi‑stage algorithm that leverages string similarity, co‑author overlap, institutional affiliation matching, and topic modeling, achieving an estimated 95 % disambiguation accuracy.

Using this gold standard as a benchmark, three conventional IBD schemes were applied: (1) surname + first initial (SFI), (2) surname + two initials (SII), and (3) surname + full set of initials (SFI‑M). For each scheme the authors reconstructed the co‑authorship graph and computed a suite of network metrics: number of unique authors, number of connected components, average shortest‑path length, clustering coefficient, assortativity, density, average degree (average co‑author count), and the size of the largest component. They also examined the impact on author‑level performance indicators by comparing the top‑10 most productive and most collaborative scholars under each disambiguation method.

The results reveal systematic and substantial biases introduced by IBD. Across all fields, IBD underestimates the number of distinct authors by 30 %–45 %, primarily due to merge errors that conflate multiple real individuals sharing the same surname and initials. This artificial compression inflates network density and average degree, shortens average path lengths by roughly 15 %–20 %, and reduces both clustering coefficient and assortativity, giving the appearance of a more tightly knit but less assortative community. The largest connected component is over‑estimated by 10 %–25 %, potentially misleading researchers about the extent of a field’s core collaboration network.

Productivity and collaboration rankings are especially vulnerable: on average, six out of the ten scholars (≈60 %) identified as top performers in the gold‑standard data disappear or are merged with other authors under IBD. The error concentration is strikingly uneven across name origins. Names common in East Asian cultures—particularly Chinese, Korean, and Japanese surnames combined with limited initial sets—account for more than 70 % of all misidentifications. The authors attribute this to cultural naming conventions (high surname frequency, short given names) and inconsistent transliteration practices in bibliographic databases.

In the discussion, the authors argue that such distortions have far‑reaching implications for comparative scientometrics, research evaluation, and policy decisions that rely on network‑based indicators. Over‑estimated centrality or collaboration metrics could unjustly influence funding allocations, hiring, and promotion decisions. Consequently, the paper recommends abandoning simple initial‑based heuristics in favor of more sophisticated disambiguation pipelines that integrate unique identifiers, machine‑learning classification, and multi‑attribute similarity scoring. Database providers are urged to enforce standardized transliteration and to mandate the inclusion of persistent author IDs.

In conclusion, the study provides compelling empirical evidence that initial‑based name disambiguation introduces severe biases into co‑authorship network analyses, especially for Asian name groups. Accurate mapping of scholarly collaboration therefore requires robust, algorithmic disambiguation methods rather than reliance on initials alone.