Several studies have found that collaboration networks are scale-free, proposing that such networks can be modeled by specific network evolution mechanisms like preferential attachment. This study argues that collaboration networks can look more or less scale-free depending on the methods for resolving author name ambiguity in bibliographic data. Analyzing networks constructed from multiple datasets containing 3.4M ~ 9.6M publication records, this study shows that collaboration networks in which author names are disambiguated by the commonly used heuristic, i.e., forename-initial-based name matching, tend to produce degree distributions better fitted to power-law slopes with the typical scaling parameter (2 < {\alpha} < 3) than networks disambiguated by more accurate algorithm-based methods. Such tendency is observed across collaboration networks generated under various conditions such as cumulative years, 5- & 1-year sliding windows, and random sampling, and through simulation, found to arise due mainly to artefactual entities created by inaccurate disambiguation. This cautionary study calls for special attention from scholars analyzing network data in which entities such as people, organization, and gene can be merged or split by improper disambiguation.
A network is called "scale-free" if its node degree distribution follows a power-law pattern of 𝑥 -𝛼 , where x is a node degree and α is a scaling parameter (Barabási & Albert, 1999). Scale-free networks have attracted huge scholarly attention due mainly to the implication that complex networks can be modeled by generic principles (Keller, 2005). Until recently, scholars across domains have reported observations of scale-free networks and proposed diverse mechanisms generating such a universal pattern (e.g., Barabási et al., 2002;Pastor-Satorras & Vespignani, 2001). Among many types of networks, scientific collaboration networks have been confirmed to exhibit scalefree-ness (e.g., Barabási et al., 2002;Milojević, 2010a;Newman, 2001). In a collaboration network, authors are represented by nodes that are connected by edges if two authors appear together in a paper's byline. Conventionally, only the existence of coauthoring relationship between a pair of authors is considered for scale-free network analyses, ignoring collaboration frequency. This means that a node degree in a collaboration network corresponds to the number of distinct coauthors who have ever collaborated with an author represented by the node. In several studies, degree distributions in collaboration networks have been found to follow a power-law: a few authors have large numbers of coauthors while many others have small numbers of coauthors, and this skewness of coauthor distribution fits approximately into a pattern of 𝑥 -𝛼 and, sometimes, across a limited range of x values.
Serving as evidence of scale-free social networks, the aggregated findings of scale-free-ness in collaboration networks have formed an important basis of various efforts to model human interaction patterns besides physical, technical, and biological complex networks (Keller, 2005). Some scholars have, however, reported that degree distributions in collaboration networks do not follow a power-law (Franceschet, 2011;Grossman, 2002;Moody, 2004;Newman, 2004). In addition, several others have noted that scale-free collaboration networks might result from bibliographic data compromised by author name ambiguity (Fegley & Torvik, 2013;Kim & Diesner, 2015). This study takes the latter data-quality approach to understanding scale-free networks.
In bibliographic data, an author is usually represented by an alphabetical string, which can lead to name ambiguity. For example, two distinct authors who have the same names (e.g., two “Charles Brown"s) can be misrepresented as one if we identify authors by their names, which is called “merging of entities.” Another ambiguous case would be an author who uses different name variants across papers (e.g., Charles Brown, Charles C. Brown, and Charlie Brown), causing the work of the author to be attributed to multiple other authors, called “splitting of entities.”
To address this ambiguity problem, many scale-free collaboration networks have been constructed under the assumption that two names that match on forename initials and surname refer to the same author. This initial-based author matching can produce disambiguation errors by mismatching two distinct authors who share the name initials (e.g., Charles Brown and Clarke Brown) or mistakenly regarding two names (e.g., Charles Brown and Charles C. Brown) of an author as belonging to different authors. Scale-free collaboration network studies using this initial-based heuristic have well acknowledged the misidentification problem but argued that the initial-matching-induced errors would not change “much” knowledge discovered from ambiguous bibliographic data (Barabási et al., 2002;Newman, 2001).
To counter-argue the negligible impact of author name ambiguity on collaboration networks, this study shows that scale-free-ness of collaboration networks can be affected by artefactual nodal entities created by ambiguous author names. In doing so, this study uses three large-scale bibliographic datasets to construct collaboration networks in which author names are disambiguated by three different methods -all forename initials plus surname, a first forename initial plus surname, and algorithmic disambiguation. Then, a power-law fitting test is conducted for degree distributions of collaboration networks generated under various conditions such as 5-& 1-year sliding windows, cumulative years, and random selection of paper records. In addition, how merged or split author entities are related to the rise of scale-free networks is simulated with incremental changes in disambiguation errors.
This paper analyzes collaboration networks constructed from three large-scale scholarly datasets covering biomedicine, physics, and computer science. This selection represents academic fields that have been frequently studied by researchers for scale-free networks as well as bibliometrics in general. MEDLINE: Maintained by the U.S. National Library of Medicine, this dataset contains almost 24M publication records pu
This content is AI-processed based on open access ArXiv data.