Topological-collaborative approach for disambiguating authors names in collaborative networks
Concepts and methods of complex networks have been employed to uncover patterns in a myriad of complex systems. Unfortunately, the relevance and significance of these patterns strongly depends on the reliability of the data sets. In the study of collaboration networks, for instance, unavoidable noise pervading author’s collaboration datasets arises when authors share the same name. To address this problem, we derive a hybrid approach based on authors’ collaboration patterns and on topological features of collaborative networks. Our results show that the combination of strategies, in most cases, performs better than the traditional approach which disregards topological features. We also show that the main factor for improving the discriminability of homonymous authors is the average distance between authors. Finally, we show that it is possible to predict the weighting associated to each strategy compounding the hybrid system by examining the discrimination obtained from the traditional analysis of collaboration patterns. Once the methodology devised here is generic, our approach is potentially useful to classify many other networked systems governed by complex interactions.
💡 Research Summary
The paper tackles the pervasive problem of author name ambiguity in scholarly databases, where different individuals share identical name strings, leading to noisy citation and productivity metrics. Traditional disambiguation approaches rely primarily on collaboration patterns—examining co‑author overlaps to infer whether two name instances belong to the same person. While effective in many cases, these methods ignore the richer structural information embedded in the underlying co‑authorship network.
To address this limitation, the authors propose a hybrid framework that fuses traditional collaboration‑based features with a suite of complex‑network topological measurements. First, they construct a weighted co‑authorship graph: each node represents an author instance (with each homonymous occurrence treated as a distinct node), and an edge weight w_ij is defined as the sum over shared papers of 1/k_ρ, where k_ρ is the number of authors on paper ρ. This normalization ensures that collaborations involving few authors receive higher weight than those in large author groups.
Two complementary feature vectors are then extracted for every ambiguous node v_i. The “collaborative” vector w(i) contains the normalized edge weights to all other nodes, capturing direct co‑author frequencies. The “topological” vector µ(i) encodes F = 9 measurements: (1) degree (number of distinct co‑authors), (2) strength (sum of edge weights), (3) average and standard deviation of neighbor degree, (4) average and standard deviation of neighbor strength, (5) clustering coefficient, (6) average shortest‑path length (computed within the same community component), (7) betweenness centrality, and (8) hierarchical extensions that consider 2‑hop and 3‑hop neighborhoods. Among these, the average shortest‑path length emerges as the most discriminative, reflecting global connectivity patterns that differ between distinct authors even when their immediate co‑author sets overlap.
For disambiguation, the authors define a similarity score between two ambiguous nodes i and j as a convex combination:
S(i,j) = α·Sim_collab(i,j) + (1‑α)·Sim_topo(i,j),
where Sim_collab and Sim_topo are cosine similarities of the collaborative and topological vectors, respectively, and α∈
Comments & Academic Discussion
Loading comments...
Leave a Comment