Finding missing edges and communities in incomplete networks
Many algorithms have been proposed for predicting missing edges in networks, but they do not usually take account of which edges are missing. We focus on networks which have missing edges of the form that is likely to occur in real networks, and compare algorithms that find these missing edges. We also investigate the effect of this kind of missing data on community detection algorithms.
đĄ Research Summary
The paper addresses a critical gap in network analysis: most edgeâprediction algorithms are evaluated on networks where missing edges are assumed to be random, ignoring the specific mechanisms that cause incompleteness in realâworld data. The authors define three realistic types of missingâedge scenariosâcrawled (boundary) networks, randomâdeletion networks, and limitedâdegree networksâeach reflecting a common dataâcollection bias: breadthâfirstâsearch sampling, nonâresponse or measurement error, and degreeâcensoring (e.g., fixedâchoice surveys).
To assess the impact of these biases, the authors generate incomplete versions of three categories of networks: synthetic ErdĹsâRĂŠnyi graphs, LFR benchmark graphs (which embed community structure and tunable clustering), and several realâworld networks (Karate club, email, terrorist, scientometrics, C. elegans metabolic, and a grassland foodâweb). For each incomplete network they evaluate a suite of edgeâprediction methods: local similarity measures based on common neighbours (CN, AdamicâAdar, Resource Allocation), normalised variants (Jaccard, Meet/Min, Geometric), a degreeâbased Preferential Attachment (PA) score, and two global probabilistic modelsâHierarchical Random Graph (HRG) and Stochastic Block Model (BM). Performance is measured by the Area Under the ROC Curve (AUC), which quantifies the probability that a true missing edge receives a higher score than a random nonâedge.
Key findings include:
-
Random graphs (ER) lack structural cues; all local methods perform at chance level, while PA shows modest improvement only when degree information is informative.
-
LFR graphs and real networks with appreciable clustering benefit most from commonâneighbour based scores (CN, AA, RA). Their performance correlates with the clustering coefficient: higher clustering yields higher AUC.
-
Crawled networks suffer from lowâdegree peripheral vertices; PA underperforms because degree information is biased, whereas local measures still succeed if the underlying community structure is strong.
-
Limitedâdegree networks equalise vertex degrees, making normalised scores (Jaccard, Meet/Min) more effective; PA excels because it directly leverages the remaining degree heterogeneity.
-
Randomâdeletion networks preserve the overall degree distribution, allowing global models (HRG, BM) to achieve the highest AUC, outperforming all local heuristics.
-
Computational cost: HRG and BM are substantially more expensive, limiting their applicability to small or mediumâsized networks.
Beyond edge prediction, the authors examine how each missingâedge scenario affects communityâdetection algorithms (e.g., Louvain, Infomap). They observe that crawled networks often become disconnected, causing many communityâdetection methods to fail or produce fragmented partitions. Randomâdeletion networks retain connectivity and thus yield relatively stable community assignments. Limitedâdegree networks alter degreeâbased centrality measures, which can blur community boundaries and degrade modularityâbased detection.
The study concludes that the nature of missing data must be explicitly considered when selecting edgeâprediction or communityâdetection techniques. For networks sampled via crawling, local similarity measures are preferable; for data with random measurement loss, global probabilistic models provide the best recovery; and for degreeâcensored datasets, degreeâaware scores such as PA or normalised similarity indices are optimal. The authors also stress the practical relevance of their findings for researchers and practitioners who must preprocess incomplete network data before downstream analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment