Finding missing edges and communities in incomplete networks

Finding missing edges and communities in incomplete networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many algorithms have been proposed for predicting missing edges in networks, but they do not usually take account of which edges are missing. We focus on networks which have missing edges of the form that is likely to occur in real networks, and compare algorithms that find these missing edges. We also investigate the effect of this kind of missing data on community detection algorithms.


💡 Research Summary

The paper addresses a critical gap in network analysis: most edge‑prediction algorithms are evaluated on networks where missing edges are assumed to be random, ignoring the specific mechanisms that cause incompleteness in real‑world data. The authors define three realistic types of missing‑edge scenarios—crawled (boundary) networks, random‑deletion networks, and limited‑degree networks—each reflecting a common data‑collection bias: breadth‑first‑search sampling, non‑response or measurement error, and degree‑censoring (e.g., fixed‑choice surveys).

To assess the impact of these biases, the authors generate incomplete versions of three categories of networks: synthetic Erdős‑Rényi graphs, LFR benchmark graphs (which embed community structure and tunable clustering), and several real‑world networks (Karate club, email, terrorist, scientometrics, C. elegans metabolic, and a grassland food‑web). For each incomplete network they evaluate a suite of edge‑prediction methods: local similarity measures based on common neighbours (CN, Adamic‑Adar, Resource Allocation), normalised variants (Jaccard, Meet/Min, Geometric), a degree‑based Preferential Attachment (PA) score, and two global probabilistic models—Hierarchical Random Graph (HRG) and Stochastic Block Model (BM). Performance is measured by the Area Under the ROC Curve (AUC), which quantifies the probability that a true missing edge receives a higher score than a random non‑edge.

Key findings include:

  1. Random graphs (ER) lack structural cues; all local methods perform at chance level, while PA shows modest improvement only when degree information is informative.

  2. LFR graphs and real networks with appreciable clustering benefit most from common‑neighbour based scores (CN, AA, RA). Their performance correlates with the clustering coefficient: higher clustering yields higher AUC.

  3. Crawled networks suffer from low‑degree peripheral vertices; PA underperforms because degree information is biased, whereas local measures still succeed if the underlying community structure is strong.

  4. Limited‑degree networks equalise vertex degrees, making normalised scores (Jaccard, Meet/Min) more effective; PA excels because it directly leverages the remaining degree heterogeneity.

  5. Random‑deletion networks preserve the overall degree distribution, allowing global models (HRG, BM) to achieve the highest AUC, outperforming all local heuristics.

  6. Computational cost: HRG and BM are substantially more expensive, limiting their applicability to small or medium‑sized networks.

Beyond edge prediction, the authors examine how each missing‑edge scenario affects community‑detection algorithms (e.g., Louvain, Infomap). They observe that crawled networks often become disconnected, causing many community‑detection methods to fail or produce fragmented partitions. Random‑deletion networks retain connectivity and thus yield relatively stable community assignments. Limited‑degree networks alter degree‑based centrality measures, which can blur community boundaries and degrade modularity‑based detection.

The study concludes that the nature of missing data must be explicitly considered when selecting edge‑prediction or community‑detection techniques. For networks sampled via crawling, local similarity measures are preferable; for data with random measurement loss, global probabilistic models provide the best recovery; and for degree‑censored datasets, degree‑aware scores such as PA or normalised similarity indices are optimal. The authors also stress the practical relevance of their findings for researchers and practitioners who must preprocess incomplete network data before downstream analysis.


Comments & Academic Discussion

Loading comments...

Leave a Comment