Performance of Local Information Based Link Prediction: A Sampling Perspective

Performance of Local Information Based Link Prediction: A Sampling   Perspective
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Link prediction is pervasively employed to uncover the missing links in the snapshots of real-world networks, which are usually obtained from kinds of sampling methods. Contrarily, in the previous literature, in order to evaluate the performance of the prediction, the known edges in the sampled snapshot are divided into the training set and the probe set randomly, without considering the diverse sampling approaches beyond. However, different sampling methods might lead to different missing links, especially for the biased ones. For this reason, random partition based evaluation of performance is no longer convincing if we take the sampling method into account. Hence, in this paper, aim at filling this void, we try to reevaluate the performance of local information based link predictions through sampling methods governed division of the training set and the probe set. It is interesting that we find for different sampling methods, each prediction approach performs unevenly. Moreover, most of these predictions perform weakly when the sampling method is biased, which indicates that the performance of these methods is overestimated in the prior works.


💡 Research Summary

The paper revisits the problem of link prediction in complex networks from the perspective of sampling bias. While most prior studies evaluate link prediction algorithms by randomly splitting the known edges of a complete network into a training set (typically 90 % of edges) and a probe set (the remaining 10 %), real‑world network data are rarely obtained in this way. In practice, networks are collected through crawling, API queries, sensor measurements, or other sampling procedures that are often biased toward certain structural features (e.g., high‑degree nodes, dense cores). The authors argue that such sampling bias fundamentally changes the distribution of missing links, making the conventional random‑split evaluation unreliable.

To investigate this issue, the authors select five representative sampling methods: (1) Breadth‑First Search (BFS), which preferentially explores high‑degree regions; (2) Metropolis‑Hastings Random Walk (MHRW), designed to achieve a uniform stationary distribution; (3) Frontier Sampling (FS), a multidimensional random walk; (4) Forest‑Fire (FF), a probabilistic “burning” process that mimics information diffusion; and (5) Pure Random (PR), an idealized method that selects edges uniformly at random (used as a baseline). For each method, a sampling fraction (s_f) ranging from 0.1 to 0.9 determines how many edges are placed in the training set (E_T); the remainder forms the probe set (E_P).

The study evaluates ten classic local‑information link prediction scores: Common Neighbours (CN), Adamic‑Adar (AA), Resource Allocation (RA), Salton Index (SAI), Jaccard Index (JI), Sørensen Index (SPI), Hub‑Promoted Index (HPI), Hub‑Depressed Index (HDI), Leicht‑Holme‑Newman (LHN), and Preferential Attachment (PA). Each score is computed using only the edges in (E_T); all non‑adjacent node pairs are ranked, and the top (|E_P|) pairs are taken as predictions. Performance is measured by Precision (the fraction of correctly predicted probe edges) and AUC (the probability that a randomly chosen missing link receives a higher score than a randomly chosen nonexistent link).

Experiments are conducted on nine real‑world networks from diverse domains (social, collaboration, biological, etc.). The results reveal systematic patterns: (i) Biased sampling methods (BFS, FF) dramatically reduce both AUC and Precision for almost all scores. In BFS‑sampled subgraphs, high‑degree nodes dominate the training set, leaving low‑degree regions under‑represented; consequently, scores that rely on common neighbours (CN, AA, RA) perform near random (AUC≈0.5). (ii) MHRW and FS, which aim for a more uniform node coverage, preserve modest predictive power; CN, AA, and RA still outperform random but fall short of the levels observed under pure random sampling (typically 5–10 % lower AUC). (iii) Scores heavily dependent on node degree (PA, HPI) can be over‑estimated in biased samples because the training set contains many high‑degree edges, inflating their scores on the probe set. (iv) Only the pure random baseline reproduces the high performance reported in earlier literature (AUC≈0.9, Precision≈0.8), confirming that previous evaluations implicitly assumed an unbiased sampling scenario.

The authors conclude that (a) the sampling process must be explicitly accounted for when benchmarking link prediction algorithms; (b) many widely used local‑information measures are vulnerable to sampling bias, especially under methods that concentrate on dense cores; and (c) future work should develop (i) preprocessing techniques to correct for sampling bias, (ii) new prediction models (potentially leveraging global structural information or graph neural networks) that are robust to biased observations, and (iii) an integrated evaluation framework that jointly considers sampling strategy and prediction performance. Such advances are essential for reliable deployment of link prediction in real‑world applications such as recommendation systems, fraud detection, and biological interaction discovery.


Comments & Academic Discussion

Loading comments...

Leave a Comment