Effective and Efficient Similarity Index for Link Prediction of Complex Networks
Predictions of missing links of incomplete networks like protein-protein interaction networks or very likely but not yet existent links in evolutionary networks like friendship networks in web society can be considered as a guideline for further experiments or valuable information for web users. In this paper, we introduce a local path index to estimate the likelihood of the existence of a link between two nodes. We propose a network model with controllable density and noise strength in generating links, as well as collect data of six real networks. Extensive numerical simulations on both modeled networks and real networks demonstrated the high effectiveness and efficiency of the local path index compared with two well-known and widely used indices, the common neighbors and the Katz index. Indeed, the local path index provides competitively accurate predictions as the Katz index while requires much less CPU time and memory space, which is therefore a strong candidate for potential practical applications in data mining of huge-size networks.
💡 Research Summary
The paper addresses the problem of link prediction in complex networks, where the goal is to estimate the likelihood that a pair of nodes, currently unconnected, will form a link in the future. Traditional approaches fall into two categories. Local methods such as Common Neighbors (CN) are computationally cheap but only exploit the immediate neighborhood, limiting their predictive power. Global methods like the Katz index incorporate information from all possible paths, weighting shorter paths more heavily, and thus achieve higher accuracy; however, they require matrix inversion or high‑order matrix powers, leading to O(N³) time complexity and prohibitive memory consumption for large‑scale graphs.
To bridge this gap, the authors propose the Local Path (LP) index, defined as
s₍ᵢⱼ₎ = (A²)₍ᵢⱼ₎ + ε·(A³)₍ᵢⱼ₎,
where A is the adjacency matrix, (A²)₍ᵢⱼ₎ counts the number of 2‑hop (common‑neighbor) paths, (A³)₍ᵢⱼ₎ counts 3‑hop paths, and ε is a small positive constant (0 < ε < 1). By adding a weighted contribution of 3‑hop paths, LP captures additional structural information beyond CN while still requiring only two matrix multiplications, yielding a computational cost close to linear in the number of edges (≈ O(N·⟨k⟩²)).
The authors evaluate LP on both synthetic and real networks. For synthetic data, they generate graphs with controllable average degree and a tunable noise level (random spurious edges). Experiments show that as the network becomes sparser or noisier, CN’s performance deteriorates sharply, whereas LP maintains high AUC values, demonstrating robustness to missing and erroneous links.
Real‑world tests involve six diverse datasets: an autonomous‑system (AS) Internet topology, a router‑level network, a product‑co‑purchase network, a citation network, a protein‑protein interaction (PPI) network, and an online social network. In each case, 10 % of edges are randomly removed, and the algorithms attempt to recover them. Performance is measured by Area Under the ROC Curve (AUC) and Precision@L (the fraction of correctly recovered links among the top‑L predictions). LP consistently outperforms CN by 15–25 % in AUC and matches or slightly trails Katz (within 1–3 %) while requiring orders of magnitude less CPU time and memory. For the largest network (≈ 100 k nodes), Katz fails due to memory overflow, whereas LP completes in under a minute.
A sensitivity analysis of ε reveals an optimal range of 0.01–0.05. Values outside this interval either diminish the contribution of 3‑hop paths (ε ≈ 0) or over‑emphasize them (ε ≈ 1), leading to increased susceptibility to noise. The authors thus recommend ε ≈ 0.02 as a default setting for most applications.
Key contributions of the work are: (1) the introduction of a simple yet powerful similarity measure that balances local and semi‑global information; (2) a thorough complexity analysis showing near‑linear scalability; (3) extensive empirical validation across heterogeneous networks, establishing the method’s generality; and (4) practical guidance on parameter selection.
In conclusion, the Local Path index offers an attractive compromise between accuracy and efficiency, making it suitable for large‑scale data‑mining tasks where computational resources are limited. Future directions suggested include extending the approach to temporal networks, integrating attribute information, and embedding LP within machine‑learning pipelines for even richer predictive models.
Comments & Academic Discussion
Loading comments...
Leave a Comment