Missing links prediction: comparing machine learning with physics-rooted approaches
An active research line within the broader field of network science is the one concerning link prediction. Close in scope to network reconstruction, link prediction targets specific connections with the aim of uncovering the missing ones, as well as predicting those most likely to emerge in the future, from the available information. In this paper, we consider two families of methods, i.e. those rooted in statistical physics and those based upon machine learning: the members of the first family identify missing links as the most probable non-observed ones, the probability coefficients being determined by solving maximum-entropy benchmarks over the accessible network structure; the members of the second family, instead, associate the presence of single edges to explanatory node-specific variables. Running likelihood-based models such as the Configuration Model, or one of its many fitness-based variants, in parallel with the Gradient Boosting Decision Tree algorithm reveals that the former’s accuracy is comparable to (and sometimes slightly higher than) the latter’s. Such a result confirms that white-box algorithms are viable competitors to the currently available black-box ones, being computationally faster and more interpretable than the latter.
💡 Research Summary
This paper investigates the problem of missing‑link prediction on two real‑world binary undirected networks—the World Trade Web and an electronic interbank deposits market—by directly contrasting two methodological families: (i) white‑box, physics‑inspired maximum‑entropy models, exemplified by the Configuration Model (CM), and (ii) black‑box machine‑learning ensembles, specifically Gradient Boosting Decision Trees (GBDT) implemented via LightGBM.
The authors first formalize the link‑prediction task. Let A be the full adjacency matrix, E_obs the set of observed edges, and E_miss the set of edges deliberately removed (10 %, 20 %, 30 % or 50 % of the total edges, each removal repeated ten times). The training set consists of the observed adjacency A_obs together with a feature matrix X that may contain node‑specific attributes (GDP, degree) and edge‑specific attributes (geographic distance). Performance is assessed with five standard metrics: True Positive Rate (TPR), False Positive Rate (FPR), Jaccard Index (JI), overall Accuracy (ACC) and the Area Under the Receiver Operating Characteristic curve (AUROC).
Physics‑based approach
Maximum‑entropy theory is used to construct a probability distribution over graphs that respects only a chosen set of constraints. The authors focus on the degree sequence as the sole constraint, leading to the Hamiltonian H_CM = Σ_{i<j} a_{ij}(θ_i+θ_j). This yields link probabilities p_{ij}=x_i x_j/(1+x_i x_j) with x_i = e^{‑θ_i}. The parameters {x_i} are obtained by solving the coupled nonlinear equations k_i^{obs}= Σ_{j≠i} p_{ij}, which is done via the iterative update x_i^{(t+1)} = k_i^{obs} / Σ_{j≠i} x_j^{(t)}/(1+x_i^{(t)}x_j^{(t)}). The resulting probabilities are used as scores; the |E_miss| highest‑scoring non‑observed pairs are declared as predicted missing links.
Machine‑learning approach
GBDT builds an additive model s_{ij}^{(K)} = α + Σ_{k=1}^{K} ρ_k φ_k(f_i,f_j,g_{ij}|θ_k) where φ_k are shallow decision trees. At each iteration the gradient of the logistic loss ℓ = – Σ_{i<j}
Comments & Academic Discussion
Loading comments...
Leave a Comment