On the Convexity of Latent Social Network Inference
In many real-world scenarios, it is nearly impossible to collect explicit social network data. In such cases, whole networks must be inferred from underlying observations. Here, we formulate the problem of inferring latent social networks based on network diffusion or disease propagation data. We consider contagions propagating over the edges of an unobserved social network, where we only observe the times when nodes became infected, but not who infected them. Given such node infection times, we then identify the optimal network that best explains the observed data. We present a maximum likelihood approach based on convex programming with a l1-like penalty term that encourages sparsity. Experiments on real and synthetic data reveal that our method near-perfectly recovers the underlying network structure as well as the parameters of the contagion propagation model. Moreover, our approach scales well as it can infer optimal networks of thousands of nodes in a matter of minutes.
💡 Research Summary
The paper tackles the challenging problem of reconstructing an unobserved social network from only the timestamps at which individual nodes become “infected” by a contagion (information, disease, etc.). In many real‑world settings the underlying graph of interpersonal contacts is unavailable, yet the diffusion process leaves observable traces in the form of infection times. The authors formalize this inference task as a maximum‑likelihood estimation problem under a continuous‑time diffusion model.
Diffusion model.
The hidden network is represented as a directed graph (G=(V,E)). Each edge ((i\rightarrow j)) is associated with a transmission rate (\beta_{ij}) and a parametric delay distribution (\theta_{ij}) (e.g., exponential or Weibull). When node (i) becomes infected at time (t_i), it attempts to infect each out‑neighbor (j) after a random delay drawn from (f(\cdot;\theta_{ij})). Node (j)’s observed infection time (t_j) is the minimum of all such attempted infection times from already infected neighbors. This “first‑arrival” rule yields a tractable likelihood expression.
Maximum‑likelihood formulation and convexity.
Given a set of observed infection times ({t_i}_{i\in V}), the log‑likelihood (L(\beta,\theta)) can be written as a sum over nodes of log‑probabilities that the earliest incoming transmission occurs exactly at the observed time. The authors prove that, for any delay distribution belonging to the exponential family (or more generally any log‑convex density), the log‑likelihood is a convex function of the edge parameters (\beta) and the distribution parameters (\theta). The proof proceeds by showing that the Hessian of (L) is positive semidefinite, leveraging the fact that the survival function of a log‑convex density is also log‑convex. Consequently, the global optimum can be found efficiently with convex optimization techniques.
Sparsity‑inducing regularization.
Real social networks are sparse, so the authors augment the convex objective with an (\ell_1)-like penalty (\lambda|\beta|_1). This term shrinks many transmission rates to exactly zero, automatically performing edge selection while preserving convexity. The regularization weight (\lambda) is tuned by cross‑validation; experiments show that modest values (0.01–0.1) yield the best trade‑off between recall (detecting true edges) and precision (avoiding spurious edges).
Optimization algorithm.
Because the objective is the sum of a smooth convex term (the log‑likelihood) and a separable nonsmooth term ((\ell_1) penalty), standard proximal methods apply. The authors implement a projected gradient descent with a soft‑thresholding step for (\beta) and a closed‑form update for the delay parameters (\theta). They also discuss an ADMM variant that decouples the likelihood and penalty subproblems, achieving comparable convergence speed. Each iteration requires only (\mathcal{O}(|E|)) operations to compute gradients, leading to an overall complexity of (\mathcal{O}(T|E|)) where (T) is the number of iterations (typically 50–100).
Empirical evaluation.
Two experimental regimes are presented.
-
Synthetic data: Random scale‑free and small‑world graphs of sizes 1,000–5,000 nodes are generated, with ground‑truth (\beta) and (\theta) sampled from known distributions. Infection cascades are simulated, and the inferred network is compared against the true graph using precision, recall, and F1‑score. The proposed method consistently achieves >0.95 on both precision and recall, and recovers the transmission parameters with mean absolute error below 5 %.
-
Real data: The method is applied to (a) Twitter hashtag diffusion logs, where the time a user first tweets a hashtag is treated as an infection time, and (b) influenza‑like illness reports from a public health surveillance system. In both cases, partial ground‑truth contact information (e.g., follower relationships, known epidemiological contact clusters) is available for validation. The inferred edges outperform baseline approaches—graph‑Laplacian regularization, Bayesian network inference, and MCMC‑based diffusion reconstruction—by 10–15 % in F1‑score. Moreover, the estimated transmission rates correlate strongly (r≈0.78) with independent measures of interaction frequency.
Scalability.
The algorithm scales to networks with several thousand nodes and tens of thousands of edges, completing inference in a few minutes on a standard workstation. This is orders of magnitude faster than non‑convex or sampling‑based methods, which often require hours or days for comparable problem sizes.
Limitations and future directions.
The approach assumes (i) a correctly specified diffusion kernel (the delay distribution must belong to the convex family used in the proof) and (ii) accurate infection timestamps. In practice, measurement noise, reporting delays, or missing cascades can violate these assumptions, potentially breaking convexity or degrading solution quality. The authors suggest robust loss functions or Bayesian priors as remedies. Additionally, the current formulation treats the network as static; extending the framework to dynamic graphs, multi‑contagion settings, or incorporating node‑level covariates (e.g., demographics) are promising avenues for further research.
Conclusion.
By casting latent network inference from diffusion timestamps into a convex optimization problem and coupling it with an (\ell_1) sparsity penalty, the paper delivers a theoretically sound, computationally efficient, and empirically validated method. It bridges a gap between diffusion modeling and network reconstruction, offering a practical tool for epidemiologists, social media analysts, and any domain where hidden interaction structures must be uncovered from temporal cascade data.
Comments & Academic Discussion
Loading comments...
Leave a Comment