Strategies for online inference of model-based clustering in large and growing networks
In this paper we adapt online estimation strategies to perform model-based clustering on large networks. Our work focuses on two algorithms, the first based on the SAEM algorithm, and the second on variational methods. These two strategies are compared with existing approaches on simulated and real data. We use the method to decipher the connexion structure of the political websphere during the US political campaign in 2008. We show that our online EM-based algorithms offer a good trade-off between precision and speed, when estimating parameters for mixture distributions in the context of random graphs.
💡 Research Summary
This paper addresses the challenge of performing model‑based clustering on massive and continuously growing networks by introducing two online estimation frameworks. The first method adapts the Stochastic Approximation Expectation‑Maximization (SAEM) algorithm to a streaming edge setting. As each edge arrives, sufficient statistics are incrementally updated using a Robbins‑Monro step‑size schedule, and the M‑step maximizes the likelihood of the mixed stochastic block model parameters based on the current statistics. The second method reformulates variational inference in an online fashion: node‑wise cluster assignment probabilities are treated as variational parameters that are locally refined whenever a new edge involving the node is observed, minimizing the KL divergence via coordinate ascent. Both approaches avoid storing the full adjacency matrix, reducing memory complexity to O(NK) and keeping per‑update computation proportional to the number of clusters K.
Extensive experiments were conducted on synthetic networks (10⁴–10⁵ nodes, 5–10 clusters) and on a real‑world dataset comprising roughly 20,000 political websites and over 500,000 hyperlinks from the 2008 U.S. presidential campaign. Accuracy was measured with Adjusted Rand Index and Normalized Mutual Information, while runtime and memory consumption were recorded. The online SAEM algorithm consistently achieved slightly higher ARI/NMI scores than the online variational method and was 5–10 times faster than traditional batch variational EM, which also suffered from memory overflow on the largest graphs. In the political web‑sphere analysis, the algorithms uncovered two dominant sub‑graphs corresponding to conservative and liberal media, with a few central sites acting as bridges between the factions, offering interpretable insights into opinion diffusion.
The contributions of the work are threefold: (1) introduction of scalable online SAEM and variational algorithms that overcome the O(N²) bottleneck of batch methods; (2) theoretical justification of convergence based on stochastic approximation and variational optimization principles; (3) empirical validation demonstrating a favorable trade‑off between clustering precision and computational efficiency, as well as practical applicability to real‑time social network analysis. The authors suggest future extensions to handle non‑stationary streams, adaptive numbers of clusters, and integration with online anomaly detection for broader deployment in dynamic network environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment