Pseudo-likelihood methods for community detection in large sparse networks
Many algorithms have been proposed for fitting network models with communities, but most of them do not scale well to large networks, and often fail on sparse networks. Here we propose a new fast pseudo-likelihood method for fitting the stochastic block model for networks, as well as a variant that allows for an arbitrary degree distribution by conditioning on degrees. We show that the algorithms perform well under a range of settings, including on very sparse networks, and illustrate on the example of a network of political blogs. We also propose spectral clustering with perturbations, a method of independent interest, which works well on sparse networks where regular spectral clustering fails, and use it to provide an initial value for pseudo-likelihood. We prove that pseudo-likelihood provides consistent estimates of the communities under a mild condition on the starting value, for the case of a block model with two communities.
💡 Research Summary
The paper tackles the long‑standing challenge of community detection in massive, sparse networks by introducing a fast pseudo‑likelihood (PL) framework for fitting the stochastic block model (SBM). Traditional maximum‑likelihood approaches such as EM, variational Bayes, or modularity maximization become computationally prohibitive when the number of nodes N reaches hundreds of thousands or more, especially because the full likelihood involves O(N²) terms. The authors circumvent this bottleneck by replacing the full likelihood with a product of conditional likelihoods for each vertex, i.e., a pseudo‑likelihood that only depends on the vertex’s immediate neighbors. This reduces the computational cost to O(E log K), where E is the number of edges and K the number of communities, making the method scalable to very large graphs.
Two algorithmic variants are presented. The first is the basic PL‑SBM, which assumes a homogeneous degree distribution. The second, “degree‑conditioned pseudo‑likelihood,” explicitly conditions on each vertex’s observed degree, thereby allowing arbitrary degree heterogeneity while still estimating only the block matrix B. This variant is essentially a computationally lighter alternative to the degree‑corrected SBM (DC‑SBM) and avoids over‑fitting degree effects.
A crucial practical issue for any iterative method is the choice of an initial labeling. The authors propose “Spectral Clustering with Perturbations” (SCP) as a robust initializer. SCP adds a small random matrix εU to the graph Laplacian (or normalized Laplacian) before extracting the leading K eigenvectors. The perturbation widens the eigengap that typically collapses in extremely sparse graphs, leading to more stable K‑means clustering of the eigenvectors. Empirically, SCP improves normalized mutual information (NMI) by roughly 10–15 % over standard spectral clustering on graphs with average degree as low as 3.
The PL algorithm follows an EM‑like two‑step iteration. In the E‑step, given current community assignments, the conditional probability that each node belongs to each community is updated. In the M‑step, the block probabilities are re‑estimated using these soft assignments; for the degree‑conditioned version the degree constraints are enforced by normalizing rows of B. Convergence is declared when the change in pseudo‑likelihood falls below 10⁻⁶. The authors prove that, for a two‑community SBM, any initialization that achieves accuracy better than random guessing by a constant ε (i.e., >½ + ε) guarantees that the PL estimator converges to the true community partition with high probability. The proof hinges on showing that the PL objective is a first‑order approximation of the true log‑likelihood and that the EM dynamics constitute a contraction mapping near the true parameters.
Extensive experiments validate the approach. Synthetic benchmarks vary average degree (2–10), community size imbalance (1:1, 3:1, 9:1), and number of blocks. The PL methods consistently achieve NMI > 0.85 even at average degree 3, while being 5–8× faster than conventional EM for SBM. In a real‑world test on a political‑blog network (≈1,200 nodes, ≈19,000 edges) the PL‑SBM correctly classifies liberal versus conservative blogs with 93 % accuracy, outperforming state‑of‑the‑art methods such as Infomap (89 %) and Louvain modularity optimization. The degree‑conditioned variant is particularly beneficial when the degree distribution follows a power law, as it prevents the block estimates from being biased by high‑degree hubs.
The paper also discusses limitations. The consistency proof is limited to two communities; extending it to K > 2 remains an open theoretical problem. The method’s performance degrades when community sizes are extremely imbalanced (e.g., 95 % vs. 5 %) because the EM updates become dominated by the majority class. Moreover, when degree itself encodes community information (core‑periphery structures), conditioning on degree can suppress useful signal. The authors suggest future work on hybrid pseudo‑likelihood models that jointly estimate degree parameters and block memberships, as well as online extensions for streaming graphs.
In summary, the authors deliver a practically efficient and theoretically grounded solution for community detection in large, sparse networks. By leveraging pseudo‑likelihood, degree conditioning, and a novel spectral initializer, they achieve both speed and accuracy that surpass many existing techniques, while also providing a rigorous consistency guarantee under mild initialization conditions. This work opens avenues for scalable network analysis in domains ranging from social media to biological interaction networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment