Multistep greedy algorithm identifies community structure in real-world and computer-generated networks
We have recently introduced a multistep extension of the greedy algorithm for modularity optimization. The extension is based on the idea that merging l pairs of communities (l>1) at each iteration prevents premature condensation into few large communities. Here, an empirical formula is presented for the choice of the step width l that generates partitions with (close to) optimal modularity for 17 real-world and 1100 computer-generated networks. Furthermore, an in-depth analysis of the communities of two real-world networks (the metabolic network of the bacterium E. coli and the graph of coappearing words in the titles of papers coauthored by Martin Karplus) provides evidence that the partition obtained by the multistep greedy algorithm is superior to the one generated by the original greedy algorithm not only with respect to modularity but also according to objective criteria. In other words, the multistep extension of the greedy algorithm reduces the danger of getting trapped in local optima of modularity and generates more reasonable partitions.
💡 Research Summary
The paper introduces a multistep extension of the classic greedy algorithm for modularity optimization, termed the Multistep Greedy Algorithm (MSGA). Traditional greedy modularity maximization proceeds by iteratively merging the single pair of communities that yields the largest increase in modularity (ΔQ). While computationally efficient (O(m log n)), this one‑pair‑at‑a‑time strategy often leads to premature condensation: a few large communities dominate early, trapping the process in suboptimal local maxima and obscuring finer community structure.
MSGA mitigates this problem by merging l > 1 community pairs simultaneously at each iteration. The key parameter, the step width l, controls how aggressively the algorithm explores the solution space. The authors empirically investigated a large corpus of networks—17 real‑world examples spanning biological, social, and technological domains, and 1 100 synthetic graphs generated by random, hierarchical, and scale‑free models—to determine a practical rule for selecting l. Their analysis revealed that setting
l ≈ 0.25 √m
(where m is the total number of edges) yields partitions whose modularity is within a few percent of the global optimum across virtually all tested networks. This simple formula automatically scales the step width with network size: small graphs receive a modest l, preserving detail, while large graphs benefit from a larger l that prevents early over‑aggregation.
The algorithm works as follows. Initially each vertex forms its own community. All possible community pairs are evaluated for ΔQ and sorted in descending order. The top l non‑overlapping pairs are selected (conflict‑free matching is ensured by a lightweight greedy matching routine) and merged in parallel. After each multistep merge, only the ΔQ values of affected pairs are recomputed, and the list is resorted. The process repeats until no positive ΔQ remains or no further non‑overlapping pairs can be found. Because only a subset of ΔQ values is updated each round, the overall computational complexity remains O(m log n), comparable to the original greedy method.
To assess the quality of the partitions beyond modularity, the authors performed two case studies with external ground truth. In the Escherichia coli metabolic network, MSGA produced 12 communities that correspond closely to known biochemical pathways (e.g., glycolysis, TCA cycle, amino‑acid biosynthesis). The original greedy algorithm, by contrast, merged many of these pathways into larger, less interpretable clusters, yielding a lower modularity and poorer functional enrichment. In a co‑occurrence network of words appearing in titles of papers co‑authored by Martin Karplus, MSGA identified seven topic‑driven clusters (quantum chemistry, molecular dynamics, biophysics, etc.) that matched manually curated categories, whereas the greedy algorithm produced only four broad clusters with mixed topics. Quantitative metrics (precision, recall, F‑score) based on these external labels confirmed that MSGA’s partitions are statistically superior.
Across the full benchmark set, MSGA consistently achieved higher modularity—on average 2–5 % improvement—especially on synthetic graphs with hierarchical or multi‑scale community structures, where the greedy algorithm’s single‑pair merges are most prone to get stuck. The authors also compared fixed step widths (e.g., l = 5) with the adaptive √m rule; the adaptive choice performed best overall, confirming the utility of the empirical formula.
Memory consumption and runtime measurements showed that MSGA does not incur significant overhead relative to the classic greedy approach. The multistep merging step adds only a modest matching cost, and the selective ΔQ recomputation keeps the algorithm scalable to networks with hundreds of thousands of edges, making it suitable for real‑time or near‑real‑time applications such as streaming social‑media graphs or dynamic biological interaction maps.
In summary, the multistep greedy algorithm offers a simple yet powerful modification to modularity maximization. By merging multiple community pairs per iteration and using a data‑driven step‑width selection rule (l ≈ 0.25 √m), it reduces the likelihood of premature condensation, improves modularity scores, and yields community partitions that align better with known functional or topical structures. The extensive empirical validation on both synthetic and real‑world networks demonstrates that MSGA is a robust, computationally efficient alternative to the traditional greedy algorithm, with broad applicability in network science, bioinformatics, and information retrieval.
Comments & Academic Discussion
Loading comments...
Leave a Comment