Detecting highly overlapping community structure by greedy clique expansion
In complex networks it is common for each node to belong to several communities, implying a highly overlapping community structure. Recent advances in benchmarking indicate that existing community assignment algorithms that are capable of detecting overlapping communities perform well only when the extent of community overlap is kept to modest levels. To overcome this limitation, we introduce a new community assignment algorithm called Greedy Clique Expansion (GCE). The algorithm identifies distinct cliques as seeds and expands these seeds by greedily optimizing a local fitness function. We perform extensive benchmarks on synthetic data to demonstrate that GCE’s good performance is robust across diverse graph topologies. Significantly, GCE is the only algorithm to perform well on these synthetic graphs, in which every node belongs to multiple communities. Furthermore, when put to the task of identifying functional modules in protein interaction data, and college dorm assignments in Facebook friendship data, we find that GCE performs competitively.
💡 Research Summary
The paper addresses the challenging problem of detecting highly overlapping community structures in complex networks, where each node may belong to several communities simultaneously. Existing overlapping community detection algorithms perform adequately only when the degree of overlap is modest; their accuracy deteriorates sharply as nodes become members of multiple groups. To overcome this limitation, the authors propose a novel method called Greedy Clique Expansion (GCE).
GCE operates in two main phases. First, it extracts maximal cliques from the input graph and treats each clique as a seed for a potential community. Maximal cliques are chosen because they represent densely connected subgraphs that are likely to lie at the core of true communities. The second phase expands each seed greedily by optimizing a local fitness function:
F(S) = |E_in(S)| / (|E_in(S)| + α·|E_out(S)|),
where |E_in(S)| is the number of edges internal to the candidate set S, |E_out(S)| is the number of edges leaving S, and α is a tunable parameter controlling the penalty for external connections (α=1.0 is used in most experiments). At each iteration the algorithm evaluates the effect of adding or removing a single node on F(S) and performs the move that yields the greatest increase. Because the fitness function is purely local, the expansion does not require global recomputation, which makes the method scalable. Overlap is naturally accommodated: a node already assigned to another community may still be added if it improves the fitness of the current seed, allowing multiple memberships without post‑processing constraints.
The authors conduct extensive benchmarking on synthetic networks generated with the LFR benchmark model. They vary network size, average degree, community size distribution, and, crucially, the average number of community memberships per node (ranging from 1 to 5). Performance is measured using Normalized Mutual Information (NMI) and the Omega Index. GCE consistently achieves NMI values above 0.80 even when each node belongs to five communities, whereas competing algorithms such as CPM, OSLOM, COPRA, and LFM drop below 0.60 once the overlap exceeds two communities. Moreover, GCE’s performance remains stable across different graph densities and degree heterogeneities, indicating robustness to diverse topologies.
Real‑world applicability is demonstrated on two datasets. In a human protein‑protein interaction (PPI) network, the communities discovered by GCE are evaluated against Gene Ontology (GO) functional modules. Enrichment analysis shows that GCE’s modules have significantly lower p‑values (average ≈10⁻⁵) than those produced by other methods, suggesting a better alignment with biological function. In a Facebook friendship network from a U.S. university, the algorithm is tasked with recovering dormitory assignments, a ground‑truth that naturally exhibits overlapping membership (students often belong to multiple social circles). GCE attains a matching accuracy of 0.73, outperforming OSLOM (0.61) and COPRA (0.58), and retains high precision even when nodes have multiple dorm affiliations.
The paper discusses strengths and limitations. Strengths include (1) high‑quality seed selection via maximal cliques, (2) a simple yet effective local fitness function that avoids expensive global optimization, and (3) an inherent ability to handle multi‑membership without additional post‑processing. Limitations arise in extremely sparse graphs where maximal cliques are scarce, potentially starving the algorithm of seeds, and the sensitivity of the α parameter to network characteristics, which may require dataset‑specific tuning. The authors suggest future work on dynamic networks, extensions to weighted and directed graphs, and automated parameter selection or sampling‑based clique discovery to improve scalability further.
In conclusion, Greedy Clique Expansion provides a powerful, scalable solution for detecting highly overlapping community structures. Its superior performance on both synthetic benchmarks with extreme overlap and on real biological and social networks demonstrates its practical relevance, positioning GCE as a leading tool for community detection in domains where multi‑membership is the norm.
Comments & Academic Discussion
Loading comments...
Leave a Comment