Clique counting in MapReduce: theory and experiments
We tackle the problem of counting the number of $k$-cliques in large-scale graphs, for any constant $k \ge 3$. Clique counting is essential in a variety of applications, among which social network analysis. Due to its computationally intensive nature, we settle for parallel solutions in the MapReduce framework, which has become in the last few years a {\em de facto} standard for batch processing of massive data sets. We give both theoretical and experimental contributions. On the theory side, we design the first exact scalable algorithm for counting (and listing) $k$-cliques. Our algorithm uses $O(m^{3/2})$ total space and $O(m^{k/2})$ work, where $m$ is the number of graph edges. This matches the best-known bounds for triangle listing when $k=3$ and is work-optimal in the worst case for any $k$, while keeping the communication cost independent of $k$. We also design a sampling-based estimator that can dramatically reduce the running time and space requirements of the exact approach, while providing very accurate solutions with high probability. We then assess the effectiveness of different clique counting approaches through an extensive experimental analysis over the Amazon EC2 platform, considering both our algorithms and their state-of-the-art competitors. The experimental results clearly highlight the algorithm of choice in different scenarios and prove our exact approach to be the most effective when the number of $k$-cliques is large, gracefully scaling to non-trivial values of $k$ even on clusters of small/medium size. Our approximation algorithm achieves extremely accurate estimates and large speedups, especially on the toughest instances for the exact algorithms. As a side effect, our study also sheds light on the number of $k$-cliques of several real-world graphs, mainly social networks, and on its growth rate as a function of $k$.
💡 Research Summary
The paper addresses the challenging problem of counting k‑cliques (for any constant k ≥ 3) in massive graphs using the MapReduce programming model. Clique counting is a fundamental primitive for many applications such as social‑network analysis, spam detection, and biological‑network pattern discovery, but exact sequential algorithms quickly become infeasible on today’s large datasets. The authors therefore propose scalable parallel solutions that work within the de‑facto batch‑processing framework of MapReduce/Hadoop. Their contributions are twofold: an exact algorithm (named F³k) and a sampling‑based approximation scheme.
The exact algorithm exploits a total order on vertices (by degree, breaking ties with vertex IDs) and assigns each k‑clique to its smallest vertex. For every vertex u, the algorithm builds the induced subgraph G⁺(u) formed by the “high‑neighbors” Γ⁺(u) (those vertices v with u ≺ v). All k‑cliques for which u is the smallest vertex are then in one‑to‑one correspondence with (k‑1)‑cliques inside G⁺(u). The computation proceeds in three MapReduce rounds: (1) each mapper emits (u, v) for edges where u ≺ v, allowing reducers to collect Γ⁺(u); (2) reducers receive each edge (x, y) together with the list of vertices u that have both x and y in their low‑neighbour sets, thereby linking edges to the appropriate G⁺(u); (3) for each u a reducer reconstructs G⁺(u) and counts its (k‑1)‑cliques, finally emitting the contribution of u. The algorithm uses O(m^{3/2}) total space and O(m^{k/2}) total work, where m is the number of edges. Local memory per mapper/reducer is O(m) and local running time O(m^{(k‑1)/2}). Importantly, the communication cost does not depend on k, unlike previous multi‑way join (AFU_k) and Partition approaches whose replication factor grows as b^{k‑2}. Consequently, F³k matches the optimal bounds for triangle listing (k = 3) and is work‑optimal for any k, while keeping communication overhead constant with respect to k. The authors note that F³k does not formally belong to the MRC class because some reducers use linear memory, but empirical results show that this does not hinder practical scalability.
The second contribution is a sampling‑based estimator. The idea is to sample a subset of high‑neighbor sets, count cliques only in the sampled subgraphs, and scale the result by the inverse sampling probability. By applying Chernoff‑type concentration bounds, the authors derive the number of samples needed to guarantee, with probability 1 − δ, an ε‑relative error. For k = 3 the required conditions are weaker than those of the state‑of‑the‑art triangle‑sampling algorithm of Pagh and Tsourakakis. The estimator fits within the MRC model, uses sublinear local memory, and dramatically reduces both runtime and space while still delivering highly accurate estimates.
Experimental evaluation is performed on Amazon EC2 clusters of various sizes, using real‑world networks from the SNAP repository (e.g., YouTube, Gowalla) and synthetic graphs generated by the preferential‑attachment model. The authors report several key findings: (i) In many real graphs the number of k‑cliques q_k grows extremely fast (up to tens or hundreds of trillions for modest k), making exact listing potentially require terabytes of output; (ii) Among exact methods, F³k consistently outperforms the triangle‑counting algorithm of Suri and Vassilvitskii and the multi‑way join algorithm of Afrati et al. on “hard” instances where q_k is large, while on easy instances (small q_k or k = 3) the older methods can be slightly faster; (iii) The approximation algorithm achieves speed‑ups of an order of magnitude or more, solving in minutes instances that are impossible to solve exactly, with average relative error around 0.08 % and negligible variance across runs and cluster configurations; (iv) The experiments also reveal diverse growth patterns of q_k across datasets—some graphs exhibit rapid exponential growth, others show sub‑linear or even decreasing trends as k increases.
Overall, the paper makes a substantial contribution to large‑scale graph analytics. It provides the first work‑optimal, k‑independent‑communication exact algorithm for k‑clique counting in MapReduce, a practical sampling‑based estimator with provable guarantees, and an extensive empirical study that validates the theoretical claims and offers insights into the behavior of cliques in real networks. The authors release their implementation (https://github.com/CliqueCounter/QkCount/) to facilitate reproducibility and future research.
Comments & Academic Discussion
Loading comments...
Leave a Comment