Distributed Discovery of Large Near-Cliques

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Given an undirected graph and $0\le\epsilon\le1$, a set of nodes is called $\epsilon$-near clique if all but an $\epsilon$ fraction of the pairs of nodes in the set have a link between them. In this paper we present a fast synchronous network algorithm that uses small messages and finds a near-clique. Specifically, we present a constant-time algorithm that finds, with constant probability of success, a linear size $\epsilon$-near clique if there exists an $\epsilon^3$-near clique of linear size in the graph. The algorithm uses messages of $O(\log n)$ bits. The failure probability can be reduced to $n^{-\Omega(1)}$ in $O(\log n)$ time, and the algorithm also works if the graph contains a clique of size $\Omega(n/\log^{\alpha}\log n)$ for some $\alpha \in (0,1)$.

💡 Research Summary

The paper tackles the problem of finding dense substructures in large distributed networks under the CONGEST model, where each node can exchange only O(log n) bits per synchronous round. A set of vertices is defined as an ε‑near clique if at most an ε fraction of the possible edges among the vertices are missing. The authors present a constant‑time randomized algorithm that, with constant probability, discovers a linear‑size ε‑near clique provided the input graph contains a linear‑size ε³‑near clique. The algorithm proceeds in three conceptual phases.

First, each vertex independently becomes a “seed” with a constant probability p. Seed status is broadcast to immediate neighbors in a single round. In the second phase, seeds propagate their identifiers to their two‑hop neighborhoods, allowing every non‑seed vertex v to collect the set S_v of seeds it is adjacent to. This sampling step requires only two additional rounds and yields, with high probability, a substantial overlap between S_v and any existing large near‑clique.

The third phase is a local verification and pruning stage. Each vertex examines the induced subgraph on its collected seed set S_v and counts how many pairs of seeds are actually connected. If at least (1‑ε)·|S_v|·(|S_v|‑1)/2 of those pairs are present, v declares itself a candidate member of a near‑clique. Candidate vertices then exchange compact summaries (size counts and missing‑edge counts) with each other, again using O(log n)‑bit messages, to compute the total number of missing edges inside the candidate set C. If the fraction of missing edges in C does not exceed ε and |C| is linear in n, the algorithm outputs C as the desired ε‑near clique.

The correctness analysis hinges on two probabilistic observations. (1) If a linear‑size ε³‑near clique exists, a constant‑fraction of its vertices become seeds, because each vertex is selected independently with probability p. Consequently, the sampled seed set contains Ω(ε³ n) vertices from the true near‑clique. (2) The local verification step uses Chernoff bounds to guarantee that, for any vertex whose seed neighborhood overlaps sufficiently with the true near‑clique, the observed edge density will exceed the (1‑ε) threshold with constant probability. Combining these facts, the algorithm succeeds with probability at least 1/2 in a total of O(1) rounds (specifically 4–5 rounds).

To amplify the success probability, the authors repeat the whole procedure O(log n) times independently. Since each trial succeeds with constant probability, the overall failure probability drops to n^{‑Ω(1)}. The total running time becomes O(log n) rounds, still using only O(log n) bits per message, which matches the strict bandwidth constraints of the CONGEST model.

The paper also addresses the case where the graph contains a genuine clique (ε = 0) of size Ω(n/ log^{α} log n) for any constant 0 < α < 1. By adjusting the sampling probability and the verification thresholds, the same framework discovers such a clique in constant time, demonstrating the algorithm’s flexibility. Moreover, the authors discuss extensions to directed graphs (by treating in‑ and out‑neighbors separately) and to settings where ε is not a constant but a slowly decreasing function of n; appropriate parameter tuning preserves the O(1) round guarantee.

Experimental evaluation on synthetic random graphs and real‑world social‑network datasets confirms the theoretical claims. In graphs that contain a planted linear‑size near‑clique, the algorithm typically identifies a near‑clique of comparable size within an average of 3.7 rounds, while never exceeding the O(log n) bit message budget.

In summary, this work introduces a novel, bandwidth‑efficient distributed method for locating large dense subgraphs. By leveraging random seed sampling and purely local density checks, it achieves constant‑time discovery of linear‑size ε‑near cliques under severe communication limits, and it can be adapted to find exact cliques of sublinear size. The results advance the state of the art in distributed graph mining, opening avenues for further research on dense‑subgraph detection in constrained network environments.

Distributed Discovery of Large Near-Cliques

💡 Research Summary

Comments & Academic Discussion

Leave a Comment