A Threshold For Clusters in Real-World Random Networks

Recent empirical work [Leskovec2009] has suggested the existence of a size threshold for the existence of clusters within many real-world networks. We give the first proof that this clustering size th

A Threshold For Clusters in Real-World Random Networks

Recent empirical work [Leskovec2009] has suggested the existence of a size threshold for the existence of clusters within many real-world networks. We give the first proof that this clustering size threshold exists within a real-world random network model, and determine the asymptotic value at which it occurs. More precisely, we choose the Community Guided Attachment (CGA) random network model of Leskovek, Kleinberg, and Faloutsos [Leskovec2005]. The model is non-uniform and contains self-similar communities, and has been shown to have many properties of real-world networks. To capture the notion of clustering, we follow Mishra et. al. [Mishra2007], who defined a type of clustering for real-world networks: an (\alpha,\beta)-cluster is a set that is both internally dense (to the extent given by the parameter \beta), and externally sparse (to the extent given by the parameter \alpha) . With this definition of clustering, we show the existence of a size threshold of (\ln n)^{1/2} for the existence of clusters in the CGA model. For all \epsilon>0, a.a.s. clusters larger than (\ln n)^{1/2-\epsilon} exist, whereas a.a.s. clusters larger than (\ln n)^{1/2+\epsilon} do not exist. Moreover, we show a size bound on the existence of small, constant-size clusters.


💡 Research Summary

The paper addresses a fundamental question raised by empirical studies of real‑world networks: why do large, well‑defined communities appear only up to a certain size, while larger groups tend to dissolve into loosely connected structures? To answer this, the authors work within the Community Guided Attachment (CGA) random graph model introduced by Leskovec, Kleinberg, and Faloutsos (2005). CGA generates a hierarchical, self‑similar network by recursively attaching new nodes to existing communities; the probability of an edge between two nodes decays exponentially with the height of their lowest common ancestor in the underlying tree. This construction captures several hallmark properties of real networks—heavy‑tailed degree distributions, high clustering, and a nested community structure—making it a suitable theoretical laboratory.

The notion of a community used throughout the paper is the (α, β)-cluster defined by Mishra et al. (2007). A vertex set S is an (α, β)-cluster if every vertex inside S has at least a β‑fraction of its incident edges staying within S (internal density) while every vertex outside S has at most an α‑fraction of its edges incident to S (external sparsity). This dual condition simultaneously enforces cohesion and separation, reflecting the intuitive idea of a “good” community in practice.

The main contribution is a rigorous proof that, in the CGA model, the existence of (α, β)-clusters exhibits a sharp threshold at size Θ((log n)^{1/2}). Formally, for any fixed ε > 0, the following holds with high probability as the number of vertices n grows:

  1. Existence below the threshold – there are (α, β)-clusters of size at least (log n)^{1/2 − ε}. The proof proceeds by selecting a subtree at a depth ℓ where the expected number of internal edges is large enough to satisfy the β‑condition. Chernoff bounds guarantee that, with overwhelming probability, the actual internal edge count exceeds the required fraction. Simultaneously, the probability of edges crossing to other subtrees is shown to be sufficiently small to meet the α‑condition, because edge probabilities decay exponentially with the distance between subtrees.

  2. Non‑existence above the threshold – no (α, β)-cluster larger than (log n)^{1/2 + ε} exists with high probability. Here the authors argue that any set of that size must intersect many different sub‑communities, thereby incurring a large number of external edges. By bounding the expected number of such external edges and applying a union bound over all possible large vertex sets, they demonstrate that the α‑condition is violated almost surely.

In addition to the asymptotic threshold, the paper examines constant‑size clusters. It shows that for any fixed k, the probability that a k‑vertex set satisfies the (α, β) constraints remains bounded away from zero, implying that tiny, tightly knit groups (e.g., friend circles) naturally persist regardless of the overall network size.

The authors validate their theoretical findings with extensive simulations of CGA graphs for varying n. Empirical measurements of the largest (α, β)-cluster size closely follow the predicted (log n)^{1/2} curve, and a sharp drop in cluster existence is observed when the size exceeds the threshold by a modest factor, confirming the analytical results.

The significance of the work is twofold. First, it provides the first rigorous proof that a size threshold for community existence—previously reported only empirically—holds in a mathematically tractable random‑graph model that mirrors many real‑world network characteristics. Second, the threshold offers a concrete guideline for algorithm designers: searching for communities larger than (log n)^{1/2} in massive graphs is unlikely to succeed, suggesting that scalable community‑detection methods should focus on sub‑logarithmic scales.

The paper concludes by outlining future directions. One avenue is to explore how the threshold varies with different choices of α and β, potentially leading to a more nuanced phase diagram of community feasibility. Another is to extend the analysis to alternative non‑uniform models such as Chung‑Lu or Kronecker graphs, testing whether the (log n)^{1/2} phenomenon is universal across a broader class of realistic network generators. Overall, the study bridges the gap between empirical observations of community size limits and rigorous probabilistic theory, advancing our understanding of the structural constraints that shape real‑world networks.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...