A Scalable Generative Graph Model with Community Structure
Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed Block Two-Level Erdss-Renyi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(d_max) storage where d_max is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop MapReduce implementation for a modeling a real-world web graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.
💡 Research Summary
The paper presents a detailed study of the Block Two‑Level Erdős‑Rényi (BTER) model, a scalable generative graph algorithm designed to simultaneously reproduce two fundamental characteristics of real‑world networks: a heavy‑tailed degree distribution and high, degree‑dependent clustering coefficients. The authors first motivate the need for realistic synthetic graphs, noting that privacy, security, and size constraints often prevent direct use of real data, while existing generative models (preferential attachment, Stochastic Kronecker Graphs, Chung‑Lu, etc.) either fail to capture clustering or require costly parameter fitting.
BTER addresses these shortcomings by dividing the generation process into three logical stages. In the preprocessing stage, each vertex is assigned a target degree according to the supplied degree distribution {n_d}. Vertices are then grouped into “affinity blocks” based on degree and the desired clustering profile {c_d}. The size of each block and its internal edge‑creation probability ρ are derived analytically from c_d, ensuring that low‑degree vertices receive higher intra‑block connectivity, which creates many triangles locally.
Phase 1 generates intra‑block edges using dense Erdős‑Rényi graphs. Because the block parameters are chosen to match the target clustering, this phase yields the required number of closed wedges (triangles) without significantly altering the overall degree distribution. Phase 2 adds inter‑block edges using a fast variant of the Chung‑Lu model: each endpoint is sampled independently with probability proportional to its target degree, and m = |E| such endpoint pairs are drawn. This yields the expected edge probability d_i d_j / (2m), reproducing the prescribed degree sequence in expectation while preserving the clustering built in Phase 1. Self‑loops and duplicate edges are discarded after generation, a step that has negligible impact on the final statistics.
From an implementation perspective, the authors achieve O(d_max) memory usage by storing only the remaining “stubs” for each vertex and compact block lists. Edge sampling is performed with binary‑search over cumulative degree weights, giving O(log d_max) time per endpoint selection. Because every edge is generated independently, the algorithm is trivially parallelizable; the paper demonstrates a Hadoop MapReduce implementation that scales to a web graph with 130 million vertices and 4.6 billion edges, completing in a few hours on a modest cluster.
Experimental evaluation compares BTER against SKG, Chung‑Lu, and other community‑aware models on several large real‑world datasets. BTER consistently matches the empirical degree distribution (within a few percent) and reproduces the degree‑wise clustering curve far more accurately than the baselines—often achieving clustering coefficients an order of magnitude larger. Moreover, BTER requires no iterative optimization; all parameters are computed directly from the input sequences, which simplifies its use for benchmarking when a target graph is unavailable.
The authors also discuss how to generate synthetic input profiles when real data are absent. They recommend a generalized log‑normal distribution for degrees (as a flexible alternative to pure power‑law) and provide functional forms for c_d (e.g., c_d ≈ α·d^{‑β}) that can be tuned to desired community strength. For extremely large graphs, they reference a recent sampling technique that estimates clustering coefficients efficiently.
In summary, the paper makes four principal contributions: (1) a clear, mathematically grounded description of BTER that captures both degree distribution and clustering; (2) a memory‑efficient, O(log d_max) per‑edge implementation suitable for billions of edges; (3) extensive empirical validation showing superior fidelity to real networks compared with existing scalable models; and (4) practical guidance for using BTER as a benchmark generator, including synthetic profile creation and parallel deployment. The work establishes BTER as a leading tool for generating realistic, community‑rich synthetic graphs at massive scale.
Comments & Academic Discussion
Loading comments...
Leave a Comment