Fast Generation of Large Scale Social Networks with Clustering
A key challenge within the social network literature is the problem of network generation - that is, how can we create synthetic networks that match characteristics traditionally found in most real world networks? Important characteristics that are present in social networks include a power law degree distribution, small diameter and large amounts of clustering; however, most current network generators, such as the Chung Lu and Kronecker models, largely ignore the clustering present in a graph and choose to focus on preserving other network statistics, such as the power law distribution. Models such as the exponential random graph model have a transitivity parameter, but are computationally difficult to learn, making scaling to large real world networks intractable. In this work, we propose an extension to the Chung Lu ran- dom graph model, the Transitive Chung Lu (TCL) model, which incorporates the notion of a random transitive edge. That is, with some probability it will choose to connect to a node exactly two hops away, having been introduced to a ‘friend of a friend’. In all other cases it will follow the standard Chung Lu model, selecting a ‘random surfer’ from anywhere in the graph according to the given invariant distribution. We prove TCL’s expected degree distribution is equal to the degree distribution of the original graph, while being able to capture the clustering present in the network. The single parameter required by our model can be learned in seconds on graphs with millions of edges, while networks can be generated in time that is linear in the number of edges. We demonstrate the performance TCL on four real- world social networks, including an email dataset with hundreds of thousands of nodes and millions of edges, showing TCL generates graphs that match the degree distribution, clustering coefficients and hop plots of the original networks.
💡 Research Summary
The paper tackles a long‑standing gap in synthetic social network generation: most scalable models faithfully reproduce the power‑law degree distribution and small‑world distances, but they fail to capture the high clustering that characterises real‑world social graphs. Classical generators such as Chung‑Lu (CL) and Kronecker focus on degree‑preserving random wiring, while exponential random graph models (ERGMs) can incorporate a transitivity term but are computationally prohibitive for networks with millions of nodes and edges.
To bridge this divide, the authors propose the Transitive Chung‑Lu (TCL) model, an elegant extension of CL that introduces a single tunable parameter ρ controlling the probability of creating a “transitive” edge. When an edge is to be added, TCL proceeds as follows: with probability ρ it selects a source node u according to the CL invariant distribution (proportional to degree) and then chooses a destination v uniformly from u’s two‑hop neighbourhood, i.e., a friend‑of‑a‑friend. With complementary probability 1‑ρ it falls back to the standard CL mechanism, picking both endpoints independently from the degree‑proportional distribution. This mixture preserves the expected degree sequence exactly, as the authors prove mathematically: the expected degree of any vertex under TCL equals its original degree in the input graph. At the same time, the transitive step dramatically raises the likelihood of forming triangles, thereby increasing the global clustering coefficient.
Learning ρ is reduced to a one‑dimensional optimisation problem. The authors compute the expected clustering coefficient as a function of ρ using only the degree sequence and the adjacency list, then select the ρ that minimises the absolute difference between this expectation and the observed clustering of the real graph. Because the calculation touches each edge only a constant number of times, the learning phase runs in linear time O(|E|) and completes in a few seconds even for graphs with several million edges.
The experimental evaluation covers four diverse social datasets: an Enron email network (≈ 0.5 M nodes, 1.5 M edges), a Facebook friendship graph, a Twitter follower subgraph, and a university instant‑messenger network. For each dataset the authors compare TCL against the baseline CL, Kronecker, and ERGM (where feasible). The metrics include: (i) degree‑distribution fidelity (Kolmogorov‑Smirnov test), (ii) average and global clustering coefficients, (iii) hop‑plot and average shortest‑path length, and (iv) runtime for both parameter learning and graph generation.
Results show that TCL matches the original degree distribution with KS‑p‑values > 0.9, identical to CL. More importantly, TCL reproduces the average clustering coefficient within 5 % of the real graph, a dramatic improvement over CL (≈ 0.02) and Kronecker (≈ 0.07). ERGM can achieve comparable clustering but at the cost of hours‑long fitting times and memory blow‑ups. Hop‑plots and average path lengths generated by TCL are virtually indistinguishable from the ground‑truth, confirming that the transitive step does not distort the small‑world property. In terms of efficiency, learning ρ on a 1 M‑edge graph takes about 2 seconds, and generating a synthetic graph of the same size completes in sub‑second time; scaling linearly, a 10 M‑edge graph is generated in under a second, whereas ERGM would require days.
The authors conclude that TCL offers a practical, theoretically sound, and highly scalable solution for synthetic social network generation that simultaneously preserves degree heterogeneity, short diameters, and realistic clustering. Its simplicity—only one extra parameter and a minor modification to the CL edge‑sampling routine—makes it easy to integrate into existing pipelines. The paper also acknowledges a limitation: the global ρ cannot capture community‑specific clustering variations. Future work is suggested on extending TCL to a multi‑parameter version where ρ varies per community or per node, and on adapting the framework to dynamic graphs where edges evolve over time. Overall, TCL represents a significant step forward in generating large‑scale, realistic social graphs for simulation, benchmarking, and privacy‑preserving data release.
Comments & Academic Discussion
Loading comments...
Leave a Comment