Revisiting Degree Distribution Models for Social Graph Analysis

Revisiting Degree Distribution Models for Social Graph Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Degree distribution models are incredibly important tools for analyzing and understanding the structure and formation of social networks, and can help guide the design of efficient graph algorithms. In particular, the Power-law degree distribution has long been used to model the structure of online social networks, and is the basis for algorithms and heuristics in graph applications such as influence maximization and social search. Along with recent measurement results, our interest in this topic was sparked by our own experimental results on social graphs that deviated significantly from those predicted by a Power-law model. In this work, we seek a deeper understanding of these deviations, and propose an alternative model with significant implications on graph algorithms and applications. We start by quantifying this artifact using a variety of real social graphs, and show that their structures cannot be accurately modeled using elementary distributions including the Power-law. Instead, we propose the Pareto-Lognormal (PLN) model, verify its goodness-of-fit using graphical and statistical methods, and present an analytical study of its asymptotical differences with the Power-law. To demonstrate the quantitative benefits of the PLN model, we compare the results of three wide-ranging graph applications on real social graphs against those on synthetic graphs generated using the PLN and Power-law models. We show that synthetic graphs generated using PLN are much better predictors of degree distributions in real graphs, and produce experimental results with errors that are orders-of-magnitude smaller than those produced by the Power-law model.


💡 Research Summary

This paper revisits the long‑standing assumption that online social networks (OSNs) follow a power‑law degree distribution. The authors first demonstrate, through extensive empirical analysis on seven real‑world OSN graphs—including multiple Facebook regional crawls, a Facebook random‑walk sample, and an Orkut graph—that elementary distributions (power‑law, lognormal, exponential) fail to capture both the low‑degree bulk and the high‑degree tail of the data. Visual diagnostics (CCDF, Q‑Q plots) reveal that power‑law dramatically overestimates the number of super‑nodes, while exponential underestimates them, and lognormal provides a mixed picture.

Motivated by these discrepancies, the authors propose the Pareto‑Lognormal (PLN) distribution as a more faithful model. PLN combines a Pareto component (parameter β) governing the lower‑degree region with a lognormal component (parameters μ and τ) governing the upper tail. The paper derives the probability density function, cumulative distribution function, and maximum‑likelihood estimators for PLN, and fits the model to the seven datasets. Using three error metrics—Kolmogorov‑Smirnov distance, mean absolute error, and log‑scale RMSE—PLN consistently outperforms all baseline models, achieving error reductions ranging from an order of magnitude to four orders of magnitude. In particular, PLN accurately predicts the frequency of very high‑degree nodes (e.g., degree > 2000), where power‑law can be off by factors of 10⁴.

The authors also provide a theoretical asymptotic analysis, showing that while power‑law maintains a constant tail exponent, PLN’s lognormal tail decays exponentially faster, aligning with the observed sparsity of extreme hubs in real OSNs. Closed‑form bounds for the degree threshold corresponding to any percentile (e.g., top 0.1 %) are derived, and empirical validation confirms that PLN’s predictions are within a few percent of the true values, whereas power‑law’s predictions can be off by orders of magnitude.

To assess practical impact, three representative graph applications are evaluated on synthetic graphs generated from both power‑law and PLN models: (1) graph partitioning, where power‑law‑based graphs yield modularity scores up to 30 % worse than real graphs, while PLN‑based graphs are within 5 %; (2) influence maximization using a greedy seed selection algorithm, where power‑law graphs overestimate spread by about 40 % compared to real networks, whereas PLN graphs differ by less than 10 %; and (3) link‑privacy attacks that exploit degree information, where power‑law dramatically inflates attack success rates, while PLN matches the observed success rates closely. These results underscore that the choice of degree distribution model can fundamentally alter algorithmic performance estimates.

Finally, the paper sketches a path toward a generative model that reproduces the PLN degree distribution. By analyzing daily snapshots of a Facebook graph, the authors observe that new nodes often join multiple communities simultaneously, a process that naturally yields a lognormal tail. They suggest integrating this multi‑community attachment mechanism into stochastic block or preferential‑attachment frameworks to produce synthetic graphs whose evolution mirrors real OSNs.

In sum, the study provides compelling evidence that the classic power‑law assumption is inadequate for modern social graphs, introduces the Pareto‑Lognormal distribution as a superior alternative, validates its superiority both statistically and in downstream applications, and outlines future work on generative modeling that could reshape how researchers simulate and analyze social networks.


Comments & Academic Discussion

Loading comments...

Leave a Comment