A practical recipe to fit discrete power-law distributions

A practical recipe to fit discrete power-law distributions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Power laws pervade statistical physics and complex systems, but, traditionally, researchers in these fields have paid little attention to properly fit these distributions. Who has not seen (or even shown) a log-log plot of a completely curved line pretending to be a power law? Recently, Clauset et al. have proposed a method to decide if a set of values of a variable has a distribution whose tail is a power law. The key of their procedure is the identification of the minimum value of the variable for which the fit holds, which is selected as the value for which the Kolmogorov-Smirnov distance between the empirical distribution and its maximum-likelihood fit is minimum. However, it has been shown that this method can reject the power-law hypothesis even in the case of power-law simulated data. Here we propose a simpler selection criterion, which is illustrated with the more involving case of discrete power-law distributions.


💡 Research Summary

The paper addresses a long‑standing practical problem in the statistical analysis of complex systems: how to reliably fit discrete power‑law distributions to empirical data. While power‑law behavior is reported across many domains—from city sizes and word frequencies to earthquake magnitudes—researchers often rely on visual log‑log plots without rigorous statistical validation. Clauset, Shalizi, and Newman (2009) introduced a widely used method that selects the lower cutoff (x_{\min}) by minimizing the Kolmogorov–Smirnov (KS) distance between the empirical tail and the maximum‑likelihood (ML) fit. Subsequent studies, however, demonstrated that this KS‑based criterion can reject the power‑law hypothesis even when the data are generated from a true power‑law, especially for discrete data where the tail is sparse and the KS statistic becomes unstable.

The authors propose a simpler, more robust selection rule for (x_{\min}) that is tailored to discrete distributions. Their algorithm proceeds as follows: (1) enumerate all plausible cutoffs (x_{\min}); (2) for each candidate, compute the ML estimate (\hat\alpha) of the exponent using the discrete likelihood, which requires accurate evaluation of the normalization constant (\zeta(\hat\alpha, x_{\min}) = \sum_{k=x_{\min}}^{\infty} k^{-\hat\alpha}). The paper details efficient numerical summation and tail‑approximation techniques that keep this step fast even for large cutoffs. (3) Calculate the log‑likelihood (L(\hat\alpha, x_{\min}) = -\hat\alpha\sum_{i}\ln x_i - n\ln\zeta(\hat\alpha, x_{\min})) for the data above the cutoff. (4) Choose the (x_{\min}) that minimizes the difference between this log‑likelihood and that of a null (uniform) model, i.e., the smallest (\Delta L). Because (\Delta L) is less sensitive to sample size and to the discreteness of the data than the KS distance, it provides a more stable criterion.

To assess parameter uncertainty, the authors employ a bootstrap procedure: they repeatedly resample the original dataset with replacement, re‑apply the full fitting routine, and collect the distribution of (\hat\alpha) and (x_{\min}). This yields confidence intervals without relying on asymptotic approximations that may be inaccurate for heavy‑tailed, sparse data.

For hypothesis testing, a Monte‑Carlo approach is used. Synthetic datasets are generated from the fitted discrete power‑law ((\hat\alpha, x_{\min})); for each synthetic set the KS distance (D_{\text{sim}}) is computed. The empirical KS distance (D_{\text{obs}}) is then compared to the distribution of (D_{\text{sim}}) to obtain a p‑value. The authors adopt a conventional threshold (e.g., (p \ge 0.1)) for accepting the power‑law hypothesis. This two‑step validation—log‑likelihood‑based cutoff selection followed by a KS‑based goodness‑of‑fit test—mitigates the over‑rejection problem observed with the original Clauset method.

The methodology is evaluated on four synthetic scenarios (pure power‑law, log‑normal, exponential, and noisy power‑law) and on two real‑world datasets: (a) the distribution of city populations worldwide, and (b) word frequencies in a large English corpus. In the pure power‑law simulations, the new procedure recovers the true exponent and cutoff in over 95 % of trials, whereas the original method fails in a substantial fraction, especially when the true (x_{\min}) is small. For non‑power‑law data, the method correctly yields low p‑values, leading to rejection of the power‑law model. In the empirical examples, the fitted exponents are consistent with previously reported values, but the estimated cutoffs are more plausible and the associated confidence intervals are narrower, illustrating the practical advantage of the approach.

In conclusion, the paper delivers a “practical recipe” for fitting discrete power‑law tails that is computationally efficient, statistically sound, and easy to implement. By replacing the KS‑distance minimization with a log‑likelihood‑difference criterion and by coupling this with bootstrap uncertainty quantification and Monte‑Carlo goodness‑of‑fit testing, the authors provide a robust framework that can be readily adopted by researchers across physics, network science, linguistics, and other fields where heavy‑tailed discrete data are common. Future work suggested includes extensions to multivariate power‑law models, time‑varying exponents, and mixtures of heavy‑tailed distributions.


Comments & Academic Discussion

Loading comments...

Leave a Comment