Estimation of global network statistics from incomplete data
Complex networks underlie an enormous variety of social, biological, physical, and virtual systems. A profound complication for the science of complex networks is that in most cases, observing all nodes and all network interactions is impossible. Previous work addressing the impacts of partial network data is surprisingly limited, focuses primarily on missing nodes, and suggests that network statistics derived from subsampled data are not suitable estimators for the same network statistics describing the overall network topology. We generate scaling methods to predict true network statistics, including the degree distribution, from only partial knowledge of nodes, links, or weights. Our methods are transparent and do not assume a known generating process for the network, thus enabling prediction of network statistics for a wide variety of applications. We validate analytical results on four simulated network classes and empirical data sets of various sizes. We perform subsampling experiments by varying proportions of sampled data and demonstrate that our scaling methods can provide very good estimates of true network statistics while acknowledging limits. Lastly, we apply our techniques to a set of rich and evolving large-scale social networks, Twitter reply networks. Based on 100 million tweets, we use our scaling techniques to propose a statistical characterization of the Twitter Interactome from September 2008 to November 2008. Our treatment allows us to find support for Dunbar’s hypothesis in detecting an upper threshold for the number of active social contacts that individuals maintain over the course of one week.
💡 Research Summary
The paper tackles a fundamental obstacle in network science: the inability to observe every node and edge in large‑scale complex systems. While prior work has largely focused on missing nodes and has often concluded that statistics derived from subsampled networks are unreliable proxies for the true network, this study proposes a set of transparent scaling methods that can predict global network statistics—including the full degree distribution—from partial observations.
Four sampling regimes are considered: (1) random node sampling (induced subgraph on a fraction q of nodes), (2) random link failure (all nodes observed but each edge retained with probability q), (3) random link sampling (edges are directly sampled with probability q), and (4) weighted interaction sampling (a fraction q of weighted interactions is observed). For each regime the authors derive the conditional probability that a node of true degree i appears with observed degree k in the sampled network:
Pr(k|i) = C(i,k) q^k (1−q)^{i−k}.
Using this, they express the observed degree distribution \tilde{P}_k as a binomial mixture of the true distribution P_i (Equations 1–3). The core contribution is the inversion formula (Equation 3) that recovers P_k from \tilde{P}_k and the known sampling fraction q:
\hat{P}k = Σ{i=k}^{k_max} (−1)^{i−k} C(i,k) (1−q)^{i−k} q^{i} \tilde{P}_i.
This differs from earlier work by a factor of q and, while not guaranteeing non‑negativity analytically, proves robust in empirical tests. The authors also provide simple scaling relations for other global metrics: node count N̂ = q N, edge count M̂ = q^2 M, average degree k̄̂ = q k̄, clustering coefficient Ĉ ≈ C, and giant component size Ŝ ≈ S, depending on the sampling type.
Validation is performed on four synthetic network models (Erdős–Rényi, scale‑free, small‑world, and range‑dependent) and six real‑world networks (Elephants, Airlines, Karate Club, Dolphins, Condensed Matter, Power Grid). For each network, 100 independent subsamples are generated for q ranging from 5 % to 100 % in 5 % increments. Across all cases the scaling formulas accurately recover the original statistics; degree‑distribution reconstruction typically yields mean absolute errors below 5 %.
Weighted networks are examined in two experiments. In the first, all edges receive a uniform weight w (values 1–5) and sampling is performed on interactions with weight >0. In the second, edge weights follow either an “equal effort” scheme (node strength equalized) or a uniform integer distribution between 1 and 9. Results show that average strength and average degree scale linearly with the mean weight, confirming that the proposed framework extends to weighted graphs without modification.
The methodology is then applied to a massive Twitter reply network constructed from over 100 million tweets (September 2008–November 2008). Only 25 %–55 % of the total tweet stream was captured, but the known capture rate serves as the sampling fraction q. Using the scaling relations, the authors estimate the full “Twitter Interactome” statistics: total active users, average weekly replies per user, clustering, and component structure. Notably, the estimated distribution of weekly active contacts exhibits an upper bound near 150 contacts, providing empirical support for Dunbar’s hypothesis on cognitive limits to stable social relationships.
In conclusion, the paper demonstrates that, provided the sampling proportion is known, global network statistics can be reliably inferred from incomplete data without assuming any specific generative model. Limitations include the necessity of an accurate q estimate and occasional negative values in the inverted degree distribution, suggesting avenues for future work such as q‑estimation techniques, extensions to dynamic or multilayer networks, and theoretical guarantees of non‑negativity. The work offers a practical toolkit for researchers dealing with massive, partially observed networks across disciplines.
Comments & Academic Discussion
Loading comments...
Leave a Comment