Empirical comparison of network sampling techniques
In the past few years, the storage and analysis of large-scale and fast evolving networks present a great challenge. Therefore, a number of different techniques have been proposed for sampling large networks. In general, network exploration techniques approximate the original networks more accurately than random node and link selection. Yet, link selection with additional subgraph induction step outperforms most other techniques. In this paper, we apply subgraph induction also to random walk and forest-fire sampling. We analyze different real-world networks and the changes of their properties introduced by sampling. We compare several sampling techniques based on the match between the original networks and their sampled variants. The results reveal that the techniques with subgraph induction underestimate the degree and clustering distribution, while overestimate average degree and density of the original networks. Techniques without subgraph induction step exhibit exactly the opposite behavior. Hence, the performance of the sampling techniques from random selection category compared to network exploration sampling does not differ significantly, while clear differences exist between the techniques with subgraph induction step and the ones without it.
💡 Research Summary
This paper presents a systematic empirical comparison of eight network sampling techniques applied to ten real‑world graphs, focusing on the impact of subgraph induction (i.e., adding all edges among sampled nodes) on the preservation of structural properties. The techniques are divided into two broad categories: random selection (RNS – random node selection, RND – degree‑biased node selection, RLS – random link selection, RLI – random link selection with induction) and network‑exploration (RWS – random walk, RWI – random walk with induction, FFS – forest‑fire, FFI – forest‑fire with induction). For each method, 100 independent samples of size 15 % of the original graph are generated, and four key metrics are evaluated: (1) degree distribution, (2) clustering‑coefficient distribution, (3) average degree, and (4) graph density.
Degree and clustering distributions are compared using the Kolmogorov‑Smirnov D‑statistic, while average degree and density are assessed via mean differences and two‑tailed Student‑t tests on externally standardized residuals. The statistical analysis shows that the presence or absence of subgraph induction is the dominant factor influencing sampling accuracy; the distinction between random‑selection and exploration‑based methods is comparatively minor.
Techniques that employ subgraph induction (RLI, RWI, FFI, and also the induced versions of random‑node methods RNS and RND) consistently produce samples with higher average degree and density than the original networks because every additional edge between already sampled nodes is added. Consequently, they tend to underestimate the tail of the degree distribution and the clustering‑coefficient distribution (the sampled graphs appear less heterogeneous and less locally clustered). Conversely, non‑induced methods (RLS, RWS, FFS) overestimate degree and clustering distributions while underestimating average degree and density, reflecting the sparsity of the raw sampled subgraph.
Among the non‑induced methods, random walk sampling (RWS) yields the best match for the degree distribution, likely due to its bias toward high‑degree vertices, which compensates for the lack of added edges. Forest‑fire sampling (FFS) performs the worst overall, generating very sparse samples that miss high‑degree nodes and thus distort both degree and clustering statistics. The induced versions of these exploration methods (RWI, FFI) improve the degree‑distribution match relative to their non‑induced counterparts, but still exhibit the systematic under‑estimation of clustering noted above.
Random‑node selection methods (RNS, RND) achieve accuracy comparable to the induced exploration methods despite often producing disconnected samples, indicating that connectivity per se is not a prerequisite for preserving degree distribution. The degree‑biased node selection (RND) offers modest improvements over plain random node selection (RNS) but does not close the gap to the best exploration‑based techniques.
Statistical testing confirms that differences attributable to subgraph induction are significant at the 0.05 level across all four metrics. This suggests that practitioners must choose sampling strategies based on the specific structural property of interest: if the goal is to estimate global density or average degree, induced methods are preferable; if the aim is to preserve local clustering patterns or avoid inflating degree heterogeneity, non‑induced methods are more suitable.
The study also highlights that the trade‑off between preserving degree distribution versus clustering distribution is inherent to the induction step: adding edges raises node degrees uniformly, shrinking the variance of the degree distribution and diluting local clustering. Conversely, omitting induction retains the original sparsity, leading to higher apparent clustering but lower average degree.
In conclusion, the paper demonstrates that subgraph induction is the primary determinant of sampling performance, outweighing the choice between random‑selection and network‑exploration paradigms. By quantifying how induction systematically biases key network metrics, the authors provide clear guidance for selecting appropriate sampling techniques in large‑scale network analysis, enabling more accurate inference while managing computational and storage constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment