Network datasets appear across a wide range of scientific fields, including biology, physics, and the social sciences. To enable data-driven discoveries from these networks, statistical inference techniques like estimation and hypothesis testing are crucial. However, the size of modern networks often exceeds the storage and computational capacities of existing methods, making timely, statistically rigorous inference difficult. In this work, we introduce a subsampling-based approach aimed at reducing the computational burden associated with estimation and two-sample hypothesis testing. Our strategy involves selecting a small random subset of nodes from the network, conducting inference on the resulting subgraph, and then using interpolation based on the observed connections between the subsample and the rest of the nodes to estimate the entire graph. We develop the methodology under the generalized random dot product graph framework, which affords broad applicability and permits rigorous analysis. Within this setting, we establish consistency guarantees and corroborate the practical effectiveness of the approach through comprehensive simulation studies.
In recent years, the use of network datasets has taken substantial growth, as they appear in wide variety of applications through numerous different scientific fields. These datasets typically represent a collection of entities (referred as nodes) that enjoy pairwise relationships (known as edges) among themselves. Examples of such data ranges from social networks (Milgram et al., 1967) and online communities (Adamic and Glance, 2005) to protein interactions (Kuzmanov and Emili, 2013), power grids (Pagani and Aiello, 2013), epidemic modelling (Venkatramanan et al., 2021), biomedical texts (Komolafe et al., 2021) and even complex brain connectivity networks (Roncal et al., 2013;Bassett and Bullmore, 2017). Due to the inherent complex structure of such data, statistical inference tools such as estimation and hypothesis testing are gaining utmost priority to extract meaningful insights.
The study of random graphs (aka networks) began with the Erdős-Rényi model, introduced independently by Erdős and Rényi (1959) and Gilbert (1959). One of the most extensively studied models is the stochastic block model (SBM) (Holland et al., 1983;Fienberg et al., 1985), which has been later extended to the degree-corrected block model (DCBM) (Karrer and Newman, 2011) and the popularity-adjusted block model (Sengupta and Chen, 2018). Recently, the latent position network known as random dot product graph (RDPG) (Young and Scheinerman, 2007) has emerged as one of the most flexible and interpretable frameworks. Comprehensive reviews of existing methodologies in this context can be found in Kolaczyk and Csárdi (2014), Goldenberg et al. (2010), and Newman (2018). In this paper, we focus on the generalized random dot product graph (GRDPG) model, introduced by Rubin-Delanchy et al. (2022), which adds more flexibility to RDPG by relaxing the positive semi-definiteness assumption for the underlying probability matrix and supports the methodology and theoretical developments presented in this paper.
However, for large-scale networks, such as those containing millions of nodes, inferential tasks often face severe scalability challenges as the procedures proposed in the existing literature encounter substantial computational limitations. In the context of estimation of the random dot product graph (RDPG) model, for instance, Sussman et al. (2012) introduced the adjacency spectral embedding (ASE) method, which estimates the latent positions by performing an eigendecomposition of the adjacency matrix. While effective , this procedure can be prohibitively expensive in terms of computation when applied to large networks. A similar challenge also exists for two-sample hypothesis testing, where the goal is to determine if two networks on a common vertex set share the same edge probability matrix. Under the random dot product graph (RDPG) model, Tang et al. (2017a,b) proposed hypothesis tests based on adjacency spectral embedding (ASE). A bootstrap-based procedure was developed by Bhadra et al. (2025a) for two-sample testing under various network models. Levin et al. (2017) also proposed a new testing procedure named omnibus embedding under the RDPG model. These methods involve estimation as an inherent step and face the same computational bottleneck as discussed above. Methods like Tang et al. (2017a), Bhadra et al. (2025a) even require repeated execution of the same, making them more difficult to scale. These challenges motivate the necessity of developing scalable and computationally efficient techniques for both estimation and hypothesis testing.
Several well-established statistical strategies have been developed to address the computational challenges associated with large-scale data, including subsampling, which draws inference from a small subset of the observed data (Politis et al., 1999;Ma et al., 2015), sketching, which constructs a low-dimensional projection of the data (Mahoney et al., 2011;Raskutti and Mahoney, 2016), and the classical divide-and-conquer approach, which partitions the dataset into subsets, analyzes them separately, and then aggregates the results to form a global inference. However, implementing these strategies in the context of network data presents several difficulties. First, network models like GRDPG exhibit invariance properties that lead to non-identifiability of model parameters, making it challenging to aggregate inference results from disjoint subsets and thereby limiting the applicability of divide-and-conquer methods. Second, the presence of node-specific parameters in many models renders naive subsampling ineffective, as each node must be included in the estimation process. The application of the sketching also becomes model-specific due to these. For example, Zhang et al. (2022) proposed such a method focused only on stochastic block mod-els. For RDPG models, Chakraborty et al. (2025) proposed a modified divide-and-conquer type method, which needs the partitioned subgraphs to possess a common overla
This content is AI-processed based on open access ArXiv data.