Post-selection inference with a single realization of a network

Post-selection inference with a single realization of a network
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Given a dataset consisting of a single realization of a network, we consider conducting inference on a parameter selected from the data. In particular, we focus on the setting where the parameter of interest is a linear combination of the mean connectivities within and between estimated communities. Inference in this setting poses a challenge, since the communities are themselves estimated from the data. Furthermore, since only a single realization of the network is available, sample splitting is not possible. In this paper, we show that it is possible to split a single realization of a network consisting of $n$ nodes into two (or more) networks involving the same $n$ nodes; the first network can be used to select a data-driven parameter, and the second to conduct inference on that parameter. In the case of weighted networks with Poisson or Gaussian edges, we obtain two independent realizations of the network; by contrast, in the case of Bernoulli edges, the two realizations are dependent, and so extra care is required. We establish the theoretical properties of our estimators, in the sense of confidence intervals that attain the nominal (selective) coverage, and demonstrate their utility in numerical simulations and in application to a dataset representing the relationships among dolphins in Doubtful Sound, New Zealand.


💡 Research Summary

The paper tackles a fundamental problem in network analysis: how to perform valid statistical inference on a parameter that is chosen based on the same single observed network. Traditional sample‑splitting methods require multiple independent realizations, which are unavailable when only one network is observed. The authors propose a novel “splitting” strategy that creates two subnetworks—training (A_tr) and testing (A_te)—while preserving the same set of nodes.

For Gaussian and Poisson edge models, they employ a thinning procedure: each edge value is randomly partitioned with probability ε∈(0,1). This yields two independent adjacency matrices, A_tr and A_te, whose means are εM and (1‑ε)M respectively (M denotes the unknown edge‑wise mean). The Fisher information about M is split proportionally, guaranteeing no loss of information.

For Bernoulli edges, independent thinning is impossible. Instead, they introduce a “fission” operation: each edge is toggled with probability γ∈(0,0.5) to form A_tr, while the original edge values are retained as A_te. A_tr and A_te are dependent, but the conditional distribution of A_te given A_tr is analytically known. This conditional distribution is used for inference, effectively treating A_tr as the training set and A_te as a test set conditioned on the training data.

The selected parameter θ is defined as a linear combination of mean connectivities between estimated communities. Communities are estimated from A_tr using any clustering algorithm (e.g., spectral clustering), producing a membership matrix \hat Z_tr. From \hat Z_tr they construct a K×K matrix B whose (k,ℓ) entry is the average expected edge weight between community k and ℓ. The parameter of interest is θ(A_tr)=u^T vec(B), where u is a user‑specified weight vector (‖u‖=1) that can depend on the training data. Examples include the within‑community mean of the first community or the difference between two community means.

Because θ depends on the data, the authors adopt the selective inference framework (Fithian et al., 2017). They derive confidence intervals C_α(A_te; A_tr) that satisfy conditional coverage P(θ∈C_α | A_tr) ≥ 1‑α for any selection rule, including data‑dependent u. The construction uses the known conditional distribution of A_te (or A_te|A_tr) to estimate the variance of θ̂ and to form a pivot that yields exact (asymptotically) selective coverage.

Theoretical contributions include: (1) proofs that thinning/fission preserve the full Fisher information about M, and (2) guarantees that the selective confidence intervals achieve the nominal coverage despite the data‑driven selection.

Simulation studies vary network size, number of communities, edge distributions, and signal strength. The proposed method consistently attains the nominal 95% coverage, outperforming alternatives that split nodes, hold out edges, or rely on matrix completion. In the Bernoulli case, the conditional approach still provides accurate intervals, whereas naïve methods fail.

An empirical application uses the well‑known dolphin social network from Doubtful Sound, New Zealand. After estimating three communities from the training split, the authors compute a confidence interval for the difference in average interaction strength between the first two communities. The interval excludes zero, indicating a statistically significant difference that aligns with ecological interpretations.

In summary, the paper delivers a general, distribution‑agnostic framework for post‑selection inference with a single network realization. By converting one network into two statistically tractable pieces, it enables researchers to select parameters (e.g., community‑based connectivity measures) and then obtain valid, selective confidence intervals. The methodology is applicable to Gaussian, Poisson, and Bernoulli edge models, and the authors suggest extensions to dependent edges, dynamic networks, and multilayer structures as promising future directions.


Comments & Academic Discussion

Loading comments...

Leave a Comment