Generative modeling for the bootstrap

Reading time: 6 minute
...

📝 Original Info

  • Title: Generative modeling for the bootstrap
  • ArXiv ID: 2602.17052
  • Date: 2026-02-19
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. **

📝 Abstract

Generative modeling builds on and substantially advances the classical idea of simulating synthetic data from observed samples. This paper shows that this principle is not only natural but also theoretically well-founded for bootstrap inference: it yields statistically valid confidence intervals that apply simultaneously to both regular and irregular estimators, including settings in which Efron's bootstrap fails. In this sense, the generative modeling-based bootstrap can be viewed as a modern version of the smoothed bootstrap: it could mitigate the curse of dimensionality and remain effective in challenging regimes where estimators may lack root-$n$ consistency or a Gaussian limit.

💡 Deep Analysis

📄 Full Content

Simulating synthetic data from observed samples is by no means a new idea. In statistics, this principle has been proposed repeatedly in support of various data-analytic tasks. Its roots can be traced at least to the work of Scott et al. (1954) and Neyman and Scott (1956), who employed synthetic sampling for model checking and for detecting unsuspected patterns. Later, Efron (1979) introduced the bootstrap, which relies on repeated sampling from the empirical distribution function for statistical inference. Rubin (1993) and Little (1993) further explored the idea in the context of privacy protection, advocating the release of fully synthetic datasets and proposing to use multiple imputation (Rubin, 1987), which generates new data by sampling from a Bayesian posterior distribution.

From a different corner of the scientific landscape, machine learning-originally centered on prediction (Breiman, 2001)-has undergone tremendous advances, particularly with the rise of deep learning. Against this backdrop, the seminal contributions of Kingma and Welling (2013) and Goodfellow et al. (2014), followed by Chen et al. (2018), Song et al. (2020), and many others, 2 Generative modeling-based bootstrap

Consider random vectors Z, Z 1 , Z 2 , . . . ∈ Z ⊂ R p sampled independently from some unknown data distribution P Z with an unknown support Z. In this paper, the support of the distribution P Z of Z refers to the smallest closed set Z ⊆ R p such that P(Z ∈ Z) = 1. A common statistical task is to estimate and infer an estimand θ 0 = θ 0 (P Z ) using an estimator θ n = θ n (Z 1 , . . . , Z n ), which is a function of the data {Z i : i ∈ [n]} with size n and [n] := {1, 2, . . . , n}.

Unlike estimation, inference requires a deeper understanding of the stochastic behavior of θ n : in particular its (limiting) distribution. To approximate the distribution of θ n , bootstrap methods are widely used and typically proceed in two steps:

Step 1: For each bootstrap iteration, resample n synthetic observations Z 1 , . . . , Z n ∈ Z n from a (random) distribution P Z,n , with support Z n , that is learned from the data and intended to approximate P Z .

Step 2: Use the conditional distribution of θ n ( Z 1 , . . . , Z n ) given the original sample to approximate the sampling distribution of θ n = θ n (Z 1 , . . . , Z n ).

Different bootstrap procedures arise from different choices of P Z,n . The choice P Z,n = P Z n , the empirical measure of {Z i } i∈ [n] , corresponds to the original proposal of Efron (1979) and remains the most widely used form of bootstrap.

Adopting the generative modeling philosophy, we introduce a new class of choices for P Z,n by incorporating additional randomness. Let U , U 1 , U 2 , . . . and U 1 , U 2 , . . . ∈ U ⊂ R p be random vectors sampled independently from a known distribution P U , with support U , and independent of the data. One may regard U i ’s and U i ’s as noise and P U as the corresponding noise distribution. A broad class of generative models approximates the data distribution P Z by learning a generator

(2.1) from either the paired observations {(Z i , U i )} i∈ [n] or from {Z i } i∈ [n] alone. The goal of this learning process is to ensure that the pushforward distribution G n #P U is close, in some predefined metric, to the true data distribution P Z . The sample { G n ( U i )} i∈ [n] then constitutes size-n synthetic data, created from noise.

Because G n #P U is intended to approximate P Z , it is natural to introduce a new class of bootstrap procedures by setting

and using the conditional distribution of

to approximate the sampling distribution of θ n (Z 1 , . . . , Z n ). In this paper, we refer to such procedures as generative modeling-based bootstraps.

Different generative models correspond to different choices of G n in (2.1). To introduce the generative models of interest, we begin with some additional notation. For any vector, let dim(•) denote its dimension, and let ∥ • ∥ 2 , and ∥ • ∥ ∞ denote its ℓ 2 , and ℓ ∞ norms, respectively. Whenever “≤” is used to compare two vectors, the comparison is done componentwise. For any (not necessarily square) matrix A, let ∥A∥ op denote its spectral norm, and ∥A∥ max denote the maximum absolute value among its entries. For a square matrix, let det(•) denote its determinant. Throughout the manuscript, the symbols “∨” and “∧” represent the maximum and minimum, respectively, of two quantities.

We first introduce the function class of neural networks.

Definition 2.1 (Neural networks). A neural network function class, denoted by F α (L, W, B, q 1 , q 2 ), consists of all neural networks with depth L, width bound W , magnitude bound B, input dimension q 1 , output dimension q 2 , and activation function α(•) : R → R.

A function f ∈ F α (L, W, B, q 1 , q 2 ) is a mapping f : R q 1 → R q 2 defined recursively by f (x) = x (L) , where x (0) = x and x (ℓ) = α A (ℓ) x (ℓ-1) + b (ℓ) , ℓ ∈ [L],

with α(•) applied componentwise and the matrices

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut