M-estimation under Two-Phase Multiwave Sampling with Applications to Prediction-Powered Inference

Reading time: 6 minute
...

📝 Original Info

  • Title: M-estimation under Two-Phase Multiwave Sampling with Applications to Prediction-Powered Inference
  • ArXiv ID: 2602.16933
  • Date: 2026-02-18
  • Authors: Dan M. Kluger, Stephen Bates

📝 Abstract

In two-phase multiwave sampling, inexpensive measurements are collected on a large sample and expensive, more informative measurements are adaptively obtained on subsets of units across multiple waves. Adaptively collecting the expensive measurements can increase efficiency but complicates statistical inference. We give valid estimators and confidence intervals for M-estimation under adaptive two-phase multiwave sampling. We focus on the case where proxies for the expensive variables -- such as predictions from pretrained machine learning models -- are available for all units and propose a Multiwave Predict-Then-Debias estimator that combines proxy information with the expensive, higher-quality measurements to improve efficiency while removing bias. We establish asymptotic linearity and normality and propose asymptotically valid confidence intervals. We also develop an approximately greedy sampling strategy that improves efficiency relative to uniform sampling. Data-based simulation studies support the theoretical results and demonstrate efficiency gains.

💡 Deep Analysis

Deep Dive into M-estimation under Two-Phase Multiwave Sampling with Applications to Prediction-Powered Inference.

In two-phase multiwave sampling, inexpensive measurements are collected on a large sample and expensive, more informative measurements are adaptively obtained on subsets of units across multiple waves. Adaptively collecting the expensive measurements can increase efficiency but complicates statistical inference. We give valid estimators and confidence intervals for M-estimation under adaptive two-phase multiwave sampling. We focus on the case where proxies for the expensive variables – such as predictions from pretrained machine learning models – are available for all units and propose a Multiwave Predict-Then-Debias estimator that combines proxy information with the expensive, higher-quality measurements to improve efficiency while removing bias. We establish asymptotic linearity and normality and propose asymptotically valid confidence intervals. We also develop an approximately greedy sampling strategy that improves efficiency relative to uniform sampling. Data-based simulation st

📄 Full Content

With recent advances in machine learning and artificial intelligence, researchers are increasingly assembling and analyzing large sample datasets in which some variables are algorithmic outputs rather than direct measurements of the quantity of interest. For example, a study may use a predicted protein structure from a protein language model rather than a structure measured with crystallography, because the latter is costly and time intensive. In this situation, naively applying traditional methods for statistical analysis will result in biased estimators and invalid confidence intervals. Nonetheless, there is an emerging tool kit of statistical methods that do work with such data, provided the analyst does have access to a small amount of gold standard direct measurements to complement algorithmic predictions (e.g., Angelopoulos et al., 2023a;Song et al., 2026).

In this paper, we consider the version of this problem where the researcher can adaptively collect such gold standard measurements. This has the promise of increasing sample efficiency, resulting in narrower confidence intervals for the parameter of interest with the same amount of data. A major technical challenge is that adaptive sampling schemes can introduce statistical dependencies across the samples, rendering it substantially more challenging to conduct valid statistical inference. In this work, we introduce a new estimator and confidence intervals for the adaptive setting and prove their validity.

Our work can be viewed as an instance of two-phase multiwave sampling (McIsaac and Cook, 2015;Chen andLumley, 2020, 2022) in which expensive variables are adaptively collected across multiple measurement waves. While the literature on two-phase multiwave sampling also studies practical sampling strategies and estimators in the regime we study, to our knowledge this literature has not established asymptotic normality of M-estimators using theory that accounts for statistical dependencies induced by the proposed sampling strategies. Moreover, much of the work in this literature assumes stratified sampling from pre-specified strata. In this paper, we consider more flexible sampling strategies that do not require stratification.

Simultaneously, our work can also be viewed as a part of the broader adaptive sampling and experimental design literature, and it is closely related to recent research on Active Statistical Inference (Zrnic and Candes, 2024). Given the difficulty of constructing asymptotically normal estimators in adaptive sampling settings, most studies restrict their attention to one of two simpler adaptive sampling regimes. The first is a data splitting regime, in which the optimal sampling rule is estimated on an independent pilot dataset and inference is conducted on the remaining data using standard asymptotic theory for i.i.d. samples. This leads to validity, but loses power since the pilot sample is discarded. The second sampling regime is online sampling: the data is observed in a sequence and the decision of whether to measure a data point must be made once and for all based on data collected up to that point. This sampling scheme allows for the use of martingale techniques to construct confidence intervals. However, a major limitation of the online regime is that it does not allow revisiting earlier samples if they are not measured. In contrast, in our work, if some particularly valuable data points were not measured in early waves, they are likely to still be measured in later waves after a better estimate of the optimal sampling strategy is obtained. We discuss related work in detail in Section 6.

We introduce an estimator for the two-phase multiwave setting. We prove that this estimator is asymptotically linear and asymptotically normal, and use this to provide asymptotic confidence intervals. To our knowledge, this is the first approach to M-estimation in two-phase multiwave sampling with theoretical guarantees. We also discuss how the user should choose the sampling strategy for increased statistical efficiency.

The outline of this paper is as follows. In Section 2, we introduce the formal setting and notation and describe the point estimator and its corresponding confidence intervals. In Section 3, we present our main theoretical results which (i) establish asymptotic linearity of the point estimator in Mestimaton tasks under fairly mild conditions (ii) provide a central limit theorem for the point estimator and (iii) establish conditions under which the confidence intervals are asymptotically valid. In Section 4, we use the asymptotic variance formula obtained in the previous section to motivate sampling strategies that are designed to reduce asymptotic variance. In Section 5, we conduct simulations to test the empirical performance and coverage of a few of these sampling strategies. In Section 6, we review related work. The proofs for all theoretical results are provided in the appendix.

In this section, we formally in

…(Full text truncated)…

📸 Image Gallery

EstimatorDistKis6.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut