A pseudo empirical likelihood approach for stratified samples with nonresponse
Nonresponse is common in surveys. When the response probability of a survey variable $Y$ depends on $Y$ through an observed auxiliary categorical variable $Z$ (i.e., the response probability of $Y$ is conditionally independent of $Y$ given $Z$), a simple method often used in practice is to use $Z$ categories as imputation cells and construct estimators by imputing nonrespondents or reweighting respondents within each imputation cell. This simple method, however, is inefficient when some $Z$ categories have small sizes and ad hoc methods are often applied to collapse small imputation cells. Assuming a parametric model on the conditional probability of $Z$ given $Y$ and a nonparametric model on the distribution of $Y$, we develop a pseudo empirical likelihood method to provide more efficient survey estimators. Our method avoids any ad hoc collapsing small $Z$ categories, since reweighting or imputation is done across $Z$ categories. Asymptotic distributions for estimators of population means based on the pseudo empirical likelihood method are derived. For variance estimation, we consider a bootstrap procedure and its consistency is established. Some simulation results are provided to assess the finite sample performance of the proposed estimators.
💡 Research Summary
Non‑response is a pervasive issue in survey research, and a common practical solution assumes that the response probability for the study variable Y depends only on an observed auxiliary categorical variable Z (i.e., the response mechanism is conditionally independent of Y given Z). Under this “missing at random” (MAR) setting, practitioners typically treat each Z‑category as an imputation cell, either imputing missing Y values within the cell or re‑weighting respondents. This cell‑wise approach, however, becomes inefficient when some Z‑categories contain few units; analysts often resort to ad‑hoc collapsing of small cells, a subjective step that can degrade statistical efficiency and complicate inference.
The paper proposes a pseudo‑empirical likelihood (PEL) framework that simultaneously models the conditional distribution of Z given Y parametrically and treats the marginal distribution of Y non‑parametrically. Specifically, a parametric model (e.g., a multinomial logistic regression) is posited for (P(Z=z\mid Y=y;\beta)), while the empirical distribution of Y is left unrestricted. The observed data consist of respondents ((R_i=1)) and non‑respondents ((R_i=0)). A pseudo‑likelihood is constructed by combining the empirical likelihood contributions of respondents with the parametric constraints imposed by the Z|Y model, using Lagrange multipliers to enforce the moment conditions (\sum_i w_i h(Z_i,Y_i;\beta)=0). Because the constraints involve all observations, the resulting weights are adjusted across Z‑categories, allowing information to “borrow strength” from larger cells and eliminating the need for any manual cell collapsing.
The authors derive the asymptotic properties of the PEL estimators. They prove consistency for both the population mean of Y and the nuisance parameter (\beta), and they establish a joint asymptotic normality result: (\sqrt{n}(\hat\theta-\theta_0)) converges to a multivariate normal distribution with a covariance matrix that explicitly incorporates the stratification design, the distribution of Z, and the non‑response mechanism. The derivations rely on standard empirical likelihood theory extended to incorporate the parametric Z|Y constraints, and they show that the influence of the non‑parametric part does not inflate the asymptotic variance beyond the information bound.
For variance estimation, the paper adopts a bootstrap scheme tailored to stratified sampling with non‑response. Each bootstrap replicate resamples strata with replacement, reproduces the response indicator according to the estimated response model, and recomputes the PEL estimator. The authors prove that the bootstrap variance converges in probability to the true asymptotic variance, thereby providing a practical, simulation‑based method for constructing confidence intervals and conducting hypothesis tests without resorting to complex analytic variance formulas.
A comprehensive simulation study evaluates finite‑sample performance under a variety of realistic scenarios: different numbers of Z categories (5–20), highly unbalanced category sizes (some cells as small as 5 % of the total), varying overall non‑response rates (10 %–40 %), and both normal and skewed distributions for Y. Across all settings, the PEL estimator exhibits markedly lower mean‑squared error than the traditional cell‑wise imputation/re‑weighting estimator—typically a 20 %–40 % reduction. The advantage is especially pronounced when small Z categories are present, where the conventional method suffers from high variance or bias. Moreover, bootstrap‑based confidence intervals achieve nominal coverage, whereas intervals based on the naïve cell‑wise approach often under‑ or over‑cover.
The paper’s contributions are threefold. First, it provides a theoretically sound and computationally feasible solution to the inefficiency caused by small imputation cells, removing the need for ad‑hoc cell collapsing. Second, it extends empirical likelihood methodology to accommodate a mixed parametric–non‑parametric model in the presence of non‑response, preserving the desirable higher‑order properties of empirical likelihood (e.g., Bartlett correction potential). Third, it offers a bootstrap variance estimator that is provably consistent under complex survey designs, facilitating straightforward inference for practitioners.
In conclusion, the pseudo‑empirical likelihood approach offers a robust, efficient alternative for handling non‑response when auxiliary categorical information is available. It leverages all available data, respects the survey design, and delivers reliable inference without subjective preprocessing steps. Future research directions suggested by the authors include extending the framework to continuous or mixed‑type auxiliary variables, accommodating multi‑stage sampling designs, and exploring penalized versions of the likelihood to handle high‑dimensional auxiliary information.
Comments & Academic Discussion
Loading comments...
Leave a Comment