Doubly Robust Machine Learning for Population Size Estimation with Missing Covariates: Application to Gaza Conflict Mortality
Population size estimation from capture-recapture data is central for studying hard-to-reach populations, incorporating auxiliary covariates to account for heterogeneous capture probabilities and recapture dependencies. However, missing attributes pose a critical methodological challenge due to reluctance to share sensitive information, data collection limitations, and imperfect record linkage. Existing approaches either ignore missingness or rely on a priori imputation, potentially introducing substantial bias. In this work, we develop a novel nonparametric estimation framework using a Missing at Random assumption to identify capture probabilities under missing covariates. Using semiparametric efficiency theory, we construct one-step estimators that combine efficiency, robustness, and finite-sample validity: they approximately achieve the nonparametric efficiency bound, accommodate flexible machine learning methods through a doubly robust structure, and provide approximately valid inference for any sample size. Simulations demonstrate substantial improvements over naive imputation approaches, with our doubly robust ML estimators maintaining valid inference even at high missingness rates where competing methods fail. We apply our methodology to re-estimate mortality in the Gaza Strip from October 7, 2023, to June 30, 2024, using three-list capture-recapture data with missing demographic information. Our approach yields more conservative yet precise estimates compared to previous methods, indicating the true death toll exceeds official statistics by approximately 26%. Our framework provides practitioners with principled tools for handling incomplete data in conflict settings and other applications with hard-to-reach populations.
💡 Research Summary
This paper tackles a pervasive problem in capture‑recapture (multiple‑systems) population size estimation: the presence of missing covariates. While modern methods incorporate auxiliary variables to model heterogeneous capture probabilities and list dependencies, they typically assume that all covariates are fully observed. In practice—especially for hard‑to‑reach or conflict‑affected populations—key demographic or behavioral attributes are often missing due to privacy concerns, administrative hurdles, or imperfect record linkage. Ignoring this missingness or applying naïve imputation can lead to substantial bias and invalid confidence intervals.
The authors propose a novel non‑parametric framework that (i) identifies the capture probability under a Missing‑at‑Random (MAR) assumption combined with a “no highest‑order interaction” log‑linear model, and (ii) constructs doubly robust (DR) machine‑learning estimators that achieve near‑semiparametric efficiency while remaining valid for any finite sample size. The key identification result shows that the overall capture probability ψ (the probability that an individual appears on at least one list) can be expressed as the harmonic mean of the conditional capture probability γ(V,X) under the observed‑data distribution Q. This holds even when the covariate vector X is partially missing, provided MAR holds conditional on the always‑observed covariates V and the missingness indicator R.
Building on this, the paper derives the efficient influence function (EIF) for ψ⁻¹ and proposes a one‑step update that incorporates two nuisance functions: (a) the conditional capture probability γ̂(V,X) and (b) the missingness mechanism π̂(R=1|V,X). Both nuisance functions may be estimated with arbitrary machine‑learning algorithms (random forests, gradient boosting, neural networks, etc.). The DR property guarantees consistency if either γ̂ or π̂ is correctly specified, protecting the estimator against misspecification of the other. Moreover, the one‑step construction ensures √n‑rate convergence and asymptotic normality, delivering valid confidence intervals even when flexible learners converge at slower non‑parametric rates.
Simulation studies explore missingness rates from 10 % to 70 % and a variety of data‑generating mechanisms (linear, non‑linear, interaction‑rich). Compared with complete‑case analysis, simple mean imputation, and earlier DR estimators that assume fully observed covariates, the proposed estimator exhibits markedly lower mean‑squared error, negligible bias, and coverage probabilities close to the nominal 95 % level across all scenarios. Notably, when missingness exceeds 50 %, the method still retains its robustness, highlighting the practical advantage of the DR approach in high‑missingness settings.
The methodology is applied to a three‑list capture‑recapture dataset documenting civilian deaths in the Gaza Strip from 7 Oct 2023 to 30 Jun 2024. The lists comprise field surveys, hospital records, and humanitarian organization reports. Approximately 40 % of the demographic covariates (e.g., age) are missing. Prior analyses using standard multiple‑systems estimation reported roughly 66 000 deaths (95 % CI ≈ 58 000–74 000). The new DR‑ML estimator yields 59 441 deaths (95 % CI ≈ 50 708–68 173), a more conservative point estimate but with a tighter interval, and still indicates that the true death toll exceeds official statistics by about 26 %.
In sum, the paper makes three substantive contributions: (1) it establishes a clear identification strategy for capture probabilities under MAR and a restricted log‑linear interaction structure; (2) it introduces a doubly robust, machine‑learning‑compatible one‑step estimator that attains the non‑parametric efficiency bound while remaining valid for any sample size; and (3) it demonstrates, both via simulations and a real‑world conflict mortality case study, that the method delivers reliable inference even with substantial covariate missingness. The framework is broadly applicable to epidemiology, human rights monitoring, and any domain where hidden populations must be quantified from incomplete, multi‑source data. Future work may relax the MAR assumption, address non‑ignorable missingness, and extend the approach to streaming or longitudinal capture‑recapture settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment