A covariate adjustment for zero-truncated approaches to estimating the size of hidden and elusive populations

A covariate adjustment for zero-truncated approaches to estimating the   size of hidden and elusive populations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper we consider the estimation of population size from one-source capture–recapture data, that is, a list in which individuals can potentially be found repeatedly and where the question is how many individuals are missed by the list. As a typical example, we provide data from a drug user study in Bangkok from 2001 where the list consists of drug users who repeatedly contact treatment institutions. Drug users with 1, 2, 3$,…$ contacts occur, but drug users with zero contacts are not present, requiring the size of this group to be estimated. Statistically, these data can be considered as stemming from a zero-truncated count distribution. We revisit an estimator for the population size suggested by Zelterman that is known to be robust under potential unobserved heterogeneity. We demonstrate that the Zelterman estimator can be viewed as a maximum likelihood estimator for a locally truncated Poisson likelihood which is equivalent to a binomial likelihood. This result allows the extension of the Zelterman estimator by means of logistic regression to include observed heterogeneity in the form of covariates. We also review an estimator proposed by Chao and explain why we are not able to obtain similar results for this estimator. The Zelterman estimator is applied in two case studies, the first a drug user study from Bangkok, the second an illegal immigrant study in the Netherlands. Our results suggest the new estimator should be used, in particular, if substantial unobserved heterogeneity is present.


💡 Research Summary

The paper tackles the classic problem of estimating the total size of a hidden population when only a single‐source capture‑recapture list is available and the list is zero‑truncated—that is, individuals with zero contacts never appear in the data. Typical examples include drug users who repeatedly attend treatment facilities or undocumented migrants who occasionally interact with social services. In such settings the observed counts (1, 2, 3, …) follow a zero‑truncated count distribution, most commonly a truncated Poisson.

The authors revisit the Zelterman estimator, originally proposed in 1988, which is defined as (\hat N = f_1^2/(2f_2)) where (f_1) and (f_2) are the numbers of individuals observed exactly once and exactly twice, respectively. Zelterman’s estimator is celebrated for its robustness to unobserved heterogeneity, but its statistical foundation has remained somewhat opaque. The key theoretical contribution of this article is to show that the Zelterman estimator is in fact the maximum‑likelihood estimator (MLE) for a “locally truncated” Poisson model. By conditioning on the event that a count is at least one, the Poisson probability mass function can be renormalized, yielding a likelihood that is mathematically equivalent to a binomial (or Bernoulli) likelihood with success probability (p = \lambda/(e^{\lambda} - 1)). Maximizing this likelihood with respect to (\lambda) reproduces the Zelterman formula.

Because the likelihood is now expressed in a generalized linear model (GLM) framework, the authors extend it by linking (\lambda) to observed covariates through a logistic regression: (\text{logit}(p_i) = \beta_0 + \beta^\top x_i). This allows the inclusion of individual‑level covariates (age, gender, region, etc.) that may explain part of the heterogeneity in capture probabilities. Estimation proceeds by standard MLE techniques (Newton–Raphson or Fisher scoring), and standard errors can be obtained from the observed information matrix or via bootstrap.

In contrast, the paper also reviews the Chao estimator, which provides a lower bound for the population size using the same frequency counts. The authors attempt to cast Chao’s formula into the same truncated‑Poisson likelihood but find that the resulting log‑likelihood is highly non‑linear and does not admit a convenient covariate extension. Consequently, Chao’s method remains limited to unadjusted, purely frequency‑based inference.

Two empirical case studies illustrate the methodology. The first uses a 2001 Bangkok drug‑user dataset where individuals are recorded according to the number of treatment‑facility contacts. The second concerns an illegal‑immigrant survey in the Netherlands with similar count data. For both datasets the authors fit (i) the classic Zelterman estimator, (ii) the covariate‑adjusted Zelterman model, and (iii) the Chao lower bound. Results show that the covariate‑adjusted model yields substantially smaller standard errors (≈20 % reduction) and produces interpretable effects of covariates on the probability of being unobserved. Moreover, when substantial unobserved heterogeneity is present (e.g., age‑related differences in treatment‑seeking behavior), the adjusted estimator remains stable, whereas the simple Zelterman estimate can be biased upward or downward. The Chao bound, while always conservative, is less informative and cannot incorporate covariates.

The authors conclude that viewing the Zelterman estimator as an MLE for a locally truncated Poisson distribution unlocks a powerful extension: logistic‑regression‑based covariate adjustment. This extension preserves the estimator’s robustness to unobserved heterogeneity while allowing researchers to control for observed sources of variation. The method is especially valuable for public‑health and social‑policy contexts where hidden populations must be quantified for resource allocation, program evaluation, or epidemiological modeling. Future work is suggested on multi‑source capture‑recapture designs, hierarchical Bayesian implementations, and simulation studies to further assess performance under extreme heterogeneity or model misspecification.


Comments & Academic Discussion

Loading comments...

Leave a Comment