Coarsened data in small area estimation: a Bayesian two-part model for mapping smoking behaviour
Estimating health indicators for restricted sub-populations is a recurring challenge in epidemiology and public health. When survey data are used, Small Area Estimation (SAE) methods can improve precision by borrowing strength across domains. In many applications, however, outcomes are self-reported and affected by coarsening mechanisms, such as rounding and digit preference, that reduce data resolution and may bias inference. This paper addresses both issues by developing a Bayesian unit-level SAE framework for semi-continuous, coarsened responses. Motivated by the 2019 Italian European Health Interview Survey, we estimate smoking indicators for domains defined by the cross-classification of Italian regions and age groups, capturing both smoking prevalence and intensity. The model adopts a two-part structure: a logistic component for smoking prevalence and a flexible mixture of Lognormal distributions for average cigarette consumption, coupled with an explicit model for coarsening and topcoding. Simulation studies show that ignoring coarsening can yield biased and unstable domain estimates with poor interval coverage, whereas the proposed model improves accuracy and achieves near-nominal coverage. The empirical application provides a detailed picture of smoking patterns across region-age domains, helping to characterize the dynamics of the phenomenon and inform targeted public health policies.
💡 Research Summary
This paper tackles two pervasive challenges in public‑health survey analysis: (i) the need for reliable small‑area estimates when sample sizes are limited, and (ii) the distortion of self‑reported quantitative outcomes by coarsening mechanisms such as rounding, digit preference (heaping), and top‑coding. Using the 2019 Italian component of the European Health Interview Survey (EHIS), the authors aim to estimate, for each of 84 region‑age domains (21 NUTS‑2 regions × 4 age groups), three key smoking indicators: the proportion of daily smokers, the mean number of cigarettes smoked per day among daily smokers, and the proportion of heavy smokers (≥20 cigarettes/day).
The methodological contribution is a Bayesian unit‑level small‑area model built in a two‑part framework. The first part models the binary smoking status W with a logistic mixed model that includes individual covariates (sex, education, age) and a domain‑specific random effect. The second part models the latent continuous intensity Z (average cigarettes per day for smokers) using a mixture of two Log‑Normal distributions, allowing the distribution to capture heterogeneity in consumption levels. Because the observed intensity Z* is coarsened, the authors introduce an auxiliary latent variable G that maps the true Z to the reported value via a heaping matrix and a censoring rule: values are rounded to the nearest multiple of 5 (5, 10, 15, 20) and any value above 20 is collapsed into a single top‑coded category (recorded as 21).
The full hierarchical model jointly estimates the logistic parameters, the mixture weights, means and variances of the Log‑Normal components, and the heaping/censoring probabilities. Priors are chosen to be weakly informative, but special attention is paid to the variance parameters of the Log‑Normal components: the authors prove that non‑restrictive priors can lead to divergent posterior moments, and they provide sufficient conditions (inverse‑Gamma priors on variances) that guarantee the existence of posterior moments for all target domain quantities.
Inference is performed via Markov chain Monte Carlo, combining Gibbs steps for conjugate blocks and Metropolis‑Hastings updates for the mixture and heaping parameters. The authors conduct a simulation study that mimics the EHIS design, comparing three specifications: (a) a naïve model that ignores coarsening, (b) a model that includes coarsening but uses a single Log‑Normal for Z, and (c) the full proposed model. Results show that ignoring coarsening leads to substantial bias (up to 30 % in small domains) and poor 95 % interval coverage (≈70 %). The full model reduces bias to below 5 % and restores coverage to the 93–97 % range, while also achieving lower mean‑squared error for the heavy‑smoker proportion.
Applying the model to the actual Italian EHIS data, the authors find that prevalence estimates (daily‑smoker proportion) are relatively robust across methods, but intensity‑related indicators differ markedly. After accounting for coarsening, the estimated mean cigarettes per day among smokers drops by 1.8–2.3 cigarettes in several southern regions, and the heavy‑smoker proportion is correspondingly lower. These adjustments reveal that the raw survey data over‑estimate smoking intensity, especially in age groups 20–44 where rounding to multiples of five is most pronounced. The refined domain estimates provide a more accurate picture of regional smoking patterns, which can inform targeted tobacco‑control policies and resource allocation.
In summary, the paper makes three substantive contributions: (1) it integrates an explicit coarsening mechanism into a Bayesian small‑area framework, thereby correcting bias that would otherwise contaminate domain estimates; (2) it introduces a flexible Log‑Normal mixture to capture heterogeneity in smoking intensity; and (3) it establishes theoretical conditions ensuring proper posterior behavior for models with Log‑Normal components. The work bridges methodological gaps between measurement‑error literature and small‑area estimation, offering a practical tool for public‑health agencies dealing with coarsened survey data. Future extensions could incorporate temporal dynamics, external validation data, or alternative coarsening structures for other health‑behavior outcomes.
Comments & Academic Discussion
Loading comments...
Leave a Comment