Learning from Neighbors with PHIBP: Predicting Infectious Disease Dynamics in Data-Sparse Environments
Modeling sparse count data, which arise across numerous scientific fields, presents significant statistical challenges. This chapter addresses these challenges in the context of infectious disease prediction, with a focus on predicting outbreaks in geographic regions that have historically reported zero cases. To this end, we present the detailed computational framework and experimental application of the Poisson Hierarchical Indian Buffet Process (PHIBP), with demonstrated success in handling sparse count data in microbiome and ecological studies. The PHIBP’s architecture, grounded in the concept of absolute abundance, systematically borrows statistical strength from related regions and circumvents the known sensitivities of relative-rate methods to zero counts. Through a series of experiments on infectious disease data, we show that this principled approach provides a robust foundation for generating coherent predictive distributions and for the effective use of comparative measures such as alpha and beta diversity. The chapter’s emphasis on algorithmic implementation and experimental results confirms that this unified framework delivers both accurate outbreak predictions and meaningful epidemiological insights in data-sparse settings.
💡 Research Summary
The paper tackles the pervasive problem of modeling sparse count data in infectious disease surveillance, where many region‑disease pairs contain zero reported cases. It introduces the Poisson Hierarchical Indian Buffet Process (PHIBP), a Bayesian non‑parametric framework originally devised for microbiome and ecological count data, and demonstrates its applicability to epidemiological forecasting.
The authors begin by contrasting absolute‑abundance modeling with the more common compositional (relative‑abundance) approaches. In compositional methods, counts are normalized to proportions that must sum to one, creating a “closure” problem that induces spurious negative correlations and makes zero counts problematic. PHIBP instead treats raw counts as realizations of Poisson processes driven by absolute intensity parameters. Each disease type ℓ has a global mean intensity λℓ . For a specific region j, a subordinated Lévy process τj transforms λℓ into a local intensity σj,ℓ(λℓ ), preserving independence across diseases while allowing region‑specific variation. This absolute‑rate parameterization cleanly separates sampling zeros (low intensity that happens not to be observed) from structural zeros (near‑zero intensity indicating true absence).
A second architectural decision highlighted is the use of “thinning” rather than exchangeable sampling. Traditional exchangeable partition models condition on a fixed sample size n, which is unrealistic for disease reporting where the total number of observed cases is itself random, driven by population size, testing effort, and reporting practices. PHIBP models the underlying intensities first, then generates observed counts by Poisson sampling with exposure terms (population‑at‑risk, test volume, etc.). This perspective yields several practical benefits: (i) transparent representation of zero‑inflation, (ii) additive prediction rules that naturally extend to new regions or time periods, and (iii) tractable posterior and predictive distributions derived from mixed‑Poisson calculus.
Mathematically, PHIBP builds on completely random measures (CRMs) and Lévy subordinators. The global intensities form a Poisson random measure on a Polish space; region‑specific Lévy measures τj act as “subordinators” that warp the global CRM, producing a hierarchy of CRMs. By selecting different Lévy families (Gamma, stable, etc.) and biasing functions h_j, practitioners can encode domain knowledge such as mobility patterns, healthcare access, or environmental covariates directly into the prior. The authors emphasize that PHIBP subsumes hierarchical Dirichlet processes (HDP) and related models—PHIBP can replicate all functionalities of HDP, but the converse does not hold—making it a strictly more expressive framework.
The experimental section applies PHIBP to real‑world disease reporting data (e.g., county‑level COVID‑19 and influenza counts). The inference targets are the posterior distributions of global rates H_ℓ and local rates σj,ℓ, from which the authors derive Bayesian estimates of alpha‑ and beta‑diversity measures and predictive distributions for diseases not yet observed in a given county. Results show that PHIBP yields substantially tighter credible intervals and higher predictive accuracy for zero‑count regions compared with compositional baselines. Moreover, the model’s ability to quantify uncertainty in diversity metrics provides actionable insight for public‑health decision makers allocating surveillance resources.
Finally, the paper outlines future extensions: (1) dynamic PHIBP models that incorporate temporal evolution of intensities for real‑time monitoring, (2) integration of covariates (demographics, climate, mobility) via tailored Lévy measures, and (3) leveraging AI/LLM tools to translate expert epidemiological knowledge into structured priors within the PHIBP hierarchy. In sum, the work demonstrates that absolute‑abundance, thinning‑based Bayesian hierarchies like PHIBP can overcome the limitations of relative‑rate methods, delivering robust, interpretable, and scalable predictions in data‑sparse infectious disease settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment