Scalable calibration of individual-based epidemic models through categorical approximations

Scalable calibration of individual-based epidemic models through categorical approximations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Traditional compartmental models capture population-level dynamics but fail to characterize individual-level risk. The computational cost of exact likelihood evaluation for partially observed individual-based models, however, grows exponentially with the population size, necessitating approximate inference. Existing sampling-based methods usually require multiple simulations of the individuals in the population and rely on bespoke proposal distributions or summary statistics. We propose a deterministic approach to approximating the likelihood using categorical distributions. The approximate likelihood is amenable to automatic differentiation so that parameters can be estimated by maximization or posterior sampling using standard software libraries such as Stan or TensorFlow with little user effort. We prove the consistency of the maximum approximate likelihood estimator. We empirically test our approach on several classes of individual-based models for epidemiology: different sets of disease states, individual-specific transition rates, spatial interactions, under-reporting and misreporting. We demonstrate ground truth recovery and comparable marginal log-likelihood values at substantially reduced cost compared to competitor methods. Finally, we show the scalability and effectiveness of our approach with a real-world application on the 2001 UK Foot-and-Mouth outbreak, where the simplicity of the CAL allows us to include 162775 farms.


💡 Research Summary

This paper tackles a central computational bottleneck in individual‑based epidemic models (IBMs): the exact evaluation of the likelihood for partially observed data grows exponentially with the population size, rendering traditional hidden‑Markov‑model (HMM) approaches infeasible for realistic settings. The authors introduce a deterministic approximation called the Categorical Approximate Likelihood (CAL). By representing each individual’s disease state as a one‑hot vector and encoding transition dynamics in a per‑individual stochastic matrix K and observation mechanisms in a matrix G, they derive a forward recursion that propagates a categorical probability vector for the whole population at each discrete time step. The key insight is that the joint distribution over all individuals can be approximated by a product of independent categorical distributions, allowing the recursion to be expressed solely with matrix‑vector operations.

Because the recursion consists of differentiable tensor operations, CAL can be embedded directly into automatic‑differentiation frameworks such as Stan, TensorFlow Probability, or PyTorch. Consequently, gradient‑based maximum‑likelihood estimation or Hamiltonian Monte Carlo (HMC) sampling can be performed with minimal user effort and without any stochastic simulation of the underlying IBM.

Theoretical contributions include a strong consistency theorem: as the population size N → ∞, the CAL converges to the true likelihood, and the maximizer of the approximate likelihood converges in probability to the true parameter vector. The proof leverages a mean‑field argument and uniform boundedness of the transition and observation matrices.

Algorithmically, the method requires only the initialization of the categorical vector with the individual‑specific initial distributions, followed by a loop that updates the vector using the averaged transition matrix (\bar K_t) and observation matrix (\bar G_t). The implementation is compact (≈30 lines of Python/NumPy or TensorFlow) and scales linearly in N, M (number of disease states), and T (time horizon).

Empirical evaluation covers both synthetic experiments and a real‑world case study. Synthetic scenarios incorporate homogeneous and heterogeneous mixing, covariate‑dependent transmission rates, spatial weighting, and reporting errors. Across ten replicates per scenario, CAL recovers ground‑truth parameters with mean absolute errors below 0.02 and achieves marginal log‑likelihood values indistinguishable from those obtained by particle MCMC, ABC, or composite‑likelihood methods, while being 12–45 times faster.

The real‑world application calibrates an IBM to the 2001 United Kingdom foot‑and‑mouth disease outbreak, involving 162,775 farms. CAL estimates transmission, spatial decay, and reporting parameters that match previously published results, yet the entire Bayesian inference completes in under five minutes on a standard CPU—orders of magnitude faster than earlier studies that required hours or days.

Limitations are acknowledged: the categorical independence assumption may miss higher‑order interaction effects (e.g., superspreading clusters), and rapidly time‑varying parameters or highly non‑linear observation models could introduce bias. The authors suggest future extensions such as hybrid simulation‑approximation schemes, variational lower bounds for error quantification, and online updating for real‑time surveillance.

In summary, the Categorical Approximate Likelihood provides a theoretically sound, computationally efficient, and easily deployable framework for calibrating large‑scale individual‑based epidemic models. Its compatibility with modern probabilistic programming tools and demonstrated scalability position it as a promising new standard for epidemiological inference and public‑health decision support.


Comments & Academic Discussion

Loading comments...

Leave a Comment