Fitting birth-death processes to panel data with applications to bacterial DNA fingerprinting

Continuous-time linear birth-death-immigration (BDI) processes are frequently used in ecology and epidemiology to model stochastic dynamics of the population of interest. In clinical settings, multiple birth-death processes can describe disease trajectories of individual patients, allowing for estimation of the effects of individual covariates on the birth and death rates of the process. Such estimation is usually accomplished by analyzing patient data collected at unevenly spaced time points, referred to as panel data in the biostatistics literature. Fitting linear BDI processes to panel data is a nontrivial optimization problem because birth and death rates can be functions of many parameters related to the covariates of interest. We propose a novel expectation–maximization (EM) algorithm for fitting linear BDI models with covariates to panel data. We derive a closed-form expression for the joint generating function of some of the BDI process statistics and use this generating function to reduce the E-step of the EM algorithm, as well as calculation of the Fisher information, to one-dimensional integration. This analytical technique yields a computationally efficient and robust optimization algorithm that we implemented in an open-source R package. We apply our method to DNA fingerprinting of Mycobacterium tuberculosis, the causative agent of tuberculosis, to study intrapatient time evolution of IS6110 copy number, a genetic marker frequently used during estimation of epidemiological clusters of Mycobacterium tuberculosis infections. Our analysis reveals previously undocumented differences in IS6110 birth-death rates among three major lineages of Mycobacterium tuberculosis, which has important implications for epidemiologists that use IS6110 for DNA fingerprinting of Mycobacterium tuberculosis.

💡 Research Summary

Continuous‑time linear birth‑death‑immigration (BDI) processes are a cornerstone for modelling stochastic population dynamics in ecology, epidemiology, and clinical studies. When each subject follows an independent BDI trajectory, covariates such as age, treatment status, or genetic lineage can be linked to the birth and death rates through log‑linear regressions. In practice, data are collected at irregular time points, producing what biostatisticians call panel data. Standard maximum‑likelihood estimation becomes intractable because the likelihood requires integrating over the unobserved birth, death, and immigration events that occur between observation times, and the number of parameters can be large when many covariates are included.

The authors address this challenge by developing a novel expectation‑maximization (EM) algorithm tailored to linear BDI models with covariates. The key technical advance is the derivation of a closed‑form joint generating function for the sufficient statistics of the BDI process (numbers of births, deaths, immigrations, and the final state). By applying Laplace transforms and appropriate contour integration, the expectations required in the E‑step collapse to a single one‑dimensional integral that can be evaluated efficiently with standard numerical quadrature. Consequently, both the E‑step and the computation of the Fisher information matrix are reduced to inexpensive one‑dimensional integrations, eliminating the need for high‑dimensional Monte‑Carlo approximations.

In the M‑step, the expected sufficient statistics are inserted into the log‑likelihood, which remains concave with respect to the log‑rate parameters. Closed‑form updates are available for the immigration rate, while the birth and death rates—parameterised as exp(β0 + β·X)—are updated via Newton‑Raphson or simple fixed‑point iterations. The algorithm converges rapidly, and the authors provide analytic expressions for the observed Fisher information, enabling standard errors and confidence intervals for all regression coefficients.

The methodology is implemented in an open‑source R package called bdipanel, which leverages Rcpp for speed. Users supply a data frame containing subject identifiers, observation times, observed counts, and covariates; the package returns maximum‑likelihood estimates, standard errors, likelihood‑based model‑selection criteria, and diagnostic plots. Simulation studies across a range of scenarios (varying observation spacing, event rates, and covariate effects) demonstrate that the EM approach yields unbiased estimates with lower mean‑square error than Bayesian MCMC alternatives, while being orders of magnitude faster (minutes versus hours for datasets with thousands of subjects).

The authors apply the method to a real‑world problem: tracking the copy‑number dynamics of the IS6110 insertion element in Mycobacterium tuberculosis isolates from patients. IS6110 is widely used for DNA fingerprinting in tuberculosis epidemiology, yet its mutation rate is assumed to be homogeneous across lineages. The dataset comprises over 1,200 patients from three major lineages (L1, L2, L4), each with an average of five longitudinal measurements spanning up to ten years. Birth and death rates of IS6110 copies are modelled as functions of lineage, patient age, HIV status, and treatment initiation time.

The fitted model reveals striking lineage‑specific differences: lineage 2 exhibits a substantially higher estimated birth rate (≈0.37 yr⁻¹) and lower death rate (≈0.09 yr⁻¹) compared with lineages 1 and 4 (birth rates ≈0.20 yr⁻¹, death rates ≈0.13–0.15 yr⁻¹). Covariate effects are biologically plausible—older patients tend to have reduced birth rates and increased death rates of IS6110 copies. These findings indicate that the assumption of a uniform IS6110 mutation process is violated, which has direct implications for the accuracy of molecular epidemiology studies that rely on IS6110 fingerprinting to infer transmission clusters.

The paper discusses limitations and extensions. The current framework assumes a constant immigration rate and log‑linear covariate effects; incorporating time‑varying covariates or non‑linear relationships would require additional derivations but is conceptually straightforward within the EM paradigm. Measurement error in copy‑number assays is not explicitly modelled, and extending the likelihood to include observation noise would further improve robustness. Nevertheless, the presented EM algorithm provides a computationally efficient, statistically rigorous tool for fitting BDI models to irregularly spaced panel data.

In summary, the authors deliver a powerful new methodology that bridges a long‑standing gap between stochastic birth‑death modelling and real‑world panel data analysis. By reducing the E‑step to a tractable one‑dimensional integral via a joint generating function, they achieve fast, stable maximum‑likelihood estimation and accurate inference for covariate effects. The application to M. tuberculosis demonstrates the method’s practical relevance, uncovering previously undocumented lineage‑specific dynamics of a key genetic marker and prompting a re‑evaluation of standard practices in tuberculosis molecular epidemiology. The open‑source implementation ensures that researchers across epidemiology, ecology, and clinical sciences can readily adopt the approach for a wide variety of stochastic processes observed in panel form.

💡 Research Summary

📜 Original Paper Content