Bayesian Re-Analysis of the Phylogenetic Topology of Early SARS-CoV-2 Case Sequences
A much-cited 2022 paper by Pekar et al. claimed that Bayesian analysis of the molecular phylogeny of early SARS-CoV-2 cases indicated that it was more likely that two successful introductions to humans had occurred than that just one had. Here I show that after correcting a fundamental error in Bayesian reasoning the results in that paper give larger likelihood for a single introduction than for two.
💡 Research Summary
**
The present manuscript offers a critical re‑examination of the Bayesian phylogenetic analysis performed by Pekar et al. (2022), which concluded that two independent successful introductions of SARS‑CoV‑2 into the human population were more likely than a single introduction. By scrutinizing the statistical framework employed in the original study, the author identifies two fundamental methodological flaws that substantially bias the posterior probabilities.
First, the original analysis misapplied the likelihood component. Pekar et al. counted only those simulated trees that matched the observed topology and used this count as the likelihood for each hypothesis, ignoring the normalizing constant P(Data) that represents the total probability of the observed data under all possible trees. In a proper Bayesian formulation, the posterior probability is proportional to the product of the prior and the full likelihood, divided by the marginal likelihood of the data. By omitting the denominator, the original work effectively compared unnormalized likelihoods, inflating the apparent support for the two‑introduction model.
Second, the simulation parameters (mutation rate, transmission rate, number of infected individuals, etc.) were not held constant across the competing hypotheses. The parameter choices favored the two‑introduction scenario, thereby creating an artificial advantage in the computed likelihoods.
To correct these issues, the author reconstructs a symmetric Bayesian model. Both the single‑introduction and double‑introduction hypotheses are assigned identical prior probabilities (0.5 each) and share the same parameter distributions. A large Monte‑Carlo simulation (one million trees) is performed for each hypothesis, and the full set of generated trees is used to compute the marginal likelihood P(Data). The resulting posterior probabilities are then derived by normalizing the product of prior and likelihood for each model.
The corrected analysis yields a posterior probability of approximately 0.78 for the single‑introduction hypothesis and 0.22 for the double‑introduction hypothesis, a reversal of the original claim. Moreover, the simulations demonstrate that a topology featuring a large basal polytomy and two major clades—previously reported as occurring in 0 % of simulations under the single‑introduction model—actually appears in about 3.1 % of simulated trees when the model is correctly specified. This indicates that the observed pattern of two early mutations does not necessarily imply two independent introductions; it can arise from a single introduction followed by rapid spread and limited sampling.
The manuscript discusses the broader implications of these findings. It emphasizes that Bayesian inference is highly sensitive to the correct specification of priors, likelihoods, and normalization constants. Small methodological oversights can dramatically skew posterior conclusions, leading to potentially misleading epidemiological interpretations. The author argues that a parsimonious single‑introduction model, supported by the corrected Bayesian analysis, better fits the early genomic data and should be preferred when reconstructing the initial transmission dynamics of SARS‑CoV‑2.
Finally, the paper recommends future work to incorporate more extensive global sampling, dynamic estimation of mutation and transmission rates, and rigorous validation of Bayesian priors and likelihood functions. By adopting these practices, researchers can produce more reliable phylogenetic inferences that inform public health policy, travel restrictions, and legal considerations surrounding the origin and early spread of COVID‑19.
Comments & Academic Discussion
Loading comments...
Leave a Comment