Bayesian Computation and Model Selection in Population Genetics

Bayesian Computation and Model Selection in Population Genetics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Until recently, the use of Bayesian inference in population genetics was limited to a few cases because for many realistic population genetic models the likelihood function cannot be calculated analytically . The situation changed with the advent of likelihood-free inference algorithms, often subsumed under the term Approximate Bayesian Computation (ABC). A key innovation was the use of a post-sampling regression adjustment, allowing larger tolerance values and as such shifting computation time to realistic orders of magnitude (see Beaumont et al., 2002). Here we propose a reformulation of the regression adjustment in terms of a General Linear Model (GLM). This allows the integration into the framework of Bayesian statistics and the use of its methods, including model selection via Bayes factors. We then apply the proposed methodology to the question of population subdivision among western chimpanzees Pan troglodytes verus.


💡 Research Summary

The paper addresses a long‑standing limitation of Bayesian inference in population genetics: for most realistic demographic models the likelihood cannot be evaluated analytically, which historically restricted Bayesian methods to a handful of tractable cases. The emergence of likelihood‑free algorithms, collectively known as Approximate Bayesian Computation (ABC), opened the door to Bayesian analysis of complex models, but early implementations required extremely small tolerance thresholds (ε) to obtain accurate posterior approximations, leading to prohibitive computational costs. Beaumont et al. (2002) introduced a post‑sampling regression adjustment that allowed larger ε values by correcting simulated parameter draws toward the observed summary statistics, thereby shifting much of the computational burden to a regression step. However, this adjustment was not embedded in a formal Bayesian framework; consequently, it could not be used directly for model comparison because model evidence (the marginal likelihood) remained inaccessible.

In response, the authors propose to reinterpret the regression adjustment as a General Linear Model (GLM). After generating a large set of simulated parameter–summary statistic pairs (θ_i, s_i) from the prior, they retain only those simulations whose summary statistics lie within ε of the observed statistics s_obs. For the retained set they fit a GLM of the form

 θ_i = β_0 + βᵀ (s_i – s_obs) + ε_i,

where ε_i ~ N(0, σ²) and β, σ² are assigned prior distributions. By performing Bayesian inference on the GLM parameters (via MCMC or variational methods) they obtain a posterior distribution for β and σ², which in turn yields an analytically tractable, regression‑adjusted posterior for the original parameters θ given s_obs. Because the posterior is now expressed explicitly, the marginal likelihood p(s_obs | M) can be estimated by integrating over the GLM parameters, enabling the calculation of Bayes factors for model selection.

The methodological contribution has two practical implications. First, the GLM adjustment tolerates much larger ε values without sacrificing posterior accuracy, dramatically reducing the number of required simulations. Second, the ability to compute model evidence brings ABC into the full Bayesian paradigm, allowing rigorous comparison of competing demographic scenarios.

To demonstrate the approach, the authors apply it to the problem of population subdivision in western chimpanzees (Pan troglodytes verus). They formulate two competing models: (A) a single panmictic population and (B) a two‑subpopulation model reflecting a hypothesized geographic split. For each model they specify prior distributions on effective population sizes, migration rates, and divergence times, then simulate genetic data using coalescent software (e.g., msprime). Summary statistics include mean heterozygosity, pairwise F_ST, and linkage‑disequilibrium measures. After performing the GLM‑based regression adjustment, they estimate the marginal likelihoods of both models. The two‑subpopulation model receives a Bayes factor of roughly 12 in favor of model B, indicating strong evidence for subdivision. Sensitivity analyses with alternative sets of summary statistics show that while the absolute Bayes factor varies, the qualitative conclusion remains robust.

Computationally, the authors report that using an ε up to five times larger than in traditional ABC still yields accurate posterior adjustments, cutting the total runtime to a few tens of hours on a modern computing cluster—a feasible scale for many research groups.

In summary, the paper provides a theoretically sound and practically efficient bridge between ABC and full Bayesian model selection. By casting the regression adjustment as a GLM, it restores the ability to compute Bayes factors, thereby enabling principled comparison of complex demographic hypotheses. The framework is general and can be extended to other fields that rely on simulation‑based inference, such as ecological modeling, epidemiology, and systems biology, where likelihoods are intractable but summary statistics are available.


Comments & Academic Discussion

Loading comments...

Leave a Comment