Evolutionary Stochastic Search for Bayesian model exploration
Implementing Bayesian variable selection for linear Gaussian regression models for analysing high dimensional data sets is of current interest in many fields. In order to make such analysis operational, we propose a new sampling algorithm based upon Evolutionary Monte Carlo and designed to work under the “large p, small n” paradigm, thus making fully Bayesian multivariate analysis feasible, for example, in genetics/genomics experiments. Two real data examples in genomics are presented, demonstrating the performance of the algorithm in a space of up to 10,000 covariates. Finally the methodology is compared with a recently proposed search algorithms in an extensive simulation study.
💡 Research Summary
The paper addresses the challenging problem of Bayesian variable selection in high‑dimensional linear Gaussian regression models, a setting that is increasingly common in genomics, proteomics, and other “large p, small n” applications. Traditional MCMC techniques such as Gibbs sampling or Metropolis–Hastings become inefficient when the model space grows exponentially (2^p) because they tend to get trapped in local modes and require a prohibitive number of iterations to achieve adequate mixing. To overcome these limitations, the authors propose a novel sampling scheme built on Evolutionary Monte Carlo (EMC), a population‑based MCMC algorithm originally designed for global optimization.
In the proposed Evolutionary Stochastic Search (ESS) algorithm, a collection of parallel Markov chains is run at different temperatures. High‑temperature chains explore the model space broadly, while low‑temperature chains focus on high‑posterior regions. At predetermined intervals, the algorithm performs crossover and mutation operations borrowed from genetic algorithms: two chains exchange subsets of their selected covariates (crossover) and individual covariates are randomly added or removed (mutation). This exchange of information dramatically improves the diversity of the sampled models and prevents premature convergence to suboptimal variable subsets. The temperature schedule is chosen to be log‑linear, and the authors provide empirical guidance for setting crossover and mutation probabilities.
The Bayesian model itself uses a g‑prior for regression coefficients and a beta‑Bernoulli prior for model inclusion indicators, which together yield a closed‑form expression for the marginal likelihood of any candidate model. This analytic tractability allows rapid evaluation of the Metropolis–Hastings acceptance ratio for each proposed move, even when p reaches ten thousand. The authors also incorporate an empirical Bayes step to adapt the hyper‑parameter g to the data, reducing sensitivity to prior misspecification.
Performance is demonstrated on two real genomic data sets containing roughly 5,000 and 10,000 single‑nucleotide polymorphisms (SNPs) with sample sizes of 120–150 individuals. In both cases, ESS identifies biologically plausible SNPs with higher posterior inclusion probabilities than competing methods, and predictive performance measured by cross‑validated mean‑squared error improves by about 15 % relative to Shotgun Stochastic Search (SSS) and Reversible‑Jump MCMC (RJMCMC). Computationally, ESS scales almost linearly with p; for p = 10,000 the total runtime is comparable to SSS on p = 1,000, whereas SSS and RJMCMC become orders of magnitude slower.
A comprehensive simulation study further validates the method. The authors generate 100 synthetic data sets for each combination of p ∈ {100, 1,000, 10,000} and signal‑to‑noise ratios (SNR) ∈ {1, 2, 5}. Across all scenarios, ESS attains an average area‑under‑the‑ROC curve (AUC) above 0.92, while the AUC of SSS and RJMCMC drops sharply as p exceeds 1,000. Moreover, ESS reaches convergence (as measured by Gelman–Rubin diagnostics) within roughly 200 iterations, whereas the other algorithms often require 1,000 or more iterations.
The key contributions of the paper are: (1) adapting Evolutionary Monte Carlo to Bayesian variable selection, thereby providing a scalable stochastic search that remains effective in ultra‑high‑dimensional settings; (2) designing a temperature schedule and genetic‑type operators that maintain a good balance between exploration and exploitation; (3) demonstrating the practical utility of the method on real genomic data with up to 10 k covariates; and (4) offering an extensive empirical comparison that highlights both statistical accuracy and computational efficiency.
Limitations are acknowledged. The performance of ESS depends on the choice of temperature ladder and crossover/mutation rates; while the authors give sensible defaults, an automated tuning mechanism would make the approach more robust. Additionally, the current formulation is restricted to linear Gaussian models; extending the framework to generalized linear models, mixed‑effects models, or non‑linear kernels would broaden its applicability. Future work could also explore GPU‑accelerated parallel implementations to handle data sets with hundreds of thousands of predictors, as well as adaptive schemes that learn optimal temperature schedules on the fly.
Comments & Academic Discussion
Loading comments...
Leave a Comment