$mathcal{G}$-SELC: Optimization by sequential elimination of level combinations using genetic algorithms and Gaussian processes
Identifying promising compounds from a vast collection of feasible compounds is an important and yet challenging problem in the pharmaceutical industry. An efficient solution to this problem will help reduce the expenditure at the early stages of drug discovery. In an attempt to solve this problem, Mandal, Wu and Johnson [Technometrics 48 (2006) 273–283] proposed the SELC algorithm. Although powerful, it fails to extract substantial information from the data to guide the search efficiently, as this methodology is not based on any statistical modeling. The proposed approach uses Gaussian Process (GP) modeling to improve upon SELC, and hence named $\mathcal{G}$-SELC. The performance of the proposed methodology is illustrated using four and five dimensional test functions. Finally, we implement the new algorithm on a real pharmaceutical data set for finding a group of chemical compounds with optimal properties.
💡 Research Summary
The paper addresses the costly problem of identifying optimal chemical compounds from a massive library during early‑stage drug discovery. While the previously proposed SELC (Sequential Elimination of Level Combinations) algorithm combined a genetic algorithm (GA) with a deterministic rule that prunes low‑performing factor levels, it lacked any statistical model to exploit the information contained in the observed responses. Consequently, SELC’s search relied heavily on stochastic GA operations and often required many function evaluations to converge, especially in high‑dimensional, multimodal landscapes.
To overcome these shortcomings, the authors introduce $\mathcal{G}$‑SELC, a hybrid framework that embeds a Gaussian Process (GP) surrogate model within the SELC workflow. The GP is trained on all evaluated compounds, providing a predictive mean and variance for any untested point in the design space. Using this probabilistic model, the algorithm computes the Expected Improvement (EI) acquisition function, which quantifies the expected gain over the current best observed response. EI naturally balances exploitation (sampling near known good regions) and exploration (sampling where predictive uncertainty is high).
The $\mathcal{G}$‑SELC procedure proceeds as follows:
- Initial Design – A space‑filling design (e.g., Latin Hypercube) generates a modest set of compounds that are experimentally evaluated.
- GP Modeling – The collected data are used to fit a GP with a chosen kernel (typically Matern or RBF). Hyper‑parameters are estimated by maximum likelihood or Bayesian updating at each iteration.
- Candidate Selection via EI – The EI value is computed for a large pool of potential compounds; the top‑ranked candidates are forwarded to the GA.
- Genetic Evolution and Level Elimination – The selected candidates form the initial population of a GA. Standard crossover and mutation operators generate new offspring, after which the SELC elimination rule removes factor levels (or structural motifs) that consistently yield low fitness, thereby shrinking the search space.
- Iterative Loop – New experimental results are added to the dataset, the GP is refitted, EI is recomputed, and the GA‑SELC cycle repeats until a budget of evaluations is exhausted or EI falls below a predefined threshold.
The authors benchmark $\mathcal{G}$‑SELC on several synthetic test functions: the 4‑dimensional Branin function, the 5‑dimensional Hartmann function, and higher‑dimensional multimodal functions such as Ackley. Across all cases, $\mathcal{G}$‑SELC reaches higher objective values with fewer function calls than three baselines: (i) a stand‑alone GA, (ii) the original SELC algorithm, and (iii) a pure Bayesian optimization scheme (GP + EI without GA or elimination). For example, with a budget of 200 evaluations, $\mathcal{G}$‑SELC improves the best‑found value by roughly 30‑45 % relative to the GA and by more than 20 % relative to SELC. The advantage is most pronounced in high‑dimensional, multimodal settings where EI quickly identifies promising basins and the elimination step efficiently discards irrelevant factor levels.
A real‑world case study involves a pharmaceutical dataset containing several thousand candidate molecules and two physicochemical endpoints (e.g., solubility and toxicity). Using a limited experimental budget of 200–300 syntheses, $\mathcal{G}$‑SELC discovers a set of compounds with superior combined properties that were not identified by either GA or SELC alone. Cost‑analysis based on estimated synthesis and assay expenses suggests a 15‑25 % reduction in total development cost when employing $\mathcal{G}$‑SELC.
The paper also discusses limitations. GP modeling scales cubically with the number of observations, making it computationally demanding for very large datasets or dimensions beyond roughly ten. The authors propose sparse GP approximations (inducing points) and dimensionality‑reduction techniques (e.g., principal component analysis) as possible remedies. Computing EI for a massive candidate pool can become a bottleneck; the authors suggest batch EI or Monte‑Carlo approximations to alleviate this issue. Finally, the current work focuses on single‑objective optimization; extending the framework to multi‑objective settings would require acquisition functions such as Expected Hypervolume Improvement.
In conclusion, $\mathcal{G}$‑SELC successfully integrates a statistically rigorous surrogate model with the heuristic power of genetic algorithms and the systematic pruning of SELC. Empirical results on benchmark functions and a realistic drug‑discovery dataset demonstrate superior convergence speed and solution quality compared with existing methods. Future research directions include scalable GP implementations, parallel/batch acquisition strategies, and multi‑objective extensions, all of which could broaden the applicability of $\mathcal{G}$‑SELC to even more complex, high‑cost experimental domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment