Informing Acquisition Functions via Foundation Models for Molecular Discovery

Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.

💡 Research Summary

The paper tackles a fundamental bottleneck in applying Bayesian Optimization (BO) to molecular discovery: the difficulty of learning an accurate surrogate model when data are scarce and the candidate space is astronomically large. Traditional BO iteratively fits a probabilistic surrogate (usually a Gaussian Process or a deep Bayesian network) on a handful of experimentally measured molecules, then derives acquisition functions such as Expected Improvement (EI) or Probability of Improvement (PI) to select the next candidate. In high‑dimensional chemical spaces (SMILES strings, graph embeddings) and with candidate libraries ranging from millions to billions, two problems arise. First, building a reliable surrogate requires high‑dimensional feature representations and extensive training, which is computationally prohibitive. Second, with fewer than a dozen labeled points, the surrogate becomes unstable, leading to poor acquisition decisions.

To overcome these issues, the authors propose a likelihood‑free BO framework that completely bypasses explicit surrogate modeling. Instead of learning a mapping from molecules to properties, they directly inject the prior knowledge encoded in large language models (LLMs) and chemistry‑specific foundation models into the acquisition function itself. Concretely, each candidate SMILES is fed to a general LLM (e.g., a GPT‑style model fine‑tuned on chemical text) to obtain a log‑probability score, reflecting how “plausible” the molecule is under the model’s learned chemistry. Simultaneously, a chemistry foundation model such as ChemBERTa or MolBERT provides a dense embedding. The two signals are combined into a prior expectation term that serves as the core of the acquisition score. Because this term is derived from pre‑trained models, no data‑driven likelihood is needed, eliminating the need for costly Bayesian inference.

The second key component is a tree‑structured partition of the search space. The authors recursively split the candidate set based on chemically meaningful criteria (e.g., presence of functional groups, molecular weight ranges) to form a hierarchical tree. Each node represents a sub‑space and is equipped with a local acquisition function that evaluates the node’s prior expectation mean and variance. This localized view allows the algorithm to focus computational effort on promising regions while still maintaining a global view of the entire library.

To navigate the tree efficiently, the authors adopt Monte Carlo Tree Search (MCTS). At each iteration, MCTS selects a node using an Upper Confidence Bound (UCB)‑like criterion that balances the node’s estimated reward (derived from the prior expectation) against the uncertainty of that estimate. The selected node is then expanded, and a small batch of actual molecules from that sub‑space are evaluated experimentally (or via high‑fidelity simulation). The observed outcomes are fed back to update the statistics of the visited nodes, gradually refining the tree’s belief about where high‑performing molecules reside. Because the acquisition function is evaluated only at the node level, the algorithm avoids the O(N) cost of scoring every candidate; instead, the cost scales with the depth of the tree and the number of explored nodes.

A third innovation is LLM‑driven clustering for scalability. Before the BO loop begins, the entire candidate library is embedded using the same LLM that provides the log‑probability scores. These embeddings are clustered (e.g., via K‑means or hierarchical clustering) into a modest number K of groups. For each cluster, the average prior expectation is pre‑computed. During MCTS, the algorithm first selects a cluster based on these averages, then proceeds down the tree inside that cluster. This two‑level restriction dramatically reduces the number of acquisition function evaluations: instead of evaluating all N candidates, the method evaluates only a subset within the top‑ranked clusters, achieving an effective computational complexity of O(K log N).

The authors validate their approach on three benchmarks: (1) QM9, a dataset of ~133 k small organic molecules with quantum‑chemical properties; (2) ZINC, a public library of ~2.5 M drug‑like compounds; and (3) a proprietary drug‑discovery set of ~10 k molecules with experimentally measured bioactivity. In each case, the budget of true evaluations is limited to 30–50 measurements, reflecting realistic low‑data scenarios. The proposed method is compared against (a) classic GP‑BO, (b) DeepBO (deep neural surrogate), and (c) a recent LLM‑augmented BO that still relies on a learned surrogate. Across all metrics—best‑found property value, convergence speed, and robustness to random seeds—the likelihood‑free, tree‑guided approach outperforms baselines by a factor of 2–3 in sample efficiency. Notably, when only 5 labeled points are available, the method still converges to high‑quality candidates, whereas GP‑BO and DeepBO often stall or diverge. Ablation studies show that (i) removing the clustering step increases evaluation cost by ~70 % without improving final performance, (ii) shallow trees (depth ≤ 2) reduce exploration capability, and (iii) varying the LLM prompt length modestly affects the prior expectation but does not overturn the overall advantage.

In summary, the paper makes three major contributions:

Prior‑only acquisition – By leveraging the probabilistic knowledge embedded in LLMs and chemistry foundation models, the method eliminates the need for an explicit surrogate and its associated likelihood computation.
Tree‑structured, local acquisition with MCTS – This design enables efficient navigation of massive, high‑dimensional chemical spaces while preserving a principled exploration‑exploitation trade‑off.
Cluster‑level pruning – Coarse‑grained LLM clustering restricts acquisition evaluations to statistically promising regions, delivering substantial scalability gains.

The proposed framework not only advances BO for molecular discovery but also offers a template for other scientific domains where large pre‑trained models exist (materials design, catalyst optimization, protein engineering). By turning “foundation models” into actionable priors rather than mere feature extractors, the work opens a new pathway for data‑efficient, large‑scale scientific optimization.

💡 Research Summary

📜 Original Paper Content