Amortized Molecular Optimization via Group Relative Policy Optimization
Molecular design encompasses tasks ranging from de-novo design to structural alteration of given molecules or fragments. For the latter, state-of-the-art methods predominantly function as “Instance Optimizers’’, expending significant compute restarting the search for every input structure. While model-based approaches theoretically offer amortized efficiency by learning a policy transferable to unseen structures, existing methods struggle to generalize. We identify a key failure mode: the high variance arising from the heterogeneous difficulty of distinct starting structures. To address this, we introduce GRXForm, adapting a pre-trained Graph Transformer model that optimizes molecules via sequential atom-and-bond additions. We employ Group Relative Policy Optimization (GRPO) for goal-directed fine-tuning to mitigate variance by normalizing rewards relative to the starting structure. Empirically, GRXForm generalizes to out-of-distribution molecular scaffolds without inference-time oracle calls or refinement, achieving scores in multi-objective optimization competitive with leading instance optimizers.
💡 Research Summary
The paper addresses the fundamental trade‑off between instance‑based molecular optimization, which repeatedly performs costly search for each new scaffold, and amortized optimization, which seeks a single policy that can instantly generate optimized molecules for any input structure. While instance optimizers such as genetic algorithms (GA) and guided diffusion models achieve high scores, they require thousands of oracle evaluations per task, making them unsuitable for high‑throughput or interactive drug‑design scenarios. Existing amortized approaches have struggled to match these scores because the difficulty of improving a given scaffold varies dramatically across inputs, leading to high‑variance reward signals that destabilize reinforcement‑learning (RL) training.
To overcome this, the authors introduce GRXForm, a framework built on the pre‑trained GraphXForm architecture (a decoder‑only Graph Transformer) that constructs molecules atom‑by‑atom through a hierarchical three‑level action space (operation, target selection, bond specification). Chemical validity is guaranteed at every step by a valence‑based action mask, and the model is first supervised‑trained on the ChEMBL‑35 dataset to learn a realistic prior over chemical space.
The core methodological contribution is Group Relative Policy Optimization (GRPO), adapted from large‑language‑model reasoning to molecular design. For each training batch, a set of B starting scaffolds {S₁,…,S_B} is sampled. For each scaffold S_i, the policy generates G distinct complete trajectories using Stochastic Beam Search (SBS). The raw oracle rewards r_{i,j} of these G molecules are then normalized by the group mean μ_i, producing a relative advantage A_{i,j}=r_{i,j}−μ_i. This dynamic, scaffold‑specific baseline replaces the global baseline used in standard REINFORCE, effectively “level‑setting” easy and hard scaffolds: even if all trajectories for a hard scaffold receive low absolute scores, the best among them still yields a positive advantage, while overly high scores on easy scaffolds are dampened. Consequently, reward variance across scaffolds is dramatically reduced, leading to more stable gradient estimates without the need for a separate value network.
Training proceeds as follows: (1) load the pre‑trained GraphXForm weights; (2) define one or more property oracles (e.g., QED, LogP, or a multi‑objective scalar); (3) for each scaffold generate a group of G candidates via SBS; (4) evaluate each candidate with the oracle; (5) compute group‑wise relative advantages and update the policy using the REINFORCE gradient with Adam. The only hyper‑parameters specific to GRPO are the group size G and the beam width used in SBS; the authors show that modest values (e.g., G=8, beam width=5) already yield strong performance.
Experimental evaluation uses the Practical Molecular Optimization (PMO) benchmark, which enforces a strict budget of 10 000 oracle calls per task. GRXForm is compared against top instance optimizers (Mol GA, GenMol) and recent amortized models. Results demonstrate that GRXForm matches or exceeds the best instance‑based scores while requiring zero oracle calls at inference time. Moreover, when tested on out‑of‑distribution scaffolds that were not seen during training, performance degradation is minimal, confirming the effectiveness of the group‑relative baseline in promoting generalization. In multi‑objective settings, GRXForm produces a well‑spread Pareto front, indicating that the method can balance competing objectives without additional engineering (e.g., Pareto ranking or scalarization tricks).
The authors acknowledge limitations: (i) the computational cost of generating G trajectories per scaffold during training can be high, especially when the oracle itself is expensive (e.g., docking or free‑energy calculations); (ii) GRPO’s performance depends on the choice of G and beam width, requiring some tuning. They suggest future work on meta‑learning to adapt these hyper‑parameters automatically, and on integrating multi‑task learning to handle several oracles simultaneously.
In summary, the paper presents a practical, scalable solution for amortized molecular optimization. By normalizing rewards relative to scaffold‑specific groups, GRPO mitigates the variance problem that has hampered previous RL‑based generative models. Combined with a chemically valid Graph Transformer backbone, GRXForm delivers high‑quality, property‑optimized molecules in a single forward pass, opening the door to rapid library generation, real‑time user‑in‑the‑loop design, and cost‑effective use of high‑fidelity computational chemistry oracles.
Comments & Academic Discussion
Loading comments...
Leave a Comment