Open Materials Generation with Inference-Time Reinforcement Learning
Continuous-time generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time.
💡 Research Summary
The paper introduces Open Materials Generation with Inference‑time Reinforcement Learning (OMatG‑IRL), a novel policy‑gradient reinforcement‑learning framework that works directly on the velocity fields learned by continuous‑time flow‑matching and stochastic‑interpolant (SI) generative models. Traditional score‑based diffusion models provide the log‑density gradient (score) required by most RL approaches, but flow‑matching models only learn a velocity field, making it difficult to apply RL without explicit score computation. OMatG‑IRL circumvents this limitation by adding a small, time‑dependent noise schedule σref(t) to the pretrained deterministic ODE defined by the velocity field bθref(t,x). This creates a surrogate stochastic differential equation (SDE) that preserves the original model’s performance while enabling stochastic exploration during inference.
The authors formulate the numerical integration of the generative process as a Markov decision process where the state consists of the current time and configuration (t, x). Actions correspond to the next configuration sampled from the policy πθ(a|s) = pθ(x + Δt | x). Rewards are assigned only at the terminal time (t = 1) and can be any black‑box function of the final crystal structure, such as a negative formation energy or a distance metric to a reference structure. To train the policy, they adopt Group‑Relative Policy Optimization (GRPO), which compares terminal rewards across a group of G rollouts generated under the same composition. The group‑relative advantage Ȧi = (ri − mean(r))/std(r) is used for every timestep, eliminating the need for a learned value function. The RL objective combines a PPO‑style clipped surrogate loss with a KL‑regularization term that penalizes deviation from the frozen reference policy, ensuring stable updates.
A key contribution is the demonstration that this velocity‑based RL works as well as score‑based RL when the latter is available. The authors apply OMatG‑IRL to the crystal‑structure‑prediction (CSP) task, where the composition is fixed and only the fractional coordinates and lattice vectors are generated. Because composition conditioning already provides diversity, no explicit diversity reward is required, unlike prior work on de‑novo molecule generation. The method successfully reinforces an energy‑based objective, producing lower‑energy structures while maintaining high match rates to known stable crystals.
Furthermore, the framework is extended to learn a time‑dependent velocity‑annealing schedule β(t). Instead of using handcrafted annealing (e.g., linear or cosine schedules), the schedule is parameterized and optimized jointly with the policy. The learned schedule performs aggressive exploration early (large Δt and σ) and fine‑grained refinement later (small Δt and σ), yielding an order‑of‑magnitude reduction in the number of integration steps needed for accurate CSP (from hundreds to a few dozen) without sacrificing quality.
Experimental results on several benchmark datasets show: (1) comparable or slightly higher average terminal rewards than score‑based RL baselines; (2) dramatically improved sampling efficiency due to the learned annealing schedule; (3) stable training without mode collapse, thanks to the group‑relative advantage normalization and KL regularization; and (4) preservation of structural diversity through composition conditioning.
The paper also discusses limitations: rewards are only terminal, making it harder to optimize intermediate properties; multi‑objective optimization (e.g., simultaneously targeting band gaps and mechanical strength) would require reward shaping or scalarization; and scalability to larger, more complex crystal systems remains to be explored.
In summary, OMatG‑IRL provides a practical and theoretically sound way to integrate reinforcement learning with velocity‑field generative models, opening the door to efficient, goal‑directed crystal‑structure generation and potentially broader applications in materials discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment