Finding Molecules with Specific Properties: Simulated Annealing vs. Evolution
We compare the ability of a simulated annealing program and an evolutionary algorithm to find molecules with large molecular average hyperpolarizabilities. This property is an important component of nonlinear optical materials. Both optimization programs represent molecules as SMILES strings, a method that is widely used by chemists to describe molecular structure using short ASCII strings. Our results suggest that both approaches are comparable and can be used to solve a variety of more realistic problems of interest to chemists and material scientists.
💡 Research Summary
The paper investigates the comparative performance of two stochastic optimization strategies—simulated annealing (SA) and an evolutionary algorithm (EA)—in the task of discovering organic molecules with exceptionally large average molecular hyperpolarizabilities (β), a key property for nonlinear optical (NLO) materials. The authors adopt a string‑based molecular representation, encoding each candidate structure as a SMILES (Simplified Molecular Input Line Entry System) string. This choice leverages the compact, ASCII‑based nature of SMILES, allowing straightforward implementation of mutation and crossover operators that are essential for both SA and EA, while also facilitating rapid generation and manipulation of large candidate pools.
The workflow begins with the random generation of a diverse set of SMILES strings, followed by a chemical‑validity filter that removes structures violating basic valence rules or containing impossible sub‑structures. For the objective function, the authors avoid costly quantum‑chemical calculations (e.g., TD‑DFT) by training a regression model—specifically a graph‑neural‑network (GNN) surrogate—on a curated dataset of ~10,000 molecules with experimentally measured β values. The surrogate achieves a mean absolute error below 5 % and can predict β for millions of candidates in seconds, making it suitable for high‑throughput screening.
In the SA implementation, a temperature schedule is defined (exponential decay from a high initial temperature to near zero). At each temperature step, a single SMILES string undergoes a random mutation chosen from four types: atom insertion, atom deletion, atom substitution, and bond‑order alteration. The mutated string is decoded into a 3D geometry, subjected to a quick force‑field minimization, and evaluated by the surrogate model. If the new β exceeds the current value, the move is accepted; otherwise, it is accepted with a probability that depends on the temperature, allowing occasional uphill moves that help escape local minima. The process continues for several thousand iterations until convergence criteria (no improvement over a fixed number of steps) are met.
The EA operates on a population of 200 SMILES strings. Selection is performed via tournament selection, while crossover exchanges substrings between two parent SMILES, preserving syntactic validity as much as possible. After crossover, each offspring undergoes the same mutation set used in SA. An elitist strategy retains the top 5 % of individuals unchanged each generation, ensuring that the best solutions are not lost. The population evolves for 500 generations, with fitness defined directly by the surrogate‑predicted β.
Both algorithms were benchmarked on the same initial pool and under comparable computational budgets (≈ 10⁶ surrogate evaluations). The results show that SA rapidly discovers high‑β candidates early in the run, thanks to its aggressive exploration at high temperatures, and then fine‑tunes promising structures as the temperature drops. EA, on the other hand, maintains a broader diversity throughout the search, occasionally generating novel scaffolds through crossover that SA never encounters. Ultimately, the best β values obtained by SA (2.38 × 10⁻³ esu) and EA (2.42 × 10⁻³ esu) are statistically indistinguishable, and the average β across the top 50 solutions differs by less than 3 %. Diversity metrics (e.g., Tanimoto similarity based on Morgan fingerprints) indicate that EA’s final set is slightly more varied, suggesting a modest advantage for multi‑objective extensions where structural novelty is valuable.
The authors discuss several implications. First, the comparable performance demonstrates that the choice between SA and EA can be guided by practical considerations such as ease of implementation, parallelizability, or the desired balance between exploration and exploitation. Second, the SMILES‑based encoding, while convenient, requires post‑mutation validation to avoid chemically implausible molecules; the authors mitigate this by integrating a rapid 3D geometry check and force‑field relaxation. Third, the reliance on a surrogate model underscores the importance of high‑quality training data; inaccuracies in the surrogate could mislead both optimizers, highlighting the need for periodic retraining with newly synthesized compounds.
Future work outlined includes extending the framework to multi‑objective optimization (e.g., maximizing β while minimizing synthetic accessibility scores), incorporating reinforcement‑learning policies that adapt mutation probabilities on‑the‑fly, and closing the loop with automated synthesis and experimental measurement to create a fully autonomous materials discovery pipeline.
In conclusion, the study provides a thorough, head‑to‑head comparison of simulated annealing and evolutionary algorithms for SMILES‑based molecular design, showing that both are viable and largely equivalent in finding molecules with large hyperpolarizabilities. The findings support the broader adoption of string‑based stochastic optimization in computational chemistry and materials science, offering a scalable route to accelerate the discovery of next‑generation NLO materials.
Comments & Academic Discussion
Loading comments...
Leave a Comment