Generative structural elucidation from mass spectra as an iterative optimization problem

Generative structural elucidation from mass spectra as an iterative optimization problem
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Liquid chromatography tandem mass spectrometry (LC-MS/MS) is a critical analytical technique for molecular identification across metabolomics, environmental chemistry, and chemical forensics. A variety of computational methods have emerged for structural annotation of spectral features of interest, but many of these features cannot be confidently annotated with reference structures or spectra. Here, we introduce FOAM (Formula-constrained Optimization for Annotating Metabolites), a computational workflow that poses structure elucidation from LC-MS/MS as an iterative optimization problem. FOAM couples a formula-constrained graph genetic algorithm with spectral simulation to explore candidate annotations given an experimental spectrum. We demonstrate FOAM’s performance on the NIST'20 and MassSpecGym datasets as both a standalone elucidation pipeline and as a complement to existing inverse models. This work establishes iterative optimization as an effective and extensible paradigm for structural elucidation.


💡 Research Summary

The paper introduces FOAM (Formula‑constrained Optimization for Annotating Metabolites), a novel computational workflow that frames LC‑MS/MS‑based structural elucidation as an iterative optimization problem. Starting from a high‑resolution precursor mass, a molecular formula is inferred using tools such as SIRIUS, MIST‑CF, or BUDDY. All structures matching this formula in a virtual library (e.g., PubChem) or supplied by other de‑novo generators become the initial population. Each candidate is fragmented in silico with ICEBERG, a geometric deep‑learning spectral simulator that reproduces the experimental collision energy, adduct, and instrument settings. The predicted spectrum is compared to the experimental MS² spectrum using an entropy‑based similarity metric. In parallel, structural complexity is quantified by the SAScore. These two objectives are combined in a Pareto‑ranking scheme; non‑dominated sorting selects the top‑ranking individuals to form a mating pool. A graph‑based genetic algorithm then applies formula‑constrained crossover and mutation operators, guaranteeing that offspring retain the exact elemental composition of the target molecule. The cycle of simulation, scoring, selection, and variation repeats for a predefined number of generations or until a computational budget (e.g., 7 500 ICEBERG calls for NIST’20) is exhausted. The final output is a list of candidates ranked by predicted spectral similarity.

Evaluation on two benchmark datasets—NIST’20 and MassSpecGym—demonstrates that FOAM can recover the true structure in 68 % of test cases. Within three generations, the true molecule appears as the top‑ranked candidate in 11 % of runs and among the top‑10 candidates in 31 % of runs. Spectral similarity improves steadily across generations (average increase of 0.09 over 60 generations), while structural similarity (Tanimoto on 2048‑bit Morgan fingerprints) peaks after two to three generations (≈0.62) before slowly declining due to the emergence of high‑scoring decoys. The authors show that performance correlates strongly with the relevance of the initial seed structures and with ICEBERG’s prediction accuracy; the latter is identified as the primary bottleneck. When combined with existing inverse models (e.g., fingerprint‑based retrieval or conditional generative networks), FOAM provides complementary gains, illustrating its extensibility.

Key contributions include: (1) reframing metabolite annotation as a formula‑constrained multi‑objective optimization; (2) integrating a state‑of‑the‑art spectral predictor as a surrogate oracle within a genetic algorithm; (3) demonstrating that iterative refinement can explore chemical space beyond pre‑enumerated libraries while respecting elemental composition; and (4) providing a systematic analysis of factors influencing success (seed relevance, oracle accuracy). Limitations are acknowledged: the quality of ICEBERG predictions directly limits the ability to distinguish true structures from high‑scoring decoys, and the graph‑genetic operators, while formula‑preserving, may explore the space inefficiently for very large or highly flexible molecules. Future work is suggested to incorporate more accurate fragmentation models, adaptive mutation strategies, and reinforcement‑learning‑driven search policies.

Overall, FOAM establishes iterative, formula‑constrained optimization as an effective and extensible paradigm for de‑novo structural elucidation from mass spectra, offering a promising avenue for metabolomics, environmental chemistry, and forensic applications where reference spectra are scarce.


Comments & Academic Discussion

Loading comments...

Leave a Comment