Enhancing NMR Shielding Predictions of Atoms-in-Molecules Machine Learning Models with Neighborhood-Informed Representations
Accurate prediction of nuclear magnetic resonance (NMR) shielding with machine learning (ML) models remains a central challenge for data-driven spectroscopy. We present atomic variants of the Coulomb matrix (aCM) and bag-of-bonds (aBoB) descriptors, and extend them using radial basis functions (RBFs) to yield smooth, per-atom representations (aCM-RBF, aBoB-RBF). Local structural information is incorporated by augmenting each atomic descriptor with contributions from the n nearest neighbors, resulting in the family of descriptors, aCM-RBF(n) and aBoB-RBF(n). For 13C shielding prediction on the QM9NMR dataset (831,925 shielding values across 130,831 molecules), aBoB-RBF(4) achieves an out-of-sample mean error of 1.69 ppm, outperforming models reported in previous studies. While explicit three-body descriptors further reduce errors at a higher cost, aBoB-RBF(4) offers the best balance of accuracy and efficiency. Benchmarking on external datasets comprising larger molecules (GDBm, Drug12/Drug40, and pyrimidinone derivatives) confirms the robustness and transferability of aBoB-RBF(4), establishing it as a practical tool for ML-based NMR shielding prediction.
💡 Research Summary
This paper, titled “Enhancing NMR Shielding Predictions of Atoms-in-Molecules Machine Learning Models with Neighborhood-Informed Representations,” presents a significant advancement in machine learning (ML) models for predicting nuclear magnetic resonance (NMR) shielding constants, a crucial property in spectroscopic analysis.
The core challenge addressed is the accurate and efficient prediction of NMR shielding, which is computationally expensive with quantum chemical methods like DFT and often inaccurate with empirical rules for complex molecules. The authors propose a novel family of atomic descriptors specifically designed for the Atoms-in-Molecules (AIM) ML framework, where properties are predicted per atom based on its local chemical environment.
The methodological innovation proceeds in three key steps. First, the authors derive atomic variants of two well-known global molecular descriptors: the atomic Coulomb Matrix (aCM) and the atomic Bag-of-Bonds (aBoB). Second, they transform these discrete representations into smooth, continuous ones by employing Radial Basis Function (RBF) expansions, resulting in aCM-RBF and aBoB-RBF. The most critical advancement is the third step: enriching these atomic descriptors by concatenating information from the n nearest neighboring atoms. This creates the final descriptor families, aCM-RBF(n) and aBoB-RBF(n), which implicitly capture many-body effects crucial for NMR shielding without explicitly encoding complex three-body terms.
These descriptors were evaluated using Kernel Ridge Regression (KRR) on the large-scale QM9NMR dataset, containing over 830,000 13C shielding values for 130,831 small organic molecules. A 5,000-molecule subset was used for efficient hyperparameter optimization. The results demonstrate a clear hierarchy of performance: continuous RBF-based descriptors outperform their discrete counterparts, and the inclusion of neighbor information (with optimal n=4) yields the most dramatic improvement.
The aBoB-RBF(4) descriptor emerged as the star performer, achieving a state-of-the-art mean absolute error (MAE) of 1.69 ppm for 13C shielding on the QM9NMR test set, surpassing previous benchmarks established on the same dataset. While a more complex descriptor explicitly including three-body interactions (SLATM) achieved a slightly lower error of 1.58 ppm, it came at a substantially higher computational cost. The aBoB-RBF(4) model thus represents the best practical trade-off between accuracy and efficiency.
To rigorously test the model’s robustness and transferability, the authors performed extensive benchmarking on several external datasets containing larger molecules outside the training domain: GDBm (molecules with 10-17 heavy atoms), Drug12/Drug40 (drug-like molecules with 7-23 heavy atoms), and a set of 208 pyrimidinone derivatives. The aBoB-RBF(4) model maintained high predictive accuracy across all these challenging sets, confirming its generalizability to broader chemical space.
In conclusion, this work systematically enhances atomic representations for ML by incorporating neighborhood information, leading to a powerful and practical tool for NMR shielding prediction. The developed aBoB-RBF(4) descriptor combines chemical intuitiveness, computational efficiency, and high accuracy, establishing a new benchmark for data-driven spectroscopy and demonstrating strong potential for high-throughput applications in fields like drug discovery and metabolomics.
Comments & Academic Discussion
Loading comments...
Leave a Comment