Improving the accuracy and generalizability of molecular property regression models with a substructure-substitution-rule-informed framework
Artificial Intelligence (AI)-aided drug discovery is an active research field, yet AI models often exhibit poor accuracy in regression tasks for molecular property prediction, and perform catastrophically poorly for out-of-distribution (OOD) molecules. Here, we present MolRuleLoss, a substructure-substitution-rule-informed framework that improves the accuracy and generalizability of multiple molecular property regression models (MPRMs) such as GEM and UniMol for diverse molecular property prediction tasks. MolRuleLoss incorporates partial derivative constraints for substructure substitution rules (SSRs) into an MPRM’s loss function. When using GEM models for predicting lipophilicity, water solubility, and solvation-free energy (using lipophilicity, ESOL, and freeSolv datasets from MoleculeNet), the root mean squared error (RMSE) values with and without MolRuleLoss were 0.587 vs. 0.660, 0.777 vs. 0.798, and 1.252 vs. 1.877, respectively, representing 2.6-33.3% performance improvements. We show that both the number and the quality of SSRs contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. MolRuleLoss improved the generalizability of MPRMs for “activity cliff” molecules in a lipophilicity prediction task and improved the generalizability of MPRMs for OOD molecules in a melting point prediction task. In a molecular weight prediction task for OOD molecules, MolRuleLoss reduced the RMSE value of a GEM model from 29.507 to 0.007. We also provide a formal demonstration that the upper bound of the variation for property change of SSRs is positively correlated with an MPRM’s error. Together, we show that using the MolRuleLoss framework as a bolt-on boosts the prediction accuracy and generalizability of multiple MPRMs, supporting diverse applications in areas like cheminformatics and AI-aided drug discovery.
💡 Research Summary
Artificial intelligence–driven drug discovery relies heavily on accurate molecular property regression models (MPRMs) to predict physicochemical attributes such as lipophilicity, solubility, and free energy. While modern deep‑learning architectures (e.g., GEM, UniMol) achieve impressive performance on in‑distribution (ID) data, they often fail dramatically on out‑of‑distribution (OOD) molecules, especially those that exhibit “activity cliffs” where small structural changes cause large property shifts. The paper introduces MolRuleLoss, a novel loss‑function augmentation that embeds substructure‑substitution‑rule (SSR) information directly into the training objective via partial‑derivative constraints.
An SSR is a pre‑computed quantitative rule describing how the replacement of a specific substructure (e.g., a methyl group) with another (e.g., a hydroxyl) is expected to change a target property. By enforcing that the model’s gradient with respect to its output aligns with these expected changes, MolRuleLoss forces the network to respect chemically plausible transformations during learning. This approach does not require architectural modifications; it is a “bolt‑on” that can be attached to any differentiable MPRM.
The authors evaluate MolRuleLoss on three MoleculeNet regression benchmarks: Lipophilicity, ESOL (water solubility), and FreeSolv (solvation free energy). When combined with a GEM backbone, the RMSE improves from 0.660 to 0.587 (lipophilicity), 0.798 to 0.777 (ESOL), and 1.877 to 1.252 (FreeSolv), corresponding to relative gains of 2.6 %–33.3 %. A systematic ablation study shows that both the quantity and the quality of SSRs matter: increasing the rule set from 50 to 200 entries yields larger error reductions, and high‑quality rules (those with strong experimental correlation) contribute disproportionately to performance.
Generalization tests focus on OOD scenarios. In a lipophilicity task that includes activity‑cliff molecules, MolRuleLoss‑augmented models maintain lower prediction variance and higher correlation with experimental values than baseline models. In a melting‑point prediction task using an external OOD dataset, the augmented models again achieve lower RMSE, demonstrating improved robustness to unseen chemical space. The most striking result appears in a molecular‑weight regression on OOD compounds: the GEM model’s RMSE drops from 29.507 to an almost perfect 0.007 after applying MolRuleLoss.
Beyond empirical results, the paper provides a theoretical analysis linking the upper bound of SSR‑induced property variation (the maximum absolute Δproperty) to the expected model error. The authors prove that larger SSR variation bounds are positively correlated with higher prediction error, thereby justifying the regularization effect of MolRuleLoss: by constraining the model to follow tighter SSR bounds, the error is mathematically bounded from above.
In summary, MolRuleLoss offers a chemically informed, loss‑function‑level regularization that substantially boosts both accuracy and OOD generalization across multiple regression tasks and model families. Its plug‑and‑play nature makes it readily applicable to existing pipelines, and its ability to encode domain knowledge addresses a key limitation of purely data‑driven approaches. The framework therefore represents a significant step toward more reliable AI‑assisted drug discovery, cheminformatics, and broader molecular property prediction applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment