SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction
In drug discovery, predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of small-molecule drugs is critical for ensuring safety and efficacy. However, the process of accurately predicting these properties is often resource-intensive and requires extensive experimental data. To address this challenge, we propose SMILES-Mamba, a two-stage model that leverages both unlabeled and labeled data through a combination of self-supervised pretraining and fine-tuning strategies. The model first pre-trains on a large corpus of unlabeled SMILES strings to capture the underlying chemical structure and relationships, before being fine-tuned on smaller, labeled datasets specific to ADMET tasks. Our results demonstrate that SMILES-Mamba exhibits competitive performance across 22 ADMET datasets, achieving the highest score in 14 tasks, highlighting the potential of self-supervised learning in improving molecular property prediction. This approach not only enhances prediction accuracy but also reduces the dependence on large, labeled datasets, offering a promising direction for future research in drug discovery.
💡 Research Summary
The paper introduces SMILES‑Mamba, a two‑stage foundation model designed to improve the prediction of ADMET (absorption, distribution, metabolism, excretion, toxicity) properties for small‑molecule drugs. In the first stage, the authors pre‑train a Mamba‑based sequence model on a large corpus of unlabeled SMILES strings sampled from the ZINC database (250 K molecules). The pre‑training objective is next‑token prediction, allowing the model to learn statistical regularities and structural patterns inherent in SMILES without any property labels. The backbone, Mamba, is a specialized implementation of the Structured State Space Sequence (S4) architecture, which captures long‑range dependencies with linear‑time complexity (O(N)) as opposed to the quadratic cost of standard Transformers. This makes it well‑suited for handling the often long SMILES strings while keeping memory and compute requirements modest.
After pre‑training, the model is fine‑tuned on 22 publicly available ADMET datasets covering absorption (Caco2, HIA, Pgp, Bioav, Lipo, AqSol), distribution (BBB, PPBR, VD), metabolism (multiple CYP inhibition and substrate assays), and excretion (Half‑Life, CL‑Micro, CL‑Hepa). These datasets contain both binary classification and continuous regression tasks, with sample sizes ranging from a few hundred to several thousand. Fine‑tuning adapts the generic chemical representations learned during pre‑training to the specific downstream tasks, dramatically reducing the amount of labeled data required for high performance.
Experimental results show that SMILES‑Mamba achieves the top score on 14 out of the 22 tasks, outperforming a suite of strong baselines including graph neural networks (GCN, GAT), Transformer‑based language models (MolBERT, ChemBERTa), and other recent self‑supervised approaches. The gains are especially pronounced on tasks with limited labeled data, such as CYP inhibition and substrate prediction, highlighting the benefit of the self‑supervised pre‑training. In addition to accuracy, the authors report favorable training efficiency: Mamba’s linear‑time state‑space updates lead to lower GPU memory consumption and faster convergence compared with attention‑based models.
The authors also discuss practical considerations. The SMILES‑based representation avoids the need for graph construction and can be processed directly by standard tokenizers, simplifying data pipelines. The model and code will be released after acceptance, facilitating reproducibility and further research. Potential extensions include multi‑task learning across ADMET endpoints, integration with reaction prediction models, and application to drug repurposing or virtual screening campaigns.
In summary, SMILES‑Mamba demonstrates that a self‑supervised, state‑space sequence model trained on massive unlabeled SMILES data can serve as a powerful foundation for downstream ADMET prediction, delivering state‑of‑the‑art performance while reducing reliance on costly experimental labels. This work points toward a new paradigm in computational drug discovery where large‑scale unsupervised learning complements traditional cheminformatics and accelerates the identification of safe and effective drug candidates.
Comments & Academic Discussion
Loading comments...
Leave a Comment