IUPAC-Induced Computational Approaches for Identifying Boosters of Small Biomolecule Functionality: A Case Study of Human Tyrosyl-DNA Phosphodiesterase 1 (TDP1) Inhibitors
This paper introduces several proof-of-concept (PoC) computational methods intended to offer biochemical researchers straightforward, time- and cost-effective strategies to accelerate their work. While Machine Learning (ML) models were developed, the study’s central purpose was to explore approaches for the identification of desirable functional groups/fragments in small biomolecules regarding a specific functionality, which, in this case, was human tyrosyl-DNA phosphodiesterase 1 (TDP1) inhibition. This was achieved primarily by tokenising IUPAC names to generate features. Additionally, the applicability of the CID_SID ML model for predicting TDP1 activity was developed and explored. Since these computational approaches were not experimentally validated due to a lack of appropriate laboratory facilities, they are presented as open proposals for further laboratory investigation.
💡 Research Summary
**
The manuscript presents a proof‑of‑concept study that leverages IUPAC systematic names as a source of machine‑learning features for the discovery of small‑molecule modulators, using human tyrosyl‑DNA phosphodiesterase 1 (TDP1) inhibition as a case study. The authors propose three complementary computational pipelines: (1) tokenisation of IUPAC names to extract discrete functional groups, conversion of these groups into a binary feature matrix, and training of a Random Forest Classifier (RFC) to predict TDP1 activity; (2) ranking of functional groups by importance derived from the RFC model, thereby providing a “booster” list for medicinal chemists; and (3) a CID‑SID based model that uses PubChem compound identifiers (CID) and substance identifiers (SID) as implicit structural and contextual descriptors, reproducing a previously published workflow for other target families.
Data were drawn from PubChem BioAssay AID 686978 (≈425 k compounds, three activity classes) and merged with AID 1996 (solubility data) to mitigate the severe class imbalance. After curation (duplicate removal, consistency checks, literature cross‑validation), the authors retrieved IUPAC names for all compounds via bulk CID queries. Tokenisation retained only substrings of four or more characters, assuming they correspond to meaningful chemical fragments (e.g., “phenyl”, “imidazo”). Exact string matching generated a binary presence/absence matrix, which was merged with the activity labels. The RFC was trained on a balanced training set, while the test set was artificially balanced (equal numbers of active, inactive, and inconclusive compounds) to avoid the accuracy paradox. Feature‑importance analysis highlighted several functional groups that correlate positively with TDP1 inhibition.
The CID‑SID model follows a previously validated approach that treats CID as a canonical structural key and SID as a link to experimental provenance. By feeding these identifiers into a gradient‑boosting or random‑forest framework, the model predicts TDP1 inhibition for compounds originally designed for unrelated targets.
Strengths of the work include the novel use of IUPAC names as a low‑cost, readily available source of chemical information, and the transparent, reproducible pipeline built with scikit‑learn. However, several limitations temper the impact. The tokenisation strategy relies on exact string matches, which can miss partial matches (e.g., “amino” embedded in longer names) and is vulnerable to non‑standard or ambiguous IUPAC nomenclature. Merging the primary assay with a solubility dataset introduces a deliberate selection bias toward compounds with known aqueous properties, potentially narrowing chemical space. The balanced test set, while useful for methodological illustration, does not reflect the highly imbalanced nature of real high‑throughput screens. Crucially, the manuscript does not report quantitative performance metrics (ROC‑AUC, F1, precision‑recall), making it impossible to assess the predictive power of either model. No experimental validation (biochemical or cellular assays) was performed, so the risk of false positives/negatives remains unquantified. Finally, IUPAC‑derived features cannot capture stereochemistry, charge, or electronic effects that are often decisive for enzyme inhibition.
In summary, the study demonstrates that systematic name tokenisation can generate interpretable binary descriptors useful for rapid, inexpensive screening and for highlighting functional groups of interest. To become a practical tool for drug discovery, future work should (i) integrate structure‑based descriptors (SMILES, SELFIES, molecular fingerprints) alongside IUPAC tokens, (ii) benchmark the RFC against more sophisticated models (e.g., graph neural networks, LLM‑derived embeddings), (iii) evaluate performance on realistic, imbalanced datasets, and (iv) validate top predictions experimentally. With these enhancements, the proposed IUPAC‑induced pipelines could serve as valuable adjuncts in early‑stage hit identification, accelerating the discovery of TDP1 inhibitors and, by extension, other therapeutic targets.
Comments & Academic Discussion
Loading comments...
Leave a Comment