ChemCLIP: Bridging Organic and Inorganic Anticancer Compounds Through Contrastive Learning

The discovery of anticancer therapeutics has traditionally treated organic small molecules and metal-based coordination complexes as separate chemical domains, limiting knowledge transfer despite their shared biological objectives. This disparity is …

Authors: Mohamad Koohi-Moghadam, Hongzhe Sun, Hongyan Li

ChemCLIP: Bridging Organic and Inorganic Anticancer Compounds Through Contrastive Learning
ChemCLIP: Brid ging Organic and Inorganic Anticancer C ompounds Throu gh Contr astive Learn ing Mohamad Koohi-Moghadam 1,* , Hongzhe Sun 2 , Hongyan Li 2 , Kyongtae T yler Bae 1 1 Department of Diag nostic Radiology , Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pok Fu Lam Road, Hong Kong SAR, PR China 2 Departmen t of Chemistry , The University of Hong Kon g, Pok Fu Lam Road, Hong Kong S.A.R., PRC * Corresponding autho r: M.K (koohi@hku.hk) Abstract The discovery of anticancer therapeutics has traditionally treated or ganic small molecules and metal-based coordination complexes as separate chemical domains, li miting knowledge transfer despite their shar ed biological objectives. This disparity is particularly pronounced in available data, with extensive screening databases for o r ganic compounds compared to only a few thousand characterized metal complexes. Her e, we introduce ChemCLIP , a dual -encoder contrastive learning framework that bridges this or ganic-inorganic divide by learning unified representations based on shared anticancer activities rather than structural similarity . W e compiled complementary datasets comprising 44,854 unique organic compounds and 5,164 unique metal c omplexes, standardized across 60 c ancer cell lines. By tra ining para llel encoders with activity-aware hard negative mining, we ma pped structurally distinct compounds into a shared 256-dimensional embedding space where biol ogically similar compounds cluster together regardless of ch emical class. W e syst ematically evaluated fou r molecular encodin g strategies—Mor gan fingerprints, ChemBER T a, MolF ormer , and Chemprop —through quantitative alignment metric s, embedding visualizations, and downstream classification tasks. Mor gan fingerprints achieved superior pe rformance with an average alignment ratio of 0.899 and downstream classification AUCs of 0.859 (inorganic) and 0.817 (or ganic). The high classification performance using frozen embeddings —without task-specific fine-tuning— demonstrates that biologically meaningful features are encoded in the learn ed representations, enabling practical applications in computational drug sc reening. This work establishes contrastive learning a s an ef fective strategy for unifying disparate chemical domains and provides empirica l guidance for encoder selection in multi-modal chemistry applications, with implications e xtending beyond anticancer drug disc overy to any s cenario requiring cross- domain chemical knowledge transfe r . Introduction The development of anticance r therapeutics h as traditionally progre ssed along two parallel but lar gely independent paths: or ganic small molecules and metal-based co ordination complexes [1, 2]. While or ganic compounds dominate the p harmaceutical landscape due to decades of systematic screening and optimization, metal -based drugs—exemplified by the clinical success of platinum-based chemotherapeutics—offer distinct mechanisms of action and poten tial advantages in addressing drug resistance [3, 4 ]. H owever , the vast disparity in available data between these domains presents a significa nt challenge for computational drug discovery . The National Cancer Institute's NCI60 screening program has generated activity data for over 44,000 or ganic c ompounds [5], whereas c omprehensive databases of metal complexes c ontain only a few thousand compounds. This data imbalance, combined with the fundamental structural dif ferences between or ganic molecules and coordination complex es, has traditionally necessitated separate computational models and limi ted knowledge transfer between organic and inor ganic domains [6 -10]. Recent advances in contrastive learning—most notably demonstrated by CLIP (Contrastive Language-Image Pre -training) [1 1] in the vision-language domain—suggest a promising approach to bridging suc h dispar ate modalities [1 2]. B y learning to align representations base d on shared functional properties rather than structural similarity , co ntrastive learning frameworks can dis cover meaningful cor respondences betw een fund amentally dif ferent data types. In the context of anticancer drug discovery , this paradigm shift offers an opportunity to create unified computational models that levera ge the extensive knowl edge embedded in or ganic compound databases to inform predictions about metal -based therapeutics, while simultaneously enabli ng direct comparison and ranking of structurally divers e candidates b ased on their biological activities. This study introduces ChemCLIP , a dual-encoder contrastive learning framework specifically designed to bridg e the organic-inor g anic divi de in anticancer compound sp ace. W e compiled complementary dat asets comprising 44,854 unique organic compounds from the NCI60 [5] screening program and 5,164 unique metal complexes from MetalCytoT oxDB [13], standardized across 60 s hared cancer cell lines. By training parallel en coders to map thes e structurally distinct compound classes into a unified 256 -dimensional embedding space, we hypothesized that comp ounds exhibiting similar biological activities would cluster together regardless of their chem ical class. T o rigorously evaluate this app roach, we compared four molecular encoding strategies—Morgan fingerprints [14], ChemBER T a [15] , MolFormer [16], and C hemprop [17]—assessing their ability to learn activity -relevant representations through quantitative alignment metrics, embedding space visualizations, and downstream classification performance. Our work addresses three fundamental questions : (1) Can contrastive lea rning successfully align organic and inorganic chemical spaces based on shared biological acti vities? (2) Which molecular encoding strat egies best support cross-domain gene ralization in the context of metal complexes and or ganic molecules? (3) Does the learned embedding space provide effective representations for down stream tasks such as compound activity classification? The answers to these questions have direct im plications for computa tional drug screening pipelines and broader applications of multi -modal learning in chemistry , where bridging diverse chemical domains could accelerate discovery across multiple therapeutic areas. Method Dataset Pr eparation W e compiled two complementary datasets to enable contrastive l earning between o r ganic and inor ganic anticancer compounds. The or ganic compound dataset was derived from the National Cancer Institute's NCI60 Human T umor Cell Lines Screen [5], a publicly available database containing dose-response data for thous ands of small mol ecules tested against 60 human cancer cell lines representing ni ne tissue types. The ino rganic compound dataset was obtained from MetalCytoT oxDB, a specialized database of metal-based anticancer compounds across various cancer cell lines. For Organic Dataset Processing. Compounds were classified as active or inactive based on their mean growth inhibition percentage across tested concentrations (threshol d < 50). Lower growth inhibit ion percentage values indicate stronger growth inhibitory activity , making this threshold appropriate for identifying compounds with meaningful anticancer ef fects. S MILES representations were retrieved from PubChem using NSC (National Service Center) identifiers. Compounds lacking valid SMILES structures were excluded from further analysis. Metal- containing compounds identified in the NCI60 database were extracted and transferred to the inor ganic datas et to ensure clear se paration between organic and inor ganic chemical spaces. T o ensure compatibili ty between organic and inor ganic datasets, we filtered the NCI60 data to retain only the 60 cell lines that were present in both datasets. This standardization was essential for enabling meaningful contrastive learning across chemical classes. For Inorganic Dataset Processing. Cell line names in MetalCytoT oxDB [13] were standardized to match NCI60 nomenclature. W e created a systematic mapping between database-specific cell lin e identifiers and standard NCI60 cell line names. E ntries without valid NCI60 cell li ne mappings were removed to maintain dataset consistency . Metal complexes were classified based on their IC50 a nd compounds with IC50 < 10 µM were labeled as active. For each metal complex, we extracted chemical features including metal ty pe (one-hot encoded for Ru, Ir , Re, Os, Rh, Cu, Pt, Au, Co, and Ti), oxidation state, atomic number , and valence electron count. These features ena bled the model to learn metal -specific pharmacological patterns. Dataset Statistics Inorganic dataset. The processed inorganic dataset comprised 13,656 records representing 5,164 unique metal complexes tested across 60 cell lines. The activity distribution showed 3,165 active compounds (23.18%) and 10,491 inactive compounds (76.82%), yielding an active- to -inactive ratio of 1:3.3. The dataset was dominated by ruthenium complexes (52.05%), followed by titanium (17.57%), and iridium (9.89%), with ten transition metals represented in total. The 60 cell lines covered nine major cancer types, with subst antial representation from colon cancer , melanoma, ovarian cancer , and leukemia. Nota bly , 60.65% of records originated from MetalCytoT oxDB, while 39.35% came from metal compounds screene d in the NCI60 program (T able S1-3). Organic Dataset. The processed organic dataset contained 1,812,339 records representing 44,854 unique or ganic compounds tested across 71 cell lines. The activity distribution was more imbalanced than the inorganic dataset, with 44,835 active compounds (2.47%) and 1,767,504 inactive compounds (97.53%), resulting in an active- to -inactive ratio of 1:39.4. The dataset showed broad coverage across cancer types, with colon cancer (17.37%), melanoma (16.56%), and ovarian c ancer (14.16%) being the most represented. On average, each unique compound was tested against 40.4 cell lines, demonstrating extensive cross-panel screening that enables robust activity pattern learning. The dataset's compound diversity , with an average of approxima tely 631 compounds tested per cell line, provide d comprehensive chemical space coverage for training the contrastive learning model T able S4, 5. Model Architecture and T rainin g Methodology ChemCLIP implements a dual-encoder contrastive learning architecture adapted from CLIP (Contrastive Language- Image Pre -training) for bridging inor ganic and organic ch emical compound spaces. The model learns a shared embedding space where compounds exhibiting similar biological activities across cancer cell li nes achieve high similarity regardless of their chemical class (or ganic versus inor ganic). This approach enables cross-domain retrieval, knowledge trans fer from the larger or ganic com pound database to metal -based drugs, and downstream activity prediction tasks. Dual-Encoder Architecture. The model comprises two parallel encoding br anches that p rocess inor ganic metal complexes and orga nic compounds separately before projecting them into a unified 256-dimensional e mbedding space. The inorga nic branch incorporates specialized metal feature integration, while the organic branch processes standard molecular representations. Both branches employ identical projection architectures to ensure symmetry in the learned embedding space. W e e valuated four distinct molecular encoding strategies to identify the opti mal repre sentation for our contrastive learning framework. Morga n fingerprints (radius=2, 20 48 bits) provided a cheminformatics baseline with fast computation but fixed representations . ChemBER T a and MolFormer represented transformer-ba sed approaches pre-trained on large c hemical databases. For graph-based molecular encoding, we implemented Chemprop's Directed Message Passing Neural Network (D-MPNN) [18]. The inorganic encoder enhanced base mol ecular representations with metal -specific information critical for u nderstanding metal complex pharmacology . For each metal complex, the ligand SMILES was processed through the base encoder , while metal features were separately enc oded. Metal type was re presented through one-hot encoding across 10 transition metals (Au, Co, Cu, Ir , Os, Pt, Re, Rh, R u, T i), supplemented by scalar features for oxidation state and atomic number . This architecture enabled the model to le arn metal-sp ecific pharmacological patterns while maintaining compatibility with or ganic compound representations. Contrastive Learning with Hard Negative Mining. The model training employed a two- component loss function that combined standard contrastive learning with activity-aware hard negative mining. The fir st component was the bidirectional InfoNCE contrastive loss [19] . W ithin each tr aining batch, we computed a simil arity matr ix be tween a ll inor ganic and or ganic compound embeddings u sing normalized dot p roducts scaled by a temperature parameter ( τ = 0.07). Each inorganic compound was paired with exactly one organic comp ound from the same cell line, forming the positive pairs r epresented by diagonal elements in the sim ilarity matrix. All other inor ganic-or g anic combinations within the batch served as negative samples. The loss function encouraged high similarity be tween positive pairs while simultaneously pushing a part negative pairs through bidirectional cross -entropy computed in both inorganic- to -organic and or ganic- to -inor ganic directions. The second component introduced activity-aware hard negative mining through triplet margin loss [20]. Unlike the random ne gatives in the contr astive loss, hard negatives we re specifically selected based on activit y labels to create mor e c hallenging learning s cenarios. For each ac tive inor ganic compound, we selected an active o r ganic compound from the sam e cell line as th e positive example and an inactive organic compound from the same cell line as the hard negative. This pairing strategy forced the model to distinguish between active and inactive compounds within the same biological context, rather than relying on easier distinctions between compounds from dif ferent cell lines. The triplet loss enforce d that the similarity between an inor ganic compound and its active organic match should exceed the similarity to its inactive or ganic counterpart. The final training objective summ ed the contrastive loss and triplet loss with equal weighting, enabling the model to learn both broad chemical sim ilarity patterns and fine-grained activity-specific discriminations. T rainin g Strategy and Data Man agement. T o address the severe class imba lance in the o rga nic dataset (2.47% active versus 97.53% inactive), we retained all active compounds while subsampling inactive compounds at a 5:1 ratio re lative to active c ompounds. W e implemented compound-based data splitting to ensure rigorous eva luation of generalization to unse en chemical structures. Unique compounds from both datasets were randomly partitioned into 70% training, 15% validation, and 15% test sets. All c ompound- cel l li ne pairs were assigned to their respective compound's split, preventing data leakage through compound rep lication across cell lines. This strategy provided stricter evaluation of model p erformance on completely novel chemical structures compared to random splitting of compound-cell line pairs. Model training employed the AdamW optimizer with learning rate 1×10⁻³, default weight decay (0.01), and beta parameters (0.9, 0.999). T raining proceeded for 100 epochs with b atch size 128, gradient clipping at maxim um norm 1.0, and dropout rate 0.1. This configuration balanced training stability with effic ient conver gence across all encoder architectures evaluated. Results Embedding Space V isualization T o evaluate the quality of learned embeddings and assess the ef fectiveness of diffe rent molecular encoders, we performed t-SNE (t-distributed Stochastic Neighbor Embedding) [21] visualization of compound representations both before and after contrastive training. The visualizations compared four molecular encoding strategies—Morganfingerprints, ChemBER T a, MolForme r , and Chemprop—across two c onditions: initial frozen encoders, and trained models projecting both compound types into the shared 256 -dimensional embedding space (Figure 1 ). The four encoders exhibi ted markedly different capabilities in learning task -relevant representations through contrastive training. Morga n fing erprints demonstrated superior embedding quality , achieving clear separation between active and inactive compounds within both inorganic and organic chemical sp aces after training. This separation was evident acros s both compound types, indicating that the model s uccessfully learned activity-relevant patterns that generalized across the orga nic-inorganic divide. ChemBER T a and MolF ormer , both transformer -based architectures pre-trained on large chemical databases, s howed intermediate performance. While these encoders achieved r easonable separation between active and inactive compounds after training, the clustering was less pronounced compared to Mor ganfingerprints. Chemprop's performance revealed a significant limitation in the contrastive learning paradigm. The trained Chemprop embeddings primarily separated inorga nic compounds from or ganic compounds but failed to achieve meaningful separation between active and inactive compounds within each chemical class. This suggests that the graph -based message-passing architecture, while effective for capturing mol ecular structure, struggled to learn fine -graine d activity distinctions under the dual-encoder contrastive framew ork. Before T raining T ra ined Model MorganFingerprint ChemBER T a MolFormer Chemprop Figure 1. Comparison of four molecular encoders. Left pane ls show encoder embeddings befor e training and right panels show t rained ChemCLIP embeddings with joint t-SNE in the shared 256- dimensional space. Colors indicate compound type (inor ganic: red squares, organic: blue circles) and activity (darker: active, lighter: inactive). MorganFingerprin t shows the clearest separation between active and inactiv e compou nds in both types. Quantitative Assessmen t of Cr oss-Modal Align ment T o quantitatively assess the quality of cross -modal alignment in the learned embedding space, we perf ormed a comprehensive statistical a nalysis using c luster ce nter distance s. This a nalysis directly tests the central hypothesis of ChemCLIP: compounds with simil ar biological activities should be closer in the shared embedding space , regardless of whether they are inor ganic or organic. W e computed centroids for four distinct groups—Inorganic Active (IA), Inor ganic Inactive (II), Organic Active (OA), a nd Orga ni c Inactive (OI) —and defined two complementary metrics: Alignment Ratio (lower is better) measures wheth er same-activity compounds from diffe rent modalities are closer than dif ferent-activity compounds: Alignment Ratio      󰇛 IA  OA 󰇜  󰇛 IA  OI 󰇜   󰇛 II  OI 󰇜  󰇛 II  OA 󰇜  V alues below 1.0 indicate successful alignment. Cross-M odal Separation Ratio (higher is better) quantifies the overall separation b etween dif ferent-activity and same-activity pairs: Separation Ratio  A vg 󰇟 󰇛 IA  OI 󰇜  󰇛 II  OA 󰇜󰇠 A vg 󰇟 󰇛 IA  OA 󰇜  󰇛 II  OI 󰇜󰇠 V alues above 1.0 indicate meaningful activity-based clustering. Figure 2 presents the alig nment and separation ra tios for all four encode rs. Mor ganFingerprint achieved the best performance ( average alignment ratio: 0.899, separation ratio: 1.127), followed by ChemBER T a (0.903, 1.1 19) and Mo lFormer (0.920, 1.093). All three encoders exceeded the threshold of 1.0, confirming successful cross -modal align ment. In contrast, Chemprop showed ratios of exact ly 1.000, indicating complete failur e to or ganize the embedding space by activity . The cross-modal distance matrix reveals that for MorganFingerprint, inorganic active compounds are 21.3% closer to organic active compounds (d = 1.147) than to or ganic inactive compounds (d = 1.457), yielding an active alignment ratio of 0.787. Ch emBER T a e xhibited similar behavior (0.796), while MolF ormer showed weaker alignment (0.837). The uniform distances (~1.265) for C hemprop across all cluster pairs indicate embedding space collapse, where only compound type is preserved without activity discriminatio n. The combined performance scor e integrates both metrics: Mor ganFingerprint (0.228) > ChemBER T a (0.216) > MolFormer (0.174) > Chemprop (0.000). While difference s b etween MorganFingerprint and ChemBER T a are modest (5.6%), MorganFingerprint consistently outper forms a cross all metrics. Figure 2 : Quant itative Compar ison of Cr oss-Modal Alignment Analysis. (A) Alignme nt ratios fo r all encoders; lowe r values i ndicate better a lignment of same-activity compounds. ( B) Cross-modal separation ratio s; higher val ues indicate bett er separation of d iff erent-activity compounds. (C) Distance heatm ap showing all pai rwise cluster cente r distances. (D ) Combined pe rformance s cores. MorganF ingerprint co nsistently outp erforms oth er encoders. Abbreviations : IA, Inor ganic Active; II, Inorga nic Inactive; O A, Or gani c Active; OI, Org anic Inactive. Downstre am Classification Performance T o evaluate the discrimin ative power and practical utility of the l earned embedding spac e, we trained binary classifiers on the frozen C hemCLIP embeddings to predict compound activity (active vs. inactive). This downstream task directly tests whether the 2 56-dimensional shared embedding space captu res biologically relevant features that enable accurate activity prediction without further encoder fine-tuning. W e trained separate classifiers fo r inorganic and or ganic compounds using the sa me test spli t employed during contrastive learning, ensuring that test set compounds were completely unseen during both contrastive training and classific ation training. Classification Methodology . F or each encoder , we trained two i ndependent binary classifiers—one for inorganic compounds and one for organic compounds —using a simple three-layer multilayer perceptron architecture [22]. Classifiers were tr ained using the same compound-based splits employed during contrast ive learning, ensuring test set compounds remained completely unseen during both training phases. Encoder weights were frozen throughout classifier training to evaluate embedding quality independently of task -specific adaptation. The datasets exhibit severe class imba lance, with only 2.5% of organic compounds and 23.2% of inorga nic compounds labeled as active. T o a ddress this challenge, w e applied weighted binary cross-entropy loss where positive class weights equal the ratio of ne gative to positi ve samples, and optimized classification thresholds on the validation set using F1 score rather than default probability cutof fs. W e employed AUC-ROC [23] as the primary e valuation metric due to its robustness to class imbalance, complemented by F1 score to ass ess precision -recall balance [24]. T able 1: Overall Classification Per formance. Bold = best in column; AUC = Area Under ROC Curve; F1 = F1 Score; Acc = Accuracy Encoder Inorganic AUC Inorganic F1 Inorganic Acc Organic AUC Organic F1 Organic Acc A vg. AUC MorganFingerprint 0.859 0.720 0.841 0.817 0.621 0.846 0.838 ChemBER T a 0.834 0.669 0.81 1 0.780 0.623 0.842 0.807 MolFormer 0.729 0.497 0.717 0.874 0.732 0.852 0.801 Chemprop 0.600 0.493 0.444 0.501 0.417 0.264 0.550 Classification Results and Analysis . T able 1 presents comprehensive classifica tion performance across all four encoders. MorganFingerprint achieved the best overall performance with the highest average AUC of 0.838, demonstrating superior discriminative power across both compo und types. The encoder particularly excelled on in or ganic compounds, attaining 0.859 AU C and 0.720 F1 score while maintaining competitive organic performance at 0.817 AUC. D etailed analysis (Supplementary T ables S6 and S7) reveals that Mor ganFingerprint embeddings ac hieve exceptional precision of 0. 884 on orga nic compounds—a critical advantage fo r drug scr eening applications where f alse positives incur substantial experimental costs. MolFormer exhibited domain-specific excellence, achieving the best organic performance across all encoders with 0.874 AUC and 0.732 F1 score. However , this strength came at the cost of poor inorganic classification, with AUC d ropping to 0.729. This 14.5 -point gap between or ganic and ino r ganic performance reveals strong distributional bias from pre-training on 1.1 billion predominantly or ganic molecules from P ubChem. While large-scale pre -training provides powerful inductive biases for organic chemistry , the se advantages do not transfer to coordination complexes with distinct electronic structures and bonding patterns not well- represented in pre-trainin g corpora. ChemBER T a provided balanced and re liable performance across both compound types, achieving 0.834 AUC on inorganic and 0.780 AUC on orga nic compounds with only 5.4% vari ation between domains. This consistency demonstrates robust cross-domain gen eralization without domain -specific ov erfitting. However , ChemBER T a's peak performance remained lower than specialized encoders, suggesting a tradeoff between generalization breadth and task-specific optimization. Chemprop exhibited complete classification fa ilure, with AUC scores barely a bove random c hance at 0.600 for inor ganic and 0.501 for organic compounds. Analysis of detailed metrics reveals a pat hological failure mode: the classifier defaults to predicting all compounds as active, achieving perfect recall but near - zero precision. This behavior confirms the embedding space collapse pr eviously observed in t- SNE visualizations and statistical alignment metrics. Discussion Principal Findings and Contributions This study introduces ChemCLIP , a contrastive learning f ramework that bridges the traditionally separate domains of organic and inorganic anticancer compounds [25] through a unified embedding space. Our results demonstrate that cross -modal alignment between metal complexes and organic molecules can su ccessfully capture shared biologi cal activity patterns, enabling knowledge transfe r from the extensively studied organic c hemical space to the comparatively under-explored inorganic domain. Among four evaluated molecular en coders, Mor ganFingerprint emerged as the optimal choice , achieving superior performance across embedding quality ( average alignment ratio: 0.8 99), visualization clar ity , and downstream classification tasks (av erage AUC: 0.838). The st rong classification performance using frozen embeddings—without task-specific fine-tuning—validates that biologically relevant features are encoded in the learned representations, offering practical adv antages for drug discovery applications where labeled training data may be limited. Activity-A ware Hard Negative Mining and Its Role in Learning Discrimination The incorporation of activity -aware hard negative mining through triplet margin loss proved essential for achieving the observed classification performance. Standard contrastive learning with random negative sa mpling encoura ges embeddings to separate compounds from dif ferent cell lines, which may correlate with activity diffe rences but does not explicitly enfor ce activity- based discrimination. By specifica lly pairing each active inorganic compound with an active or ganic c ompound (positive) and an inactive orga nic compound (ha rd negative) from the same cell line, the triplet loss forces the model to learn su btle distinctions between active and inactive compounds within identical biological conte xts. This design choice addresses a fundamental challenge in multi -label biological activity prediction: compounds may exhibit cell line-specific effec ts, and simple cross-entropy losses can exploit cell line identity as a shortcut rather than learning generali zable activity patterns. The hard negative mining strategy eliminates this shortcut by ensuring that positive and negative pairs sha re the same c ell li ne, compelling the model to discover chemical fea tures that predict activity independent of the testing context. The clear separation between active and inactive compounds in th e t -SNE visualizations, particularly fo r MorganFingerprint, validates that this approach successfully guides the contrastive learning process toward biologically meaningful representations. The equal weighting of contrastive and triplet losses represents a balance between two complementary objectives: the bidi rectional InfoNCE loss establishes broad cross -modal alignment, while the triplet loss refines this alignment to distinguish active fro m inactive compounds. Alternative we ighting schemes or curriculum learning approaches —where th e relative importance of each loss component varies during training —might further improve embedding quality , particularly in the early train ing phases where establi shing basic cross- modal correspondence may be more important than fine-grained activity discrimination. Implications for Drug Discove ry and Scree ning The strong classification performance achieved using frozen ChemCLIP embeddings (0.80 - 0.87 AUC for successful encoders) demonstrates practical utility for computational drug screening pipelines. T raditional approaches to anticancer drug discovery often treat or ganic and inor ganic compounds as separate chemical spaces, r equiring independent computational models, screening protocols, and optimization strategies. C hemCLIP's u nified framework enables direct comparison and ranking of structurally diverse compoun ds based on predicted biological activity , potentially accelerating the identification of metal -based alternatives to or ganic drugs or revealing opportunities for hybrid therapeutic strategies. The high pre cision achieved by MorganFingerprint on both or ganic (0. 884) and inorganic (0.723) compounds proves particularly valuable for prioritizing compounds for experimental validation. In high -throughput screening campaigns where experimental capacity limits the number of compounds that can be tested, ranking compounds by predicte d activity and selecting top c andidates can d ramatically reduce costs while maintaining high hit rates. A precision of 0.88 implies that among compounds predicted as active, 88% truly exhibit anticancer activity—a hit rate far ex ceeding typical screening success rates and justifying the computational investment required for embedding-based prioritization. The transfer learning capability demonstrated by training sim ple classifiers on frozen embeddings of fers additional advantages for specialized applications. Researchers investigating specific cancer subtypes, resistance mechanisms, or combination therapies can fine-tune lightweight classifiers on small datasets without retraining the computationally expensive encoder . This approach proves especially valuable for inorganic compounds, where limited experimental data has historically constrained the applicatio n of machine learning methods. By leveraging the knowledge encoded in the organic compound training set through the shared embedding space, ChemCLIP enables effective activity pr ediction for metal complexes even when direct training exa mples are scarce. Broader Implications for Chemical Repr esentation Learning Beyond the specific application to anticancer compounds, this work illustrates general principles for bridging chemical domains through contrastive learning. Th e succe ss of domain - agnostic structural encodings suggests that multi-domain chemical applications—such as predicting environmental toxicity across orga nic pollutants and inorga nic heavy metals [26] , or optimizing ca talytic activity a cross homogeneous a nd heterogeneous ca talysts —would benefit from similar approaches. The failure of p re-trained transformers t o achieve balanced cross-domain p erformance highlight s the importance of matching encoder archit ectures to task requirements rather than defaulting to the lar gest available foundation models. The ChemCLIP framework could extend natural ly to other multi-modal chemistry problems where explicit similarity l abels are unavailable but functional equivalence c an be inferred fr om shared properties. Contrasti ve learning between molecules and their sp ectr oscopic signatures (NMR, IR, mass spectra) could accelerate structure elucidation for unknown compounds. Alignment between chemical structures and natural language descriptions of their properties could enable text-bas ed chemical search and improve human -AI inter faces for chemistry applications. The core pri nciple—learning sh ared embeddings by enforcing similarity between functionally equivalent but structurally dissimilar entities —applies broadly across scientific domains wherever multiple representations of the same underlying phenomena exist. Conclusion This study demonstrates that contrastive learning can successfully bridge o rganic and inorganic chemical spaces, creating unified repre sentations that ca pture sha red biological activities despite funda mental structural dif ferences. Mor ganFingerprint's doma in -agnostic structural encoding emer ges as the optimal choice for this task, achieving superior cros s-modal alignment, embedding quality , and downstream classification performance. The strong results obtained with simple classifiers trained on frozen embed dings validate the practical utility of this approach for drug dis covery applications, enabli ng knowledge transfer from extensively studied or ganic compounds to comparatively under- explored metal-based drugs. Thes e findings establish contrastive learning as a promising strategy for multi- domain chemistry applications and provide guidance for encoder selection in scenarios requiring balanced performance across structurally diverse chemical classes. Conflict of inter est The authors of this study declare that they do not have any conflict of interest. Data availability statement The data we used are pu blicly available and have already been referenced in our manuscript. The training and inference code is available at https://github.com/ . Refer ence: [1] D. Bai et al. , "The outcast of medicine: metals in medicine -- from traditional mi neral medicine to metallodrugs," Fr ontiers in pharmacology , vol. 16, p. 1542560, 2025. [2] T . Furuhashi, K. T oda, and W . W eckwerth, "R eview of cancer cell volatile organic compounds: their metabolism and evolution," Fr ontiers in Molec ular Bioscience s, vol. 1 1, p. 1499104, 2025. [3] L. Kelland, "The resur gence of platinum-based cancer chemotherapy ," Natur e Reviews Cancer , vol. 7, no. 8, pp. 573-584, 2007. [4] S. Rottenber g, C. Disler , and P . Perego, "The rediscovery of platinum -based cancer therapy ," Natur e Reviews Cancer , vol. 21, no. 1, pp. 37-50, 2021. [5] R. H. Shoe maker , "The NCI60 human tumour c ell line antica ncer drug screen," Natur e Reviews Cancer , vol. 6, no. 10, pp. 813-823, 2006. [6] S. Kim, W . Lee, H. I. Kim, M. K. Kim, and T . S. Choi, "R ecent advances and future challenges in pre dictive modeling of metalloproteins by artificial i ntelligence," Molecules and cells, vol. 48, no. 4, p. 100191, 2025. [7] E. López-López and J. L. Medina-Franco, "MetAP DB and Metal-FP : a database an d fingerprint framework for adv ancing metal -based drug discovery ," Journal of Computer- Aided Molecular Design, vol. 40, no. 1, p. 8, 2026. [8] J. W . T oney , R. G. St. Michel, A. G. Garrison, I. Kevlishvili, and H. J. Kulik, "Graph neural n etworks for predicting metal–ligand coordination o f tr ansition metal complexes," Pr oceedings of the National Academy of Scien ces, vol. 122, no. 41, p. e2415658122, 2025. [9] L. Duo, Y . Liu, J. Ren, B. T ang, and J. D. Hir st, "Artificial intelligence for small molecule anticancer drug discovery ," Expert Opinion on Drug Discovery , vol. 19, no. 8, pp. 933-948, 2024. [10] S. Sarvepalli and S. V adarevu, "Role of artificial intelligence in cancer dru g discovery and development," Cancer letters, vol. 627, p. 217821, 2025. [1 1] A. Radford et al. , "Learning transferable visual models from natural language supervision," in Internati onal confer ence on machine learning , 2021: PmLR, pp. 8748- 8763. [12] P . Ga o et al. , "C lip-adapter: Better vision -language models with feature adapter s," International journal of computer vision, vol. 132, no. 2, pp. 581-595, 2024. [13] L. Krasnov , D. Malikov , M. Kiseleva, E. Nykhrikova, S. T atarin, and S . Bezzubov , "Machine Learning for Anticancer Activity Prediction of T ransition Metal Complexe s," 2025. [14] H. Zhou and J. Skolnick, "Utility of the Morgan fingerprint in structure -based virtual ligand screening," The Journal of Physical Chemistry B, vol. 128, no. 22, pp. 5363- 5370, 2024. [15] S. Chithrananda, G. Grand, and B. Ramsundar , "ChemBER T a: large-scale self- supervised pretraining for molecular property prediction," arXiv pr eprint arXiv:2010.09885, 2020. [16] F . W u, D. Radev , and S. Z. Li, "Molformer: Motif -based transformer on 3d heterogeneous molecular graphs," in Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 2023, vol. 37, no. 4, pp. 5312-5320. [17] E. Heid et al. , "Chemprop: a machine learni ng package for chemical property prediction," Journal of chemical information and modeling, vol. 64, no. 1, pp. 9-17, 2023. [18] K. Swanson, "Message passing neural networks for molecular property prediction," Massachusetts Institute of T echnology , 2019. [19] A. Forooghi, S. Sadeghi, L. Rueda, and A. Ngom, "A survey of contrastiv e learning methods in molecular representation," Briefings in Bioinformatics, vol. 27, no. 1, p. bbaf731, 2026. [20] J. Robinson, C. -Y . Chuang, S. S ra, and S. Jegelka, "Contrastive learning with hard negative samples," arXiv pr eprint arXiv:2010.04592, 2020. [21] S. Arora, W . Hu, and P . K. Kothari, "An analysis of the t -sne algorithm for data visualization," in Confer ence on learning theory , 2018: P MLR, pp. 1455-1462. [22] J. Singh and R. B anerjee, "A study on single and multi -layer perceptron neural network," in 2019 3r d Internati onal Confer ence on Computing Methodologies and Communication (ICCMC) , 2019: IEEE, pp. 35-40. [23] A. M. Carrington et al. , "Deep ROC analysis and AUC as b alanced average accuracy , for improved classifier selection, audit and explanation," IE EE T ransactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 329-341, 2022. [24] R. Y acouby and D. Axman, "Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models," in Pr oceedings of the first workshop on evaluation and comparison of NLP systems , 2020, pp. 79-91. [25] E. J. Anthony et al. , "Metallodrugs are unique: opportunities and challenges of discovery and development," Chemical science, vol. 1 1, no. 48, pp. 12888-1 2917, 2020. [26] R. V . Thurston, T . A. Gi lfoil, E. L. Meyn, R. K. Zajdel, T . I. Aoki, and G. D. V eith, "Comparative toxicity of ten or ganic chemicals to ten common aquatic species," W ater Resear ch, vol. 19, no. 9, pp. 1 145-1 155, 1985.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment