Local-Global Multimodal Contrastive Learning for Molecular Property Prediction
Accurate molecular property prediction requires integrating complementary information from molecular structure and chemical semantics. In this work, we propose LGM-CL, a local-global multimodal contrastive learning framework that jointly models molecular graphs and textual representations derived from SMILES and chemistry-aware augmented texts. Local functional group information and global molecular topology are captured using AttentiveFP and Graph Transformer encoders, respectively, and aligned through self-supervised contrastive learning. In addition, chemically enriched textual descriptions are contrasted with original SMILES to incorporate physicochemical semantics in a task-agnostic manner. During fine-tuning, molecular fingerprints are further integrated via Dual Cross-attention multimodal fusion. Extensive experiments on MoleculeNet benchmarks demonstrate that LGM-CL achieves consistent and competitive performance across both classification and regression tasks, validating the effectiveness of unified local-global and multimodal representation learning.
💡 Research Summary
The paper introduces LGM‑CL (Local‑Global Multimodal Contrastive Learning), a novel self‑supervised framework for molecular property prediction that jointly leverages graph‑based structural information at two complementary scales and chemically enriched textual representations derived from SMILES. The core idea is to construct multiple “views” of the same molecule—(i) a local view captured by AttentiveFP, which focuses on short‑range atom neighborhoods and functional group patterns, (ii) a global view captured by a Graph Transformer, which uses adjacency‑aware multi‑head self‑attention to model long‑range dependencies across the entire molecular graph, and (iii) two textual views: the raw SMILES string and a chemistry‑aware description generated by a large language model (LLM) using a carefully designed prompt that minimizes hallucination.
During pre‑training, the authors apply a multi‑view contrastive learning strategy. Within each modality, positive pairs are formed between the two graph encoders (local vs. global) and between the two text encoders (SMILES vs. description). Cross‑modal positives are also created, aligning graph embeddings with text embeddings. Negative samples are taken from other molecules in the same mini‑batch, and an InfoNCE loss is used to pull together representations of the same molecule while pushing apart different molecules. This dual‑level alignment forces the model to learn embeddings that are simultaneously rich in local chemical detail, global topological context, and physicochemical semantics.
After the self‑supervised stage, the model is fine‑tuned on downstream tasks. The pretrained graph and text embeddings are first aggregated within each modality to produce unified graph and text vectors. An additional molecular fingerprint modality (e.g., MACCS, PubChem, ErG) is introduced, and all three modalities are fused through a Dual Cross‑Attention module. This module learns dynamic attention weights that determine how much each modality contributes to the final representation, preserving complementary information while mitigating redundancy. The fused representation is then fed into a simple MLP head for either regression or classification.
Extensive experiments on seven MoleculeNet benchmarks (PCBA, BACE, HIV, ESOL, Lipo, FreeSolv, QM9) demonstrate that LGM‑CL consistently outperforms state‑of‑the‑art graph neural networks (GIN, Graphormer, DimeNet, etc.) and multimodal baselines (MolBERT, ChemBERTa). Notably, the method yields 3–7 % absolute improvements in both classification AUC and regression RMSE, with particularly strong gains on tasks where either global topology (QM9) or local functional groups (HIV) dominate. Ablation studies reveal that removing either the local or global graph encoder, or omitting the text contrast, leads to a substantial drop in performance, confirming the necessity of each component. Visualization via UMAP shows that the learned embeddings form chemically meaningful clusters, indicating that the contrastive objectives successfully shape a semantically coherent latent space.
Key contributions of the work are: (1) a chemistry‑aware prompting template that generates reliable textual augmentations for SMILES, reducing LLM hallucination; (2) a dual‑encoder graph architecture that explicitly separates and then aligns local and global structural information; (3) a unified multimodal framework that integrates graph, text, and fingerprint modalities through Dual Cross‑Attention, enabling robust transfer to diverse property prediction tasks. By unifying structural scales and modalities, LGM‑CL provides a more comprehensive and transferable molecular representation, offering practical benefits for drug discovery, material design, and toxicity assessment.
Comments & Academic Discussion
Loading comments...
Leave a Comment