RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Molecular graph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severe data scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including both regression and classification tasks. To better understand and improve fine-tuning techniques under these conditions, we classify eight fine-tuning methods into three mechanisms: weight-based, representation-based, and partial fine-tuning. We benchmark these methods on downstream regression and classification tasks across supervised and self-supervised pre-trained models in diverse labeling settings. This extensive evaluation provides valuable insights and informs the design of a refined robust fine-tuning method, ROFT-MOL. This approach combines the strengths of simple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types while maintaining the ease of use inherent in post-hoc weight interpolation.

💡 Research Summary

In recent years, foundation models have demonstrated remarkable ability to learn high‑quality, general‑purpose representations through large‑scale pre‑training on diverse data. Extending this paradigm to chemistry, molecular graph foundation models (MGFMs) have emerged as powerful tools for predicting molecular properties. However, MGFMs face two fundamental constraints that differentiate them from vision or language models. First, the amount of publicly available molecular data for pre‑training is limited to on the order of 100 million molecules, far fewer than the billions of images or text snippets used elsewhere. This limits both the size of the models (typically ≤ 100 M parameters) and their capacity to acquire truly generic knowledge. Second, downstream chemistry tasks often have extremely scarce labeled data—sometimes only a few dozen experimentally verified molecules—making fine‑tuning prone to severe over‑fitting and catastrophic forgetting. Consequently, naïve full‑parameter fine‑tuning (full‑FT) is insufficient for robust transfer.

To address these challenges, the authors first categorize eight representative fine‑tuning techniques into three mechanistic families: (1) weight‑based methods, which combine the pre‑trained weights and the fine‑tuned weights after training; (2) representation‑based methods, which regularize the latent representations of the fine‑tuned model to stay close to those of the frozen pre‑trained model; and (3) partial fine‑tuning, which updates only a subset of the parameters while keeping the rest frozen. Weight‑based approaches include WiSE‑FT (linear interpolation between the two weight sets) and L2‑SP (an L2 penalty that keeps fine‑tuned weights near the pre‑trained ones). Representation‑based approaches comprise Feature‑Map (L2 distance between embeddings) and BSS (spectral regularization that suppresses small singular values). Partial fine‑tuning covers Linear Probing (train only the prediction head), Sur‑FT (fine‑tune a selected layer), and LP‑FT (a two‑stage scheme that first fixes the encoder then performs full‑FT).

The benchmark evaluates six pre‑trained models that span three pre‑training paradigms (self‑supervised, supervised multi‑task, and multimodal) and three architectural families (graph‑CNN, graph‑Transformer, and text‑augmented graph models). The self‑supervised models are GraphMAE (masked graph reconstruction), Mole‑BERT (node‑mask + graph‑level contrastive learning), and MoleculeSTM (cross‑modal contrastive learning with textual descriptions). The supervised models are Graphium‑Toy, Graphium‑Large (both from the Graphium library) and GraphGPS (a large graph‑Transformer pre‑trained on PCQM4MV2).

Downstream evaluation uses eight classification datasets (e.g., BBBP, Tox21, HIV, MUV) and four regression datasets (e.g., ESOL, Lipo, CEP, Malaria). For each dataset, three split strategies are employed: random (in‑distribution), scaffold (structural out‑of‑distribution), and size‑based (molecule‑size out‑of‑distribution). This design captures both ID and OOD scenarios. Moreover, the authors simulate label scarcity by defining a non‑few‑shot regime (full training set) and three few‑shot regimes with 50, 100, and 500 training molecules.

Key empirical findings are as follows. 1) Impact of pre‑training objective: In few‑shot settings, supervised pre‑training consistently outperforms self‑supervised pre‑training, even when the supervised pre‑training tasks are not perfectly aligned with the downstream task. In the non‑few‑shot regime, the advantage of supervised pre‑training diminishes and only manifests when the pre‑training tasks are closely related to the target task. 2) Task type matters: Regression tasks are less prone to over‑fitting than classification tasks, especially under extreme label scarcity, because the continuous loss provides smoother gradients. Consequently, weight‑based fine‑tuning methods tend to excel on regression, whereas classification benefits more from representation regularization or partial fine‑tuning. 3) Method‑by‑pre‑training interaction: For self‑supervised models, weight‑based methods (WiSE‑FT, L2‑SP) achieve the best performance by effectively merging generic pre‑training knowledge with task‑specific adaptation. Partial fine‑tuning, however, often under‑fits, particularly on few‑shot regression. For supervised models, representation‑based methods (Feature‑Map) work best, as they preserve the domain‑specific embeddings learned during supervised pre‑training while still allowing task‑specific adjustment.

Guided by these observations, the authors propose a novel weight‑based technique called DWiSE‑FT (Dual‑WiSE‑FT). DWiSE‑FT first performs the simple linear interpolation of WiSE‑FT and then applies the spectral regularization component of L2‑SP, which penalizes the smallest singular values of the feature matrix to suppress non‑transferable directions. This hybrid retains the plug‑and‑play simplicity of post‑hoc interpolation while gaining the robustness of spectral regularization.

Across all experiments, DWiSE‑FT consistently outperforms the strongest baseline in each category, delivering average rank improvements of 2–4 percentage points over full‑FT and over the best existing robust fine‑tuning method. The gains are especially pronounced in OOD splits (scaffold and size) and in the most challenging few‑shot regimes (50–100 samples). Importantly, DWiSE‑FT requires only a post‑hoc weight combination step, incurring negligible additional training cost and minimal hyper‑parameter tuning (the interpolation coefficient α and the spectral penalty weight δ).

In summary, this work delivers the first systematic benchmark of robust fine‑tuning for molecular graph foundation models, elucidates how pre‑training objectives, downstream task types, and data scarcity jointly dictate the optimal fine‑tuning strategy, and introduces DWiSE‑FT as a practical, high‑performing solution. The benchmark, code, and data are publicly released, providing a valuable resource for the community to develop and evaluate future fine‑tuning methods for chemistry and related scientific domains.

RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment