How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?

Reading time: 5 minute
...

📝 Original Info

  • Title: How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?
  • ArXiv ID: 2602.11618
  • Date: 2026-02-12
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (예시: 김민수, 박지현, 이준호 등) **

📝 Abstract

Chemical Language Models (CLMs) pre-trained on large scale molecular data are widely used for molecular property prediction. However, the common belief that increasing training resources such as model size, dataset size, and training compute improves both pretraining loss and downstream task performance has not been systematically validated in the chemical domain. In this work, we evaluate this assumption by pretraining CLMs while scaling training resources and measuring transfer performance across diverse molecular property prediction (MPP) tasks. We find that while pretraining loss consistently decreases with increased training resources, downstream task performance shows limited improvement. Moreover, alternative metrics based on the Hessian or loss landscape also fail to estimate downstream performance in CLMs. We further identify conditions under which downstream performance saturates or degrades despite continued improvements in pretraining metrics, and analyze the underlying task dependent failure modes through parameter space visualizations. These results expose a gap between pretraining based evaluation and downstream performance, and emphasize the need for model selection and evaluation strategies that explicitly account for downstream task characteristics.

💡 Deep Analysis

📄 Full Content

Chemical language models (CLMs), which are pre-trained on string representations of molecules, have become increasingly important as foundation models that can transfer to a wide range of molecular property prediction (MPP) tasks in biology, chemistry, and drug discovery [Park et al., 2024;Ross et al., 2022;Edwards et al., 2022]. This is largely because labeled data are expensive to obtain, while the rapid growth of molecular databases has made billions of unlabeled structures available. By representing molecules as sequences of atomic and bond symbols, we can treat them as a form of language. This formulation makes it natural to adopt techniques from natural language processing (NLP), and a variety of NLP methods have been adopted in this line of work [Chithrananda et al., 2020;Li and Jiang, 2021;Irwin et al., 2022]. Motivated by scaling practices in NLP, CLMs have also been rapidly scaled, following the expectation that increasing model size, data size, and compute lead to better performance [Soares et al., 2025a[Soares et al., , 2025b;;Cai et al., 2025]. In NLP, pre-training loss has been shown to follow a power law with respect to training resources, and this relationship has served as a practical guideline for model design and training resource allocation [Kaplan et al., 2020;Hoffmann et al., 2022].

In recent years, studies in NLP have reported that improvements in pre-training loss do not necessarily translate into better downstream performance and can even lead to negative transfer in some cases [Zoph et al., 2020;Isik et al., 2024;Lourie et al., 2025]. This observation suggests that the implicit assumption that minimizing the pre-training loss is equivalent to acquiring useful representations may not be universal. To address this issue, prior work has proposed alternatives to pre-training loss, including Hessian and loss landscape-based measures, and has shown that these metrics can serve as indicators of transfer performance on downstream NLP tasks [Liu et al., 2023]. However, whether negative transfers occur and whether such alternative metrics to pre- training loss are expected to be effective depend on the downstream tasks and application domains. In particular, these questions have not been sufficiently examined in the context of CLMs.

Figure 1 illustrates that, for a typical Transformer-based CLM, lower pre-training loss does not necessarily translate into better downstream performance. As pre-training progresses, the loss decreases monotonically, but downstream performance after fine-tuning model states from different stages on the HIV benchmark [Wu et al., 2018] is non-monotonic. This suggests that optimizing the pre-training objective can diverge from learning representations that transfer well. It also indicates that common CLM practices, such as using pre-training loss as the primary criterion for early stopping, model selection, and training resource allocation, may not be fully justified from the perspective of downstream applications.

Motivated by this observation, we investigate (i) how well pre-training loss correlates with downstream performance on MPP tasks, and (ii) how this relationship changes as we scale model size, data size, and training compute. To address these questions, we focus on encoder-based CLMs that are commonly used for MPP, pre-train them under different resource budgets and evaluate transfer on MPP benchmarks. We then evaluate downstream performance under practical transfer settings that are widely used in CLM applications, including fine-tuning and linear probe, to identify when improvements in pre-training align with downstream gains versus when they diverge. We further analyze task-dependent factors underlying this phenomenon in a comprehensive evaluation across 36 MPP tasks.

Our contributions are summarized as follows: • We show that scaling model size, data size, and training compute consistently reduces pre-training loss, suggesting scaling-law behavior for CLMs at least with respect to loss. • We systematically evaluate the relationship between pretraining loss and downstream performance on MPP benchmarks and demonstrate that lower loss does not reliably imply better downstream performance in CLMs. • We show that, under standard CLM training and transfer settings, the relationship between pre-training loss and downstream performance depends strongly on the task and the training setup, which establishes the limitations of using pre-training loss as a single criterion for early stopping, model selection, and resource allocation. These limitations persist even when using alternatives to pre-training loss based on Hessian information and loss landscapes. Finally, we analyze this behavior using visualizations in parameter space.

2 Related Work

Chemical language models (CLMs) are typically pre-trained on large-scale unlabeled corpora of molecular strings and use representations such as SMILES (Simplified Molecular-Input Line-Entry System) as input [Weining

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut