Tokenization for Molecular Foundation Models

Tokenization for Molecular Foundation Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in material science and molecular design.However, existing models are constrained by closed-vocabulary tokenizers that capture only a fraction of molecular space. In this work, we systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by pretraining and finetuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers – Smirk and Smirk-GPE – with full coverage of the OpenSMILES specification. The proposed tokenizers systematically integrate nuclear, electronic, and geometric degrees of freedom; facilitating applications in pharmacology, agriculture, biology, and energy storage. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics.


💡 Research Summary

This paper investigates the often‑overlooked yet pivotal role of tokenization in molecular foundation models that operate on SMILES strings. The authors begin by highlighting a fundamental limitation of the dominant “atom‑wise” tokenizers: they treat any bracketed atom—potentially encoding isotope, chirality, charge, hydrogen count, and class information—as a single indivisible token. According to the OpenSMILES specification, the combinatorial space of such bracketed atoms is astronomically large (exceeding 2.8 × 10¹³ distinct tokens), yet existing chemistry‑specific tokenizers maintain vocabularies of fewer than three thousand entries, leading to frequent out‑of‑vocabulary (UNK) occurrences.

To quantify this shortcoming, the authors systematically evaluate 34 tokenizers (19 chemistry‑specific, 15 generic NLP) across four benchmark datasets: raw SMILES, MoleculeNet, tmQM, and Enamine REALSpace. Four intrinsic metrics are employed: fertility (average token count per molecule), normalized entropy (distribution uniformity of token frequencies), imbalance (deviation from a uniform distribution), and UNK frequency (proportion of tokens mapped to the unknown token). Results reveal that chemistry‑specific tokenizers suffer from high fertility, low entropy, and substantial UNK rates (18 %–50 % depending on the dataset), whereas open‑vocabulary NLP tokenizers and the newly proposed Smirk family produce no UNK tokens and achieve competitive entropy scores.

Recognizing that full transformer pre‑training for each tokenizer is computationally prohibitive, the authors adopt n‑gram language models as a low‑cost proxy. They pre‑train 1‑ to 5‑gram models on 1.6 billion SMILES from REALSpace and compute cross‑entropy loss on validation splits of all datasets. Smirk and its compressed variant Smirk‑GPE consistently attain lower cross‑entropy than both the chemistry‑specific and generic tokenizers, indicating that they preserve more information about the underlying chemical syntax.

To validate the proxy findings, the study pre‑trains 18 RoBERTa‑style encoder‑only models, each paired with a distinct tokenizer (including the three molecular encodings used in prior work). All models are trained from scratch, ensuring that performance differences can be attributed solely to tokenization. Downstream evaluation on molecular property prediction tasks (MAE, R²) shows that Smirk‑GPE, which applies a glyph‑level BPE merge on the fully decomposed tokens, matches the computational cost of standard BPE while delivering 4 %–7 % higher predictive accuracy than any existing chemistry‑specific tokenizer. The advantage is most pronounced on tmQM, a dataset rich in bracketed atoms, underscoring the importance of full OpenSMILES coverage.

The proposed tokenization pipeline consists of two stages. First, a SMILES string is split into atomic units (e.g., “OC


Comments & Academic Discussion

Loading comments...

Leave a Comment