SMILES 기반 고분자 그래프 표현을 활용한 기초 모델의 혁신적 성능

Reading time: 6 minute
...

📝 Abstract

From the relative scarcity of training data to the lack of standardized benchmarks, the development of foundation models for polymers face significant and multi-faceted challenges. At the core, many of these issues are tied directly to the structural representation of polymers and here, we present a new foundation model using a SMILES-based polymer graph representation. This approach allows representation of critical polymer architectural features and connectivity that are not available in other SMILES-based representations. The developed polymer foundation model exhibited excellent performance on 28 different benchmark datasets. Critical evaluation of the developed representation against other variations in control experiments reveals this approach to be a highly performant method of representing polymers in language-based foundation models. These control experiments also reveal a strong invariance of all SMILES representations, with many variations achieving state-of-the-art or near state-of-the-art performance-including those which are chemically or semantically invalid. Examination of error sources and attention maps for the evaluated representations corroborate the findings of the control experiments, showing that chemistry language models based on SMILES interpolate over all sequence space for prediction tasks, not only those of semantically valid inputs. Overall, this work highlights the importance of control experiments as a check on human-imposed assumptions that can limit rational design of both chemistry foundation models and their underlying structural representations.

💡 Analysis

From the relative scarcity of training data to the lack of standardized benchmarks, the development of foundation models for polymers face significant and multi-faceted challenges. At the core, many of these issues are tied directly to the structural representation of polymers and here, we present a new foundation model using a SMILES-based polymer graph representation. This approach allows representation of critical polymer architectural features and connectivity that are not available in other SMILES-based representations. The developed polymer foundation model exhibited excellent performance on 28 different benchmark datasets. Critical evaluation of the developed representation against other variations in control experiments reveals this approach to be a highly performant method of representing polymers in language-based foundation models. These control experiments also reveal a strong invariance of all SMILES representations, with many variations achieving state-of-the-art or near state-of-the-art performance-including those which are chemically or semantically invalid. Examination of error sources and attention maps for the evaluated representations corroborate the findings of the control experiments, showing that chemistry language models based on SMILES interpolate over all sequence space for prediction tasks, not only those of semantically valid inputs. Overall, this work highlights the importance of control experiments as a check on human-imposed assumptions that can limit rational design of both chemistry foundation models and their underlying structural representations.

📄 Content

Understanding Structural Representation in Foundation Models for Polymers Nathaniel H. Park1*, Eduardo Soares2, Victor Shirasuna2, Tiffany J. Callahan1, Sara Capponi1, Emilio Vital Brazil2* 1IBM Research Almaden, 650 Harry Rd., San Jose, 95120, CA, United States of America. 2IBM Research Brazil, Street, Rio de Janeiro, 10587, RJ, Brazil. Corresponding author(s). E-mail(s): npark@us.ibm.com; evital@br.ibm.com; Contributing authors: eduardo.soares@ibm.com; vshirasuna@ibm.com; tiffany.callahan@ibm.com; sara.capponi@ibm.com; Abstract From the relative scarcity of training data to the lack of standardized bench- marks, the development of foundation models for polymers face significant and multi-faceted challenges. At the core, many of these issues are tied directly to the structural representation of polymers and here, we present a new founda- tion model using a SMILES-based polymer graph representation. This approach allows representation of critical polymer architectural features and connectivity that are not available in other SMILES-based representations. The developed polymer foundation model exhibited excellent performance on 28 different bench- mark datasets. Critical evaluation of the developed representation against other variations in control experiments reveals this approach to be a highly performant method of representing polymers in language-based foundation models. These control experiments also reveal a strong invariance of all SMILES representa- tions, with many variations achieving state-of-the-art or near state-of-the-art performance—including those which are chemically or semantically invalid. Examination of error sources and attention maps for the evaluated representa- tions corroborate the findings of the control experiments, showing that chemistry language models based on SMILES interpolate over all sequence space for pre- diction tasks, not only those of semantically valid inputs. Overall, this work highlights the importance of control experiments as a check on human-imposed assumptions that can limit rational design of both chemistry foundation models and their underlying structural representations. 1 arXiv:2512.11881v1 [cond-mat.soft] 8 Dec 2025 Keywords: Polymers, Deep-Learning, Foundation Models 1 Introduction Foundational artificial intelligence (AI) models hold immense promise for revolution- izing rational design of polymeric materials owing to their ability to interpolate over large regions of chemical space, providing considerable predictive capabilities for a vari- ety of downstream tasks. Despite numerous reports on the development of predictive and generative models for polymers[1–4], none have demonstrated significant academic and industrial impact in manner analogous to AlphaFold[5]. In contrast to proteins, creation of foundation models for polymers face major and frequently intractable challenges. Data for training and benchmarking deep-learning models for polymers is scarce[6], restrictively licensed[7–10], and comes from only a handful of sources (Fig. 1c)—limiting both model development and their use in commercial applications. Moreover, existing datasets are often incomplete and lack many critical structural descriptors[11–15], such as dispersity, number-average molecular weight, or processing conditions that play important roles in determining polymer properties. Structural representations in these datasets are overwhelmingly based on variations of SMILES (Fig. 1b)[12–15], which are then featurized using various strategies. These approaches include fingerprinting[8, 16, 17], conversion to graphs[18–22], tokenization of SMILES line-notations[11, 22, 22–26], coarse grained representations[27, 28], images[29], 3D conformations[24], and others depending on the model architecture[30]. The lack of standardized benchmarks prevents accurate comparison of model architectures and many reports do not or cannot publish their benchmark data. Consequently, nearly every reported predictive or generative model for polymers looks the same (Fig. 1c and Fig. 1d) as they largely trained on the same data and use nearly equivalent polymer featurization strategies. 2 * O N H * [>] O N H [<] [:2] O N H [:1] [R] O N H [Q] [:2] O N H [:1] C(CCCCN)=O {[>]C(CCCCN[<])=O} [:1]C(CCCCN[:2])=O|1->2|DP|D <A|[Q]C(CCCCN[Q])=O; A.Q -> A.R> [:1]C(CCCCN[*:2])=O;1->2 a b c d CMDL Graph v1 CMDL Graph v2 PSMILES BigSMILES PSMILES (Graph) Fig. 1 a. Visual comparison of different string text representations for Nylon-6. b. Survey of polymer data structural representations in training datasets in polymer ML papers. c. Dataset source for recent polymer machine learning paper Dataset source for recent polymer machine learning papers. d. Model input format from recent polymer machine learning papers. Data for b, c, and d were sourced from a survey of 80 machine learning papers in polymers published between 2018–2025. The core of many issues surrounding foundation models for chemistry and materi- als revolves arou

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut