Chemistry Integrated Language Model using Hierarchical Molecular Representation for Polymer Informatics
📝 Abstract
Machine learning has transformed material discovery for inorganic compounds and small molecules, yet polymers remain largely inaccessible to these methods. While data scarcity is often cited as the primary bottleneck, we demonstrate that strategic molecular representations can overcome this limitation. We introduce CI-LLM (Chemically Informed Language Model), a framework combining HAPPY (Hierarchically Abstracted rePeat unit of PolYmer), which encodes chemical substructures as tokens, with numerical descriptors within transformer architectures. For property prediction, De $^3 $BERTa, our descriptor-enriched encoder, achieves 3.5x faster inference than SMILES-based models with improved accuracy ( $R^2$ score gains of 0.9-4.1 percent across four properties), while providing interpretable structure-property insights at the subgroup level. For inverse design, our GPT-based generator produces polymers with targeted properties, achieving 100 percent scaffold retention and successful multi-property optimization for negatively correlated objectives. This comprehensive framework demonstrates both forward prediction and inverse design capabilities, showcasing how strategic molecular representation advances machine learning applications in polymer science.
💡 Analysis
Machine learning has transformed material discovery for inorganic compounds and small molecules, yet polymers remain largely inaccessible to these methods. While data scarcity is often cited as the primary bottleneck, we demonstrate that strategic molecular representations can overcome this limitation. We introduce CI-LLM (Chemically Informed Language Model), a framework combining HAPPY (Hierarchically Abstracted rePeat unit of PolYmer), which encodes chemical substructures as tokens, with numerical descriptors within transformer architectures. For property prediction, De $^3 $BERTa, our descriptor-enriched encoder, achieves 3.5x faster inference than SMILES-based models with improved accuracy ( $R^2$ score gains of 0.9-4.1 percent across four properties), while providing interpretable structure-property insights at the subgroup level. For inverse design, our GPT-based generator produces polymers with targeted properties, achieving 100 percent scaffold retention and successful multi-property optimization for negatively correlated objectives. This comprehensive framework demonstrates both forward prediction and inverse design capabilities, showcasing how strategic molecular representation advances machine learning applications in polymer science.
📄 Content
The advent of machine learning has revolutionized materials discovery. The success of these models hinges on three pillars: quality data, robust network architectures, and effective material representation schemes. Traditional models employ handcrafted numerical descriptors, such as Morgan fingerprints, RDKit descriptors, and Mordred descriptors that encode chemical and structural information based on domain expertise [1][2][3]. While these approaches leverage chemical intuition, they suffer from fundamental inflexibility: descriptor sets require manual curations for each task, and the resulting representations fail to generalize across chemically diverse systems [4,5]. Rather than relying on predefined descriptors, modern approaches train neural networks to extract features directly from raw molecular input in the form of graph-based molecular connectivity or string representation of molecules.
Among various approaches, Transformers, the foundational network architecture of Large Language Models (LLMs), have fundamentally transformed how we process sequential information across scientific domains due to their ability to capture “grammar” and “context”. This capability has led researchers to leverage Transformer as feature extractors, generating numerical embeddings that encode substantial material-related knowledge and relevant information [6,7]. Transformers have enabled breakthrough applications from protein structure prediction (AlphaFold2 [8] ) to drug discovery [9] and materials design [10]. Transformers require conversion of chemical structures into molecular string representations, with SMILES (Simplified Molecular Input Line Entry System), and related string notations [11][12][13] being the most commonly adopted formats. Decoder-only LLMs such as GPT (Generative Pre-trained Transformer) have therefore been employed to directly generate SMILES-like sequences, often conditioning on target properties by prepending property tokens at the beginning of the sequence [14,15]. Although this approach improves controllability, it still struggles with long sequence lengths, structural complexity, validity, and diversity [16,17].
Yet polymers remain largely inaccessible to these methods. Despite offering unparalleled tunability through choices of functional group selection and chain architectures, their configurational freedom generates a combinatorically explosive search space [18][19][20]. Their hierarchical complexity from repeat units to functional groups to long chains with varied sequence arrangements resists atomistic encodings effective for other materials. Moreover, curated polymer databases contain structures for most properties, orders of magnitude fewer than protein and small-molecule repositories. Consequently, applying data-hungry Transformers to polymer systems remains exceptionally challenging. The problem intensifies for inverse design, where models must generate chemically valid polymer structures that satisfy property constraints. While Transformer encoder-based models for polymer property prediction exist, such as polyBERT [21] and TransPolymer [22] , Transformer decoder (GPT)-based models for polymer design remain virtually nonexistent.
This limitation stems partly from how polymers are represented. The dominant chemical string representation, SMILES, encodes molecules as character strings where individual symbols denote atoms (C, N, O), bonds (=, #), and grammatical constructs (parentheses for branching, digits for ring closures). Transformer models for small molecules such as SMILES-BERT , ChemBERTa, MolGPT adopt characterlevel tokenization, treating each SMILES symbol as an independent token [15,23,24] . This approach succeeds for small molecules where compact sequences enable transformers to learn from millions of examples. However, for polymers, atom-level tokenization produces sequences too long for effective learning, grammatical special characters (parentheses, ring closures) complicate validity constraints, and the models struggle to maintain consistency across multiscale structural features (e.g., ensuring stereochemistry while modifying functional groups). Reinforcement learning optimization, successfully applied to small-molecule property targeting [25], exacerbates these issues-models trained on character-level polymer SMILES disperses functional groups across dozens of non-contiguous tokens, obscures hierarchical structure, and generates sequences that strain Transformer capacity, often collapse toward invalid structures or fail to explore diverse chemical regions [26,27].
We address these challenges through hierarchical coarsegraining into chemically meaningful motifs, mirroring both how chemists reason about molecules and the shift from character-level to sub-word tokenization in natural language processing [28]. We implement this through HAPPY (Hierarchically Abstracted rePeat unit of PolYmers), a representation that decomposes polymer structures into chemically meaning
This content is AI-processed based on ArXiv data.