단백질 언어 모델의 빠른 지도학습으로 효율적인 단백질 설계와 혁신적 서열 탐색
📝 Abstract
Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because highquality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM’s output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome-scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine-tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.
💡 Analysis
Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because highquality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM’s output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome-scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine-tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.
📄 Content
Large protein language models (PLMs) have become a crucial tool in computational protein design by offering scalable generative frameworks for exploring protein sequence space. PLMs such as ESM [1], ProtGPT2 [2], ProGen [3], and GenSLM [4] are pretrained on millions of natural protein sequences and have shown that the transformer architecture can capture evolutionary constraints and structural regularities that underpin protein functionality. These models can generate vast numbers of unseen sequences in seconds, which accelerates the design process far beyond what is feasible with traditional experimental or simulation-based methods. Yet, important considerations need to be addressed regarding the practical utility of PLMs. First, the majority of generated sequences are not functionally viable, requiring additional mechanisms to prioritize promising candidates [5,6]. Second, pretrained PLMs are typically optimized for broad next-token prediction objectives, and do not directly align with downstream design goals such as stability, catalytic activity, or specificity. Addressing these challenges requires methods to adapt pretrained PLMs toward targeted objectives while preserving their generalization ability. Supervised fine-tuning (SFT) offers a principled way to adapt pretrained protein language models (PLMs) to specific design objectives [7,8]. By further training on curated datasets drawn from defined protein families and experimentally validated variants, SFT shifts model priors toward sequences consistent with structural and functional constraints. In large language models (LLMs) SFT has become the standard recipe for domain adaptation and task alignment [9,10], often serving as a precursor to reinforcement learning approaches [11]. In natural language, SFT is performed via self-distillation where synthetic data is generated by the model itself. In contrast, applications of SFT in protein modeling have remained inconsistent, with limited decisions in data selection, ad hoc filtering, and evaluation [12,8], largely due to the difficulty of obtaining high-quality experimentally labeled data. Such variability obscures reproducibility and limits the systematic integration of SFT and its benefits into protein design pipelines.
A central challenge underlying this inconsistency is the reliance of SFT on large-scale, high-quality protein sequence annotations. While natural language annotations can be generated rapidly by non-experts, evaluating protein functionality requires labor-intensive and costly experimental assays. This fundamental barrier constrains the scalability of SFT for PLMs, particularly when exploring novel protein families for which annotated data are sparse or unavailable. Computational proxies for annotation can potentially address this bottleneck, but their inherent inaccuracies further complicate training, as they cannot guarantee the identification of functionally viable sequences.
We propose a simple and general recipe for filtering functionally viable artificially generated protein sequences to enable self-distilling SFT. The framework emphasizes standardized computational data curation, the integration of lightweight sequence-and structure-based filters, and reproducible filtering procedures that align with the natural criteria of protein variants’ viability. The filters are broadly applicable across PLMs and protein families for (1) purifying final candidates for in vitro analysis; and (2) curating SFT data. The design of our filters allows flexible control over similarity measures by sampling from a wide range of proteins. This enables tuning the trade-off between sequence novelty and functionality-two essential considerations for evaluating PLM performance in protein design. To illustrate the approach, we focus on the β-chain of the tryptophan synthase complex (TrpB), a well-studied enzyme in amino acid biosynthesis. We fine-tune a 25M-parameter GenSLM model on TrpB sequences using filtered data and evaluate the resulting model on both explicitly enforced properties-including sequence length, active-site conservation, and predicted folding confidence (pLDDT)-and emergent properties not directly optimized during filtering, such as predicted stability and docking scores to natural substrates. This case study demonstrates that filtering-based SFT improves not only targeted properties but also generalizes to unseen objectives, underscoring the broader utility of our recipe for protein engineering.
Here, we describe the protein language model (PLM), and the filtering pipeline used to curate highquality data for supervised fine-tuning (SFT). Details on the studied protein family are presented in the Appendix. As discussed above, SFT improves the fidelity of generated sequences while also enhancing their novelty. Both aspects are critical for generative protein design: higher fidelity increases the likelihood of functional proteins, while greater novelty allows exploration beyond natural
This content is AI-processed based on ArXiv data.