De novo generation of functional terpene synthases using TpsGPT
Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front-line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine-tuning of a protein language model on a carefully curated, enzyme-class-specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.
💡 Research Summary
The paper presents TpsGPT, a generative framework for de‑novo design of terpene synthases (TPS), built by fine‑tuning the large protein language model ProtGPT2 on a curated dataset of 79,000 TPS sequences mined from UniProt. The authors start with a seed set of 1,125 experimentally validated TPS enzymes, expand it using HMMER searches against Pfam and SUPERFAMILY profiles, and apply stringent filters (length 300‑1100 aa, removal of non‑TPS hits, presence of canonical DDXXD/NSE/DTE or DXDD motifs) to obtain a high‑quality training corpus. The corpus is split into six partitions with ≤30 % pairwise identity; five partitions (≈63 k sequences) are used for training, one for validation, ensuring no data leakage.
Fine‑tuning is performed on the distilled “tiny” version of ProtGPT2 (38.9 M parameters) rather than the full 738 M model, enabling rapid training on a single NVIDIA L4 GPU with Lightning AI. After fine‑tuning, the model generates 28,000 candidate TPS sequences. A multi‑stage filtering pipeline reduces this pool to seven high‑confidence candidates:
- Sequence filters – top 10 % by perplexity, max pairwise identity to training set ≤60 % (to enforce evolutionary distance).
- Functional filters – EnzymeExplorer TPS score ≥0.7, CLEAN EC prediction matching terpenoid pathways, InterPro domain annotation indicating TPS‑specific domains.
- Structural filters – ESMFold predicted pLDDT ≥70 (indicating reliable backbone prediction) and Foldseek TM‑score between 0.6 and 0.9 when aligned to the closest training‑set structure.
All seven candidates meet these criteria, showing pLDDT scores of 71–80, TM‑scores of 0.65–0.84, and low sequence identity (49.7–60 %). CLEAN assigns each to a known TPS EC class, and InterPro detects at least one TPS‑related domain per sequence.
Experimental validation is carried out by heterologously expressing each candidate in a Saccharomyces cerevisiae strain engineered for high geranylgeranyl pyrophosphate production. LC‑MS analysis of culture extracts reveals diterpene‑like products (C₂₀H₃₆O₂, consistent with sclareol) for two candidates, TpsGPT1 and TpsGPT2, confirming functional terpene synthase activity. The other five candidates remain to be tested.
The authors emphasize the cost‑effectiveness of the approach: the entire computational pipeline required less than $200 in GPU time, a stark contrast to robotic continuous evolution platforms that can cost hundreds of thousands of dollars. They also note limitations: only two of seven candidates showed activity, and the detected products contain oxygen, suggesting that the enzymes may not follow canonical class I or II TPS mechanisms.
Future directions include conditioning generation on specific terpene subclasses to produce targeted products, integrating catalytic‑site structural data for more precise functional control, and extending the workflow to other under‑characterized enzyme families such as lysozymes. The study demonstrates that fine‑tuning a protein language model on a focused, high‑quality dataset, combined with rigorous multi‑modal validation, can explore protein sequence space far beyond natural diversity and yield functional, evolutionarily distant enzymes.
Comments & Academic Discussion
Loading comments...
Leave a Comment