Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability

Reading time: 5 minute
...

📝 Original Info

  • Title: Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability
  • ArXiv ID: 2602.14828
  • Date: 2026-02-16
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (저자명 및 소속을 확인하려면 원문을 참고하십시오.) **

📝 Abstract

Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid-level embeddings outperform sequence-level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.

💡 Deep Analysis

📄 Full Content

Machine learning (ML)-based protein design has become a powerful strategy in modern protein engineering 1 . A critical aspect of this approach is selecting an appropriate format to represent the protein sequence as input for the ML model. The optimal representation depends on factors such as the specific task to be learnt and dataset characteristics (for comprehensive reviews on the topic, see, for example, Yue et al. (2023) 2 or Harding-Larsen et al. (2024) 3 ). Since different representations capture distinct types of information and give rise to different data structures and properties, this decision can significantly impact computational complexity, model accuracy, and generalizability.

A relevant use case where optimal representation formats are critical is the ML-based protein bioengineering of viral vectors for gene therapy. The datasets used in this field tend to be either deep mutational scanning studies, which target the entire protein sequence but typically introduce only mutation per variant (e.g. 4 ), or high-intensity mutational studies, which explore a dense set of variations but restricted to small regions of the protein. For example, in recent ML-based bioengineering studies of adeno-associated virus (AAV) vectors, researchers have typically concentrated on a 20-50 amino acid fragment region of the capsid, which exceeds 700 residues [5][6][7] . This challenge extends beyond viral vectors and reflects a general issue in protein bioengineering studies, including in areas such as therapeutic enzyme optimization 8 and peptides or antibody design for cancer 9,10 , where sequence variants rarely contain dense mutations spanning the entire length. This has important implications for representation formats since many of them, including those considered state-of-the-art, are designed to capture broad sequence-level information at the scale of the entire protein. When functional optimization depends instead on small or highly localized sequence changes, these representations may dilute the signal of interest within a much larger context. Therefore, it is crucial to evaluate how the number and localization of sequence changes affect the various representation formats, how they perform under these constraints, and to identify strategies to adapt them for data regimes characteristic of this field.

Protein representations can be broadly categorized into three main groups, based on the type of information they capture: primary amino acid sequence, three-dimensional structure, and molecular dynamics or activity 3 . Multimodal approaches that integrate various types of information are also gaining traction 11 . Representations based on primary amino acid sequences are often the first choice, as sequence data is more readily available and offers greater reliability. Traditionally, sequences are transformed into numerical formats based on predefined rules, such as positional encoding schemes (e.g., one-hot encoding (OHE) of amino acids) or descriptors that summarize physicochemical properties of amino acid residues (e.g., hydrophobicity, charge, or molecular weight) 12 . These formats are relatively simple to generate and have long served as a workhorse for classical ML approaches. However, their reliance on hand-crafted features requires the user to decide what information is most relevant, which can limit flexibility and bias the model with arbitrary assumptions. More recently, representations learnt directly from raw amino acid sequences, called embeddings, have gained relevance 11 .

An embedding is a continuous vector that represents an entity (e.g., a protein, word, image). Embeddings can be generated using several methods, but modern approaches typically rely on deep learning models that learn representations from large datasets by mapping raw inputs into high-dimensional feature spaces. This enables features to be learned automatically and allows the representation to encode rich, high-level information beyond raw data and without explicit user intervention 13 . Such representations are generally referred to as pre-trained embeddings, as they are learned prior to their application in specific downstream tasks. Because embeddings represent entities in the same n-dimensional space, they support similarity calculation between entities 13 . Another key advantage of embeddings is their adaptability through finetuning, a transfer learning strategy in which a pre-trained model is further trained on a smaller, task-specific dataset 14 . This process tailors the embeddings to the downstream task (e.g., protein function classification) by refining them to capture task-relevant patterns, enhancing their specificity, discriminative power, and overall predictive accuracy 15 .

Among embeddings, those generated with transformer-based models 16 have the added value of incorporating attention mechanisms 17 . Attention enables the model to dynamically focus on different regions of the input sequence when generating

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut