Structure-Aligned Protein Language Model
Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but often lack the structural knowledge essential for some biological applications. To address this, we introduce a method to enrich pLMs with structural knowledge by leveraging pre-trained protein graph neural networks (pGNNs). First, a latent-level contrastive learning task aligns residue representations from pLMs with those from pGNNs across multiple proteins, injecting inter-protein structural information. Additionally, a physical-level task integrates intra-protein information by training pLMs to predict structure tokens. Together, the proposed dual-task framework effectively incorporates both inter- and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module that uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method as a simple, lightweight post-training step to the state-of-the-art ESM2 and AMPLIFY yields notable performance gains. These improvements are consistent across a wide range of tasks, including substantial gains in deep mutational scanning (DMS) fitness prediction and a 59% increase in P@L for ESM2 650M contact prediction on CASP16. Furthermore, we demonstrate that these performance gains are robust, scaling with model sizes from 8M to 650M and extending to different downstream tasks.
💡 Research Summary
**
The paper introduces a lightweight post‑training method called Structure‑Aligned Protein Language Model (SAM) that injects structural knowledge into existing protein language models (pLMs) without sacrificing their sequence‑only nature. The authors leverage pre‑trained protein graph neural networks (pGNNs) as a source of structural information and design a dual‑task framework that jointly optimizes two complementary objectives.
The first objective, a latent‑level contrastive learning task, aligns residue‑level hidden representations from the pLM with frozen embeddings produced by a pGNN (GearNet) across a batch of proteins. By projecting both sets of embeddings into a common space with learnable linear maps and a temperature scaling factor, the method maximizes the similarity of matching residue pairs (same protein) while minimizing similarity to all other residues in the batch. This inter‑protein contrastive loss (InfoNCE‑style) transfers global structural patterns captured by the graph model into the language model.
The second objective, a physical‑level task, addresses the limitation that pure contrastive alignment may over‑emphasize inter‑protein patterns and ignore intra‑protein context. Here, each residue’s hidden state from the pLM is used to predict a discrete “structure token” that encodes the residue’s 3‑D conformation relative to its nearest neighbor (as defined by the Foldseek tokenizer). This token‑prediction loss is a standard cross‑entropy loss over a multi‑layer perceptron head. By jointly training on both tasks with equal weighting (γ_latent = γ_physical = 0.5) and retaining the original masked language modeling (MLM) loss, SAM preserves the pLM’s ability to model sequence statistics while gaining structural awareness.
A key practical challenge is the variable quality of protein structures in the PDB. To avoid noisy supervision, the authors introduce a residue‑loss selection module. They curate a high‑quality reference set using resolution and R‑free thresholds, train a small reference model on this set, and compute an “excess loss” for each residue as the difference between the current model’s loss and the reference model’s loss. Only residues with high excess loss—those that are both reliable (high‑quality structure) and challenging (large current error)—are retained for each of the three loss components (sequence‑to‑structure contrast, structure‑to‑sequence contrast, and token prediction). This filtering discards inaccurate residues and easy examples, focusing learning on informative cases.
Experiments apply SAM as a post‑training step to state‑of‑the‑art pLMs: ESM2 (sizes 8 M, 35 M, 650 M) and AMPLIFY. The pGNN is frozen, and only a few million parameters are updated, making the procedure computationally cheap. Evaluation spans a broad suite of benchmarks: deep mutational scanning (DMS) fitness prediction on ProteinGym, contact‑map prediction on a held‑out CASP16 set, nine tasks from xTrimoPGLM, nine from SaProt, and pseudo‑perplexity on a high‑quality validation set.
Results show consistent improvements across model scales and tasks. On DMS, SAM yields 7–10 % higher Spearman correlation compared to the baseline pLMs. For contact prediction, ESM2‑650 M’s precision‑at‑L (P@L) improves by 59 % after SAM alignment. The gains persist for the smallest 8 M model, demonstrating scalability. Importantly, pseudo‑perplexity remains essentially unchanged, indicating that the language modeling capability is not compromised. The residue‑loss selection module adds an extra 3–5 % boost on average, confirming its effectiveness.
The authors discuss several advantages: (1) the method is a lightweight post‑training that does not require retraining the massive pLM from scratch; (2) freezing the pGNN allows easy swapping of different graph encoders; (3) no explicit structural input is needed at inference time, so the model can be applied to proteins lacking experimental structures or reliable AlphaFold predictions, including intrinsically disordered regions. Limitations include dependence on the quality of the frozen pGNN and the relatively coarse granularity of current structure tokens, which capture only relative orientation to the nearest neighbor. Future work could explore more expressive VQ‑VAE tokenizers, multi‑scale graph embeddings, or end‑to‑end fine‑tuning of the pGNN.
In summary, SAM demonstrates that a simple contrastive alignment combined with token‑level supervision and a smart loss‑selection strategy can endow sequence‑only protein language models with rich structural knowledge. The approach yields substantial, robust performance gains on functional prediction, contact mapping, and language modeling benchmarks while remaining computationally efficient and broadly applicable across model sizes and downstream tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment