Controlling Repetition in Protein Language Models

Controlling Repetition in Protein Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Protein language models (PLMs) have enabled advances in structure prediction and de novo protein design, yet they frequently collapse into pathological repetition during generation. Unlike in text, where repetition merely reduces readability, in proteins it undermines structural confidence and functional viability. To unify this problem, we present the first systematic study of repetition in PLMs. We first propose quantitative metrics to characterize motif-level and homopolymer repetition and then demonstrate their negative impact on folding reliability. To address this challenge, we propose UCCS (Utility-Controlled Contrastive Steering), which steers protein generation with a constrained dataset. Instead of naively contrasting high- vs. low-repetition sequences, we construct contrastive sets that maximize differences in repetition while tightly controlling for structural utility. This disentanglement yields steering vectors that specifically target repetition without degrading foldability. Injected at inference, these vectors consistently reduce repetition without retraining or heuristic decoding. Experiments with ESM-3 and ProtGPT2 in CATH, UniRef50, and SCOP show that our method outperforms decoding penalties and other baselines, substantially lowering repetition while preserving AlphaFold confidence scores. Our results establish repetition control as a central challenge for PLMs and highlight dataset-guided steering as a principled approach for reliable protein generation.


💡 Research Summary

Protein language models (PLMs) have become powerful tools for structure prediction and de‑novo protein design, yet a pervasive failure mode—pathological repetition—has been largely ignored. This paper provides the first systematic investigation of this phenomenon, introduces quantitative metrics tailored to protein sequences, and proposes a novel inference‑time steering method that reduces repetition without sacrificing structural plausibility.

The authors first categorize repetition into two biologically meaningful forms: (1) motif‑level repetition, where short n‑grams (e.g., “AGAGAG”) recur cyclically, and (2) homopolymer repetition, where a single amino acid forms long runs (e.g., “AAAAAA”). Existing NLP diversity metrics (token entropy, distinct‑n) capture only part of the problem, so the paper defines three complementary scores: (i) normalized Shannon entropy (H_norm) to measure global amino‑acid imbalance, (ii) Distinct‑2 and Distinct‑3 to quantify local motif diversity, and (iii) a homopolymer diversity score (R_hpoly) that penalizes runs longer than a biologically motivated threshold (k = 4). These three components are combined into a unified repetition score R(x). In parallel, a utility score U(x) is derived from AlphaFold confidence metrics (pLDDT, pTM) to assess foldability. The authors show that natural proteins and PLM‑generated sequences are clearly separated in the (R, U) space, with high Jensen‑Shannon divergence between them.

To control repetition while preserving utility, the paper introduces Utility‑Controlled Contrastive Steering (UCCS). The key idea is to construct two contrastive datasets that are matched in utility (similar U) but differ maximally in repetition (high‑R vs. low‑R). Hidden‑state activations are averaged over each set, and the difference vector v_UCCS is extracted. During generation, v_UCCS is added (scaled by a coefficient α) to the token embeddings at each step, nudging the model away from repetitive directions. This operation requires no model retraining and can be applied to any PLM at inference time.

Experiments are conducted on two representative PLMs—ESM‑3 (masked language model) and ProtGPT2 (autoregressive model)—across three benchmark protein datasets (CATH, UniRef50, SCOP) and both unconditional and conditional generation settings. Results demonstrate that UCCS consistently lowers the repetition score R by 15‑30 % compared to strong baselines (temperature = 0.8, top‑p = 0.9, repetition penalty = 1.2). Importantly, the utility score U is maintained or slightly improved (average pLDDT gains of 0.5‑2.0 points). For ESM‑3, the homopolymer metric R_hpoly drops dramatically, indicating near‑elimination of long homopolymer stretches; for ProtGPT2, Distinct‑2/3 recover to levels comparable with natural proteins. The method outperforms conventional decoding heuristics across all datasets, confirming that a representation‑level intervention can target repetition more precisely than token‑level penalties.

The contributions are threefold: (1) a principled set of protein‑specific repetition metrics that correlate with folding reliability, (2) the UCCS framework that disentangles repetition from structural utility via contrastive dataset construction and linear steering, and (3) extensive empirical validation showing that UCCS improves both diversity and foldability without any additional training. Limitations include the need to manually select the scaling factor α and the focus on a single control objective; extending the approach to multi‑objective steering (e.g., preserving active‑site motifs while reducing repetition) remains an open challenge. Future work could integrate meta‑learning or reinforcement learning to automatically tune α, or combine UCCS with other controllable generation techniques to achieve richer functional specifications.

Overall, the paper establishes pathological repetition as a central obstacle for reliable protein generation and provides a clear, data‑driven solution that can be adopted by the broader PLM community.


Comments & Academic Discussion

Loading comments...

Leave a Comment