BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing Protein Language Models (PLMs) often suffer from limited adaptability to multiple tasks and exhibit poor generalization across diverse biological contexts. In contrast, general-purpose Large Language Models (LLMs) lack the capability to interpret protein sequences and fall short in domain-specific knowledge, limiting their capacity for effective biosemantic reasoning. To combine the advantages of both, we propose BioBridge, a domain-adaptive continual pretraining framework for protein understanding. This framework employs Domain-Incremental Continual Pre-training (DICP) to infuse protein domain knowledge and general reasoning corpus into a LLM simultaneously, effectively mitigating catastrophic forgetting. Cross-modal alignment is achieved via a PLM-Projector-LLM pipeline, which maps protein sequence embeddings into the semantic space of the language model. Ultimately, an end-to-end optimization is adopted to uniformly support various tasks, including protein property prediction and knowledge question-answering. Our proposed BioBridge demonstrates performance comparable to that of mainstream PLMs on multiple protein benchmarks, such as EC and BindingDB. It also achieves results on par with LLMs on general understanding tasks like MMLU and RACE. This showcases its innovative advantage of combining domain-specific adaptability with general-purpose language competency.

💡 Research Summary

BioBridge presents a novel framework that unifies protein language models (PLMs) and general-purpose large language models (LLMs) to achieve robust biological reasoning while preserving broad linguistic competence. The authors identify two fundamental obstacles in existing approaches: (1) the “biological knowledge barrier,” where LLMs lack systematic protein‑specific information, and (2) the “modality gap,” stemming from the distinct syntactic and structural properties of amino‑acid sequences compared to natural language. To overcome these challenges, BioBridge integrates three core components.

First, Domain‑Incremental Continual Pre‑training (DICP) adapts a strong LLM (Qwen2.5‑7B‑Instruct) to the biomedical domain without erasing its original language abilities. The authors compile a mixed corpus comprising (i) advanced biology textbooks, (ii) full‑text PubMed articles and abstracts, (iii) sequence‑augmented sentences where named entities are replaced or appended with explicit amino‑acid strings, and (iv) high‑quality protein‑description pairs from Swiss‑Prot. To mitigate catastrophic forgetting, a small “Mixture of Thoughts” (MoT) subset containing mathematical, coding, and scientific reasoning problems is interleaved, preserving general reasoning capacity. The DICP stage runs for one epoch on 24 NVIDIA A800 GPUs with a learning rate of 1e‑5 and a context window of 4096 tokens.

Second, the PLM‑Projector module aligns protein representations with the LLM’s semantic space. ESM2, a state‑of‑the‑art protein encoder, is frozen to provide stable amino‑acid embeddings. A Q‑Former with K learnable query tokens extracts multiple latent vectors from the ESM2 output via cross‑attention, offering richer structural encoding than a single CLS token. These vectors are linearly projected into the LLM’s hidden dimension and paired with text embeddings obtained from the LLM’s CLS token. Alignment is enforced through a bidirectional contrastive loss (Lₚ₂ₜ and Lₜ₂ₚ) that maximizes cosine similarity for positive protein‑text pairs while minimizing it for negatives, complemented by a matching prediction loss (L_PTM) that explicitly discriminates matched versus mismatched pairs. Training on OntoProtein (422 K pairs) and Swiss‑Prot (430 K pairs) proceeds for 30 epochs on eight A800 GPUs.

Third, an end‑to‑end fine‑tuning stage integrates the aligned protein embeddings directly into the LLM’s input stream. The projected protein representation is prepended to the tokenized natural‑language prompt, allowing the LLM to attend to protein information during generation without architectural changes. Supervised training on the same Swiss‑Prot pairs enables the model to generate biologically grounded responses, answer queries, and produce explanations.

Evaluation uses PFMBench, a comprehensive suite of 16 protein‑related tasks spanning annotation, solubility, subcellular localization, mutation effect, interaction, and production. Metrics include accuracy, F1 score, and Spearman correlation as appropriate. BioBridge achieves performance comparable to dedicated PLMs such as ESM2, ProtT5, and SaProt on key tasks (EC classification, localization, BindingDB interaction), with accuracy in the 0.74–0.76 range. Notably, it improves multi‑class localization by ~7 % and metal‑ion binding classification by ~3.5 % over baselines, demonstrating strong generalization. On general language benchmarks (MMLU, RACE), the model retains competence similar to the original Qwen2.5‑7B‑Instruct, confirming that DICP and MoT data successfully prevent catastrophic forgetting. Ablation studies reveal that removing DICP or the MoT component degrades both language and protein tasks, while omitting the PLM‑Projector reduces cross‑modal alignment quality.

The paper’s contribution lies in a systematic method for injecting domain knowledge into LLMs while preserving their universal reasoning abilities, and in a practical pipeline for aligning protein sequences with textual semantics. Limitations include reliance on a frozen ESM2 encoder (precluding joint optimization of protein representations), evaluation confined to a single LLM backbone, and substantial computational resources required for training. Future work could explore lightweight protein encoders, multi‑LLM compatibility, and real‑world applications such as drug target discovery. Overall, BioBridge establishes a promising direction for integrating biological sequence understanding with the powerful generative and reasoning capabilities of modern LLMs.

BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment