AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.

💡 Research Summary

AFD‑Instruction introduces a large‑scale instruction‑following dataset that explicitly aligns antibody sequences with functional natural‑language descriptions, enabling large language models (LLMs) to both interpret existing antibodies and generate new ones under user‑specified functional constraints. The authors first curated 4,305 antibodies from SabDab and the Protein Data Bank, ensuring sequence diversity through MMseqs2‑based distance calculations and balanced sampling. For each antibody they retrieved the associated publication and employed a three‑stage multi‑agent pipeline—Mr. Extractor, Dr. Mechanism, and Prof. Function—to automatically extract basic metadata, enrich it with mechanistic details, and synthesize a coherent functional narrative.

To transform these curated pairs into a training resource, the team applied a self‑questioning strategy. Seed prompts derived from the curated pairs were fed to a large language model to generate a wide variety of instruction‑response pairs covering both classification tasks (e.g., “Does this antibody neutralize IgE?”) and open‑ended reasoning tasks (e.g., “Explain the binding mechanism of this sequence”). The resulting instruction set exceeds 430,000 high‑quality examples.

Beyond understanding, the dataset includes function‑guided design instructions. Users can specify an antigen sequence and desired functional outcome using tags; the model is then asked to output either a full antibody sequence or a CDR3 region enclosed in tags that satisfies those constraints. A template‑based conversion path and biological plausibility filters (e.g., checking for structural consistency) were incorporated to prevent unrealistic designs.

Quality control combined automated integrity checks, duplication removal, semantic consistency verification, and expert review. Random samples (5 % of instructions) were double‑checked by independent experts, achieving a Cohen’s κ of 0.82, indicating strong inter‑annotator agreement.

The authors fine‑tuned several open‑source LLMs (Llama‑2‑7B, Qwen‑1.8B, etc.) on AFD‑Instruction and evaluated them on three fronts: (i) functional classification accuracy, (ii) quality of functional explanations measured by BLEU/ROUGE, and (iii) success of CDR3 design in meeting target affinity and specificity. Across all models, instruction‑tuned versions outperformed sequence‑only baselines by 8–15 percentage points, with especially notable gains in complex functional reasoning such as “blocking a specific ion channel” or “neutralizing a cytokine.” In design experiments, generated CDR3 sequences showed high correlation with experimentally measured affinities, and human expert assessments rated the designs as comparable to manually engineered candidates.

Limitations include a focus on human and mouse IgG1 antibodies, leaving non‑canonical formats (single‑domain antibodies, VHHs, bispecifics) under‑represented. Functional descriptions rely on literature summaries, so quantitative metrics like K_D values are sparsely annotated. Future work should expand the dataset to cover diverse antibody formats, integrate high‑resolution structural‑functional mappings, and close the loop with experimental validation to enable iterative design‑evaluate‑retrain cycles.

In summary, AFD‑Instruction provides the first comprehensive, function‑annotated instruction dataset for antibodies, bridging the gap between protein language models and natural‑language instruction following. By embedding functional supervision into LLM training, the work demonstrates that models can both explain antibody behavior and generate novel candidates guided by human‑readable functional goals, offering a promising route to accelerate antibody discovery and therapeutic development.

AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

💡 Research Summary

Comments & Academic Discussion

Leave a Comment