Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.

💡 Research Summary

The paper tackles the critical bottleneck in biomedical large language model (LLM) development: the scarcity of large‑scale, high‑quality, annotated corpora. While existing open‑source biomedical datasets are clean, they are limited in size and topical breadth, and the massive body of biomedical literature remains largely unstructured and unsuitable for direct training. To overcome these limitations, the authors propose a novel, fully automated, knowledge‑driven, multi‑agent framework called m‑KAILIN (multi‑agent enhanced Knowledge hierarchy guided biomedical dataset distillation).

Core Idea
The framework leverages the Medical Subject Headings (MeSH) ontology as a guiding knowledge hierarchy for a suite of specialized agents that work together to (1) generate domain‑specific questions from raw papers, (2) retrieve the most relevant textual contexts, (3) evaluate which question‑context pair best aligns with MeSH semantics, and (4) produce high‑quality answers. Two distinct question‑generation agents are instantiated: a domain‑specialized model (e.g., BioMistral) and a general‑purpose, powerful model (e.g., Llama‑3). This dual‑agent design ensures both fine‑grained biomedical terminology coverage and broader reasoning diversity.

Technical Pipeline

Question Generation Agent – Fine‑tuned on the BioASQ QA dataset, each of the two models receives a biomedical document (sampled from PubMed) and outputs a candidate question.
Context Retrieval Agent – Employs Dense Passage Retrieval (DPR) with a shared BiomedBERT encoder to embed both the question and millions of candidate passages. Top‑k passages with the highest inner‑product similarity are selected as potential contexts.
Question‑Context Evaluation Agent – Operates in two stages. First, a rule‑based, MeSH‑driven scoring function computes a semantic similarity score between the MeSH terms of the source document and each candidate context, using information‑content based common‑ancestor calculations. This yields “cold‑start” preference labels at scale without human annotation. Second, a large language model (ϕ) is fine‑tuned on these labels to predict the preferred pair among two candidates, effectively learning a MeSH‑aware preference classifier.
Answer Generation Agent – Takes the selected question‑context pair and generates a concise, accurate answer using a state‑of‑the‑art LLM such as GPT‑4o.
Agentic Collaborative Framework – The outputs are integrated, and the question generators are further refined via Direct Preference Optimization (DPO), creating a feedback loop that continuously improves question quality.

Scale and Results
The system processes over 23 million PubMed articles, automatically constructing roughly one billion “question‑answer‑context” triples. Models trained on this distilled corpus (e.g., Llama‑3‑70B) achieve substantial gains on multiple biomedical QA benchmarks (BioASQ, MedQA, PubMedQA) compared with strong open‑source baselines (BioMistral, PubMed‑LLaMA) and even proprietary models such as Med‑PaLM‑2 and GPT‑4‑based MedPrompt. Notably, Llama‑3‑70B with the AI‑Ready dataset surpasses GPT‑4‑based systems despite the latter’s larger scale.

Ablation Studies
The authors systematically disable components: using a single question‑generation agent, removing MeSH‑guided evaluation, or omitting DPO. Each ablation leads to measurable performance drops, confirming that (a) dual‑agent question diversity, (b) knowledge‑driven evaluation, and (c) collaborative optimization are all essential for the observed improvements.

Case Analyses
Qualitative examples illustrate the framework’s ability to capture complex disease mechanisms, drug‑interaction queries, and up‑to‑date research findings, producing answers that are both factually accurate and contextually rich.

Contributions and Impact

Automated Biomedical Corpus Distillation – Eliminates the need for costly human annotation while dramatically expanding dataset size and coverage.
Knowledge‑Hierarchy‑Driven Evaluation – Introduces a scalable, ontology‑based supervision signal that aligns generated data with established biomedical vocabularies.
Multi‑Agent Collaboration – Demonstrates that combining domain‑specialized and generalist LLMs yields superior question diversity and robustness.
Empirical Validation – Provides extensive experiments, ablations, and scaling laws that guide future corpus‑distillation efforts.

Future Directions
The authors suggest extending the framework to other scientific domains by swapping MeSH with analogous ontologies (e.g., UMLS, SNOMED CT, chemical ontologies). Incorporating human‑in‑the‑loop verification, real‑time feedback, and cross‑modal data (e.g., figures, tables) could further enhance data quality.

In summary, m‑KAILIN presents a compelling, fully automated pipeline that leverages hierarchical biomedical knowledge to orchestrate multiple specialized agents, producing a massive, high‑quality training corpus that substantially advances the performance of biomedical LLMs.

Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

💡 Research Summary

Comments & Academic Discussion

Leave a Comment