BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs
Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: https://github.com/CyL-ucas/BioMed_Search
💡 Research Summary
The paper introduces BioMedSearch, a novel retrieval‑augmented framework that couples large language models (LLMs) with authoritative biomedical resources—literature databases (PubMed, PMC, ScienceDirect), protein repositories (UniProt, AlphaFold), and general web search engines—to address the chronic “hallucination” problem of LLMs in the life‑science domain. The authors observe that existing LLM‑driven agents (e.g., PaSa, MindSearch, DeepSearcher) excel at generic information tasks but lack real‑time access to domain‑specific databases, leading to erroneous protein function or interaction claims. Moreover, current retrieval‑augmented generation (RAG) approaches are evaluated on limited biomedical QA sets (MedQA, MedMCQA, MMLU), which do not reflect the complexity of open‑ended research queries.
Core Architecture
- Biomedical Search Planner – Upon receiving a natural‑language query, the system uses an LLM with a decomposition prompt to split the query into a set of sub‑queries (S = {q_1,…,q_n}) across predefined semantic dimensions (developmental, endocrine, clinical, molecular, etc.). For each sub‑query, a fine‑grained keyword set (K_i) is extracted.
- DAG‑Based Retrieval Planning – All keywords become nodes in a directed acyclic graph (DAG) (G=(V,E)). Edges encode logical dependencies and guide the assignment of retrieval tools to each sub‑query, ensuring that the most appropriate source (literature, protein DB, or web) is selected.
- Biomedical Retrieval Executor – The executor implements three specialized modules:
- Literature Retrieval: Simultaneously queries PubMed, PMC, and ScienceDirect (up to 100 hits each). Documents are filtered first by keyword coverage (≥ 80 % of (K_i) appearing in title/abstract) and then by semantic similarity using PubMedBERT embeddings; the top‑k (default 10) papers are retained.
- Protein Information Retrieval: Detects gene/protein mentions, performs UniProt ID lookup, extracts functional annotations, interaction partners, and sequence data, and, when structural information is required, calls the AlphaFold API to generate a PDB model.
- Web Search: Issues real‑time queries to major search engines to capture cutting‑edge findings, niche terminology, or patient‑reported outcomes. Results are re‑ranked with LLM‑driven summarization and relevance scoring.
- Answer Synthesis & Verification – Filtered evidence is fed back to the LLM, which composes a structured report and performs an answerability check, aligning the generated answer with the retrieved citations.
Benchmark – BioMedMCQs
To evaluate both retrieval fidelity and reasoning depth, the authors construct BioMedMCQs, a 3,000‑question multiple‑choice benchmark covering three reasoning levels:
- Level 1 – Mechanistic Identification (simple fact retrieval).
- Level 2 – Non‑adjacent Semantic Integration (requires linking concepts from disparate sources).
- Level 3 – Temporal Causal Reasoning (demands multi‑step, time‑ordered inference).
Experimental Results
BioMedSearch is compared against baseline LLMs, standard RAG pipelines, and domain‑specific agents (Self‑BioRAG, MedRAG). Across all three levels, BioMedSearch achieves substantial gains:
- Level 1 accuracy: 91.9 % (↑ 32.8 % over baseline).
- Level 2 accuracy: 81.0 % (↑ 34.0 %).
- Level 3 accuracy: 73.4 % (↑ 37.1 %).
These improvements demonstrate that the DAG‑guided multi‑source retrieval dramatically enhances both the relevance of the evidence and the logical coherence of the final answer, especially for the most complex causal reasoning tasks.
Contributions & Impact
- Introduces a search planner that translates ambiguous biomedical queries into executable, source‑specific retrieval workflows.
- Provides a multi‑modal executor that seamlessly integrates literature, protein functional/structural data, and up‑to‑date web content without additional model training.
- Releases BioMedMCQs, the first benchmark explicitly designed for multi‑source biomedical search and reasoning, facilitating future research on retrieval‑augmented LLMs in life sciences.
- Opens source code and data on GitHub, ensuring reproducibility and encouraging community extensions (e.g., incorporation of knowledge graphs, cost‑effective web crawling).
Limitations & Future Work
The current system relies on multiple external APIs, which may introduce latency and cost; scaling to high‑throughput settings will require more efficient caching or lightweight retrieval models. Additionally, integration with structured biomedical knowledge graphs (e.g., GO, Disease Ontology) could further improve evidence traceability. The authors propose exploring continual learning from user feedback and automated error detection to refine the planner’s routing decisions over time.
In summary, BioMedSearch represents a significant step toward trustworthy, evidence‑grounded AI assistants for biomedical research, combining the generative power of LLMs with rigorous, real‑time access to domain‑specific databases and achieving state‑of‑the‑art performance on a challenging, multi‑level reasoning benchmark.
Comments & Academic Discussion
Loading comments...
Leave a Comment