Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

February 09, 2026

Reading time: 12 minute

...

📝 Original Info

Title: Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework
ArXiv ID: 2512.05863
Date: 2025-12-05
Authors: Tasnimul Hassan, Md Faisal Karim, Haziq Jeelani, Elham Behnam, Robert Green, Fayeq Jeelani Syed

📝 Abstract

Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA 2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization. The system retrieves relevant medical literature to ground the LLM's answers, thereby improving factual correctness and reducing hallucinations. We evaluate the approach on benchmark datasets (PubMedQA and MedMCQA) and show that retrieval augmentation yields measurable improvements in answer accuracy compared to using LLMs alone. Our finetuned LLaMA 2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline, while maintaining transparency by providing source references. We also detail the system design and fine-tuning methodology, demonstrating that grounding answers in retrieved evidence reduces unsupported content by approximately 60%. These results highlight the potential of RAG-augmented open-source LLMs for reliable biomedical QA, pointing toward practical clinical informatics applications.

📄 Full Content

Large language models (LLMs) have dramatically advanced the state of natural language understanding and generation. The release of GPT-3 [1] demonstrated that very large LLMs can achieve remarkable few-shot question-answering performance. More recently, open-source LLMs such as LLaMA 2 [2] and Falcon [3] have shown strong capabilities, though they typically lag behind the latest proprietary models in raw performance. This trade-off is often acceptable given their accessibility and customizability for domain-specific applications. There is particularly strong interest in applying LLMs to the biomedical and health informatics domains, where these models could assist clinicians and patients in answering questions by leveraging the vast body of medical knowledge [4], [5]. Early explorations have shown promising results; for example, Google's Med-PaLM achieved 67.2% on USMLEstyle questions, becoming the first to exceed a passing score [6], while GPT-4 has demonstrated strong performance on various medical benchmarks [7].

Yet significant challenges remain before LLMs can be safely and effectively used in patient care. A critical concern is the factual accuracy of model-generated answers. Generalpurpose LLMs often produce confidently stated incorrect or fabricated information (so-called “hallucinations”), which is unacceptable in medicine, where inaccuracies can be harmful. For example, models like ChatGPT or GPT-4 can output medical advice that sounds plausible but is not supported by evidence or clinical guidelines [7]. Moreover, because LLMs are usually trained on general internet text, they may lack up-to-date medical knowledge. They also often fail to use specialized terminology with precision. Fine-tuning LLMs on biomedical text can mitigate some of these issues [8], [9]. However, fully retraining or fine-tuning a large model is resource-intensive and might still not eliminate hallucinations.

One promising approach to improve factual accuracy is retrieval-augmented generation (RAG) [11]. In such a system, the model first retrieves relevant documents (e.g., medical literature, guidelines, electronic health records) from an external knowledge source, then generates an answer conditioned on that evidence. Grounding responses in actual text encourages the model to produce answers that are supported by references, thereby reducing the incidence of hallucinated facts. Retrievalbased QA has a rich history, from early pipeline systems like DrQA [10] to modern approaches that integrate neural retrievers with generators [11], [12]. This approach is especially wellsuited for biomedicine, given the vast and constantly growing body of biomedical literature.

In this work, we propose a medical QA system that leverages RAG in combination with fine-tuned open-source LLMs to address these challenges. Our key contributions include:

• We design a retrieval-augmented generation architecture for medical QA that combines a document retriever with a large generative model. The system cites relevant medical literature to justify its answers, enhancing transparency. • We fine-tune two high-performing open LLMs (Meta’s LLaMA 2 and TII’s Falcon) on medical QA data using LoRA [16], a parameter-efficient fine-tuning method. This enables effective domain adaptation at low computational cost, without full model retraining.

• We evaluate the system on multiple biomedical QA benchmarks, including PubMedQA [17] and MedMCQA [18]. Retrieval augmentation substantially improves answer accuracy and reduces hallucinations compared to generation without retrieval. For instance, our model based on LLaMA 2 reaches 71.8% accuracy on Pub-MedQA, improving significantly over the 55.4% zeroshot baseline.

• We analyze the system’s outputs and find that grounding answers in retrieved evidence reduces unsupported statements by approximately 60%. We also discuss the system’s potential clinical applications (as an assistive tool for healthcare professionals and patients) and outline remaining challenges, such as the need for rigorous validation and adherence to medical guidelines. II. RELATED WORK LLMs for Biomedical QA: Early transformer-based language models tailored to biomedicine (e.g., BioBERT [9]) improved tasks like clinical named entity recognition and QA, but these models were relatively small and task-specific. The advent of much larger LLMs has opened new possibilities for generative QA in medicine. Luo et al. introduced BioGPT [8], a 1.5B-parameter GPT model trained on PubMed, which achieved strong results on biomedical QA benchmarks (78.2% accuracy on PubMedQA). More recently, researchers have applied general LLMs to medical QA via fine-tuning or prompting. Google’s Med-PaLM [6], built on a fine-tuned 540B-parameter model (Flan-PaLM), was the first to exceed the passing score on USMLE-style questions. Its successor, Med-PaLM 2, achieved 86.5% on MedQA and showed substantial improvements in physician evaluations [6]. Nori et al. [7] evaluated GPT-4 on medical exams and found it could outperform many specialized models without any domainspecific training. While these works demonstrate the potential of LLMs in healthcare, they largely rely on proprietary models. In contrast, we focus on open-source LLMs (LLaMA 2 [2], Falcon [3]) that can be custom-tailored and deployed without such restrictions.

Retrieval-Augmented QA: Augmenting NLP models with retrieved knowledge is a well-established strategy to improve factual accuracy. Traditional open-domain QA systems (e.g., DrQA [10]) employed a two-stage pipeline: document retrieval (with methods like BM25) followed by a reading comprehension model to extract answers. More recently, neural retrievers and sequence-to-sequence generators have been integrated in end-to-end frameworks. Lewis et al. [11] introduced RAG, which combines a learned neural retriever with a parametric generator, and the Atlas model (T5-based, 11B parameters) achieved state-of-the-art open QA results by retrieving relevant passages even in few-shot settings [12]. In the biomedical domain, retrieval has long been used in challenges like BioASQ [13], where systems search PubMed articles to answer questions. Our approach follows this line of work by applying retrieval augmentation to a modern LLM for medical QA. We use a Dense Passage Retriever (DPR) [14] to find relevant snippets from a large corpus of medical literature, which then serve to ground the LLM’s answers in evidence. Providing source material to the generator helps curb the model’s tendency to produce unsupported claims, an effect also observed in retrieval-augmented models like RETRO [15]. Efficient Fine-Tuning of LLMs: Fine-tuning very large models on domain-specific data can be prohibitively expensive, but parameter-efficient techniques offer a solution. Low-Rank Adaptation (LoRA) [16] inserts small trainable matrices into each layer of the model while keeping the original weights frozen, drastically reducing the resources needed. Hu et al. showed that LoRA can match full fine-tuning performance while updating only about 0.1% of the parameters [16]. Other approaches (prompt tuning, adapters) similarly minimize training overhead, but LoRA has been particularly effective for LLMs. In our work, we use LoRA to fine-tune LLaMA 2 (13B) and Falcon (40B) on medical QA data, allowing us to specialize these models to the domain with relatively modest computational resources. We did not employ reinforcement learning from human feedback for alignment; instead, we rely on grounding the LLM’s outputs in retrieved evidence and supervised fine-tuning on correct QA examples to ensure reliability. The combination of retrieval augmentation and LoRA fine-tuning enables us to build a high-performing medical QA system efficiently.

Our system follows a retrieval-augmented generation pipeline for medical question answering, as illustrated in Figure 1. Given a user’s question, the system first retrieves relevant context from a knowledge repository, then generates an answer that incorporates both the question and the retrieved evidence. This design ensures that the answer is grounded in verifiable information. The knowledge repository consists of a large collection of biomedical documents (e.g., PubMed abstracts, clinical guidelines, and curated FAQs) indexed in a vector database for efficient semantic search.

The pipeline has two main stages: 1) Question Retrieval: The input question is encoded into a vector using a bi-encoder transformer model. We employ a dense passage retriever (DPR) [14] trained on biomedical text to embed the question and candidate passages in the same space. The top-k most relevant passages are retrieved based on inner-product similarity to the question embedding. We use k = 5, which provides sufficient context while balancing relevance and computational efficiency. 2) Answer Generation: The question and the retrieved passages are concatenated to form a prompt for the generative model. We prepend an instruction that the model should use the provided information and cite its sources when answering. The fine-tuned LLM (LLaMA 2 or Falcon) then generates a free-form answer, with an answer length limit of 256 tokens to ensure focused responses. The output is encouraged to include references to the retrieved documents when making specific claims. The final answer presented to the user is a coherent explanation with references to the source material. For example, the system might respond: “The recommended dosage of Drug X for condition Y is 5-10 mg daily, based on clinical guidelines from the retrieved literature.” Providing such references enhances trustworthiness, as users can verify the information from the original sources.

To adapt the generative models to the medical QA task, we fine-tuned the base LLMs on domain-specific QA data using supervised learning and LoRA [16]. LoRA inserts low-rank adapters into each transformer layer, allowing us to update only a small fraction of the model’s parameters during finetuning. We set the adapter rank to 16 and alpha to 32, updating fewer than 0.5% of the model’s weights, which significantly reduces GPU memory requirements.

We compiled a training set of approximately 15,000 question-answer pairs from several sources: the PubMedQA training split (8,000 pairs), the MedMCQA training subset (5,000 pairs), and a curated collection of medical FAQs (2,000 pairs).

Each training example included the question, a set of relevant context passages (retrieved from our document corpus), and the correct answer. We fine-tuned the LLaMA 2 and Falcon models on these examples so that they learned to (1) comprehend medical questions, (2) incorporate the provided evidence into their answers, and (3) produce accurate, concise explanations. Including retrieved context in training was important: it taught the model to rely on external information from documents rather than solely on its internal knowledge. Fine-tuning was performed using the AdamW optimizer with a learning rate of 2 × 10 -4 and cosine scheduling for 3 epochs on a server with four NVIDIA A100 GPUs. Thanks to LoRA’s efficiency, adapting the 13B-parameter and 40Bparameter models was feasible, with training completing in approximately 48 hours. After fine-tuning, we integrated each LLM into the retrieval pipeline for inference. At test time, the model is given the top-5 retrieved passages along with the question, prefaced by an instruction to ground its answer in the provided information. While the model was trained to reference sources, the citation format varies and is not always consistent in the generated outputs. This attribution capability, when present, is valuable for clinical applications as it allows users to trace information back to sources.

We evaluated our system on two benchmark datasets and conducted a comprehensive analysis of its outputs. The evaluation compares language models under two conditions: a standard closed-book setting (the model relies solely on internal knowledge) and a retrieval-augmented setting (the model uses retrieved documents during inference). Unless stated otherwise, “LLaMA 2” refers to our fine-tuned 13B-parameter model, and “Falcon” refers to our fine-tuned 40B-parameter model.

To assess the effectiveness of retrieval augmentation, we evaluated models on PubMedQA and MedMCQA using accuracy as the primary metric, following standard practice for these benchmarks. Table I summarizes the results. As expected, retrieval augmentation and fine-tuning significantly improve performance over zero-shot baselines. While GPT-4 achieves the highest scores, our fine-tuned LLaMA 2 with RAG shows substantial improvements, reaching 71.8% on PubMedQA compared to 55.4% without retrieval.

Table II outlines latency and memory usage for different models. While GPT-4 provides faster response times through API calls, it requires paid access. Among the open models, LLaMA 2 (13B) offers the best balance of performance and resource efficiency, while Falcon (40B) requires more memory but provides marginally better accuracy.

We compared our best-performing model (fine-tuned LLaMA 2 with RAG) to recent approaches from the literature. Table III shows that while our model does not exceed state-ofthe-art proprietary systems like Med-PaLM 2, it offers a strong open-source alternative that significantly outperforms the zeroshot baseline and provides transparency through source attribution.

To assess the impact of retrieval augmentation on factual accuracy, we manually evaluated 100 randomly sampled QA pairs from the test set. Two medical professionals independently annotated each answer for factual errors and unsupported claims. Inter-annotator agreement was substantial (Cohen’s κ = 0.73). Results showed that retrieval augmentation reduced factual errors from 35% in the fine-tuned-only setting to 14% with RAG. The most common remaining errors were: (1) misinterpretation of statistical findings from retrieved papers (32% of errors), (2) overgeneralization from specific study populations (28%), and (3) outdated information from older retrieved documents (25%). The retrieved documents often provided specific phrasing that appeared verbatim in the model’s answers, improving verifiability. However, the model’s ability to provide consistent structured citations remained limited, with only 42% of answers including clear source attribution.

Our system demonstrates reasonable performance with manageable resource demands, suggesting potential utility in clinical informatics settings. Possible applications include:

• Assisting medical students with literature review and exam preparation • Providing clinicians with rapid access to relevant research findings • Supporting evidence-based patient education materials However, several limitations must be addressed before clinical deployment:

• The system is not intended for direct clinical decisionmaking without physician oversight • Performance on rare diseases and specialized procedures remains limited • The knowledge repository requires continuous updates to maintain currency • Robust evaluation by medical professionals in real clinical workflows is essential

In this paper, we introduced a retrieval-augmented QA framework for biomedical applications, built on fine-tuned open-source LLMs. Extensive experiments demonstrate that our approach significantly improves factual accuracy, reduces hallucinations, and achieves performance approaching that of leading domain-specific and proprietary systems. Importantly, the system offers a transparent and resource-efficient alternative suitable for deployment in diverse settings.

Future work will explore clinician-in-the-loop feedback mechanisms, multi-modal reasoning (e.g., incorporating imaging and EHR data), and formal validation with healthcare professionals. Our findings advocate for the broader adoption of transparent, adaptable, and reproducible LLM-based systems in medical AI.

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

📝 Original Info

📝 Abstract

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

📸 Image Gallery

Reference

Start searching

No results found