Leveraging LLMs for Title and Abstract Screening for Systematic Review: A Cost-Effective Dynamic Few-Shot Learning Approach
Systematic reviews are a key component of evidence-based medicine, playing a critical role in synthesizing existing research evidence and guiding clinical decisions. However, with the rapid growth of research publications, conducting systematic reviews has become increasingly burdensome, with title and abstract screening being one of the most time-consuming and resource-intensive steps. To mitigate this issue, we designed a two-stage dynamic few-shot learning (DFSL) approach aimed at improving the efficiency and performance of large language models (LLMs) in the title and abstract screening task. Specifically, this approach first uses a low-cost LLM for initial screening, then re-evaluates low-confidence instances using a high-performance LLM, thereby enhancing screening performance while controlling computational costs. We evaluated this approach across 10 systematic reviews, and the results demonstrate its strong generalizability and cost-effectiveness, with potential to reduce manual screening burden and accelerate the systematic review process in practical applications.
💡 Research Summary
This paper addresses a critical bottleneck in evidence-based medicine: the time-consuming and resource-intensive process of title and abstract screening in systematic reviews. To mitigate this burden, the authors propose a novel LLM-based approach called Dynamic Few-Shot Learning (DFSL), designed to enhance both the performance and cost-effectiveness of automated screening.
The core innovation lies in a two-stage, cascaded architecture. In the first stage, a lower-cost LLM (e.g., GPT-4.1-mini) performs an initial screening of all candidate studies. Crucially, the model is prompted to generate a confidence score (ranging from 0 to 1) alongside its inclusion/exclusion decision. In the second stage, only the instances flagged as “low-confidence” (with a score below a predefined threshold of 0.9) are re-evaluated using a higher-performance, more expensive LLM (e.g., GPT-4.1). This strategy ensures computational resources are allocated efficiently, focusing expensive model calls on the most ambiguous cases.
The “Dynamic Few-Shot” component further refines performance. Instead of using random or static examples for few-shot prompting, the DFSL method first clusters all studies based on the embeddings of their titles and abstracts (generated by MedCPT encoder and reduced via UMAP). Representative inclusion and exclusion examples are then selected from each cluster to form an instance pool. When screening a specific study, the most similar examples are dynamically retrieved from its corresponding cluster to construct a tailored, context-aware prompt. This addresses the variability across different medical topics and provides more relevant guidance to the LLM.
The approach was rigorously evaluated on a curated dataset from CLEF 2019, comprising 10 systematic reviews across diagnostic, interventional, and prognostic domains, encompassing 9,515 total studies. The results demonstrated that DFSL significantly outperformed standard prompting strategies: zero-shot (avg. F1: 0.458), chain-of-thought (0.458), and static few-shot learning (0.508), achieving a superior average F1 score of 0.552 across all reviews. The study also included a comparison with several open-weight LLMs (e.g., Phi-3.5-mini, Gemma3-4B, and medical-specific models like Med-Gemma) under a zero-shot setting, finding that medical-specific models did not consistently outperform their general counterparts in this screening task.
In conclusion, the DFSL framework successfully balances two key challenges: it improves screening accuracy through dynamic, cluster-based example selection, and it controls operational costs through a confidence-driven cascade that reserves high-power LLMs for only the most uncertain decisions. The study provides strong evidence for the generalizability and practical potential of this approach to accelerate the systematic review process while reducing the manual screening burden.
Comments & Academic Discussion
Loading comments...
Leave a Comment