Chained Prompting for Better Systematic Review Search Strategies

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Systematic reviews require the use of rigorously designed search strategies to ensure both comprehensive retrieval and minimization of bias. Conventional manual approaches, although methodologically systematic, are resource-intensive and susceptible to subjectivity, whereas heuristic and automated techniques frequently under-perform in recall unless supplemented by extensive expert input. We introduce a Large Language Model (LLM)-based chained prompt engineering framework for the automated development of search strategies in systematic reviews. The framework replicates the procedural structure of manual search design while leveraging LLMs to decompose review objectives, extract and formalize PICO elements, generate conceptual representations, expand terminologies, and synthesize Boolean queries. In addition to query construction, the framework exhibits superior performance in generating well-structured PICO elements relative to existing methods, thereby strengthening the foundation for high-recall search strategies. Evaluation on a subset of the LEADSInstruct dataset demonstrates that the framework attains a 0.9 average recall. These results significantly exceed the performance of existing approaches. Error analysis further highlights the critical role of precise objective specification and terminological alignment in optimizing retrieval effectiveness. These findings confirm the capacity of LLM-based pipelines to yield transparent, reproducible, and high-performing search strategies, and highlight their potential as scalable instruments for supporting evidence synthesis and evidence-based practice.

💡 Research Summary

The paper presents a novel pipeline that uses chained prompting with a large language model (LLM) to automate the creation of search strategies for systematic reviews (SRs). Traditional manual methods rely on librarians translating a review question into a PICO (Population, Intervention, Comparison, Outcome) framework, expanding each component with synonyms, controlled vocabulary (e.g., MeSH), and then combining them with Boolean operators. While systematic, this process is time‑intensive and prone to subjectivity. Existing automated tools such as litsearchr or the LEADS model either depend on heuristic keyword extraction or single‑prompt generation, resulting in modest recall unless they employ ensemble techniques.

The authors propose a four‑stage chain‑prompting workflow implemented with GPT‑4o‑mini. Stage 1 extracts structured PICO elements directly from the review objective (title and abstract). Stage 2 refines each PICO component into precise domain‑specific concepts. Stage 3 generates a rich set of keywords for each concept, including synonyms, lexical variants, and related terms, thereby expanding the search vocabulary. Stage 4 assembles the final Boolean query: keywords within a concept are combined with OR, and concepts are linked with AND, respecting proximity and database‑specific syntax. Each stage is guided by a system prompt that defines the model’s role and constraints, and a user prompt that specifies the task, ensuring consistent, hierarchical output that can be parsed downstream.

Evaluation used the LEADSInstruct dataset, a collection of PubMed‑indexed systematic reviews with known included studies. After filtering for quality (81 reviews, publication years 2012‑2016, API limits), the authors first ran the pipeline starting from the PICO elements supplied by the original LEADS approach. This yielded an average recall of 0.62 across all reviews and 0.80 when low‑recall (<0.2) cases were excluded. When the full pipeline—including the objective reformulation step—was applied, recall rose to 0.87 overall and 0.90 after filtering, outperforming baseline GPT‑4o (0.10), the original LEADS model (0.24), and even LEADS+ensemble (0.82 under comparable conditions). The improvement is attributed primarily to better prompt engineering for PICO extraction; the authors’ custom prompt achieved 0.87 recall versus 0.62 with the LEADS prompt.

Error analysis on a random subset of low‑recall strategies identified three main failure modes: (1) terminology mismatch (e.g., non‑standard synonyms used in target articles), (2) ambiguous or incomplete objective statements leading to incomplete concept generation, and (3) insufficient coverage of domain‑specific vocabularies. The authors suggest incorporating medical ontologies and more rigorous objective normalization in future prompts to mitigate these issues.

The study demonstrates that a carefully designed, multi‑step prompting pipeline can replicate the logical flow of manual search‑strategy development while delivering higher recall with far less human effort. The structured outputs promote transparency and reproducibility, essential for evidence‑based practice. Moreover, the approach scales because it relies on a lightweight model (GPT‑4o‑mini) and does not require ensembling multiple queries. The authors envision extending this framework to the broader systematic‑review workflow, including title/abstract screening and full‑text extraction, thereby moving toward a fully automated, end‑to‑end evidence synthesis pipeline.

Chained Prompting for Better Systematic Review Search Strategies

💡 Research Summary

Comments & Academic Discussion

Leave a Comment