AutoBool: An Reinforcement-Learning trained LLM for Effective Automated Boolean Query Generation for Systematic Reviews

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present AutoBool, a reinforcement learning (RL) framework that trains large language models (LLMs) to generate effective Boolean queries for medical systematic reviews. Boolean queries are the primary mechanism for literature retrieval in this domain and must achieve high recall while maintaining reasonable precision - a challenging balance that existing prompt-based LLM approaches often struggle to achieve. A major limitation in this space is the lack of high-quality ground-truth Boolean queries for each topic, which makes supervised fine-tuning impractical. AutoBool addresses this challenge by using RL to directly optimize query generation with retrieval measures, without requiring target queries. To support this effort, we create and release the largest dataset of its kind: 65588 topics in total for training and evaluating the task of automatic Boolean query formulation. Experiments on our new dataset and two established datasets (CLEF TAR and Seed Collection) show that AutoBool significantly outperforms zero shot/few shot prompting and matches or exceeds the effectiveness of much larger GPT-based models (e.g., GPT-4o, O3) using smaller backbones. It also approaches effectiveness of expert-authored queries while retrieving 10 to 16 times fewer documents. Ablation studies reveal the critical roles of model backbone, size, decoding temperature, and prompt design. Code and data are available at https://github.com/ielab/AutoBool.

💡 Research Summary

The paper introduces AutoBool, a reinforcement‑learning (RL) framework that fine‑tunes large language models (LLMs) to automatically generate Boolean search queries for medical systematic reviews. Boolean queries are the primary tool for literature retrieval in this domain; they must achieve very high recall (often > 80 %) while keeping the number of retrieved documents manageable (reasonable precision). Existing prompt‑based LLM approaches can produce syntactically correct queries but typically retrieve far fewer relevant studies than expert‑crafted queries, resulting in recall rates of only 10‑40 %—far below the thresholds required for systematic reviews.

A major obstacle to supervised fine‑tuning is the absence of a single high‑quality “ground‑truth” query for each topic. Expert queries are inconsistent, often sub‑optimal, and publicly available datasets contain fewer than 200 training pairs. AutoBool circumvents this limitation by optimizing directly for retrieval performance: the model generates a query, the query is executed against PubMed via the Entrez API, and the retrieved document set is compared to a gold‑standard set of included studies (extracted from the reference lists of systematic reviews). Recall and precision are computed and fed into a custom reward function; the model is then updated using the Group Relative Policy Optimization (GRPO) algorithm.

Dataset

The authors mined the PubMed Central Open Access (PMC‑OA) corpus for articles labeled “Systematic Review”. From 75 676 such articles they extracted the PMIDs cited in the results sections, treating these as the gold‑standard included studies. After filtering for availability, 65 588 unique topics remained. The dataset is split temporally to avoid leakage:

Training set – 32 794 topics published between 2000‑07‑06 and 2021‑10‑30.
Test set – 32 794 topics published between 2021‑10‑31 and 2025‑03‑01.
PubT_emp set – 1 000 topics published after 2024‑11‑01, reserved for out‑of‑distribution evaluation (ensuring no overlap with the LLM’s pre‑training cutoff of October 2024).

All data, along with code, are released on GitHub and HuggingFace.

Reward Design

The total reward is a sum of three components:

Formatting Reward (R_format) – +10 if the query follows expected conventions (capitalized Boolean operators, quoted terms, balanced parentheses), –10 otherwise.
Validity Reward (R_validity) – +10 if the query parses correctly and returns at least one but fewer than 200 000 documents; –10 otherwise.
Retrieval Reward (R_retrieval) – A composite function that prioritizes recall while allowing precision to matter increasingly as recall grows.
\

AutoBool: An Reinforcement-Learning trained LLM for Effective Automated Boolean Query Generation for Systematic Reviews

💡 Research Summary

Dataset

Reward Design

Comments & Academic Discussion

Leave a Comment