Automated Risk-of-Bias Assessment of Randomized Controlled Trials: A First Look at a GEPA-trained Programmatic Prompting Framework

Reading time: 6 minute
...

📝 Original Info

  • Title: Automated Risk-of-Bias Assessment of Randomized Controlled Trials: A First Look at a GEPA-trained Programmatic Prompting Framework
  • ArXiv ID: 2512.01452
  • Date: 2025-12-01
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Assessing risk of bias (RoB) in randomized controlled trials is essential for trustworthy evidence synthesis, but the process is resource-intensive and prone to variability across reviewers. Large language models (LLMs) offer a route to automation, but existing methods rely on manually engineered prompts that are difficult to reproduce, generalize, or evaluate. This study introduces a programmable RoB assessment pipeline that replaces ad-hoc prompt design with structured, code-based optimization using DSPy and its GEPA module. GEPA refines LLM reasoning through Pareto-guided search and produces inspectable execution traces, enabling transparent replication of every step in the optimization process. We evaluated the method on 100 RCTs from published meta-analyses across seven RoB domains. GEPA-generated prompts were applied to both open-weight models (Mistral Small 3.1 with GPT-oss-20b) and commercial models (GPT-5 Nano and GPT-5 Mini). In domains with clearer methodological reporting, such as Random Sequence Generation, GEPA-generated prompts performed best, with similar results for Allocation Concealment and Blinding of Participants, while the commercial model performed slightly better overall. We also compared GEPA with three manually designed prompts using Claude 3.5 Sonnet. GEPA achieved the highest overall accuracy and improved performance by 30%-40% in Random Sequence Generation and Selective Reporting, and showed generally comparable, competitively aligned performance in the other domains relative to manual prompts. These findings suggest that GEPA can produce consistent and reproducible prompts for RoB assessment, supporting the structured and principled use of LLMs in evidence synthesis.

💡 Deep Analysis

Deep Dive into Automated Risk-of-Bias Assessment of Randomized Controlled Trials: A First Look at a GEPA-trained Programmatic Prompting Framework.

Assessing risk of bias (RoB) in randomized controlled trials is essential for trustworthy evidence synthesis, but the process is resource-intensive and prone to variability across reviewers. Large language models (LLMs) offer a route to automation, but existing methods rely on manually engineered prompts that are difficult to reproduce, generalize, or evaluate. This study introduces a programmable RoB assessment pipeline that replaces ad-hoc prompt design with structured, code-based optimization using DSPy and its GEPA module. GEPA refines LLM reasoning through Pareto-guided search and produces inspectable execution traces, enabling transparent replication of every step in the optimization process. We evaluated the method on 100 RCTs from published meta-analyses across seven RoB domains. GEPA-generated prompts were applied to both open-weight models (Mistral Small 3.1 with GPT-oss-20b) and commercial models (GPT-5 Nano and GPT-5 Mini). In domains with clearer methodological reporting,

📄 Full Content

AUTOMATED RISK-OF-BIAS ASSESSMENT OF RANDOMIZED CONTROLLED TRIALS: A FIRST LOOK AT A GEPA-TRAINED PROGRAMMATIC PROMPTING FRAMEWORK A PREPRINT Lingbo Li ∗ School of Mathematical and Computational Sciences Massey University Auckland, New Zealand Anuradha Mathrani School of Mathematical and Computational Sciences Massey University Auckland, New Zealand Teo Susnjak School of Mathematical and Computational Sciences Massey University Auckland, New Zealand December 2, 2025 ABSTRACT Assessing risk of bias (RoB) in randomized controlled trials is essential for trustworthy evidence synthesis, but the process is resource-intensive and prone to variability across reviewers. Large lan- guage models (LLMs) offer a route to automation, but existing methods rely on manually engineered prompts that are difficult to reproduce, generalize, or evaluate. This study introduces a programmable RoB assessment pipeline that replaces ad-hoc prompt design with structured, code-based optimization using DSPy and its GEPA module. GEPA refines LLM reasoning through Pareto-guided search and produces inspectable execution traces, enabling transparent replication of every step in the optimization process. We evaluated the method on 100 RCTs from published meta-analyses across seven RoB domains. GEPA-generated prompts were applied to both open-weight models (Mistral Small 3.1 with GPT-oss-20b) and commercial models (GPT-5 Nano and GPT-5 Mini). In domains with clearer methodological reporting, such as Random Sequence Generation, GEPA-generated prompts performed best, with similar results for Allocation Concealment and Blinding of Participants, while the commercial model performed slightly better overall. We also compared GEPA with three manually designed prompts using Claude 3.5 Sonnet. GEPA achieved the highest overall accuracy and improved performance by 30%–40% in Random Sequence Generation and Selective Reporting, and showed generally comparable, competitively aligned performance in the other domains relative to manual prompts. These findings suggest that GEPA can produce consistent and reproducible prompts for RoB assessment, supporting the structured and principled use of LLMs in evidence synthesis. Keywords Risk of bias, Randomized controlled trials, Large language models, GEPA, Evidence Synthesis Automation, Programmatic Risk Assessment 1 Introduction Meta-analyses serve as cornerstone methodologies for synthesizing clinical evidence [1, 2]. However, the credibility of such evidence syntheses fundamentally depends on rigorous risk of bias (RoB) assessments to evaluate the methodolog- ical quality of included studies [3, 4]. This challenge is particularly relevant for randomized controlled trials (RCTs), ∗Corresponding author: L.Li5@massey.ac.nz arXiv:2512.01452v1 [cs.AI] 1 Dec 2025 ... A PREPRINT which constitute the primary evidence source for intervention effectiveness and require specialized assessment frame- works [5]. Currently, these assessments rely predominantly on manual evaluation processes that, while methodologically sound, face significant scalability challenges [6, 7, 8]. Manual RoB assessment is inherently time-intensive, requiring expert reviewers to extract and evaluate methodological details from each trial report [6, 9, 10, 8, 7]. Moreover, the process introduces unavoidable subjectivity, as different reviewers may interpret identical methodological descriptions differently, leading to inter-rater variability that can compromise the reliability of evidence syntheses[11, 12, 13, 14]. Large language models (LLMs) have opened new opportunities for automating evidence synthesis [15]. These models demonstrate strong capabilities in processing unstructured clinical text [16], understanding contextual dependencies[17, 18, 19], and adapting to specialized domains [20, 21], making them promising tools for RoB assessment. Several studies have explored the application of LLMs for automating RoB assessment in systematic reviews and meta-analysis. Lai et al.[22] applied GPT-series models and Claude to evaluate RoB in randomized trials using a structured prompt developed by experts, achieving high accuracy and substantial agreement with human reviewers. However, the study was limited by a small sample size, reliance on handcrafted prompts, and restricted validation within specific review domains, leaving its generalizability uncertain. In early 2025, Eisele-metzger et al.[23] evaluated Claude 2 using the RoB 2 tool and reported only slight-to-fair agreement with human assessments, emphasising inconsistencies in LLM performance across tools and datasets. Lai et al.[24] further extended LLM applications to complementary medicine, demonstrating that LLM-assisted approaches achieved up to 97.9% accuracy and markedly reduced assessment time, yet the study remained limited to complementary medicine RCTs and required expert-designed prompts for reliable performance. Most recently, Xai et al.[25] applied GPT-4o, Moonshot-v1-128k, and De

…(Full text truncated)…

📸 Image Gallery

compare.png compare.webp orcid.png orcid.webp workflow.png workflow.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut