RiskAgent: Synergizing Language Models with Validated Tools for Evidence-Based Risk Prediction
Large Language Models (LLMs) achieve competitive results compared to human experts in medical examinations. However, it remains a challenge to apply LLMs to complex clinical decision-making, which requires a deep understanding of medical knowledge and differs from the standardized, exam-style scenarios commonly used in current efforts. A common approach is to fine-tune LLMs for target tasks, which, however, not only requires substantial data and computational resources but also remains prone to generating `hallucinations’. In this work, we present RiskAgent, which synergizes language models with hundreds of validated clinical decision tools supported by evidence-based medicine, to provide generalizable and faithful recommendations. Our experiments show that RiskAgent not only achieves superior performance on a broad range of clinical risk predictions across diverse scenarios and diseases, but also demonstrates robust generalization in tool learning on the external MedCalc-Bench dataset, as well as in medical reasoning and question answering on three representative benchmarks, MedQA, MedMCQA, and MMLU.
💡 Research Summary
RiskAgent is a novel multi‑agent framework that augments a large language model (LLM) with hundreds of validated clinical decision tools to deliver evidence‑based risk predictions. The authors argue that existing medical LLMs excel at standardized examinations but falter on real‑world clinical decision‑making due to four key challenges: limited clinical efficacy, high resource demands for fine‑tuning, privacy constraints when using commercial APIs, and the propensity to generate hallucinated answers without verifiable sources. To address these issues, RiskAgent introduces three cooperating LLM agents—Decider, Executor, and Reviewer—plus an Environment module that hosts the tool registry and execution interfaces.
The Decider receives a patient’s electronic health record (EHR) or a clinical vignette, parses the relevant variables, and selects the most appropriate evidence‑based calculator (e.g., CHA₂DS₂‑VASc, Framingham risk score, CURB‑65). The Executor then formats the required parameters, calls the selected tool (via API or local implementation), and returns the raw numerical result. The Decider interprets this output, drafts an initial answer, and passes it to the Reviewer, which validates the reasoning chain, attaches citations to the specific tool version, and produces the final, clinician‑readable response. This architecture externalizes all deterministic, guideline‑driven computations, thereby eliminating the need for the LLM to “memorize” complex formulas and dramatically reducing hallucinations.
Training leverages an 8‑billion‑parameter LLaMA foundation model, further instruction‑tuned on a mixture of medical texts, tool usage manuals, and synthetic risk‑prediction prompts. Reinforcement learning from human feedback (RLHF) is employed to fine‑tune the Decider’s tool‑selection policy, achieving >93 % accuracy in choosing the correct calculator for a given scenario. The system is deliberately lightweight: the LLM itself remains modest in size, while the heavy lifting is performed by the external tools, which are free of charge and continuously updated by the medical community.
To evaluate the approach, the authors construct MedRisk, a comprehensive benchmark comprising 12,352 risk‑prediction questions spanning 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems. Each item is grounded in a real‑world clinical scenario and has a ground‑truth answer derived from the corresponding validated tool. RiskAgent attains an average accuracy of 78.4 % on MedRisk, substantially outperforming GPT‑4o (62.1 %), Google’s o1 (58.7 %), and the state‑of‑the‑art medical LLM Meditron‑70B (65.3 %). The performance gap widens to 15‑20 % in domains that heavily rely on precise calculations, such as cardiovascular event risk, cancer incidence risk, and asthma exacerbation risk. Statistical significance is confirmed via paired t‑tests (p < 0.01).
Generalization is further demonstrated on the external MedCalc‑Bench dataset, where RiskAgent, without any additional fine‑tuning, correctly invokes the appropriate tool in 85 % of zero‑shot cases, whereas baseline LLMs fail to use any tool at all. In three standard medical QA benchmarks—MedQA, MedMCQA, and MMLU—RiskAgent also yields 7‑12 % absolute gains over the best competing models, while uniquely providing traceable evidence (tool name, version, formula) alongside each answer.
The paper’s contributions are threefold: (1) the design of RiskAgent, a resource‑efficient, evidence‑centric multi‑agent system that couples LLM reasoning with validated clinical calculators; (2) the release of MedRisk, a large, diverse benchmark for generalist medical risk prediction; and (3) extensive empirical validation showing superior accuracy, robust tool‑learning transfer, and improved medical reasoning across multiple tasks.
Limitations include the reliance on up‑to‑date tool repositories—if guidelines change, the tool registry must be refreshed—and the current focus on quantitative risk scores, leaving qualitative treatment recommendations for future work. The authors propose extending the framework to therapeutic decision support and building automated pipelines for tool metadata updates, thereby moving toward fully trustworthy, AI‑augmented clinical practice.
Comments & Academic Discussion
Loading comments...
Leave a Comment