Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning
Clinical decision support requires not only correct answers but also clinically valid reasoning. We propose Differential Reasoning Learning (DRL), a framework that improves clinical agents by learning from reasoning discrepancies. From reference reasoning rationales (e.g., physician-authored clinical rationale, clinical guidelines, or outputs from more capable models) and the agent’s free-form chain-of-thought (CoT), DRL extracts reasoning graphs as directed acyclic graphs (DAGs) and performs a clinically weighted graph edit distance (GED)-based discrepancy analysis. An LLM-as-a-judge aligns semantically equivalent nodes and diagnoses discrepancies between graphs. These graph-level discrepancy diagnostics are converted into natural-language instructions and stored in a Differential Reasoning Knowledge Base (DR-KB). At inference, we retrieve top-$k$ instructions via Retrieval-Augmented Generation (RAG) to augment the agent prompt and patch likely logic gaps. Evaluation on open medical question answering (QA) benchmarks and a Return Visit Admissions (RVA) prediction task from internal clinical data demonstrates gains over baselines, improving both final-answer accuracy and reasoning fidelity. Ablation studies confirm gains from infusing reference reasoning rationales and the top-$k$ retrieval strategy. Clinicians’ review of the output provides further assurance of the approach. Together, results suggest that DRL supports more reliable clinical decision-making in complex reasoning scenarios and offers a practical mechanism for deployment under limited token budgets.
💡 Research Summary
The paper introduces Differential Reasoning Learning (DRL), a two‑stage framework designed to close reasoning gaps in clinical language agents. The authors argue that clinical decision support must deliver not only correct answers but also clinically valid reasoning pathways. To achieve this, DRL first converts high‑quality reference rationales—drawn from physician‑authored explanations, clinical guidelines, or more capable teacher models—into structured directed acyclic graphs (DAGs). Nodes represent clinical entities (symptoms, labs, diagnoses, treatments) and edges encode inferential relationships such as “evidence supports hypothesis” or “hypothesis triggers test.” Using the same semantic parser, the agent’s free‑form chain‑of‑thought (CoT) is also transformed into a DAG.
The core of DRL is a clinically weighted graph edit distance (GED) analysis. By assigning higher costs to edits involving high‑risk factors or critical treatment decisions, the method quantifies three categories of discrepancy: missing important nodes (v_miss), hallucinated or irrelevant nodes (v_hallu), and incorrect or incomplete reasoning paths (e_diff). Because clinical language is highly paraphrastic, exact string matching is insufficient; therefore, an “LLM‑as‑a‑judge” module performs semantic node alignment and context‑based plausibility checks, ensuring that conceptually equivalent but lexically different nodes are matched correctly.
The discrepancy report is then fed to an “Insight Generator” LLM, which produces a concise natural‑language instruction. Each instruction explicitly states what went wrong, why it matters clinically, the associated risks, preventive steps, and contextual triggers (keywords or patterns). These instructions are stored in a Differential Reasoning Knowledge Base (DR‑KB), effectively a repository of reusable “patches” that capture recurring logical failures.
During inference, the current case’s text is used to query DR‑KB; the top‑k most relevant instructions are retrieved and injected into the prompt of the clinical agent (a retrieval‑augmented generation step). This approach patches the agent’s reasoning without any parameter updates, making it suitable for environments with strict governance, limited compute, or token‑budget constraints.
Empirical evaluation covers two open‑source medical QA benchmarks (MedQA and MedMCQA) and an internal Return Visit Admission (RVA) prediction task derived from emergency department notes. The RVA‑QA dataset simulates a realistic domain shift: predicting whether a patient will be readmitted within nine days after an ED visit. DRL achieves 81.28 ± 0.47 % accuracy on RVA‑QA, outperforming the strongest baseline (Qwen‑3‑8B at 56.97 % accuracy) by more than 24 percentage points. On the open benchmarks, DRL yields modest but consistent gains over baseline LLMs, demonstrating that reasoning‑level supervision can complement knowledge‑level training.
Ablation studies confirm that (1) incorporating reference reasoning graphs is essential for performance, and (2) the retrieval depth k balances coverage against token budget; both under‑ and over‑retrieval degrade results. Human evaluation by clinicians shows that the generated instructions improve interpretability and trust, as they expose missing risk factors, prevent hallucinations, and align the agent’s reasoning with established guidelines.
The authors highlight three design goals: process‑level supervision (focusing on how an answer is derived), interpretability/auditability (allowing clinicians to inspect and edit patches), and practicality under domain shift (no fine‑tuning required). Limitations include reliance on the accuracy of the initial graph parser, the need to keep guideline‑derived graphs up‑to‑date, and scalability concerns for a large DR‑KB. Future work is proposed on multimodal integration, automated weighting of graph edits, and continuous KB maintenance.
In summary, DRL offers a novel, discrepancy‑driven alignment mechanism that transforms reasoning errors into actionable knowledge, stores this knowledge for reuse, and applies it at inference time to produce more reliable, clinically grounded AI assistants.
Comments & Academic Discussion
Loading comments...
Leave a Comment