End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The integration of Large Language Models (LLMs) into healthcare is constrained by knowledge limitations, hallucinations, and a disconnect from Evidence-Based Medicine (EBM). While Retrieval-Augmented Generation (RAG) offers a solution, current systems often rely on static workflows that miss the iterative, hypothetico-deductive reasoning of clinicians. To address this, we introduce Deep-DxSearch, an agentic RAG system trained end-to-end via reinforcement learning (RL) for traceable diagnostic reasoning. Deep-DxSearch acts as an active investigator, treating the LLM as an agent within an environment of 16,000+ guideline-derived disease profiles, 150,000+ patient records for case-based reasoning, and over 27 million biomedical documents. Using soft verifiable rewards that co-optimize retrieval and reasoning, the model learns to formulate queries, evaluate evidence, and refine searches to close diagnostic gaps. Experiments show our end-to-end RL framework consistently outperforms prompt-engineering and training-free RAG methods. On in-distribution (ID) and out-of-distribution (OOD) benchmarks for common and rare diseases, Deep-DxSearch surpasses strong baselines-including GPT-4o, DeepSeek-R1, and medical-specific frameworks-achieving an average accuracy gain of 22.7% over the second-best model. In validation with 150 real-world cases, Deep-DxSearch boosts physicians’ average diagnostic accuracy from 45.6% to 69.1%. These results indicate that evolving agentic systems to leverage statistical regularities in large-scale healthcare data is key for trustworthy diagnostic assistants. All data, code, and checkpoints are available at https://qiaoyu-zheng.github.io/Deep-DxSearch.

💡 Research Summary

The paper tackles the persistent gap between large language models (LLMs) and evidence‑based clinical practice. While Retrieval‑Augmented Generation (RAG) can ground LLMs in external knowledge, most medical RAG pipelines are static, relying on a single “query‑and‑answer” step that fails to capture the iterative, hypothesis‑driven reasoning clinicians use. To bridge this gap, the authors introduce Deep‑DxSearch, an agentic RAG system that treats the LLM as an autonomous decision‑making agent operating within a richly structured medical environment.

The environment aggregates three massive resources: (1) over 16 000 guideline‑derived disease profiles, (2) more than 150 000 curated patient records for case‑based reasoning, and (3) a corpus of more than 27 million biomedical documents (PubMed articles, textbooks, Wikipedia, etc.). Within this setting the agent can invoke five primitive actions— (hypothesis generation), (guideline verification), (similar‑case retrieval),

(literature mining), and (final decision).

Training is performed end‑to‑end with reinforcement learning (PPO). The reward function is multi‑faceted: (i) correctness of the final diagnosis, (ii) “evidence quality” rewards for retrieving high‑trust sources (guidelines, matched cases, peer‑reviewed literature), and (iii) trajectory‑level incentives that encourage diverse, efficient search strategies. This “process‑oriented” reward forces the model not only to be right but also to produce a traceable chain of evidence—what the authors call a “Chain of Evidence”—instead of a black‑box “Chain of Thought”.

Empirical evaluation spans eight clinical centers and more than 24 000 cases, split into in‑distribution (ID) and out‑of‑distribution (OOD) test sets, with a special focus on rare diseases (≈27 % of the data). Compared with strong baselines—including GPT‑4o, DeepSeek‑R1, MedRAG, MedCPT, Meditron, and several chain‑of‑thought or multi‑agent systems—Deep‑DxSearch achieves:

A 29.7 % absolute gain in top‑1 accuracy on ID data and 9.7 % on OOD data.
Up to 31.8 % improvement on rare‑disease subsets.
Consistently high human‑rated reasoning quality (average score ≥ 4/4).

Ablation studies show that removing the process‑oriented reward reduces accuracy by 13.7 %, confirming that the evidence‑centric objective is a key driver of performance.

In a physician‑in‑the‑loop study with 150 real‑world cases, physicians alone achieved 45.6 % diagnostic accuracy; with Deep‑DxSearch assistance this rose to 69.1 %. Clinicians highlighted the transparency of the evidence trail as a major advantage over competing models that provide only internal reasoning without verifiable citations.

Limitations include reliance on expert‑crafted reward components, the computational cost of maintaining a multi‑petabyte retrieval index, and potential latency in real‑time settings. The authors suggest future work on automated reward shaping, lightweight indexing, multimodal integration (e.g., imaging), and seamless coupling with hospital information systems.

Overall, Deep‑DxSearch demonstrates that end‑to‑end reinforcement‑learning of an LLM‑agent within a comprehensive medical retrieval environment can simultaneously boost diagnostic accuracy and provide a verifiable, traceable reasoning process. This represents a significant step toward trustworthy, evidence‑anchored AI assistants for high‑stakes clinical decision‑making.

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment