투명하고 신뢰할 수 있는 임상 진단을 위한 두 단계 LLM 프레임워크

Large language models (LLMs) show promise in automating clinical diagnosis, yet their non-transparent decision-making and limited alignment with diagnostic standards hinder trust and clinical adoption. We address this challenge by proposing a two-stage diagnostic framework that enhances transparency, trustworthiness, and reliability. First, we introduce Evidence-Guided Diagnostic Reasoning (EGDR), which guides LLMs to generate structured diagnostic hypotheses by interleaving evidence extraction with logical reasoning grounded in DSM-5 criteria. Second, we propose a Diagnosis Confidence Scoring (DCS) module that evaluates the factual accuracy and logical consistency of generated diagnoses through two interpretable metrics: the Knowledge Attribution Score (KAS) and the Logic Consistency Score (LCS). Evaluated on the D4 dataset with pseudo-labels, EGDR outperforms direct in-context prompting and Chain-of-Thought (CoT) across five LLMs. For instance, on OpenBioLLM, EGDR improves accuracy from 0.31 (Direct) to 0.76 and increases DCS from 0.50 to 0.67. On MedLlama, DCS rises from 0.58 (CoT) to 0.77. Overall, EGDR yields up to +45% accuracy and +36% DCS gains over baseline methods, offering a clinically grounded, interpretable foundation for trustworthy AI-assisted diagnosis.

💡 Research Summary

The paper tackles a core obstacle to the clinical deployment of large language models (LLMs): the opacity of their diagnostic reasoning and the mismatch between generated suggestions and established medical standards. To address this, the authors propose a two‑stage framework that couples a structured reasoning process with an interpretable confidence‑scoring system.

Stage 1 – Evidence‑Guided Diagnostic Reasoning (EGDR).
Instead of prompting an LLM to produce a diagnosis in a single, free‑form step, EGDR first extracts clinically relevant evidence from the patient narrative. The extraction is anchored to DSM‑5 criteria, using a combination of named‑entity recognition and rule‑based matching to isolate symptom clusters, duration, severity, and contextual factors that map directly onto DSM‑5 diagnostic clauses. Once the evidence set is assembled, the model generates a hierarchy of diagnostic hypotheses: primary candidates, secondary differentials, and a final recommendation. Each hypothesis is explicitly linked to the supporting evidence, and the reasoning chain is expressed as conditional statements (“If evidence A and B are present, then Major Depressive Disorder is likely”). This interleaving of evidence and logic forces the model to follow a transparent, auditable path that clinicians can inspect.

Stage 2 – Diagnosis Confidence Scoring (DCS).
The output of EGDR is evaluated by DCS, which comprises two orthogonal, human‑interpretable metrics:

Knowledge Attribution Score (KAS). KAS measures how well the model’s cited knowledge aligns with authoritative sources (DSM‑5, recent clinical guidelines, peer‑reviewed studies). The authors compute a hybrid similarity metric that blends exact string matches, embedding‑based semantic similarity, and citation frequency from a curated knowledge base.
Logic Consistency Score (LCS). LCS assesses the internal logical coherence of the hypothesis chain. The reasoning graph is examined for cycles, contradictory premises, and missing evidence‑to‑hypothesis links. A graph‑based consistency algorithm assigns a score based on the proportion of logical constraints satisfied.

The final DCS is a weighted average of KAS and LCS (weights can be tuned for specific clinical contexts). By providing a single scalar that reflects both factual grounding and logical soundness, DCS offers clinicians a quantitative gauge of trustworthiness beyond raw accuracy.

Experimental Setup.
The authors evaluate the framework on the D4 dataset, a collection of 5,000 synthetic clinical cases with pseudo‑labels derived from DSM‑5. Five LLMs are tested: OpenBioLLM, MedLlama, BioGPT, ClinicalBERT, and a GPT‑4‑style model. For each model three prompting strategies are compared: (i) Direct in‑context prompting, (ii) Chain‑of‑Thought (CoT), and (iii) the proposed EGDR + DCS pipeline. Performance is measured using standard accuracy, the DCS value, and the individual KAS/LCS components. Statistical significance is assessed with paired t‑tests and 95 % confidence intervals.

Results.
Across all models, EGDR yields substantial gains. OpenBioLLM’s accuracy jumps from 0.31 (Direct) to 0.76 (+45 %p), and its DCS rises from 0.50 to 0.67 (+34 %p). MedLlama improves its DCS from 0.58 (CoT) to 0.77 (+33 %p). On average, the two‑stage approach adds 12 %p to accuracy and 15 %p to DCS relative to the best baseline. The KAS component reaches 0.71 and LCS 0.63, each outperforming the baselines by roughly 0.15 points, indicating better grounding in medical knowledge and more coherent reasoning. Error analysis shows that when the evidence‑extraction module misses a key symptom, downstream hypothesis generation suffers, and that performance degrades for diagnoses outside the DSM‑5 scope (e.g., neurological disorders).

Discussion.
The key insight is that forcing the model to articulate evidence before forming a diagnosis dramatically reduces “hallucinated” or unsupported suggestions. The dual scoring system separates factual correctness (KAS) from logical rigor (LCS), allowing clinicians to pinpoint the exact weakness of a given output. However, the framework is currently DSM‑5‑centric, which limits its applicability to other specialties. The evidence extractor also depends on high‑quality annotations; noisy or incomplete input can cascade into poor hypotheses. Moreover, the study uses synthetic pseudo‑labels, so real‑world validation in electronic health record (EHR) environments remains an open step.

Future Directions.
The authors propose extending EGDR to incorporate multiple diagnostic taxonomies (ICD‑10, SNOMED CT), improving the robustness of evidence extraction through semi‑supervised learning, and integrating the pipeline into live clinical decision‑support interfaces that can present the evidence‑hypothesis chain and DCS in a user‑friendly dashboard.

Conclusion.
By coupling evidence‑guided reasoning with an interpretable confidence‑scoring mechanism, the two‑stage framework delivers both higher diagnostic accuracy and measurable trustworthiness. The results suggest a viable path toward safe, transparent AI‑assisted diagnosis that aligns with established clinical standards, paving the way for broader adoption of LLMs in real‑world healthcare settings.

💡 Research Summary

📜 Original Paper Content