의료 현장 대형언어모델 평가를 위한 MediEval과 안전 파인튜닝

Reading time: 5 minute
...

📝 Abstract

Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patientlevel reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Finetuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.

💡 Analysis

Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patientlevel reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Finetuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.

📄 Content

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs Zhan Qu and Michael Färber TU Dresden and ScaDS.AI, Germany {zhan.qu, michael.faerber}@tu-dresden.de Abstract Large Language Models (LLMs) are increas- ingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medi- cal knowledge in isolation or assess patient- level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vo- cabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evalu- ation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current propri- etary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine- tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe con- fusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher ac- curacy and substantially greater safety. 1 Introduction Large Language Models (LLMs) have demon- strated remarkable capabilities across diverse do- mains, with medicine being among the most high- impact areas of application. In clinical contexts, LLMs have been explored for tasks such as summa- rizing electronic health records (EHRs), generat- ing discharge instructions, providing clinical deci- sion support, and answering medical questions (Ra- jpurkar et al., 2022; Singhal et al., 2025; Moor et al., 2023; Singhal et al., 2023; Pal et al., 2022). Their appeal lies in the ability to integrate unstructured text with medical knowledge, potentially reducing documentation burden and assisting clinicians in decision-making. Translating research prototypes into real-world deployment hinges critically on reliability and safety. Unlike generic NLP tasks, medical reason- ing requires not only factual correctness but also contextual grounding in patient-specific data while adhering to verified medical knowledge. Errors in this setting are not mere degradations in per- formance but risks that can directly translate into patient harm (Thirunavukarasu et al., 2023; Yang et al., 2023; Haltaufderheide and Ranisch, 2024). A critical challenge is that LLMs often fail to apply medical knowledge consistently within the heterogeneous and noisy context of patient records (Zhou et al., 2025b). For example, a model may state that metformin, a therapy for type 2 diabetes, is contraindicated in severe renal impairment, yet fail to apply this knowledge when the condition appears in a noisy and heterogeneous patient record. The cause of such errors may be that current models are trained to recall facts in isolation rather than to integrate them with diverse patient information. Such inconsistencies expose a critical gap between knowing medical facts and using them safely. Existing evaluation paradigms only partially ad- dress this gap. Medical data are uniquely challeng- ing because patient records are heterogeneous with free-text clinical notes and tabular entries coded in different systems for diagnoses, procedures, and medications. Medical knowledge is hierarchical and ontology-driven (e.g., UMLS, SNOMED CT, RxNorm), but its large scale, noise, and limited cross-vocabulary connectivity make consistent rea- soning difficult. Benchmarks based on EHRs as- sess the ability to extract or reason over struc- tured data, but often reduce the task to retrieval or serialization without verifying medical sound- ness (Lovón-Melgarejo et al., 2025). In contrast, knowledge-based evaluations test whether LLMs can handle logical transformations of medical facts (Zhou et al., 2025a; Sung et al., 2021), but do not connect reasoning to real patient contexts. The field 1 arXiv:2512.20822v1 [cs.CL] 23 Dec 2025 Figure 1: Overview of the current work with a real example; texts in blue indicate the extracted sample. thus lacks a unified framework that probes whether LLMs can (i) remain faithful to medical knowledge and (ii) apply it consistently to individual patient records. In this paper, we address this gap with MediEval (Figure 1), a benchmark and evaluation framework that links real patient records (MIMIC-IV) (John- son et al., 2023) with a unified biomedical knowl- edge base built from UMLS (Bodenreider, 2004), SNOMED CT (Donnelly et al., 2006), and RxNorm (Liu et al., 2005; Nelson et al., 2011). To ensure rig- orous construction, MediEval constructs evaluation statements by applying graph-guided substitutions and recombinations within biomedical ontologies, with pla

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut