임신 치료에서 대형 언어 모델 정렬 전략의 역설 정확도와 임상의 신뢰의 괴리

Reading time: 5 minute
...

📝 Abstract

Large language models (LLMs) are increasingly adopted in clinical decision support, yet aligning them with the multifaceted reasoning pathways of real-world medicine remains a major challenge. Using more than 8,000 infertility treatment records, we systematically evaluate four alignment strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL) through a duallayer framework combining automatic benchmarks with blinded doctor-in-the-loop assessments. GRPO achieves the highest algorithmic accuracy across multiple decision layers, confirming the value of reinforcement-based optimization for structured prediction tasks. However, clinicians consistently prefer the SFT model, citing clearer reasoning processes (p = 0.035) and higher therapeutic feasibility (p = 0.019). In blinded pairwise comparisons, SFT attains the highest winning rate (51.2%), outperforming both GRPO (26.2%) and even physicians’ original decisions (22.7%). These results reveal an alignment paradox: algorithmic improvements do not necessarily translate into higher clinical trust, and can diverge from human-centered preferences. Our findings highlight the need for alignment strategies that prioritize clinically interpretable and practically feasible reasoning, rather than solely optimizing decision-level accuracy.

💡 Analysis

Large language models (LLMs) are increasingly adopted in clinical decision support, yet aligning them with the multifaceted reasoning pathways of real-world medicine remains a major challenge. Using more than 8,000 infertility treatment records, we systematically evaluate four alignment strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL) through a duallayer framework combining automatic benchmarks with blinded doctor-in-the-loop assessments. GRPO achieves the highest algorithmic accuracy across multiple decision layers, confirming the value of reinforcement-based optimization for structured prediction tasks. However, clinicians consistently prefer the SFT model, citing clearer reasoning processes (p = 0.035) and higher therapeutic feasibility (p = 0.019). In blinded pairwise comparisons, SFT attains the highest winning rate (51.2%), outperforming both GRPO (26.2%) and even physicians’ original decisions (22.7%). These results reveal an alignment paradox: algorithmic improvements do not necessarily translate into higher clinical trust, and can diverge from human-centered preferences. Our findings highlight the need for alignment strategies that prioritize clinically interpretable and practically feasible reasoning, rather than solely optimizing decision-level accuracy.

📄 Content

Large language models (LLMs) have rapidly advanced across domains, yet how their post-training alignment interacts with the hierarchical, high-stakes reasoning of real-world medicine remains poorly understood. These alignment methods usually involve various reinforcement learning variants such as Reinforcement Learning with Human Feedback (RLHF), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO), which have shown remarkable improvement within some general areas by using verifiable rewards or preference signals to optimize the policy model. [1][2][3][4][5]For example, DeepSeek-R1 adopted the GRPO and its performance on math questions significantly outperformed the baseline. However, clinical decision-making presents a fundamentally different challenge: reasoning chains are long and multi-dimensional, rewards are noisy or unverifiable, and clinicians rely heavily on explanation quality and feasibility rather than solely output correctness alone. These properties suggest a structural tension between algorithmic alignment and clinical alignment, a tension that has not been mechanistically examined. The domain-specific medical LLMs, like Med-Palm, MedGemma, and Lingshu, were proven to exhibit profound potential in solving clinical problems, ranging from text classification to clinical image reports [6][7][8][9]. These models demonstrate capabilities that surpass human-level performance in the answer correctness on benchmark datasets. Despite their outstanding performance, when applying these state-of-the-art (SOTA) models to specialized real-world clinical cases instead of a general consultation, which could support the realistic diagnosis operation, they seem to encounter some obstacles [10,11]. Although SOTA models generated responses were preferred to those provided by generalist physicians, they still did not outperform domain-specific specialists, indicating limited utility in addressing highly specialized and practical clinical problems [12]. Moreover, their opaque reasoning processes limit clinical adoption, despite evidence that explanation-based systems (“XAI”) enhance trust and interpretability [13][14][15][16]. While supervised fine-tuning (SFT) remains the dominant paradigm in medical model development [10,17], data scarcity and incomplete supervision motivate increasing reliance on post-training alignment [8,18,19]. Yet, despite the rapid adoption of RLHF-style optimization, little is known about whether improvements in algorithmic metrics translate to enhanced interpretability, trust, or multi-step clinical reasoning. Even fewer studies have evaluated how different alignment paradigms behave in real-world, interdependent clinical decision environments where reasoning quality directly affects patient safety. This gap raises a fundamental scientific question: Does stronger post-training alignment produce more clinically aligned models-or can optimization distort clinical reasoning, giving rise to an alignment paradox? As one of the medical diseases and global health issues declared by the World Health Organization (WHO), Infertility is estimated to be experienced by one in six at some stage in their lives globally [20]. Assisted Reproductive Technology (ART) has become a central clinical pathway for these patients, typically after the failure of conventional treatments such as cycle regulation or ovulation induction [21]. ART serves as an ideal testbed for reasoning alignment because it demands the rigorous integration of high-dimensional data, including age, Anti-Müllerian Hormone (AMH), Follicle-Stimulating Hormone (FSH), Body Mass Index (BMI), endocrine profiles, past medical history, and gynecological ultrasound findings, to formulate a safe and effective treatment plan. Beyond selecting the ART strategy, clinicians must also determine the controlled ovarian stimulation (COS) protocol and the gonadotropin (Gn) starting dose, decisions that are interdependent and sensitive to subtle variations in the patient’s physiological state. Consequently, ART decision-making is a multistage, high-dimensional, and evidence-driven process that is both time-consuming and cognitively demanding. Clinical outcomes are highly dependent on nuanced reasoning: inappropriate treatment selection can reduce cycle success rates, increase financial and emotional burden, and even elevate the risk of severe complications such as ovarian hyperstimulation syndrome (OHSS) [22]. These decisions rely heavily on individual clinical experience, leading to substantial inter-physician and inter-center variability and limiting the scalability of standardized training for novice practitioners. Moreover, high outpatient volumes can induce decision fatigue among experienced clinicians [23]. The inherent complexity, multi-layered structure, and safety-critical nature of ART therefore make it not only a representative real-world medical decision environment, but also an exceptionally sensitive setting in w

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut