Augmenting Clinical Decision-Making with an Interactive and Interpretable AI Copilot: A Real-World User Study with Clinicians in Nephrology and Obstetrics
Clinician skepticism toward opaque AI hinders adoption in high-stakes healthcare. We present AICare, an interactive and interpretable AI copilot for collaborative clinical decision-making. By analyzing longitudinal electronic health records, AICare grounds dynamic risk predictions in scrutable visualizations and LLM-driven diagnostic recommendations. Through a within-subjects counterbalanced study with 16 clinicians across nephrology and obstetrics, we comprehensively evaluated AICare using objective measures (task completion time and error rate), subjective assessments (NASA-TLX, SUS, and confidence ratings), and semi-structured interviews. Our findings indicate AICare’s reduced cognitive workload. Beyond performance metrics, qualitative analysis reveals that trust is actively constructed through verification, with interaction strategies diverging by expertise: junior clinicians used the system as cognitive scaffolding to structure their analysis, while experts engaged in adversarial verification to challenge the AI’s logic. This work offers design implications for creating AI systems that function as transparent partners, accommodating diverse reasoning styles to augment rather than replace clinical judgment.
💡 Research Summary
This paper addresses the persistent “last‑mile” problem that hampers the deployment of artificial intelligence in high‑stakes clinical settings. The authors introduce AICare, an interactive and interpretable AI copilot designed to support, rather than replace, clinicians during the intermediate stages of diagnostic reasoning. AICare processes longitudinal electronic health records (EHR) to generate dynamic risk predictions and grounds these predictions in four tightly integrated modules: (1) a risk‑trajectory visualization that plots a patient’s predicted risk over time, (2) an interactive feature‑importance list (using SHAP) that allows on‑demand drill‑down into the temporal evolution of key variables, (3) a large‑language‑model (LLM)‑driven narrative that translates the model’s findings into concise, clinically phrased text, and (4) a population‑level indicator comparison that situates the patient’s data within cohort trends. All components are embedded directly into the hospital’s information system, enabling real‑time data access without disrupting existing workflows.
To evaluate the system, the authors conducted a within‑subjects, counterbalanced user study with 16 practicing clinicians (8 senior nephrologists/obstetricians and 8 junior residents) across two high‑risk specialties: chronic kidney disease management and preterm‑birth risk assessment. Each participant completed diagnostic tasks under two conditions—using AICare and using their usual manual analysis—while objective metrics (task completion time, error rate) and subjective metrics (NASA‑TLX for cognitive workload, System Usability Scale, and self‑reported confidence) were recorded. Semi‑structured interviews and interaction logs provided qualitative insight.
Quantitative results show that AICare significantly reduced perceived cognitive workload (p = 0.023) and increased diagnostic confidence (p = 0.018) compared with the baseline condition. Diagnostic accuracy remained comparable, and overall task duration did not differ significantly. However, senior clinicians spent slightly more time on tasks when using AICare, a pattern linked to a higher frequency of data‑exploration actions. Log analysis revealed two distinct interaction strategies: senior clinicians engaged in “adversarial verification,” repeatedly interrogating the risk trajectory and feature‑importance explanations to challenge and align the AI’s reasoning with their own mental models; junior clinicians treated the system as a “cognitive scaffold,” using the visualizations to structure their analysis while performing fewer deep dives.
Qualitative findings reconceptualize trust as an active, iterative process rather than a static attitude. Participants reported that moments of disagreement with the AI did not erode trust; instead, when the AI’s rationale was transparently presented, clinicians used those moments to reflect critically, sometimes revising their own conclusions or prompting the AI to clarify its reasoning. The LLM‑generated narrative was praised for reducing information overload and providing a familiar clinical language bridge between raw model outputs and actionable insights.
From these observations the authors derive several design implications: (1) interactive explanation interfaces must accommodate both novice “scaffolding” and expert “adversarial verification” workflows; (2) natural‑language summaries powered by LLMs should be concise yet evidence‑based to support rapid sense‑making; (3) seamless integration with existing EHRs is essential to avoid workflow friction; and (4) AI systems should be positioned as collaborative partners that expose their reasoning rather than as authoritative black boxes.
The paper’s contributions are threefold: (i) the end‑to‑end design and deployment of AICare, a real‑world, interpretable AI copilot integrated into hospital information systems; (ii) a mixed‑methods empirical evaluation across two distinct clinical domains and varying levels of expertise, revealing shared and context‑specific needs for AI adoption; and (iii) actionable design guidelines grounded in observed verification mechanisms, highlighting how AI can augment clinical judgment without undermining professional agency.
Limitations include the modest sample size, restriction to only nephrology and obstetrics, and the lack of longitudinal outcome tracking. Future work should expand to additional specialties, conduct multi‑site trials to test generalizability, and explore tighter coupling of AI recommendations with concrete ordering or guideline‑linking functionalities.
Comments & Academic Discussion
Loading comments...
Leave a Comment