AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care

AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Background: Empathy is widely recognized for improving patient outcomes, including reduced pain and anxiety and improved satisfaction, and its absence can cause harm. Meanwhile, use of artificial intelligence (AI)-based chatbots in healthcare is rapidly expanding, with one in five general practitioners using generative AI to assist with tasks such as writing letters. Some studies suggest AI chatbots can outperform human healthcare professionals (HCPs) in empathy, though findings are mixed and lack synthesis. Sources of data: We searched multiple databases for studies comparing AI chatbots using large language models with human HCPs on empathy measures. We assessed risk of bias with ROBINS-I and synthesized findings using random-effects meta-analysis where feasible, whilst avoiding double counting. Areas of agreement: We identified 15 studies (2023-2024). Thirteen studies reported statistically significantly higher empathy ratings for AI, with only two studies situated in dermatology favouring human responses. Of the 15 studies, 13 provided extractable data and were suitable for pooling. Meta-analysis of those 13 studies, all utilising ChatGPT-3.5/4, showed a standardized mean difference of 0.87 (95% CI, 0.54-1.20) favouring AI (P < .00001), roughly equivalent to a two-point increase on a 10-point scale. Areas of controversy: Studies relied on text-based assessments that overlook non-verbal cues and evaluated empathy through proxy raters. Growing points: Our findings indicate that, in text-only scenarios, AI chatbots are frequently perceived as more empathic than human HCPs. Areas timely for developing research: Future research should validate these findings with direct patient evaluations and assess whether emerging voice-enabled AI systems can deliver similar empathic advantages.


💡 Research Summary

This paper presents a systematic review and meta‑analysis evaluating whether large‑language‑model (LLM) based AI chatbots demonstrate greater empathy than human healthcare professionals (HCPs) in patient‑care interactions. The authors searched seven major databases (PubMed, Cochrane Library, Embase, PsycINFO, CINAHL, Scopus, IEEE Xplore) plus clinical trial registries up to November 2024, following PRISMA 2020 guidelines. Inclusion criteria required empirical comparisons of empathy between AI chatbots (using LLMs such as GPT‑3.5, GPT‑4, Claude, Gemini, Med‑PaLM2) and human HCPs, with interactions derived from real patient‑generated text (emails, portal messages, online forum posts). Studies using rule‑based bots, hypothetical scenarios, or lacking original data were excluded.

Fifteen studies published between 2023 and 2024 met the criteria; 13 provided extractable quantitative data suitable for pooling. The studies spanned a wide range of clinical topics (general health queries, dermatology, oncology, endocrinology, rheumatology, neurology, surgery, etc.) and compared AI responses to various human comparators (physicians, surgeons, nurses, reception staff). All but one study employed text‑only interaction; a single study transcribed speech to text for the LLM and then reconverted the response to audio, but empathy ratings were still based on the transcript.

Risk of bias was assessed with ROBINS‑I. Nine studies were rated as having moderate risk, six as serious. Major concerns included reliance on non‑validated, single‑item Likert scales (14 of 15 studies), heterogeneous raters (patient proxies, clinicians, psychology trainees, laypeople), selection bias from Reddit or other online forums, and, in a few cases, supervised AI outputs that could confound performance assessment. Only one study used a validated instrument (the CARE scale).

Meta‑analysis of the 13 studies using GPT‑3.5 or GPT‑4 yielded a standardized mean difference (SMD) of 0.87 (95 % CI 0.54–1.20, p < 0.00001), equivalent to roughly a two‑point increase on a 10‑point empathy scale. Subgroup analysis suggested slightly higher effects for GPT‑4 versus GPT‑3.5, but heterogeneity was substantial (I² > 70 %). Two dermatology studies favored human responses, indicating possible specialty‑specific differences.

The authors conclude that, in purely textual exchanges, LLM‑based chatbots are frequently perceived as more empathic than human providers. However, they caution that the evidence is limited by methodological weaknesses: lack of non‑verbal cues, proxy raters rather than actual patients, and non‑standardized empathy measures. They recommend future research to (1) conduct randomized controlled trials with real patients, (2) evaluate voice‑enabled or multimodal AI systems, (3) employ validated empathy scales (e.g., CARE, CEEQ), and (4) explore cultural and linguistic generalizability. Until such high‑quality data are available, the clinical significance of AI‑delivered empathy remains uncertain.


Comments & Academic Discussion

Loading comments...

Leave a Comment