Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI
Ramaswamy et al. reported in Nature Medicine that ChatGPT Health under-triages 51.6% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol – forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions – that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors’ released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and conditions. Asthma triage improved from 48% to 80%. The forced A/B/C/D format was the dominant failure mechanism: three models scored 0–24% with forced choice but 100% with free text (all $p < 10^{-8}$), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Prompt-faithful checks on the authors’ exact released prompts confirmed the scaffold produces model-dependent, case-dependent results. Our results suggest that the headline under-triage rate is highly contingent on evaluation format and may not generalize as a stable estimate of deployed triage behavior. Valid evaluation of consumer health AI requires testing under conditions that reflect actual use.
💡 Research Summary
This paper critically re‑examines the headline claim from Ramaswamy et al. (Nature Medicine) that “ChatGPT Health under‑triages 51.6 % of emergencies,” a finding that has generated widespread media attention and policy concern. The original study evaluated the model using an exam‑style scaffold: a forced A/B/C/D choice, a “base your answer only on the information in this message” instruction that suppresses the model’s background knowledge, and a prohibition on asking clarifying questions. While such a protocol mimics a multiple‑choice exam, it diverges sharply from how consumers actually interact with health chatbots, which involve free‑text input, iterative clarification, and the ability to draw on prior conversational context.
To assess whether the reported under‑triage rate is a stable property of modern large language models (LLMs) or an artifact of the evaluation design, the authors conducted a partial mechanistic replication using five frontier LLMs: GPT‑5.2 (OpenAI), Claude Sonnet 4.6 and Claude Opus 4.6 (Anthropic), Gemini 3 Flash and Gemini 3.1 Pro (Google). They built a bank of 17 clinical scenarios (including the two emergency cases from the original study: diabetic ketoacidosis and asthma exacerbation) and tested each model under two contrasting conditions:
-
Constrained (exam‑style) condition – identical to the original scaffold: structured vignette, system prompt, forced A/B/C/D output, knowledge‑suppression, and no clarifying questions. Three prompt variants (structured, patient‑realistic, patient‑minimal) were each run five times per model, yielding 1,275 trials (5 models × 17 cases × 3 variants × 5 runs).
-
Naturalistic condition – patient‑style free‑text messages without any scaffolding, system prompt, or forced choice. Two patient‑style variants (realistic and minimal) were each run five times, producing 850 trials (5 models × 17 cases × 2 variants × 5 runs). Free‑text responses were adjudicated by two independent LLM adjudicators (GPT‑5.2 and Claude Opus 4.6) using a standardized rubric mapping free‑text recommendations back to the A‑D triage categories; inter‑rater agreement was 94.7 % (κ = 0.921).
Key Findings
-
Overall accuracy improves in the naturalistic setting – mean triage accuracy rose from 63.6 % under the constrained protocol to 70.1 % under naturalistic interaction (Δ = +6.4 pp, Wilcoxon p = 0.015). This demonstrates that the exam‑style constraints suppress the models’ true triage capability.
-
Emergency cases are not inherently mis‑triaged – diabetic ketoacidosis was correctly identified as an emergency in every trial (100 % across all models, both conditions). Asthma, the scenario that drove most under‑triage in the original study, improved dramatically from 48 % (constrained) to 80 % (naturalistic).
-
Forced A/B/C/D discretization is the dominant failure mechanism – a targeted ablation on the asthma case showed that three models (GPT‑5.2, Gemini 3 Flash, Gemini 3.1 Pro) scored 0–24 % when forced to choose a letter but achieved 100 % when allowed free‑text output (p < 10⁻⁸). In free‑text, these models consistently recommended emergency care in their own words; the forced‑choice format artificially recorded them as under‑triaging. Claude models behaved differently, achieving 100 % under both formats, indicating that architecture and prompt‑interpretation interact with the constraint.
-
Prompt‑faithful checks confirm scaffold dependence – using the exact prompts released by Ramaswamy et al., the authors observed model‑dependent and case‑dependent variations. For the “factor sweep” (16 demographic variants of asthma and DKA), GPT‑5.2 remained stable (16/16 emergencies), whereas Claude Opus varied widely (7/16 for asthma, 9/16 for DKA). The “naturalization ladder” (progressively stripping scaffolding) produced non‑uniform performance shifts, reinforcing that the scaffold is not a neutral measurement tool.
-
Exploratory addition of GPT‑5.3 Instant – even with a sixth, newer model, the aggregate improvement persisted (+6.8 pp, p = 0.0043), suggesting the phenomenon is not limited to a particular generation.
Interpretation and Implications
The study demonstrates that the headline under‑triage rate is highly contingent on the evaluation format rather than reflecting an intrinsic limitation of consumer‑facing health AI. Forced multiple‑choice output, knowledge suppression, and the prohibition of clarifying questions together create an artificial context that can invert a model’s true clinical recommendation. In real‑world deployments, chatbots operate in a multi‑turn, clarifying dialogue, often with persistent memory of prior interactions, allowing them to request missing information and refine their assessment. Consequently, a single‑turn, forced‑choice protocol cannot reliably predict safety performance in practice.
The authors caution against using the 51.6 % figure as a definitive safety metric for consumer health AI. Instead, they advocate for evaluation frameworks that (i) allow free‑text responses, (ii) permit iterative clarification, (iii) incorporate system‑prompt realism, and (iv) test across diverse demographic and linguistic variations. Such ecologically valid assessments will better capture the models’ ability to recognize emergencies and avoid both under‑ and over‑triage.
Limitations
- Direct testing of ChatGPT Health was not possible due to regional availability; however, the inclusion of multiple frontier models mitigates this gap.
- The 17‑scenario bank, while covering key emergencies, does not span the full spectrum of possible consumer health queries.
- Human verification of the LLM‑generated scenario texts was limited; future work could involve clinician‑authored vignettes.
Conclusion
Evaluation design—particularly forced A/B/C/D discretization and knowledge‑suppression—drives the apparent triage failures observed in prior work. When models are assessed under conditions that mirror actual consumer interactions, triage accuracy improves, and the dramatic under‑triage rates disappear. The paper calls for a paradigm shift toward realistic, interactive testing to ensure that safety claims about consumer health AI are grounded in the contexts in which these systems will actually be used.
Comments & Academic Discussion
Loading comments...
Leave a Comment