Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.

💡 Research Summary

This study systematically investigates whether contemporary large language models (LLMs) exhibit sex-based biases in clinical reasoning tasks, a critical concern as these models are increasingly integrated into healthcare workflows. The researchers evaluated four general-purpose LLMs—ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash, and DeepSeek-chat—across three experiments designed to probe model behavior under controlled conditions.

The core methodology involved using 50 clinician-authored clinical vignettes spanning 44 medical specialties. Crucially, these vignettes were designed so that the patient’s sex was non-informative to the correct diagnostic pathway. In the first experiment, models were prompted to assign a binary sex (male/female) to these sex-neutral vignettes. The second experiment allowed models to abstain from assigning sex. The third experiment compared the top-five differential diagnosis lists generated for otherwise identical vignette pairs that differed only by the stated patient sex (male or female). All queries were run at three different temperature settings (0.2, 0.5, 1.0) with ten repetitions per condition to assess stability and variability.

The findings revealed significant and model-specific sex assignment biases. At a temperature of 0.5, ChatGPT demonstrated a strong female skew, assigning female sex in 70% of cases. DeepSeek and Claude showed a moderate female bias (61% and 59% female assignments, respectively), while Gemini exhibited a male skew, assigning female sex in only 36% of cases. Temperature alone did not have a significant main effect on sex assignment, but a significant interaction between model and temperature was observed. Specialty-level analysis uncovered stark patterns: psychiatry, rheumatology, and hematology cases were almost exclusively labeled female across models, whereas cardiology and urology cases were uniformly labeled male.

When permitted to abstain from sex assignment, ChatGPT abstained in 100% of cases, while other models showed varying abstention rates. However, this reduction in explicit labeling did not eliminate downstream effects. In the differential diagnosis experiment, the lists generated for male versus female versions of the same clinical scenario frequently diverged. At temperature 0.5, the proportion of diagnosis lists that were mismatched (differing in content) was highest for ChatGPT (78%) and lowest for DeepSeek and Gemini (58% each). Similarity metrics like Jaccard index confirmed that diagnostic overlap decreased as temperature increased, but sex-contingent differences persisted across all settings.

The study concludes that contemporary LLMs demonstrate stable, model-specific sex biases when performing clinical reasoning tasks. Allowing models to abstain from explicit demographic inference reduces overt labeling but does not mitigate the bias manifested in subsequent clinical outputs, such as differential diagnoses. The authors emphasize that the observed biases often mirror documented sex disparities in medical research and care delivery. For safe clinical integration, they recommend conservative and well-documented model configuration, specialty-level auditing of model outputs for biased patterns, and the imperative of continued human oversight when deploying general-purpose LLMs in healthcare settings. The research underscores that technical interventions, like abstention options, are insufficient on their own and that a multifaceted approach is necessary to address embedded biases.

Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment