A Women's Health Benchmark for Large Language Models

Reading time: 5 minute
...

📝 Original Info

  • Title: A Women’s Health Benchmark for Large Language Models
  • ArXiv ID: 2512.17028
  • Date: 2025-12-18
  • Authors: Victoria-Elisabeth Gruber, Razvan Marinescu, Diego Fajardo, Amin H. Nassar, Christopher Arkfeld, Alexandria Ludlow, Shama Patel, Mehrnoosh Samaei, Valerie Klug, Anna Huber, Marcel Gühner, Albert Botta i Orfila, Irene Lagoja, Kimya Tarr, Haleigh Larson, Mary Beth Howard

📝 Abstract

As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60\% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.

💡 Deep Analysis

Figure 1

📄 Full Content

A Women’s Health Benchmark for Large Language Models Victoria-Elisabeth Gruber1 Razvan Marinescu1 Diego Fajardo1 Amin H. Nassar2 Christopher Arkfeld3 Alexandria Ludlow4 Shama Patel5 Mehrnoosh Samaei6 Valerie Klug7 Anna Huber7 Marcel Gühner7 Albert Botta i Orfila7 Irene Lagoja7 Kimya Tarr8 Haleigh Larson9 Mary Beth Howard10 1Lumos AI 2Medical Oncology, Yale Cancer Center 3Obstetrics and Gynecology, MGH, Harvard Medical School 4Obstetrics, Gynecology & Reproductive Sciences, UCSF 5Brown Division of Global Emergency Medicine 6Department of Emergency Medicine, Emory University 7Pharmacy Department, Clinic Ottakring 8Windrush Surgery, Buckinghamshire, Oxfordshire and Berkshire West Integrated Care Board, NHS 9Women’s Health Research, Yale School of Medicine 10Johns Hopkins University School of Medicine {victoria,razvan,diego}@thelumos.ai Abstract As large language models (LLMs) become primary sources of health information for millions, their accuracy in women’s health remains critically unexamined. We introduce the Women’s Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women’s health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect fac- tual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and re- vealed alarming gaps: current models show approximately 60% failure rates on the women’s health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improve- ments in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women’s health. 1 Introduction Artificial intelligence (AI) has made significant progress in various domains, including healthcare, with the potential to significantly improve patient care and quality of life [1] [2]. Medical diagnosis is one application area where AI has already shown promising results, as algorithms are able to identify patterns in medical images and text that are not readily visible to the human eye [3]. Another field where AI is making significant progress is personalized medicine, where algorithms are identifying patterns in vast volumes of patient data and providing personalized treatment plans taking into account the patient’s unique genetics, lifestyle and medical history [4]. AI is also being utilized to increase the effectiveness of clinical operations, through integration of chatbots and virtual assistants, to help with Preprint. arXiv:2512.17028v1 [cs.CL] 18 Dec 2025 patient intake and triage [5]. Hence, it is not surprising that more and more patients and physicians are turning towards AI chatbots for health related questions and get help in clinical decision making everyday. Roughly one in six adults has used an AI chatbot for health related questions in the last year [6]. This gives rise to the question: How accurate are AI models when it comes to women’s health-related questions? Historically, women have been underrepresented in research and clinical trials, meaning that a majority of available data is biased toward male populations due to a myriad of reasons [7] [8]. Some of the major reasons include biological sex differences (physiological differences driven by chromosomal, hormonal, and anatomical factors), gender effects (differences arising from socially constructed roles, behaviors, and identities) and inadequate research in women, such as excluding pregnant women and women of child-bearing age from participating in clinical studies for decades[9] [10] [11]. Bias in the training data of large language models arises not only from the historical underrepresentation of women in research, but also from shortcut learning and spurious correlations present in imbalanced or noisy medical training datasets. As a result, models may rely on simplified patterns or stereotypes instead of true medical reasoning, leading to confident but incorrect responses that can worsen sex- and gender-related gaps in healthcare.[12] [13] [14] A recent study examining online reproductive health misinformation across multiple social media platforms and websites found that 23% of the content included medical recommendations that do not align with professional guidelines [15]. They further found that potentially misleading claims and narratives about reproductive topics like contraception, abortion, fertility, chronic disease, breast cancer, maternal health, an

📸 Image Gallery

Model_stump_example.png WHB_1.png WHB_2.png WHB_Overview_4.png error_type_distribution.png failure_rates_by_error_type.png failure_rates_by_error_type_heatmap.png failure_rates_by_specialty.png failure_rates_by_specialty_heatmap.png incorrect_cases_by_model_query_type.png model_performance.png query_type_distribution.png speciality_distribution.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut