ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer’s Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs’ knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.


💡 Research Summary

The paper introduces ADRD‑Bench, the first benchmark specifically designed to evaluate large language models (LLMs) on tasks related to Alzheimer’s disease and related dementias (ADRD). The authors identify a critical gap: existing medical LLM benchmarks contain only a tiny fraction of ADRD content (generally less than 1% of questions) and virtually no caregiving‑oriented items, despite the fact that ADRD care heavily involves daily caregiver decision‑making and behavioral management.

ADRD‑Bench consists of two complementary components. The first, “ADRD Unified QA,” aggregates 1,352 ADRD‑related questions drawn from seven well‑established medical benchmarks (PubMedQA, HEAD‑QA, MedBullets, MedMCQA, MedQA, MEDEC, and MedHallu). These questions retain their original format, providing multiple‑choice knowledge items and error‑detection (hallucination) items without any modification, enabling direct comparison with prior work.

The second component, “ADRD Caregiving QA,” addresses the missing real‑world caregiving dimension. It is built from the Aging Brain Care (ABC) program, a nationally recognized, evidence‑based dementia care model with two decades of clinical validation. By abstracting recurring caregiving scenarios and recommended strategies from de‑identified ABC educational materials, the authors created 149 new QA pairs (120 true/false and 29 multiple‑choice). A senior clinician who helped design the ABC program reviewed all items for clinical accuracy, clarity, and relevance.

To assess the benchmark, the authors evaluated 33 state‑of‑the‑art LLMs, covering open‑weight general models (3.8 B–235 B parameters), open‑weight medical‑specialized models, and closed‑source general models such as ChatGPT, Claude, and Gemini. Experiments were run on a local RTX 6000 Ada GPU for smaller models and on an H100‑based cloud cluster for larger ones, ensuring consistent prompting and evaluation protocols. Accuracy was the primary metric.

Results show that closed‑source general models achieved the highest mean accuracy (0.89 ± 0.03), followed by open‑weight medical models (0.82 ± 0.13) and open‑weight general models (0.78 ± 0.09). Larger models tended to perform better, but performance varied markedly across question types: multiple‑choice clinical knowledge items were generally answered correctly, whereas error‑detection and caregiving scenario items exposed notable weaknesses.

Beyond raw accuracy, the authors performed qualitative case studies. They found that even top‑performing models could produce inconsistent answers when the same question was paraphrased or when contextual cues changed, indicating fragility in reasoning stability. In hallucination detection tasks, models frequently misidentified correct statements as false or missed fabricated answers, highlighting limited meta‑reasoning capabilities. For caregiving questions, models often omitted empathetic nuance or cultural considerations, suggesting that current LLMs lack the depth required for real‑world caregiver support.

The paper acknowledges limitations: the overall dataset size remains modest, especially the caregiving subset (149 items), which may not capture the full diversity of daily ADRD care scenarios. Moreover, the evaluation relies mainly on accuracy, without metrics for explanation quality, confidence calibration, or ethical safety. The authors propose future work including human‑in‑the‑loop assessments with professional caregivers, domain‑specific fine‑tuning using ABC program transcripts, and the development of richer evaluation metrics (e.g., factual consistency, trustworthiness, and sentiment alignment).

In summary, ADRD‑Bench fills a notable void in LLM evaluation by unifying scattered ADRD knowledge questions and introducing a novel caregiving‑focused component grounded in a validated clinical program. Initial experiments reveal that while contemporary LLMs can achieve high accuracy on knowledge‑centric items, they still struggle with consistent reasoning, hallucination detection, and nuanced caregiver support. The benchmark thus provides a valuable testbed for guiding the next generation of domain‑adapted LLMs toward safer, more reliable deployment in Alzheimer’s and dementia care.


Comments & Academic Discussion

Loading comments...

Leave a Comment