GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating Grounding (cognitive depth), Adequacy (answer completeness), Perturbation (robustness), and Safety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment (90% agreement, Cohen’s Kappa 0.77). Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice. The benchmark dataset GAPS-NSCLC-preview and evaluation code are publicly available at https://huggingface.co/datasets/AQ-MedAI/GAPS-NSCLC-preview and https://github.com/AQ-MedAI/MedicalAiBenchEval.


💡 Research Summary

The paper introduces GAPS, a multidimensional benchmark designed to evaluate AI clinician systems beyond the limited scope of traditional multiple‑choice exams and manually crafted rubrics. GAPS decomposes clinical competence into four orthogonal axes: Grounding (cognitive depth of reasoning, from factual recall G1 to inferential reasoning under uncertainty G4), Adequacy (completeness of the answer, split into Must‑have A1, Should‑have A2, and Nice‑to‑have A3 components), Perturbation (robustness to input variations, ranging from clean prompts P0 to linguistic noise P1, redundant context P2, and adversarial premises P3), and Safety (risk taxonomy from harmless S1 to catastrophic S4).

To operationalize this framework, the authors build a fully automated pipeline anchored in clinical practice guidelines—in this case the NCCN guideline for non‑small cell lung cancer (NSCLC). The pipeline first assembles a frozen “evidence neighbourhood” and constructs two complementary representations: a knowledge graph (KG) capturing relations among clinical concepts, and a hierarchical tree preserving the guideline’s structure. These structures drive systematic question generation across Grounding levels (G1‑G4). For each clean question (P0), perturbation variants (P1‑P3) are automatically created using predefined prompt‑transformation rules that preserve the clinical core while introducing controlled noise.

Rubric synthesis is performed by a purpose‑built DeepResearch agent that follows a ReAct‑style reasoning loop. The agent formulates PICO queries, retrieves guideline‑consistent evidence, synthesizes comprehensive answers, and extracts both positive (Adequacy) and negative (Safety) rubric elements with verifiable citations, ensuring transparency and reproducibility. The resulting benchmark, GAPS‑NSCLC‑preview, contains 92 questions and 1,691 rubric items, averaging about 12 Adequacy and 7 Safety elements per question, with balanced coverage across reasoning levels.

Evaluation is carried out with an ensemble of large language model (LLM) judges that combine rule‑based scoring with rubric‑based assessment. Clinician audits on stratified subsets confirm the automated items are of high quality (90 % agreement, Cohen’s κ = 0.77). Five state‑of‑the‑art models—GPT‑5, Gemini 2.5 Pro, Claude Opus 4, DeepSeek‑V3.1‑Terminus, and Qwen3‑235B‑A22B‑Instruct‑2507—are benchmarked. Results show a clear degradation along the Grounding axis: while all models achieve ~0.70 on G1 (factual) and G2 (explanatory) items, performance drops to ~0.45–0.68 at G3 (applied decision‑making) and falls below 0.35 at G4 (inferential reasoning). Adequacy analysis reveals high hit rates for Must‑have elements (A1) but steep declines for Should‑have (A2) and especially Nice‑to‑have (A3), indicating that models often omit contextual qualifiers and follow‑up recommendations. Safety assessment uncovers model‑specific risk profiles: Claude Opus 4 exhibits a rise in catastrophic S4 errors from 3 % at G1 to 25 % at G4, whereas GPT‑5 and Gemini 2.5 Pro maintain near‑zero S4 rates even on the hardest items. Perturbation experiments demonstrate robustness to benign noise (P1, P2) but severe vulnerability to adversarial premises (P3), where scores collapse across all Grounding levels.

The authors conclude that current LLM‑based AI clinicians function well as sophisticated medical encyclopedias (high factual recall) but lack the depth, completeness, robustness, and safety required for real‑world decision support. GAPS provides a reproducible, scalable, and clinically grounded evaluation methodology that quantifies these gaps and offers concrete directions for future research: strengthening inferential reasoning (G4), enriching answer completeness across all Adequacy tiers, hardening models against misleading inputs, and implementing safety safeguards to prevent S4‑level failures. The benchmark and code are publicly released, enabling the community to iterate toward trustworthy AI clinicians.


Comments & Academic Discussion

Loading comments...

Leave a Comment