Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context
Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of current “state-of-the-art” (SOTA) models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a “pre-CXR” probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions: First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability. Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance may depend in part on inference of pre-CXR clinical context.
💡 Research Summary
The paper revisits the reported performance of state‑of‑the‑art chest X‑ray (CXR) classification models by incorporating clinical context that is available to clinicians before an image is taken. Using the publicly available MIMIC‑CXR dataset linked with MIMIC‑IV electronic health records, the authors extract discharge summaries from all prior admissions for each patient. From these notes they compute a “pre‑test probability” for each of the 13 disease labels (excluding “No Finding”) using a suite of large language models (Mistral‑7B, PubMedBERT, BERT, ClinicalBERT, BioLinkBERT, RoBERTa) and a range of conventional classifiers (logistic regression, SVM, random forest, etc.). The best text‑only classifier for each label is calibrated with Platt scaling and then used to assign a probability that reflects the clinician’s prior belief about disease risk before the CXR is obtained.
Two complementary evaluation strategies are then applied. First, the test set is stratified into three quantiles based on the predicted pre‑test probability (bottom 25 %, middle 50 %, top 25 %). When standard SOTA vision models (e.g., CheXpert‑trained DenseNet‑121) are evaluated on these sub‑populations, AUROC and related metrics consistently drop for the high‑risk (top‑quartile) group, with reductions of 0.07–0.12 points across labels. This indicates that models perform worse when the prior probability of disease is already high, suggesting reliance on contextual cues rather than pure visual evidence.
Second, the authors explicitly break the correlation between the image label and the pre‑test probability. They construct a “matched” evaluation set by pairing each positive CXR with a negative CXR that has a nearly identical pre‑test probability (using the Hungarian algorithm for optimal 1‑to‑1 matching). In this balanced set, model AUROC declines dramatically (by 0.15–0.20), demonstrating that when the text‑derived prior cannot be used to discriminate, the visual model’s diagnostic power is substantially weaker. A complementary “re‑weighted” set is created by assigning weights to examples so that the weighted distribution of pre‑test probabilities is the same for positives and negatives; performance also degrades, though less severely than in the matched case.
The authors further analyze the text‑only predictors, showing that medically relevant terms (e.g., “clavicle”, “rib” for fracture) dominate feature importance, confirming that discharge summaries contain predictive signals for future CXR findings. They also discuss related work on dataset confounders, bias detection, and multimodal training, emphasizing that prior clinical notes have been under‑utilized as a diagnostic performance probe.
Overall, the study reveals that current CXR classification benchmarks may overstate true visual diagnostic ability because models can implicitly infer disease risk from correlated clinical context embedded in the images (e.g., scanner settings, patient demographics, prior treatments). The proposed framework—pre‑test probability estimation, stratified analysis, matched‑pair evaluation, and re‑weighting—offers a systematic way to assess and mitigate such contextual shortcuts. The authors argue that future model development should prioritize learning robust visual features that remain predictive even when prior clinical information is neutralized, and that similar context‑aware evaluation should become standard practice across medical imaging domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment