CONRep: Uncertainty-Aware Vision-Language Report Drafting Using Conformal Prediction
Automated radiology report drafting (ARRD) using vision-language models (VLMs) has advanced rapidly, yet most systems lack explicit uncertainty estimates, limiting trust and safe clinical deployment. We propose CONRep, a model-agnostic framework that integrates conformal prediction (CP) to provide statistically grounded uncertainty quantification for VLM-generated radiology reports. CONRep operates at both the label level, by calibrating binary predictions for predefined findings, and the sentence level, by assessing uncertainty in free-text impressions via image-text semantic alignment. We evaluate CONRep using both generative and contrastive VLMs on public chest X-ray datasets. Across both settings, outputs classified as high confidence consistently show significantly higher agreement with radiologist annotations and ground-truth impressions than low-confidence outputs. By enabling calibrated confidence stratification without modifying underlying models, CONRep improves the transparency, reliability, and clinical usability of automated radiology reporting systems.
💡 Research Summary
The paper introduces CONRep, a model‑agnostic framework that equips vision‑language models (VLMs) with statistically rigorous uncertainty estimates using conformal prediction (CP). The authors argue that while recent advances in automated radiology report drafting (ARRD) have dramatically improved the quality of generated reports, the lack of explicit confidence measures hampers clinical adoption. CONRep addresses this gap by providing two complementary pipelines.
-
Label‑level uncertainty – The task is cast as ten independent binary classifications (e.g., consolidation, effusion, pneumothorax) on the ChestX‑Det10 dataset. Two VLM families are evaluated:
- MedGemma, a decoder‑only generative model, is prompted with a constrained “Yes/No” question for each pathology. Token‑level softmax probabilities for “Yes” and “No” are extracted and normalized to obtain a continuous confidence score.
- BiomedCLIP, a contrastive model, embeds images and two textual prompts (“with condition” and “without condition”). The difference in cosine similarities is passed through a sigmoid to produce a pseudo‑probability.
For each model, a calibration split (30 % of the data) is used to compute non‑conformity scores (NCS) for positive and negative cases. Given a target error level α (0.05, 0.1, 0.2), the α‑quantile of the calibration NCS distribution defines a threshold. At test time, a prediction set is formed by including any class whose NCS ≤ threshold. Sets of size one are labeled “certain” (high confidence), while size two are “uncertain”. Performance metrics (accuracy, AUROC, AUPRC, sensitivity, specificity) are reported separately for the certain and uncertain subsets. Across both MedGemma and BiomedCLIP, the certain subset consistently achieves markedly higher scores (e.g., AUROC > 0.85 vs. < 0.75 for uncertain) and the differences are statistically significant (p < 0.01).
-
Sentence‑level uncertainty – This pipeline evaluates the free‑text “impression” section, reflecting real‑world radiology workflow. MedGemma is few‑shot prompted to generate only the impression; BiomedCLIP computes embeddings for the generated impression and the corresponding X‑ray, and their cosine similarity is min‑max normalized to a pseudo‑probability. The same CP calibration procedure is applied, yielding three categories: certain, uncertain, and highly uncertain. Semantic fidelity is assessed by computing cosine similarity between embeddings of generated impressions and ground‑truth impressions. The certain group shows substantially higher similarity scores than the uncertain and highly uncertain groups (p < 0.01).
Statistical analysis includes Shapiro‑Wilk normality testing, Welch’s t‑test or Mann‑Whitney U as appropriate, and Pearson/Spearman correlation to link alignment scores with classification performance.
The key contribution of CONRep is that it provides distribution‑free coverage guarantees without altering the underlying VLM architecture. By post‑hoc calibration, it can be applied to any existing ARRD system, delivering interpretable confidence labels that clinicians can use to decide whether to accept a generated report or request further review. The experiments demonstrate that high‑confidence (certain) outputs are reliably more accurate, thereby enhancing transparency, trust, and potential safety of automated radiology reporting in clinical practice.
Comments & Academic Discussion
Loading comments...
Leave a Comment