Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging
In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.
💡 Research Summary
The paper addresses a critical gap in medical‑image decision‑support systems: the need for models that are not only accurate but also well‑calibrated, i.e., their confidence scores must reliably indicate correctness. While deep neural networks achieve expert‑level performance, they often suffer from over‑confidence, especially under data scarcity or severe class imbalance, which hampers clinical adoption. To solve this, the authors propose a two‑stage probabilistic optimization framework built on Bayesian Neural Networks (BNNs) with variational inference (VI) and Monte‑Carlo sampling.
The first novelty is the Confidence‑Uncertainty Boundary Curve (CUBC), a theoretical construct that defines an ideal geometric relationship between a model’s predictive confidence and its estimated uncertainty: high confidence should correspond to low uncertainty and vice‑versa. Based on CUBC, they introduce the Confidence‑Uncertainty Boundary Loss (CUB‑Loss). Unlike prior count‑based Accuracy‑vs‑Uncertainty (AvU) objectives, CUB‑Loss is a continuous, distance‑based loss that penalizes deviations from the CUBC for each individual sample. Consequently, high‑confidence errors (Incorrect‑Certain) and low‑confidence correct predictions (Accurate‑Uncertain) receive large gradients, forcing the network during training to align uncertainty with correctness.
The second contribution is Dual Temperature Scaling (DTS), a post‑hoc calibration method that extends classic Temperature Scaling (TS). Standard TS rescales all logits with a single scalar temperature, which cannot differentiate between correct and incorrect predictions. DTS learns two separate temperatures: T₁ applied to correctly classified samples and T₂ applied to mis‑classified ones, using a validation set. This bidirectional scaling sharpens the separation in the uncertainty space without affecting classification accuracy, effectively increasing the model’s ability to flag unreliable outputs.
The framework is evaluated on three heterogeneous medical imaging tasks: (1) pneumonia screening from chest X‑rays, (2) diabetic retinopathy grading from fundus photographs, and (3) multi‑class skin‑lesion classification. Each task is tested under realistic challenges—near‑out‑of‑distribution (OOD) inputs, extreme class imbalance (up to 1:100), and limited training data (as low as 5 % of the full set). Metrics include Expected Calibration Error (ECE), Uncertainty Calibration Error (UCE), and the AvU score (the proportion of samples that are either Accurate‑Certain or Inaccurate‑Uncertain). Compared with baselines such as MC‑Dropout, standard VI‑BNNs, and recent Soft‑AvUC methods, the CUB‑Loss + DTS pipeline consistently reduces ECE and UCE by roughly 30‑45 % and improves AvU scores across all domains. Notably, performance remains robust in data‑scarce regimes, and uncertainty spikes appropriately on near‑OOD samples, providing a practical safety signal for clinicians.
Strengths of the work include: (i) a mathematically grounded loss that is fully differentiable and can be dropped into any existing training pipeline; (ii) a lightweight post‑hoc calibration that requires only a validation set and does not degrade predictive accuracy; (iii) demonstration of generalizability across modalities, imbalance levels, and OOD conditions. Limitations are the computational overhead of VI‑based BNN training, the need to tune CUBC parameters (e.g., slope) per dataset, and the fact that DTS uses only two scalar temperatures, which may not capture more complex non‑linear calibration needs. Future directions suggested are automated, data‑driven CUBC parameter learning and extending DTS to multi‑temperature or sample‑wise scaling schemes.
In summary, the paper delivers a coherent, theoretically justified, and empirically validated solution for aligning predictive confidence with uncertainty in Bayesian deep learning for medical imaging. By jointly optimizing CUB‑Loss during training and applying Dual Temperature Scaling at inference, the proposed system offers clinicians calibrated, explainable predictions that can be trusted in high‑stakes diagnostic workflows.
Comments & Academic Discussion
Loading comments...
Leave a Comment