Uncertainty Quantification for Machine Learning: One Size Does Not Fit All
Proper quantification of predictive uncertainty is essential for the use of machine learning in safety-critical applications. Various uncertainty measures have been proposed for this purpose, typically claiming superiority over other measures. In this paper, we argue that there is no single best measure. Instead, uncertainty quantification should be tailored to the specific application. To this end, we use a flexible family of uncertainty measures that distinguishes between total, aleatoric, and epistemic uncertainty of second-order distributions. These measures can be instantiated with specific loss functions, so-called proper scoring rules, to control their characteristics, and we show that different characteristics are useful for different tasks. In particular, we show that, for the task of selective prediction, the scoring rule should ideally match the task loss. On the other hand, for out-of-distribution detection, our results confirm that mutual information, a widely used measure of epistemic uncertainty, performs best. Furthermore, in an active learning setting, epistemic uncertainty based on zero-one loss is shown to consistently outperform other uncertainty measures.
💡 Research Summary
This paper presents a critical analysis and a novel framework for uncertainty quantification (UQ) in machine learning, challenging the prevailing “one-size-fits-all” approach. The core thesis is that there is no single best measure of predictive uncertainty; instead, the choice of UQ method must be tailored to the specific downstream task where the uncertainty estimate will be used.
The authors begin by highlighting the essential role of UQ in safety-critical applications and the common but flawed practice of proposing new uncertainty measures as universally superior alternatives. They adopt the standard distinction between aleatoric uncertainty (inherent data noise) and epistemic uncertainty (model ignorance), typically represented via second-order distributions (e.g., from Bayesian posteriors or deep ensembles).
The technical foundation of their work is a flexible family of uncertainty measures derived from proper scoring rules. A proper scoring rule (e.g., log loss, Brier loss, zero-one loss) is a loss function that incentivizes a forecaster to report their true belief. The expected loss of such a rule can be decomposed into a “divergence” term and an “entropy” term. The authors generalize this decomposition to second-order distributions, defining Total Uncertainty (TU), Aleatoric Uncertainty (AU), and Epistemic Uncertainty (EU) as functionals parameterized by the choice of scoring rule ℓ. This framework subsumes popular measures: using log loss recovers the entropy-KL divergence measures, Brier loss yields Gini impurity-based measures, and zero-one loss leads to novel measures based on the “maximum probability.”
The paper’s primary contribution is rigorously demonstrating that different instantiations of this framework (i.e., different choices of ℓ) are optimal for different practical tasks. The authors introduce a key conceptual distinction between the task loss (used to evaluate performance on the downstream objective) and the uncertainty loss (the scoring rule ℓ used to compute the uncertainty measure). They argue theoretically and validate empirically that aligning these losses leads to optimal task performance.
The empirical analysis focuses on three canonical downstream tasks:
- Selective Prediction: The system can abstain from prediction when uncertain. The results show that the scoring rule used for UQ should ideally match the task loss. For instance, if the goal is to minimize classification error (zero-one loss), then using the zero-one loss to quantify uncertainty yields the best risk-coverage curve.
- Out-of-Distribution (OoD) Detection: Identifying samples from a different distribution than the training data. Here, the experiments confirm that mutual information—a measure of epistemic uncertainty derived from the log loss—performs best, aligning with the intuition that OoD samples increase model parameter uncertainty.
- Active Learning: Selecting the most informative unlabeled samples for annotation. In this setting, epistemic uncertainty as measured by the zero-one loss consistently outperforms other measures, effectively pinpointing samples where model ensemble members disagree on the most likely class label.
In conclusion, the paper advocates for a paradigm shift from seeking a universal uncertainty metric to a principled, task-aware approach. By explicitly considering the end goal of UQ and customizing the measure through the lens of proper scoring rules, practitioners can achieve significantly better performance on their specific application, moving beyond the limitations of a one-size-fits-all methodology.
Comments & Academic Discussion
Loading comments...
Leave a Comment