Credal Ensemble Distillation for Uncertainty Quantification
📝 Abstract
Deep ensembles (DE) have emerged as a powerful approach for quantifying predictive uncertainty and distinguishing its aleatoric and epistemic components, thereby enhancing model robustness and reliability. However, their high computational and memory costs during inference pose significant challenges for wide practical deployment. To overcome this issue, we propose credal ensemble distillation (CED), a novel framework that compresses a DE into a single model, CREDIT, for classification tasks. Instead of a single softmax probability distribution, CREDIT predicts class-wise probability intervals that define a credal set, a convex set of probability distributions, for uncertainty quantification. Empirical results on out-of-distribution detection benchmarks demonstrate that CED achieves superior or comparable uncertainty estimation compared to several existing baselines, while substantially reducing inference overhead compared to DE.
💡 Analysis
Deep ensembles (DE) have emerged as a powerful approach for quantifying predictive uncertainty and distinguishing its aleatoric and epistemic components, thereby enhancing model robustness and reliability. However, their high computational and memory costs during inference pose significant challenges for wide practical deployment. To overcome this issue, we propose credal ensemble distillation (CED), a novel framework that compresses a DE into a single model, CREDIT, for classification tasks. Instead of a single softmax probability distribution, CREDIT predicts class-wise probability intervals that define a credal set, a convex set of probability distributions, for uncertainty quantification. Empirical results on out-of-distribution detection benchmarks demonstrate that CED achieves superior or comparable uncertainty estimation compared to several existing baselines, while substantially reducing inference overhead compared to DE.
📄 Content
Uncertainty quantification (UQ) in neural networks (NNs) has gained increasing attention, with two primary types of uncertainty distinguished: aleatoric uncertainty (AU), which stems from the inherent randomness in the data generation process and models the stochasticity in the output given an input (i.e., via a conditional distribution p(output|input)); and epistemic uncertainty (EU), which is caused by a lack of evidence and reflects the model’s imprecise knowledge of the true conditional distribution (Hüllermeier and Waegeman 2021;Hüllermeier, Destercke, and Shaker 2022;Wang et al. 2024Wang et al. , 2025b)). The effective estimation and differentiation of AU and EU can improve a model’s trustworthiness and robustness (Senge et al. 2014;Kendall and Gal 2017;Sale, Caprio, and Hüllermeier 2023;Manchingal et al. 2025). For example, proper EU estimates can help avoid misclassifying ambiguous in-distribution (ID) samples as out-of-distribution (OOD), since their ambiguity does not necessarily correspond to regions of high EU within the ID distribution (Mukhoti et al. 2023;Wang et al. 2025a).
To quantify both AU and EU, recent studies propose training NNs to predict a second-order representation, capable of expressing the uncertainty about a prediction’s uncertainty itself (Malinin, Mlodozeniec, and Gales 2019;Hüllermeier and Waegeman 2021;Caprio et al. 2024;Wang et al. 2024). Bayesian neural networks (BNNs) (Blundell et al. 2015;Gal and Ghahramani 2016;Krueger et al. 2017;Mobiny et al. 2021), in particular, learn posterior distributions over their weights and enable predictions in the form of second-order distributions (Caprio et al. 2024). However, BNNs generally face significant challenges in scalability to large datasets and complex architectures due to high computational demands (Mukhoti et al. 2023). Their performance is also sensitive to the choice of prior, likelihood, and training objectives (Henning, D’Angelo, and Grewe 2021;Knoblauch, Jewson, and Damoulas 2022).
Alternative to BNNs, deep ensembles (DE), which combine multiple standard neural networks (SNNs) to predict a finite set of distributions (Lakshminarayanan, Pritzel, and Blundell 2017), have been treated as a strong UQ baseline (Ovadia et al. 2019;Gustafsson, Danelljan, and Schon 2020;Abe et al. 2022;Mucsányi, Kirchhof, and Oh 2024). Nevertheless, a key limitation of DEs is their substantial demand for memory and computational resources. To this end, ensemble distillation (ED) has become a popular way of significantly reducing inference costs (Hinton, Vinyals, and Dean 2015;Lin et al. 2020), by distilling a DE into an SNN that approximates the mean of DE’s predictive distributions. However, one drawback of ED is that the distilled SNN only generates a single predictive distribution, limiting its ability to quantify AU. This is because this single distribution captures randomness in the mapping between the input and output while assuming precise knowledge of this dependency (Hüllermeier, Destercke, and Shaker 2022).
To address this problem, ensemble distribution distillation (EDD) (Malinin, Mlodozeniec, and Gales 2019) has been proposed to distill a DE into a single model outputting a Dirichlet distribution as the second-order prediction. Yet, one practical challenge in EDD and other Dirichlet-based methods (DBMs) (Malinin andGales 2018, 2019;Charpentier, Zügner, and Günnemann 2020) is the absence of ground-truth Dirichlet labels for training. In addition, DBMs have recently faced criticism for departing from the theoretical tenets of epistemic uncertainty (Ulmer, Hardmeier, and Frellsen 2023), and failing to provide a meaningful quantitative interpretation of EU (Juergens et al. 2024).
In an alternative approach, credal sets, i.e., convex sets of probability distributions (Levi 1980), have been employed for UQ in a broader machine learning context (Zaffalon 2002;Corani and Zaffalon 2008;Corani, Antonucci, and Zaffalon 2012;Mauá et al. 2017). This method has recently garnered renewed attention in deep learning. Recent advancements include modeling both NN weights and outputs as credal sets (Caprio et al. 2024), deriving credal set predictions from outputted probability intervals (Wang et al. 2024(Wang et al. , 2025c)), and wrapping the predictive probabilities of BNNs and DE as a credal set (Wang et al. 2025a), to name a few. Although these credal predictors offer improved UQ compared to BNN and DE baselines, they generally demand even greater computational resources for inference.
In this context, a research question arises: Can a single NN predicting a credal set as a second-order representation be distilled from a DE, which is capable of improving the UQ performance of existing distillation frameworks?
Novelty and Contributions In response, this paper proposes an innovative distillation framework, termed as credal ensemble distillation (CED), to distill a DE teacher into a single model, called CREDIT. The distilled CREDIT can predict
This content is AI-processed based on ArXiv data.