Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction

Rob ust Uncertainty Quantiﬁcation Using Conf ormalised Monte Carlo Pr ediction Daniel Bethell, Simos Gerasimou, Radu Calinescu Department of Computer Science, Univ ersity of Y ork Y ork, United Kingdom { daniel.bethell, simos.gerasimou, radu.calinescu } @york.ac.uk Abstract Deploying deep learning models in safety-critical applica- tions remains a very challenging task, mandating the pro- vision of assurances for the dependable operation of these models. Uncertainty quantiﬁcation (UQ) methods estimate the model’ s conﬁdence per prediction, informing decision- making by considering the effect of randomness and model misspeciﬁcation. Despite the adv ances of state-of-the-art UQ methods, they are computationally e xpensiv e or produce con- servati v e prediction sets/interv als. W e introduce MC-CP , a nov el hybrid UQ method that combines a ne w adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adapti vely modulates the traditional MC dropout at runtime to sa ve memory and computation resources, enabling predictions to be consumed by CP , yielding rob ust predic- tion sets/intervals. Throughout comprehensive experiments, we show that MC-CP deli vers signiﬁcant improv ements ov er comparable UQ methods, like MC dropout, RAPS and CQR, both in classiﬁcation and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple. The MC-CP code and replication package is a v ailable at https://github .com/team- daniel/MC- CP. Introduction Advances in Deep Learning (DL) enable its employment in div erse and challenging tasks, including speech recogni- tion (Kumar et al. 2020) and image annotation (Barnard et al. 2003). Despite its numerous potential applications, using DL in safety-critical applications (e.g., medical imaging/di- agnosis) mandates ensuring its dependable and robust op- eration (Pereira and Thomas 2020; Gerasimou et al. 2020). Uncertainty quantiﬁcation (UQ) is crucial in assessing the DL model’ s conﬁdence for input-prediction pairs and estab- lishing the potential impact of noisy , sparse, or low-quality input and misspeciﬁcation in DL models (Kendall, Badri- narayanan, and Cipolla 2016). Ultimately , UQ enables un- derstanding situations where the model is particularly uncer - tain, instrumenting uncertainty-aw are decision-making (Ca- linescu et al. 2018). DL-focused methods for UQ aim at assessing model and data uncertainty of DL models (Abdar et al. 2021). In par- ticular , Monte Carlo (MC) dropout (Gal and Ghahramani Copyright © 2024, Association for the Advancement of Artiﬁcial Intelligence (www .aaai.org). All rights reserved. 2016) elegantly quantiﬁes uncertainty within DL models by outputting the standard deviation of predictions from an en- semble of networks using dropout layers. Running, ho wever , numerous forward passes is computationally expensi ve. Similarly , Bayesian Neural Networks (BNNs) (MacKay 1992) constitute a more natural UQ method that can estimate both epistemic and aleatoric uncertainty . Howe ver , BNNs are computationally-intensi ve both during training and in- ference and require substantial ﬁne-tuning. Finally , confor- mal prediction (CP) (V ovk, Gammerman, and Shafer 2005) produces prediction sets/intervals instead of singletons. The larger the set/interv al, the more unsure the model is about its prediction, with a singleton prediction/narro w interval typ- ically signifying large conﬁdence. Despite their merits, CP methods are over -conserv ati ve, producing larger sets/inter - vals than necessary (F an, Ge, and Mukherjee 2023). Driv en by these adv ances, we introduce M onte C arlo- C onformal P rediction (MC-CP), a nov el hybrid method that comprises adaptiv e MC dropout and conformal pre- dictiv e techniques, inheriting both the statistical efﬁciency of the former and the distrib ution-free co verage guarantee of conformal prediction. MC-CP dynamically adapts the con v entional MC dropout with a conv er gence assessment, saving memory and computational resources during infer - ence where possible. The predictions are then consumed by advanced CP techniques to synthesize rob ust predic- tion sets/intervals. Our experimental e v aluation shows that the hybrid MC-CP approach ov erestimates less than re gu- lar CP methods. Despite its simplicity , it outperforms state- of-the-art CP- and MC-based methods, e.g., traditional MC dropout, RAPS (Angelopoulos et al. 2022) and CQR (Ro- mano, Patterson, and Candes 2019), both in classiﬁcation and re gression benchmarks. While RAPS and CQR quan- tify uncertainty by increasing the prediction set/interval size, MC-CP does this and also outputs an e xact quantiﬁcation in the form of v ariance in the prediction distrib ution. Our MC- CP method is designed to be implemented at inference time, in contrast to evidential deep learning and Bayesian neural networks. Whilst these methods provide salient and infor - mativ e UQ estimations, MC-CP is realised post-training. Our contributions are: • An adapti ve MC dropout method that can sav e computa- tional resources compared to the original method; • The hybrid MC-CP method that addresses major issues common with CP methods, yielding signiﬁcant improv e- ments across sev eral metrics and datasets. • A comprehensiv e empirical MC-CP ev aluation against state-of-the-art UQ methods (MC Dropout, RAPS, CQR) on v arious benchmarks, including CIF AR-10, CIF AR- 100, MNIST , Fashion-MNIST , and T iny ImageNet. Paper Structure: Sections and discuss related UQ work and background material. Sections and present MC-CP and its empirical ev aluation. Section concludes the paper . Related W ork Uncertainty Quantiﬁcation (UQ) in DL indicates how uncer- tain a model is about its predictions. The most common un- certainty types are aleatoric and epistemic. The former sur- rounds the irreducible uncertainty within data (e.g., random noise). The latter is the model’ s lack of knowledge or poor training which can be reduced with more data or better train- ing. MC-CP focuses on quantifying epistemic uncertainty . Deep ensembles is a straightforward method to quantify uncertainty in DL (Lakshminarayanan, Pritzel, and Blundell 2017). The method inv olves training an ensemble of net- works with the same or similar architecture, initialised with different weights. After training, the ensemble predicts on the same input data using the mean of their predictions as the ﬁnal prediction and the variance as the uncertainty . Monte Carlo (MC) dropout (Gal and Ghahramani 2016) is a simple and effecti ve method to compute epistemic un- certainty in DL models by exploiting dropout (Sriv astava et al. 2014), a regularization technique that randomly drops units of the neural network to prev ent reliance on certain weights. Although dropout is typically used during train- ing, MC dropout keeps this feature acti ve during inference and performs sev eral forward passes to devise a prediction distribution. The ﬁnal prediction is the mean of the dis- tribution, and the v ariance signiﬁes the uncertainty . Gaus- sian dropout (Kingma, Salimans, and W elling 2015) com- plements regular dropout by adding noise using a Gaussian distribution instead of setting the unit’ s v alue to zero. Bayesian Neural Networks (BNNs) (K endall, Badri- narayanan, and Cipolla 2016) realise UQ directly in the model’ s architecture. While in traditional DL networks, weights are a singleton variable, in BNNs, weights are rep- resented as a distribution. Although BNNs produce proba- bilistic predictions that naturally capture uncertainty , they are computationally intensiv e and require substantially more training than standard networks, resorting to approximate Bayesian computation techniques like v ariational inference. Conformal prediction (CP) (V ovk, Gammerman, and Shafer 2005) is a framew ork that uses v alidity to quantify a model’ s prediction conﬁdence. V alidity encodes that, on av erage, a model’ s predictions will be correct within a guar- anteed conﬁdence lev el (e.g., 90% of the time). The method then alters the prediction from a singleton/point to a set/in- terval that indicates the conﬁdence level of the model. The larger the set/interval, the more uncertain the model is, and vice versa. CP in volv es splitting the test data into two sets: a calibration and a test set. The calibration set is used to estimate the thresholds needed to achiev e the desired conﬁ- dence levels. CP has been applied to a div erse set of applica- tions (e.g., image classiﬁcation (Angelopoulos et al. 2022), regression (Romano, Patterson, and Candes 2019), object detection (de Grancey et al. 2022)). An orthogonal method is test time augmentation (W ang et al. 2019; Moshkov et al. 2020) which alters the data at inference time instead of the model or predictions. Gi ven an input, the method creates multiple augmented inputs using various augmentation techniques. The DL model then makes predictions for the augmented inputs; their distribution and variance represent the model’ s uncertainty . Data augmenta- tion using generativ e AI has also been proposed to enhance the inference capabilities of DL models (Missaoui, Gerasi- mou, and Matragkas 2023). Preliminaries Giv en a le vel of co verage α ∈ (0 , 1) signifying a probability guarantee that the true label/point is in the prediction set/in- terval (1 − α ) %, Conformal prediction (CP) constructs a pre- diction set/interv al instead of a singleton/point. T o achie ve this, CP splits the test dataset into a calibration set c and a validation set v . Next, conformal scores s ( f ( x i ) , y i ) ∈ R are calculated for each ( x i , y i ) ∈ c . This score is high when the model f ( . ) produces a lo w softmax output for the true class, i.e., when the model is very wrong. A quantile threshold ˆ q = Q ( [( n +1)(1 − α )] n ) is calculated, using the desired cov- erage α , the calibration set c , and the size of the calibration set n , which is used to form prediction sets C ( x j ) = { y : f ( x j ) ≤ 1 − ˆ q } for each new input x j (e.g., from the v alida- tion set v ). For quantile regression, prediction interv als are formed by C ( x j ) = [ t α/ 2 ( x j ) − ˆ q , t 1 − α/ 2 ( x j ) + ˆ q ] where t are the α -informed quantiles produced by the trained model. Cov erage is a key metric for assessing CP , measuring how often the predicted set/interval contains the ground truth. Cov erage is expected to reﬂect the desired coverage prop- erty 1 − α . Gi v en model f , cov erage is calculated by: C ov erag e ( f α ) = 1 n n X i =1 1 { y i ∈ f α ( x i ) } (1) where α is the user-deﬁned coverage, n is the size of the validation set, y i is the true label/value, and f α ( x ) is the prediction interval/set made by the model for input x i . This equation reﬂects the percentage of true labels/v alues cap- tured by the respectiv e prediction sets/interv als. Efﬁcienc y is another important CP metric. While includ- ing all possible classes in a prediction set would, by def ault, yield a perfect accuracy score, it is impractical. Thus, a DL model that achie ves the desired cov erage efﬁciently is pre- ferred. Efﬁcienc y is calculated as the av erage expected size of the set/interval, gi ven by: S iz e ( f α ) = 1 n n X i =1 | f α ( x i ) | (2) MC-CP Our M onte- C arlo C onformal P rediction (MC-CP) method for UQ incorporates adaptiv e MC dropout and conformal Adapt i ve MC D r opout C on for mal P redi c t i on I n put Da t a 76 .1% , 16.2% B i r d, Ai r pl ane 22 % , 2 1% MC - C P P redi c t i on s e t S e t P r oba bi l i t i e s S e t U nce r t ai nt i es Figure 1: High-lev el overvie w of our MC-CP method for image classiﬁcation. prediction, lev eraging their lo w computational cost and ﬁ- nite sample distribution-free cov erage guarantees, respec- tiv ely . Fig. 1 shows a high-lev el ov ervie w of our MC-CP method for image classiﬁcation. W e discuss next adapti v e MC dropout, followed by an exposition of MC-CP for clas- siﬁcation and regression. This no vel combination of adap- tiv e MC dropout and CP , albeit straightforward, results in a hybrid MC-CP method that yields signiﬁcant improv ements compared to state-of-the-art UQ techniques (Section ). Adaptive Monte Carlo Dr opout The competitiv e predictiv e performance of MC dropout largely depends on the ex ecution of multiple stochastic for- ward passes of each input through the DL model at inference time. The number of forward passes K the model should perform is deﬁned a priori and is ﬁxed. Since, for an y new input, the dropout layers of the DL model are kept on dur - ing inference, the ensemble of these K forward passes pro- duces a distrib ution of predictions. This distrib ution enables quantifying uncertainty by computing metrics such as the expected (a verage) v alue, standard deviation and entropy . The moti vation underpinning adaptiv e MC dropout orig- inates from the observation that each forward pass corre- sponds to a particular DL model instantiation that adds unique variance to the prediction distrib ution. Some of these DL model instantiations, informed by MC dropout forward passes, can produce similar or ev en the e xact same predic- tion. Hence, although the prediction v ariance might be large initially , as the number of forward passes increases, the vari- ance value becomes smaller , indicating that the inference process has conv erged. If the current number of forward passes is substantially less than the maximum number of forward passes K when this e vent occurs, the remaining for - ward passes incur only additional o v erheads but add little to no value. Adaptiv e MC Dropout le verages this observation to reduce the number of wasted forward passes once con ver- gence is diagnosed, thus yielding signiﬁcant computational savings without impacting the prediction ef fectiveness. Algorithm 1 sho ws our adapti ve MC dropout method. Giv en a ne w input x , the method performs up to K for- ward passes ov er model f to produce the predicti ve poste- rior mean as the ﬁnal prediction and the v ariance of the pre- dictiv e posterior as the prediction uncertainty . Unlik e con- ventional MC dropout, our algorithm uses the hyperparam- eters threshold δ and patience P to detect the con ver gence and terminate early . The threshold parameter δ denotes the Algorithm 1: Adaptiv e Monte Carlo Dropout Input : Model f , Input x , Maximum forw ard passes K , Threshold δ and Patience P Output : Mean prediction µ , and V ariance σ 1: Count ← 0 2: Predictions ← [] 3: while (Count < P & size(Predictions) < K ) do 4: y ← f ( x ) 5: Predictions ← Predictions ∪ y 6: σ ← V ar ( Predictions ) 7: if (Predictions > 1) then 8: diff ← | σ i − 1 − σ | ▷ list of dif ferences 9: if ( ∀ z ∈ diff .z ≤ δ ) then 10: Count ← Count + 1 11: else 12: Count ← 0 13: σ i − 1 ← σ 14: r eturn Predictions , σ maximum difference in variance required to trigger that the class/quantile prediction has likely conv er ged. Patience P signiﬁes the number of consecuti ve forward passes where all classes/quantiles are below δ to stop the execution early . The criterion of performing P successive forward passes that meet the threshold δ is important in determining con- ver gence and mitigating the potential ef fect of randomness. Adaptiv e MC dropout works as follo ws. While the current forward pass counter is less than K and the current patience counter is less than P (line 3), the model predicts the input data with dropout layers switched on (line 4). The prediction is added to a list, and the variance of that list is estimated (lines 5-6). From the second forward pass onward, the dif- ference between the current v ariance σ and the last estimated variance σ i − 1 is calculated (line 8). If the difference for all classes/quantiles is below the threshold δ , then the current patience counter is increased (lines 9-10); otherwise, it is reset (line 12). Once all classes/quantiles conv erge below δ after P consecutiv e forward passes, the predictiv e posterior mean and v ariance are outputted as the predictions and their measured uncertainty , respectiv ely (line 14). The user-deﬁned parameters threshold δ ∈ (0 , 1) and pa- tience P ∈ Z + enable controlling the sensitivity of the adap- tiv e MC dropout to changes in prediction v ariance. When δ approaches 1, our method becomes less sensiti ve, allow- ing to stop earlier . In contrast, the closer δ is to 0, the more sensitiv e it becomes, requiring the e xecution of more for - ward passes until conv er gence is diagnosed. It can be easily seen that selecting a small δ and large patience P values en- ables instrumenting the conv entional MC dropout method. W e demonstrate this remark later in T ables 4 and 6. W e also pro vide a sketch of the proof for the adap- tiv e MC dropout method. The MC Dropout process is a Bernoulli process; each MC Dropout forward pass is inde- pendent of the others, and the model parameters are ﬁx ed during our adaptive MC Dropout approach. According to the Law of Large Numbers, as the number of Predictions from line 5 of Algorithm 1 increases, the sample variance Algorithm 2: MC-CP for image classiﬁcation Input : Model f , T est set, Maximum forward passes K , Threshold δ , and Patience P Output : Prediction set, and variance set Conformal Calibration 1: Split test set: split the test set in calibration c and vali- dation v . 2: Calibrate: perform Platt scaling on the model using c . 3: Calculate conf ormal score: For each image in the training set, deﬁne E j = P k ′ i =1 ( ˆ π ( i ) ( x j ) + λ 1[ i > k reg ]) where k ′ is the model’ s ranking of the true class y j and ˆ π ( i ) ( x j ) is the i th largest score for the j th image. 4: Find the threshold: assign ˆ T ccal to the 1 − α quantile of the E j . Conformal Prediction 1: Mean softmax: retrieve softmax and variance from Adaptiv e Monte Carlo Dropout( f , v , K , δ, P ) . 2: Pr ediction set: output the k ∗ highest-score classes, where E k ∗ i =1 = P k ′ i =1 ( ˆ π ( i ) ( x n +1 ) + λ 1[ j > k reg ]) ≥ ˆ T ccal . σ from line 6 will con ver ge to the true variance σ true of the MC Dropout output population, and there exists a num- ber of forward passes N = #Predictions such that for all i ≥ N , | σ − σ true | < δ / 2 . W e show that the while loop from lines 3–13 terminates after fewer than K iterations if N < K − P . T o that end, we note that, since the σ v alue computed in iterations N , N + 1 , . . . , N + P of the while loop is within δ / 2 of σ true , in each of these successiv e iter- ations diﬀ = | σ i − 1 − σ | < δ in line 8, and therefore Count is incremented in line 10, reaching the value P and ending the while loop before K iterations. MC-CP f or Image Classiﬁcation For image classiﬁcation, we combine our Adaptive Monte Carlo dropout method with conformal prediction to form MC-CP , sho wn in Algorithm 2. MC-CP is split into two steps, conformal calibration and prediction. First, a test dataset is split into calibration and validation sets. Platt scal- ing is then performed on the pre-trained model using the cal- ibration dataset. Ne xt, we calculate the conformal scores for each input image in the training set, which can then be used to calculate the quantile threshold ˆ q . During the prediction stage of MC-CP , we in vok e the adaptiv e MC dropout method, with the selected hyperparam- eters, for each new input image. This inv ocation returns the mean prediction and variance of the possible classes of the image. The ﬁnal prediction set can then be determined by calculating the cumulati ve softmax output for all classes and then including the classes from most to least likely that do not exceed the quantile threshold. In Section , we sho w ho w MC-CP outperforms other state-of-the-art conformal predic- tion techniques, with modest computational ov erheads. Algorithm 3: MC-CP for deep quantile regression Input : Model f , T est set, Maximum ensemble K , Threshold δ , and Patience P Output : Prediction interval, and v ariance Conformal Calibration 1: Split test set: split the test set in calibration c and vali- dation v . 2: Calculate conf ormal score: for each data point in c , deﬁne E i : = max { ˆ q α lo ( x i ) − y i , y i − ˆ q α hi ( x i ) } . 3: Find the threshold: compute Q q − α ( E , c ) , the (1 − α )(1 + 1 / | c | ) -th empirical quantile of { E i : i ∈ c } . Conformal Prediction 1: Mean softmax: retrieve softmax and variance from Adaptiv e Monte Carlo Dropout( f , v , K , δ, P ) . 2: Pr ediction Interval: output the prediction interval C ( v ) = [ ˆ q α lo ( v ) − Q 1 − α ( E , c ) , ˆ q α hi ( v ) + Q 1 − α ( E , c )] for unseen validation data v . MC-CP f or Regression W e also de velop an e xtension of MC-CP for deep quantile regression, shown in Algorithm 3. This is also split up into calibration and prediction steps. T o calculate the conformal scores, the magnitude of error for the desired quantiles is estimated. Ne xt, the threshold can be calculated using the calibration dataset. For the prediction stage of MC-CP for deep quantile re- gression, once again, the adaptiv e MC Dropout method is called, with the desired hyperparameters, for each data point in the v alidation dataset. Finally , a prediction interv al is cal- culated for both quantiles on an unseen data point in the val- idation set using the calculated threshold. In Section , we show ho w MC-CP outperforms re gular deep quantile re gres- sion and the CQR method. Evaluation Experimental Setup Benchmarks. For classiﬁcation, we ev aluate MC-CP on ﬁv e image datasets: CIF AR-10 and CIF AR-100 (Krizhe vsky 2009), MNIST (LeCun et al. 1998), Fashion-MNIST (Xiao, Rasul, and V ollgraf 2017), and Tin y ImageNet (Wu, Zhang, and Xu 2017). CIF AR-10 and CIF AR-100 contain 60 , 000 32x32 colour images with 10 and 100 classes respectiv ely . MNIST and Fashion-MNIST contain 60 , 000 28x28 grey- scale images with 10 classes each. T iny ImageNet is a small version of the well-known ImageNet dataset contain- ing 100 , 000 64x64 colour images with 200 classes. For regression, we use the following ﬁv e bench- marks: Boston Housing (Harrison and Rubinﬁeld 1978), Abalone (Nash et al. 1995), Blog Feedback (Buza 2014), Concrete Compressiv e Strength (Y eh 2007), and Physicochemical Properties of Protein T ertiary Structure dataset (Rana 2013). The Boston Housing dataset contains 506 data points with 14 attributes, the Abalone dataset has 4180 data points with 9 attributes, the Blog Feedback dataset contains 60,021 data points with 281 attributes, the Con- crete dataset contains 1,030 data points with 9 attributes, and Dataset T ech. T est Err or Pred Sizes Baseline 38.05 ± 0.36 1.00 ± 0.00 MC 35.54 ± 0.36 1.00 ± 0.00 Naiv e 5.07 ± 0.28 3.53 ± 1.63 RAPS 3.29 ± 0.30 4.35 ± 1.86 CIF AR-10 MC-CP 1.47 ± 0.15 4.11 ± 1.81 Baseline 72.14 ± 0.87 1.00 ± 0.00 MC 69.46 ± 0.60 1.00 ± 0.00 Naiv e 4.92 ± 0.42 39.25 ± 11.43 RAPS 4.83 ± 0.25 41.65 ± 4.01 CIF AR-100 MC-CP 3.54 ± 0.20 39.26 ± 3.95 Baseline 1.10 ± 0.06 1.00 ± 0.00 MC 1.11 ± 0.04 1.00 ± 0.00 Naiv e 5.01 ± 0.58 0.95 ± 0.22 RAPS 1.10 ± 0.04 1.08 ± 0.35 MNIST MC-CP 0.32 ± 0.01 1.06 ± 0.33 Baseline 12.01 ± 0.26 1.00 ± 0.00 MC 12.08 ± 0.23 1.00 ± 0.00 Naiv e 4.93 ± 0.40 1.20 ± 0.42 RAPS 1.12 ± 0.07 1.80 ± 0.99 Fashion-MNIST MC-CP 0.82 ± 0.06 1.76 ± 1.00 Baseline 81.07 ± 1.05 1.00 ± 0.00 MC 78.60 ± 2.37 1.00 ± 0.00 Naiv e 4.85 ± 0.34 97.53 ± 29.64 RAPS 4.57 ± 0.09 107.78 ± 2.06 Tiny ImageNet MC-CP 3.99 ± 0.41 97.17 ± 3.67 T able 1: T est errors (%) and prediction sizes per UQ method on the ﬁv e classiﬁcation benchmarks ( δ =5e-4, P =10). the Ph ysicochemical Properties of Protein T ertiary Structure dataset contains 45,730 data points with 9 attributes. UQ Methods Conﬁguration. In our classiﬁcation experi- ments, all methods use a basic conv olution neural network (CNN) architecture comprising two hidden layers, tw o pool- ing layers, and two dropout layers with a frequency of 50%. All models are trained on a batch size of 128 for 10 epochs. The categorical cross entropy loss function and stochastic gradient descent optimiser with a learning rate and momen- tum of 0 . 1 and 0 . 9 , respectiv ely . Each experiment is repeated ﬁv e times to account for stochasticity . For CP methods, the calibration set size is 25% of the test set and α = 0 . 05 . W e do not consider Deep Ensembles or Bayesian Neural Networks within our experiments. These techniques require heavy ﬁne-tuning between datasets, disallo wing us to es- tablish a clear baseline. It would not be evident if perfor- mance differences would be due to hyperparameter tuning or the method itself. T o enable a fair comparison, we use the same network architecture and hyperparameters for all classiﬁcation-based UQ methods, i.e., a standard CNN, a CNN with MC dropout, Naiv e CP , RAPS, and MC-CP (in- strumented with RAPS). In our regression experiments, all methods use a deep quantile re gression model comprising two hidden layers and Figure 2: Percentage and accuracy of singleton and mixed predictions for Naiv e CP , RAPS, MC-CP on CIF AR-10. two dropout layers with a frequency of 25%. The learning rate of the Adam optimiser is 0 . 001 , and a custom multi- quantile loss function is used with the quantiles 0 . 05 and 0 . 95 . Each model is trained for 100 epochs on a batch size of 32, with experiments repeated ﬁ ve times to consider stochas- ticity . For CP methods, the calibration set size is 2% of the test set and α = 0 . 1 . As before, the same DL model is used for comparing the following regression-based UQ methods: a deep quantile regressor , a deep quantile re gressor with MC dropout, CQR, and MC-CP instrumented with CQR. Image Classiﬁcation Results Classiﬁcation Accuracy . The accuracy results of ﬁ ve dif- ferent methods against each of the datasets are shown in T able 1. The methods tested against MC-CP were a base- line CNN, the same CNN with MC dropout applied, Naive conformal prediction (Angelopoulos and Bates 2022), and RAPS. Results show that not only does MC-CP hav e in- creased accuracy in comparison to baseline and state-of- the-art conformal prediction methods, but it also does so with less deviation between runs. In particular , we empha- sise that our method consistently increases accuracy and yields a lower standard de viation on dif ﬁcult datasets such as CIF AR-10, CIF AR-100 and Tin y ImageNet. Further , and as expected, conformal prediction methods can drastically im- prov e accuracy compared to baseline methods, such as reg- ular CNN and MC dropout. Howe ver , MC-CP improves ac- curacy substantially with less deviation between runs, high- lighting its consistency with Nai v e CP and RAPS. Singleton and Mixed Predictions. Next, we compare the percentage and accuracy of singleton and non-singleton (mixed) predictions for all three conformal prediction meth- ods on CIF AR-10 (Figure 2). Naive CP is more likely to predict singleton values, whereas our method is least likely . When a model is not conﬁdent about its prediction, CP- based methods should desirably increase the prediction set size to account for this uncertainty and, hopefully , include the correct class in the larger prediction set. The compari- son of singleton and non-singleton results in Figure 2 pro- Figure 3: Mean conﬁdence of top predictions for Naiv e CP , RAPS, MC-CP on CIF AR-10, CIF AR-100, T iny ImageNet. vides evidence that our method correctly increases the set size to impro ve accuracy . In fact, for both singleton and non- singleton set sizes, our method performs with the highest ac- curacy , also exhibiting a consistent behaviour , as indicated by the low amount of v ariance between runs. An argument can be made that making the set size large enough could co ver nearly all the classes, and this beha viour could reﬂect a higher accurac y . Comparing these results with the mean set sizes in T able 1, we can see that all methods only cov er a portion of the classes in their mean set sizes. Conﬁdence of Predictions. W e ev aluated whether MC-CP could result in a more conﬁdent model than traditional con- formal prediction methods, thus providing improved accu- racy . Figure 3 shows the mean highest softmax output for ev ery CP method for CIF AR-10, CIF AR-100, and T iny Ima- geNet. Compared to Naiv e CP and RAPS, our method sho ws an increase in conﬁdence across all benchmarks. Looking closely at larger -scale datasets, such as CIF AR-100 and Tin y ImageNet, MC-CP is substantially more conﬁdent in its pre- dictions. W e also observe, in Figure 3, that MC-CP consis- tently has a smaller standard deviation between runs than the other methods. Prediction Sets Size. W e ha ve already sho wn how the accu- racy of each method has been tested at scale using CIF AR- 100. Howe v er , this only reﬂects a portion of the performance of each method at scale and doesn’t highlight any of its weaknesses. The ‘Prediction Sizes’ column in T able 1 shows the mean set size and variance for Nai ve CP , RAPS, and MC- CP on the ﬁve datasets. The results on CIF AR-10 show that Naiv e CP has the smallest mean set size, b ut this does not re- ﬂect its accuracy . Looking at the CIF AR-100 results, we can see that Naive CP has the smallest mean again, but its v ari- ance is substantially lar ger than the other results. In fact, we observed that Nai ve CP had set sizes ranging from 1 to 86 , which indicates that the method cannot cope ef fectively with large-scale datasets with many (potential) classes. For both datasets, MC-CP achieves a smaller mean than RAPS and has less deviation around the mean. For CIF AR-100, RAPS Figure 4: Con ver gence of variance for each class during the Adaptiv e MC Dropout procedure. Model T echnique T est Err or Prediction Sizes Naiv e 5.16 ± 0.28 32.39 ± 23.44 RAPS 4.90 ± 0.37 42.18 ± 5.17 V GG16 MC-CP 3.86 ± 0.26 40.44 ± 5.09 Naiv e 4.84 ± 0.34 28.40 ± 22.53 RAPS 4.77 ± 0.19 36.50 ± 4.90 V GG19 MC-CP 3.78 ± 0.14 32.48 ± 5.22 T able 2: T est errors (%) and prediction sizes per UQ method for two lar ge DL models on the T iny ImageNet dataset. has set sizes ranging from 33 to 59 , whereas MC-CP has set sizes ranging from 30 to 52 . These results show ho w the MC-CP can boost conﬁdence in conformal prediction algo- rithms and achieve better results. Overall, we observe that advanced CP algorithms, like RAPS, tend to o verestimate their predictions, and MC-CP reduces this ov erestimation. W e also demonstrate that our MC-CP method works well with models at scale by assessing its capabilities using the VGG16 and VGG19 models on the Tin y ImageNet dataset. T able 2 sho ws the reduced prediction set sizes for these mod- els. The results on the larger DL models align with those shown in T able 1, except in smaller magnitudes. Accuracy of Classes. W e next v alidated that MC-CP was not just doing signiﬁcantly better than other methods in one or two classes but that indeed performs better for nearly all classes. T able 3 shows the mean accuracy for all methods for each class in the CIF AR-10 dataset. W e again see the trend where MC-CP increases the accuracy in comparison to the other methods, and the deviation between runs is also reduced. MC-CP consistently achieves an accuracy of ap- proximately 97-99%, showing that it does improve general accuracy , not just of a few classes. The F ro g class is the sole outlier where Naiv e CP achieves a higher accuracy , but this appears to be an outlier for that model; MC-CP still achieves a high mean accuracy of 99 . 02% ± 0 . 59 . Adaptive MC Dr opout. Figure 4 sho ws the con vergence in each class variance for an example image from the CIF AR- Accuracy of Class T ech. Airplane Automob . Bird Cat Deer Dog Frog Horse Ship T ruck Baseline 62.9 ± 4.5 76.3 ± 6.9 39.8 ± 4.1 40.8 ± 2.9 59.8 ± 9.2 49.5 ± 6.2 87.8 ± 2.9 61.2 ± 5.3 76.8 ± 5.1 66.7 ± 3.4 MC 65.9 ± 5.3 75.4 ± 7.9 45.3 ± 5.1 39.4 ± 3.2 57.3 ± 3.3 59.2 ± 7.9 76.4 ± 6.9 72.1 ± 3.1 77.1 ± 6.4 73.1 ± 6.9 Naiv e 93.5 ± 1.0 96.8 ± 0.7 92.4 ± 2.2 93.1 ± 2.2 97.6 ± 0.8 93.1 ± 0.8 99.3 ± 0.4 91.4 ± 2.9 96.6 ± 0.8 94.8 ± 1.4 RAPS 95.6 ± 1.2 97.7 ± 0.7 94.7 ± 1.2 97.8 ± 0.5 98.6 ± 0.6 97.2 ± 0.9 98.8 ± 0.4 95.5 ± 0.9 97.2 ± 1.0 95.9 ± 1.7 MC-CP 97.8 ± 0.9 99.3 ± 0.4 98.2 ± 0.8 98.3 ± 0.5 99.0 ± 0.3 98.2 ± 0.3 99.0 ± 0.6 98.6 ± 0.4 98.2 ± 0.5 98.5 ± 0.3 T able 3: Mean accurac y (%) of classes for each method on the CIF AR-10 dataset. Delta δ Patience P Metrics 0.1 0.01 0.001 0.0001 0.00001 T est Error 6.25 ± 0.43 4.00 ± 0.50 3.60 ± 0.48 2.00 ± 0.50 2.00 ± 0.50 Mean fwd passes 3.77 ± 0.82 9.22 ± 3.16 52.71 ± 18.59 387.71 ± 150.13 985.02 ± 75.78 1 Mean set size 5.99 ± 1.32 5.35 ± 1.39 5.34 ± 1.21 5.01 ± 150.13 4.21 ± 1.33 T est Error 3.80 ± 0.40 3.40 ± 0.49 2.40 ± 0.49 1.67 ± 0.47 1.33 ± 0.47 Mean fwd passes 13.10 ± 1.50 31.45 ± 10.84 142.36 ± 56.79 812.34 ± 211.21 1000.00 ± 0.00 10 Mean set size 4.80 ± 1.44 5.30 ± 1.21 5.26 ± 1.34 4.11 ± 1.81 4.06 ± 1.48 T est Error 2.67 ± 0.47 2.50 ± 0.50 1.75 ± 0.43 1.38 ± 0.48 1.50 ± 0.48 Mean fwd passes 103.09 ± 1.60 156.77 ± 35.56 672.57 ± 180.16 1000.00 ± 0.00 1000.00 ± 0.00 100 Mean set size 5.01 ± 1.43 5.27 ± 1.20 4.75 ± 1.32 4.08 ± 1.33 4.19 ± 1.20 T able 4: Sensiti vity analysis on v arious threshold δ and patience P combinations on the CIF AR-10 dataset ( K = 1000 ). 10 dataset. W e observe that at approximately 200 forward passes, the variance difference of all classes is below the δ threshold, and the patience counter starts increasing with e v- ery new iteration. Howe ver , at approximately 205 forward passes, the v ariance dif ference for classes Ship and A utomo- bile spikes above the threshold; this is due to the stochastic nature of MC dropout. After 246 forward passes, all classes drop below the threshold, and the MC-CP procedure ﬁnishes early ten iterations later . W e also performed a sensitivity analysis of adaptiv e MC dropout to assess the impact of the threshold δ and patience P on its performance. T able 4 shows the various combina- tions of δ and P v alues used in these experiments. As P increases and δ decreases (from top left to bottom right), we notice an increase in the mean number of forward passes yielding a corresponding reduction in test error (i.e., accu- racy increase) and prediction set size. As expected, for δ = 0 . 00001 , P = 100 (bottom right) we obtain the traditional MC dropout, where the forward passes equals K = 1000 . Finally , we demonstrate that adapti ve MC dropout can sav e resources by comparing its ex ecution overheads against traditional MC Dropout for K = 1000 , δ =5e-4, P =10. Tra- ditional MC dropout performed all 1000 forward passes on CIF AR-10, and each image inference took an average of 35 . 52 ± 0 . 42 seconds. Adaptive MC Dropout av eraged 500 . 21 ± 196 . 37 passes on all images and took an a verage of 17 . 99 ± 7 . 09 seconds. The ability of our method to diagnose con v ergence led to ≈ 50% faster execution, meaning that the other ≈ 500 forward passes were not needed. Consid- ering memory consumption, as expected, both methods use the same memory ( ≈ 1.07GB/ ≈ 1.08GB for regular/adapti v e MC Dropout) when training a full model plus inference on a dataset. Regression Results Regression Accuracy and Coverage. In deep quantile re- gression, the mean absolute error (MAE) provides the mag- nitude of errors between the predicted quantiles and the true quantiles. Since MAE is less sensitive to outliers, we use it instead of (root) mean squared error . W e also compute the empirical coverage, which measures how often the pre- dicted quantiles contain the true statistical quantile. Simi- larly to image classiﬁcation, the objectiv e is for the poste- rior prediction set to contain the true quantile. T able 5 sho ws the MAE end empirical cov erage for four different methods on the Boston Housing, Abalone, Blog Feedback, Concrete Strength and Protein datasets. W e ev aluated MC-CP against a baseline deep quantile re gressor , the same deep quantile regressor with MC dropout, and conformalized quantile re- gression (CQR), the state-of-the-art CP regression method. Looking at MAE, the traditional deep quantile regression model performs best across the ﬁv e datasets. Howe ver , it also has a v ery lo w empirical coverage percentage across all ﬁv e datasets. For example, in the Boston Housing dataset, the true data points are included in the predicted quantile only 22% of the time. Similarly , although MC dropout in- creases the coverage by a considerable amount across all datasets, this method consistently leads to a worse MAE ov erall. In fact, we observe a tradeoff between these two methods. A low MAE comes with a low coverage, whereas a high cov erage induces a high MAE. Considering the CP-based methods, we observe that CQR provides the 1 − α co verage guarantee speciﬁed for all datasets, i.e., approximately 90% . Furthermore, CQR achiev es this cov erage with an MAE comparable to the base- line method in our e xperiments. Our MC-CP method reaches the highest empirical cov erage across all four datasets, b ut it Dataset T echnique MAE E. Coverage Baseline 0.30 ± 0.02 23.52 ± 3.18 MC 0.37 ± 0.02 72.83 ± 2.75 CQR 0.31 ± 0.61 95.97 ± 5.10 Boston Housing MC-CP 0.35 ± 0.20 98.46 ± 4.83 Baseline 0.62 ± 0.04 47.86 ± 2.34 MC 0.64 ± 0.02 85.96 ± 1.82 CQR 0.62 ± 0.11 92.94 ± 2.36 Abalone MC-CP 0.64 ± 0.04 95.98 ± 3.07 Baseline 2.12 ± 0.08 70.32 ± 5.70 MC 2.61 ± 0.08 86.09 ± 5.80 CQR 2.21 ± 0.10 90.73 ± 0.34 Blog Feedback MC-CP 2.40 ± 0.12 95.73 ± 0.34 Baseline 0.37 ± 0.01 20.55 ± 1.41 MC 0.54 ± 0.01 71.54 ± 5.51 CQR 0.37 ± 0.02 90.34 ± 3.69 Concrete MC-CP 0.44 ± 0.01 93.36 ± 2.49 Baseline 1.35 ± 0.01 49.10 ± 1.75 MC 1.49 ± 0.02 81.87 ± 0.21 CQR 1.40 ± 0.02 94.79 ± 0.01 Protein MC-CP 1.45 ± 0.01 96.06 ± 0.73 T able 5: Mean absolute error (MAE) and empirical coverage (%) for each method on the Boston Housing, Abalone, Blog Feedback, Concrete Strength and Protein datasets. does this with slightly higher ov erall MAE (but lower stan- dard deviation) on a verage than CQR. Giv en, howe ver , the improv ed empirical coverage of MC-CP and its very close MAE results, we can conclude that MC-CP deliv ers very competitiv e results against the state-of-the-art CP method for regression. This is a particularly important insight, espe- cially in safety-critical applications where higher co verage is vital. W e conclude our ev aluation with Figure 5 which shows the predicted quantiles and coverage of the true v alues on an excerpt of the Boston Housing dataset. As e xpected, MC- CP yields slightly larger quantiles than CQR but has higher empirical cov erage and misses fe wer points. Figure 5: Predicted quantiles (95%, 5%) of all four methods on a sample of the Boston Housing dataset. Adaptive MC Dr opout f or Regr ession. Similar to T able 4, we performed sensitivity analysis on various combinations of δ and the patience value on deep quantile regression. T a- ble 6 shows how dif ferent combinations affect MAE and cov erage. W e also visualised the quantiles for the various combinations, which can be seen in Figure 6. Similarly to the results sho wn in T able 4, a small δ and large patience show results comparable to traditional MC Dropout. It can be seen that with δ = 1 e − 5 , p = 10 , we get consider- able computational time saved with a compatible MAE to δ = 1 e − 5 , p = 100 . Similarly to the computational ov erheads in vestigation performed in image classiﬁcation, we ev aluated the over - heads of traditional MC Dropout against Adaptiv e MC Dropout with the same parameters. Traditional MC Dropout performed all 1000 forward passes on the Boston Hous- ing dataset, and each image inference took an average of 34 . 08 ± 1 . 51 seconds. Adaptiv e MC Dropout averaged 502 . 58 ± 56 . 94 forward passes on all images and took an av erage of 16 . 58 ± 2 . 91 seconds. Accordingly , we have ob- tained evidence that Adapti ve MC Dropout was ≈ 50% faster again. Conclusion and Future W ork Quantifying uncertainty in Deep Learning models is vital, especially when the y are deplo yed in safety-critical applica- tions. W e introduced MC-CP , a hybrid uncertainty quantiﬁ- cation method that combines a novel adaptiv e Monte Carlo dropout, informed by a coverage criterion to sav e resources during inference, with conformal prediction. MC-CP deliv- ers robust prediction sets/intervals by exploiting the statisti- cal efﬁcienc y of MC dropout and the distribution-free cov- erage guarantees of conformal prediction. Our ev aluation in classiﬁcation and regression benchmarks sho wed that MC- CP offers signiﬁcant improvements over advanced meth- ods, like MC dropout, RAPS and CQR. Our future work in- cludes: (i) enhancing MC-CP to support object detection and segmentation tasks; (ii) performing a more extensi ve ev alu- ation using larger benchmarks and DL models; and (iii) ex- tending MC-CP to encode risk-related aspects in its analysis. Acknowledgments This research has receiv ed funding from the Doctoral Cen- tre for Safe, Ethical and Secure Computing (SEtS) at the Univ ersity of Y ork, UK, the European Union’ s Horizon projects SESAME and SOPRANO (grant agreements No 101017258 and 101120990, respectiv ely), and the EPSRC project ‘UKRI T AS Node in Resilience’ (EP/V026747/1), and the Assuring Autonomy International Programme. RC’ s work has also been funded by the Institute for Software Engineering and Software T echnology ‘Jose Mar ´ ıa T roya Linero’ at the Univ ersity of M ´ alaga. References Abdar , M.; Pourpanah, F .; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghav amzadeh, M.; Fieguth, P .; Cao, X.; Khosravi, A.; Acharya, U. R.; et al. 2021. A revie w of uncertainty Delta Patience Metrics 1.00E-01 1.00E-02 1.00E-03 1.00E-4 1.00E-5 Mean MAE 0.41 ± 0.02 0.40 ± 0.01 0.40 ± 0.02 0.40 ± 0.01 0.41 ± 0.03 Mean fwd passes 3.00 ± 0.00 3.15 ± 0.84 4.83 ± 2.70 5.03 ± 3.03 5.14 ± 3.47 1 Mean Cov erage 84.85 ± 3.81 87.83 ± 2.17 88.85 ± 1.63 83.70 ± 9.78 87.50 ± 2.72 Mean MAE 0.39 ± 0.02 0.38 ± 0.02 0.37 ± 0.01 0.36 ± 0.02 0.36 ± 0.01 Mean fwd passes 82.38 ± 209.43 339.28 ± 368.87 382.57 ± 393.45 403.24 ± 399.50 459.85 ± 383.65 10 Mean Cov erage 84.78 ± 14.13 89.67 ± 0.54 97.28 ± 0.54 96.74 ± 2.17 97.28 ± 2.72 Mean MAE 0.37 ± 0.01 0.34 ± 0.01 0.35 ± 0.01 0.35 ± 0.03 0.35 ± 0.01 Mean fwd passes 484.33 ± 432.50 984.32 ± 98.02 974.85 ± 127.46 977.25 ± 113.95 982.10 ± 108.46 100 Mean Cov erage 97.83 ± 2.17 98.37 ± 1.63 93.76 ± 5.98 92.93 ± 5.98 95.65 ± 2.17 T able 6: Sensiti vity analysis on v arious hyperparameter combinations on the Boston Housing dataset. Figure 6: Predicted quantiles (95%, 5%) for T able 6 on a sample of the Boston Housing dataset. quantiﬁcation in deep learning: T echniques, applications and challenges. Information fusion , 76: 243–297. Angelopoulos, A.; Bates, S.; Malik, J.; and Jordan, M. I. 2022. Uncertainty Sets for Image Classiﬁers using Confor- mal Prediction. Angelopoulos, A. N.; and Bates, S. 2022. A Gentle Intro- duction to Conformal Prediction and Distribution-Free Un- certainty Quantiﬁcation. Barnard, K.; Duygulu, P .; Forsyth, D.; De Freitas, N.; Blei, D. M.; and Jordan, M. I. 2003. Matching w ords and pictures. The Journal of Mac hine Learning Resear ch , 3: 1107–1135. Buza, K. 2014. BlogFeedback. UCI Machine Learning Repository . DOI: https://doi.org/10.24432/C58S3F . Calinescu, R.; ˇ Ce ˇ ska, M.; Gerasimou, S.; Kwiatkowska, M.; and Paoletti, N. 2018. Efﬁcient synthesis of robust models for stochastic systems. Journal of Systems and Softwar e , 143: 140–158. de Grancey , F .; Adam, J.-L.; Alecu, L.; Gerchinovitz, S.; Mamalet, F .; and V igouroux, D. 2022. Object Detection with Probabilistic Guarantees: A Conformal Prediction Ap- proach. In Trapp, M.; Schoitsch, E.; Guiochet, J.; and Bitsch, F ., eds., Computer Safety , Reliability , and Security . SAFE- COMP 2022 W orkshops , Lecture Notes in Computer Sci- ence, 316–329. Cham: Springer International Publishing. ISBN 978-3-031-14862-0. Fan, J.; Ge, J.; and Mukherjee, D. 2023. UTOPIA: Univer - sally Trainable Optimal Prediction Intervals Aggregation. Gal, Y .; and Ghahramani, Z. 2016. Dropout as a bayesian ap- proximation: Representing model uncertainty in deep learn- ing. In international conference on machine learning , 1050–1059. PMLR. Gerasimou, S.; Eniser, H. F .; Sen, A.; and Cakan, A. 2020. Importance-driv en deep learning system testing. In Pr o- ceedings of the ACM/IEEE 42nd International Confer ence on Softwar e Engineering , 702–713. Harrison, D.; and Rubinﬁeld, D. L. 1978. The Boston house- price data. http://lib.stat.cmu.edu/datasets/boston. Ac- cessed: 2023-07-04. Kendall, A.; Badrinarayanan, V .; and Cipolla, R. 2016. Bayesian SegNet: Model Uncertainty in Deep Con volutional Encoder-Decoder Architectures for Scene Understanding. ArXiv:1511.02680 [cs], arXi v:1511.02680. Kingma, D. P .; Salimans, T .; and W elling, M. 2015. V aria- tional Dropout and the Local Reparameterization T rick. In Cortes, C.; Lawrence, N.; Lee, D.; Sugiyama, M.; and Gar- nett, R., eds., Advances in Neural Information Pr ocessing Systems , volume 28. Curran Associates, Inc. Krizhevsk y , A. 2009. Learning Multiple Layers of F eatures fr om T iny Images . Ph.D. thesis, Uni versity of T ront. Kumar , Y .; Sahrawat, D.; Maheshw ari, S.; Mahata, D.; Stent, A.; Y in, Y .; Shah, R. R.; and Zimmermann, R. 2020. Har- nessing gans for zero-shot learning of new classes in visual speech recognition. In Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , v olume 34, 2645–2652. Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. 2017. Simple and Scalable Predictiv e Uncertainty Estimation us- ing Deep Ensembles. In Guyon, I.; Luxburg, U. V .; Bengio, S.; W allach, H.; Fergus, R.; V ishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Pr ocessing Sys- tems , volume 30. Curran Associates, Inc. LeCun, Y .; Bottou, L.; Bengio, Y .; and Haffner , P . 1998. Gradient-based learning applied to document recognition. Pr oceedings of the IEEE , 86(11): 2278–2324. MacKay , D. J. 1992. A practical Bayesian frame work for backpropagation networks. Neural computation , 4(3): 448–472. Missaoui, S.; Gerasimou, S.; and Matragkas, N. 2023. Se- mantic Data Augmentation for Deep Learning T esting using Generativ e AI. In 2023 38th IEEE/ACM International Con- fer ence on Automated Softwar e Engineering (ASE) , 1694– 1698. IEEE. Moshkov , N.; Mathe, B.; K ertesz-Farkas, A.; Hollandi, R.; and Horvath, P . 2020. T est-time augmentation for deep learning-based cell se gmentation on microscopy images. Scientiﬁc r eports , 10(1): 5068. Nash, W .; Sellers, T .; T albot, S.; Cawthorn, A.; and F ord, W . 1995. Abalone. UCI Machine Learning Repository . DOI: https://doi.org/10.24432/C55C7W . Pereira, A.; and Thomas, C. 2020. Challenges of machine learning applied to safety-critical cyber -physical systems. Machine Learning and Knowledge Extraction , 2(4): 579– 602. Rana, P . 2013. Physicochemical Properties of Protein T er- tiary Structure. UCI Machine Learning Repository . DOI: https://doi.org/10.24432/C5QW3H. Romano, Y .; Patterson, E.; and Candes, E. 2019. Confor- malized quantile regression. Advances in neural information pr ocessing systems , 32. Sriv astav a, N.; Hinton, G.; Krizhe vsky , A.; Sutske ver , I.; and Salakhutdinov , R. 2014. Dropout: A simple way to prevent neural networks from overﬁtting. J ournal of Machine Learn- ing Resear ch , 15: 1929–1958. V ovk, V .; Gammerman, A.; and Shafer , G. 2005. Algorith- mic learning in a random world , v olume 29. Springer . W ang, G.; Li, W .; Ourselin, S.; and V ercauteren, T . 2019. Automatic Brain T umor Segmentation Using Con v olutional Neural Networks with T est-Time Augmentation. In Crimi, A.; Bakas, S.; Kuijf, H.; Keyv an, F .; Reyes, M.; and van W alsum, T ., eds., Brainlesion: Glioma, Multiple Scler osis, Str oke and T raumatic Brain Injuries , 61–72. Cham: Springer International Publishing. ISBN 978-3-030-11726-9. W u, J.; Zhang, Q.; and Xu, G. 2017. T iny imagenet chal- lenge. T echnical r eport . Xiao, H.; Rasul, K.; and V ollgraf, R. 2017. Fashion-MNIST : a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Y eh, I.-C. 2007. Concrete Compressi ve Strength. UCI Machine Learning Repository . DOI: https://doi.org/10.24432/C5PK67.

Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment