Deep Evidential Regression
Deterministic neural networks (NNs) are increasingly being deployed in safety critical domains, where calibrated, robust, and efficient measures of uncertainty are crucial. In this paper, we propose a novel method for training non-Bayesian NNs to est…
Authors: Alex, er Amini, Wilko Schwarting
Deep Evidential Regr ession Alexander Amini 1 , W ilk o Schwarting 1 , A v a Soleimany 2 , Daniela Rus 1 1 Computer Science and Artificial Intelligence Lab (CSAIL), Massachusetts Institute of T echnology (MIT) 2 Harv ard Graduate Program in Bioph ysics Abstract Deterministic neural netw orks (NNs) are increasingly being deplo yed in safety critical domains, where calibrated, rob ust, and ef ficient measures of uncertainty are crucial. In this paper , we propose a no v el method for training non-Bayesian NNs to estimate a continuous tar get as well as its associated e vidence in order to learn both a leatoric and epi stemic uncertainty . W e accompli sh this by placing e vidential priors o v er the original Gaussian lik elihood function and training the NN to infer the h yperparameters of the e vidential distrib ution. W e additionally impose priors during training such that the model is re gularized when its predicted e vidence is not aligned with the corre ct output. Our method does not rely on sampling during inference or on out-of-distrib ution (OOD) e xamples for training, thus enabling ef ficient and scalable unce rtainty learning. W e demonstrate learning well-calibrated measures of uncertainty on v arious benchmarks, scaling to comple x computer vision ta sks, as well as rob ustness to adv ersarial and OOD test samples. 1 Intr oduction Data, • Images • T imeseries • Feature V ector Evidential Prior Aleatoric Epistemic T arget, Neural Network Figure 1: Evidential r egr ession simultaneously learns a continuous tar get along with aleatoric (data) and epistemic (model) uncertainty . Gi v en an input, the netw ork is trained to predic t the parame- ters of an e vi dential distrib ution, which models a higher -order probability distrib ution o v er the indi- vidual lik elihood parameters, ( µ, σ 2 ) . Re gression-based neural netw orks (NNs) are being deplo yed in safety critical domains in computer vi- sion [ 15 ] as well as in robotics and control [ 1 , 6 ], where the ability to infer model uncertainty is crucial for e v entual wide-scale adoption. Furthermore, pre- cise and calibrated uncertainty estimates are useful for interpreting confidence, capturing dom ain shift of out-of-distrib ution (OOD) test samples, and recog- nizing when the model is lik ely to f ail. There are tw o ax es of NN uncertainty that can be modeled: (1) uncertainty in the data, called aleatoric uncertainty , and (2) uncertainty in the prediction, called epistemic uncertainty . While representations of aleatoric uncertainty can be learned directly from data, there e xist se v eral a pproaches for estimating epistemic uncertainty , such as Bayesian NNs, which place probabilistic priors o v er netw ork weights and use sampling to approximate output v ariance [ 25 ]. Ho we v er , Bayesian NNs f ace se v eral limitations, including the intractability of directly inferring the posterior distrib ution of the weights gi v en data, the requirement and computational e xpense of sampling during inference, and the question of ho w to choose a weight prior . 34th Conference on Neural Information Processing Systems (NeurIPS 2020), V ancouv er , Canada. Code a v ailable on https://github.com/a amini/evidential- de ep- learning . In contrast, evidential deep learning formulates learning as an e vidence acquisition process [ 42 , 32 ]. Every training example adds support to a learned higher -order , evidential distrib ution. Sampling from this distribution yields instances of lower -order likelihood functions from which the data was drawn. Instead of placing priors on network weights, as is done in Bayesian NNs, evidential approaches place priors directly ov er the likelihood function. By training a neural network to output the hyperparameters of the higher-order e vidential distribution, a grounded representation of both epistemic and aleatoric uncertainty can then be learned without the need for sampling. T o date, evidential deep learning has been tar geted tow ards discrete classification problems [ 42 , 32 , 22 ] and has required either a well-defined distance measure to a maximally uncertain prior [ 42 ] or relied on training with OOD data to inflate model uncertainty [ 32 , 31 ]. In contrast, continuous regression problems present the complexity of lacking a well-defined distance measure to regularize the inferred e vidential distribution. Further , pre-defining a reasonable OOD dataset is non-tri vial in the majority of applications; thus, methods to obtain calibrated uncertainty on OOD data from only an in-distribution training set are required. W e present a no vel approach that models the uncertainty of re gression networks via learned e vidential distributions (Fig. 1). Specifically , this work makes the follo wing contributions: 1. A nov el and scalable method for learning epistemic and aleatoric uncertainty on re gression problems, without sampling during inference or training with out-of-distribution data; 2. Formulation of an e vidential regularizer for continuous re gression problems, necessary for penalizing incorrect evidence on errors and OOD e xamples; 3. Evaluation of epistemic uncertainty on benchmark and complex vision regression tasks along with comparisons to state-of-the-art NN uncertainty estimation techniques; and 4. Robustness and calibration e valuation on OOD and adversarially perturbed test input data. 2 Modelling uncertainties from data 2.1 Preliminaries Consider the follo wing supervised optimization problem: given a dataset, D , of N paired training examples, D = { x i , y i } N i =1 , we aim to learn a functional mapping f , parameterized by a set of weights, w , which approximately solv es the following optimization problem: min w J ( w ); J ( w ) = 1 N N X i =1 L i ( w ) , (1) where L i ( · ) describes a loss function. In this work, we consider deterministic re gression problems, which commonly optimize the sum of squared errors, L i ( w ) = 1 2 k y i − f ( x i ; w ) k 2 . In doing so, the model is encouraged to learn the average correct answer for a gi v en input, but does not explicitl y model any underlying noise or uncertainty in the data when making its estimation. 2.2 Maximum likelihood estimation One can approach this problem from a maximum likelihood perspectiv e, where we learn model parameters that maximize the likelihood of observing a particular set of training data. In the context of deterministic regression, we assume our targets, y i , were dra wn i.i.d. from a distrib ution such as a Gaussian with mean and v ariance parameters θ = ( µ, σ 2 ) . In maximum likelihood estimation (MLE), we aim to learn a model to infer θ that maximize the likelihood of observing our tar gets, y , giv en by p ( y i | θ ) . This is achiev ed by minimizing the neg ativ e log likelihood loss function: L i ( w ) = − log p ( y i | µ, σ 2 | {z } θ ) = 1 2 log(2 π σ 2 ) + ( y i − µ ) 2 2 σ 2 . (2) In learning θ , this likelihood function successfully models the uncertainty in the data, also known as the aleatoric uncertainty . Ho wev er , our model is obli vious to its predictiv e epistemic uncertainty [ 25 ]. In this paper , we present a nov el approach for estimating the evidence supporting network predictions in regression by directly learning both the aleatoric uncertainty present in the data as well as the 2 -2 -1 012 0 1 2 0 1 2 0 1 2 0 1 2 -2 -1 0 1 2 0 1 -3 -2 -1 0123 0 1 0 1 C Higher Order (Evidential): Lower Order (Likelihood): Increasing Evidence Decreasing V ariance A iii ii i iii ii i B Figure 2: Normal In v erse-Gamma distri b ution. Dif ferent realizations of our e vidential distrib ution (A) correspond to dif ferent le v els of confidences in the parameters (e.g. µ, σ 2 ). Sampling from a single realization of a higher -order e vidential distrib ution (B), yiel ds lo wer -order lik elihoods (C) o v er the data (e.g. p ( y | µ, σ 2 ) ). Dark er shading indicates higher probability mass. W e aim to learn a model that predicts the tar get, y , from an input, x , with an e vidential prior imposed on our lik elihood to enable uncertainty estimation. model’ s underlying epistemic uncertainty . W e achie v e this by placing higher -order prior distrib utions o v er the learned parameters go v erning the distrib ution from which our observ ations are dra wn. 3 Evidential uncertainty f or r egr ession 3.1 Pr oblem setup W e consider the problem where the observ ed tar gets, y i , are dra wn i.i.d. from a Gaussian distrib ution, as in standard MLE (Sec. 2.2), b ut no w with unknown mean and variance ( µ, σ 2 ) , which we seek to also probabilistically estimate. W e model this by placing a prior distrib ution on ( µ, σ 2 ) . If we assume observ ations are dra wn from a Gaussian, in line with assumptions Sec. 2.2, this leads to placing a Gaussian prior on the unkno wn mean and an In v erse-Gamma prior on the unkno wn v ariance: ( y 1 , . . . , y N ) ∼ N ( µ, σ 2 ) µ ∼ N ( γ , σ 2 υ − 1 ) σ 2 ∼ Γ − 1 ( α , β ) . (3) where Γ( · ) is the g amma function, m = ( γ , υ , α , β ) , and γ ∈ R , υ > 0 , α > 1 , β > 0 . Our aim is to estimate a posterior distrib ution q ( µ, σ 2 ) = p ( µ, σ 2 | y 1 , . . . , y N ) . T o obtain an approxi- mation for the true posterior , we assume that the estimated distri b ution can be f actorized [ 39 ] such that q ( µ, σ 2 ) = q ( µ ) q ( σ 2 ) . Thus, our approximation tak es the form of the Gaussian conjug ate prior , the Normal In v erse-Gamma ( NIG ) distrib ution: p ( µ, σ 2 | {z } θ | γ , υ , α , β | {z } m ) = β α √ υ Γ( α ) √ 2 π σ 2 1 σ 2 α +1 exp − 2 β + υ ( γ − µ ) 2 2 σ 2 . (4) A popular interpretation of the parameters of this conjug ate prior distrib ution is in terms of “virtual- observ ations” in support of a gi v en propert y [ 23 ]. F or e xampl e, the mean of a NIG distrib ution can be intuiti v ely interpreted as being estimated from υ virtual-observ ations with sample mean γ , while its v ariance is estimated from α virtual-observ at ions with sample mean γ and sum of squared de viations 2 υ . F ollo wing from this interpretation, we define the total e vidence, Φ , of our e vidential distrib utions as the sum of all inferred virtual-observ ations counts: Φ = 2 υ + α . Dra wing a sample θ j from the NIG distrib ution yields a single instance of our lik elihood function, namely N ( µ j , σ 2 j ) . Thus, the NIG h yperparameters, ( γ , υ , α , β ), determine not only the location b ut also the dispersion concentrations, or uncertainty , associated with our inferred lik elihood function. Therefore, we can interpret t he NIG distrib ution as the higher -or der , e vidential distrib ution on top of the unkno wn lower -or der lik elihood distrib ution from which observ ations are dra wn. 3 For example, in Fig. 2A we visualize dif ferent evidential NIG distributions with v arying model parameters. W e illustrate that by increasing the e vidential parameters (i.e. υ , α ) of this distribution, the p.d.f. becomes tightly concentrated about its inferred lik elihood function. Considering a single parameter realization of this higher-order distrib ution (Fig. 2B), we can subsequently sample man y lower -order realizations of our likelihood function, as shown in Fig. 2C. In this work, we use neural networks to infer , gi ven an input, the hyperparameters, m , of this higher-order , evidential distrib ution. This approach presents sev eral distinct advantages compared to prior work. First, our method enables simultaneous learning of the desired regression task, along with aleatoric and epistemic uncertainty estimation, by enforcing evidential priors and without lev eraging any out-of-distrib ution data during training. Second, since the evidential prior is a higher -order NIG distribution, the maximum likelihood Gaussian can be computed analytically from the expected values of the ( µ, σ 2 ) parameters, without the need for sampling. Third, we can ef fecti vely estimate the epistemic or model uncertainty associated with the network’ s prediction by simply e v aluating the variance of our inferred e vidential distribution. 3.2 Prediction and uncertainty estimation The aleatoric uncertainty , also referred to as statistical or data uncertainty , is representativ e of unknowns that differ each time we run the same experiment. The epistemic (or model) uncertainty , describes the estimated uncertainty in the prediction. Gi ven a NIG distribution, we can compute the prediction, aleatoric, and epistemic uncertainty as E [ µ ] = γ | {z } prediction , E [ σ 2 ] = β α − 1 | {z } aleatoric , V ar[ µ ] = β υ ( α − 1) | {z } epistemic . (5) Complete deri v ations for these moments are av ailable in Sec. S1.1. Note that V ar[ µ ] = E [ σ 2 ] /υ , which is expected as υ is one of our two evidential virtual-observ ation counts. 3.3 Learning the e vidential distribution Having formalized the use of an evidential distribution to capture both aleatoric and epistemic uncertainty , we next describe our approach for learning a model to output the hyperparameters of this distribution. For clarity , we structure the learning process as a multi-task learning problem, with two distinct parts: (1) acquiring or maximizing model evidence in support of our observations and (2) minimizing e vidence or inflating uncertainty when the prediction is wrong. At a high le vel, we can think of (1) as a way of fitting our data to the evidential model while (2) enforces a prior to remov e incorrect evidence and inflate uncertainty . (1) Maximizing the model fit. From Bayesian probability theory , the “model evidence”, or mar ginal likelihood, is defined as the likelihood of an observation, y i , giv en the evidential distrib ution parame- ters m and is computed by marginalizing o ver the lik elihood parameters θ : p ( y i | m ) = p ( y i | θ , m ) p ( θ | m ) p ( θ | y i , m ) = Z ∞ σ 2 =0 Z ∞ µ = −∞ p ( y i | µ, σ 2 ) p ( µ, σ 2 | m ) d µ d σ 2 (6) The model e vidence is, in general, not straightforward to ev aluate since computing it inv olv es integrating out the dependence on latent model parameters. Howe v er , in the case of placing a NIG evidential prior on our Gaussian lik elihood function an analytical solution does exist: p ( y i | m ) = St y i ; γ , β (1 + υ ) υ α , 2 α . (7) where St y ; µ St , σ 2 St , υ S t is the Student-t distribution e valuated at y with location µ St , scale σ 2 St , and υ S t degrees of freedom. W e denote the loss, L NLL i ( w ) , as the negati v e logarithm of model evidence L NLL i ( w ) = 1 2 log π υ − α log (Ω) + α + 1 2 log(( y i − γ ) 2 υ + Ω) + log Γ( α ) Γ( α + 1 2 ) (8) where Ω = 2 β (1 + υ ) . Complete deri vations for Eq. 7 and Eq. 8 are provided in Sec. S1.2. This loss provides an objective for training a NN to output parameters of a NIG distribution to fit the observations by maximizing the model e vidence. 4 (2) Minimizing evidence on err ors. Next, we describe ho w to re gularize training by applying an incorrect evidence penalty (i.e., high uncertainty prior) to try to minimize e vidence on incorrect pre- dictions. This has been demonstrated with success in the classification setting where non-misleading e vidence is removed from the posterior , and the uncertain prior is set to a uniform Dirichlet [ 42 ]. The analogous minimization in the regression setting in v olves K L [ p ( θ | m ) || p ( θ | ˜ m ) ] , where ˜ m are the parameters of the uncertain NIG prior with zero e vidence (i.e., { α, υ } = 0 ). Unfortunately , the KL between any NIG and the zero e vidence NIG prior is undefined (1) . Furthermore, this loss should not be enforced ev erywhere, b ut instead specifically where the posterior is “misleading”. Past works in classification [ 42 ] accomplish this by using the ground truth likelihoood classification (the one-hot encoded labels) to remove “non-misleading” evidence. Howe ver , in regression, it is not possible to penalize e vidence e verywhere e xcept our single label point estimate, as this space is infinite and unbounded. Thus, these previous approaches for re gularizing evidential learning are not applicable. T o address these challenges in the regression setting, we formulate a novel e vidence regularizer , L R i , scaled on the error of the i -th prediction, L R i ( w ) = | y i − E [ µ i ] | · Φ = | y i − γ | · (2 υ + α ) . (9) This loss imposes a penalty whenev er there is an error in the prediction and scales with the total evidence of our inferred posterior . Con versely , large amounts of predicted evidence will not be penalized as long as the prediction is close to the tar get. A naïve alternati ve to directly penalizing evidence would be to soften the zero-e vidence prior to instead ha ve -evidence such that the KL is finite and defined. Howe ver , doing so results in hypersensitivity to the selection of , as it should be small yet K L → ∞ as → 0 . W e demonstrate the added value of our e vidential regularizer through ablation analysis (Sec. 4.1), the limitations of the soft KL re gularizer (Sec. S2.1.3), and the ability to learn disentangled aletoric and epistemic uncertainty (Sec. S2.1.4). Summary and implementation details. The total loss, L i ( w ) , consists of the tw o loss terms for maximizing and regularizing e vidence, scaled by a regularization coef ficient, λ , L i ( w ) = L NLL i ( w ) + λ L R i ( w ) . (10) Here, λ trades of f uncertainty inflation with model fit. Setting λ = 0 yields an over -confident estimate while setting λ too high results in ov er-inflation (2) . In practice, our NN is trained to output the parameters, m , of the evidential distribution: m i = f ( x i ; w ) . Since m is composed of 4 parameters, f has 4 output neurons for e very tar get y . W e enforce the constraints on ( υ , α, β ) with a softplus activ ation (and additional +1 added to α since α > 1 ). Linear acti vation is used for γ ∈ R . 4 Experiments 4.1 Predicti ve accuracy and uncertainty benchmarking W e first qualitativ ely compare the performance of our approach against a set of baselines on a one- dimensional cubic regression dataset (Fig. 3). Follo wing [ 20 , 28 ], we train models on y = x 3 + , where ∼ N (0 , 3) within ± 4 and test within ± 6 . W e compare aleatoric (A) and epistemic (B) uncertainty estimation for baseline methods (left), e vidence without re gularization (middle), and with regularization (right). Gaussian MLE [ 36 ] and Ensembling [ 28 ] are used as respectiv e baseline methods. All aleatoric methods (A) accurately capture uncertainty within the training distrib ution, as expected. Epistemic uncertainty (B) captures uncertainty on OOD data; our proposed e vidential method estimates uncertainty appropriately and grows on OOD data, without dependence on sampling. T raining details and additional experiments for this e xample are av ailable in Sec. S2.1. Additionally , we compare our approach to baseline methods for NN predictiv e uncertainty estimation on real world datasets used in [ 20 , 28 , 9 ]. W e evaluate our proposed e vidential regression method against results presented for model ensembles [ 28 ] and dropout [ 9 ] based on root mean squared error (RMSE), negati ve log-likelihood (NLL), and inference speed. T able 1 indicates that e ven though, unlike the competing approaches, the loss function for e vidential regression does not explicitl y optimize accuracy , it remains competiti ve with respect to RMSE while being the top performer on all datasets for NLL and speed. T o gi ve the two baseline methods maximum adv antage, we parallelize (1) Please refer to Sec. S1.3 for deriv ation of the KL between two NIG s, along with a no-evidence NIG prior . (2) Experiments demonstrating the effect of λ on a learning problem are provided in Sec. S2.1.3 5 RMSE NLL Infer ence Speed (ms) Dataset Dropout Ensembles Evidential Dropout Ensembles Evidential Dropout Ensemble Evidential Boston 2.97 ± 0.19 3.28 ± 1.00 3.06 ± 0.16 2.46 ± 0.06 2.41 ± 0.25 2.35 ± 0.06 3.24 3.35 0.85 Concrete 5.23 ± 0.12 6.03 ± 0.58 5.85 ± 0.15 3.04 ± 0.02 3.06 ± 0.18 3.01 ± 0.02 2.99 3.43 0.94 Ener gy 1.66 ± 0.04 2.09 ± 0.29 2.06 ± 0.10 1.99 ± 0.02 1.38 ± 0.22 1.39 ± 0.06 3.08 3.80 0.87 Kin8nm 0.10 ± 0.00 0.09 ± 0.00 0.09 ± 0.00 -0.95 ± 0.01 -1.20 ± 0.02 -1.24 ± 0.01 3.24 3.79 0.97 Na v al 0.01 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 -3.80 ± 0.01 -5.63 ± 0.05 -5.73 ± 0.07 3.31 3.37 0.84 Po wer 4.02 ± 0.04 4.11 ± 0.17 4.23 ± 0.09 2.80 ± 0.01 2.79 ± 0.04 2.81 ± 0.07 2.93 3.36 0.85 Protein 4.36 ± 0.01 4.71 ± 0.06 4.64 ± 0.03 2.89 ± 0.00 2.83 ± 0.02 2.63 ± 0.00 3.45 3.68 1.18 W ine 0.62 ± 0.01 0.64 ± 0.04 0.61 ± 0.02 0.93 ± 0.01 0.94 ± 0.12 0.89 ± 0.05 3.00 3.32 0.86 Y acht 1.11 ± 0.09 1.58 ± 0.48 1.57 ± 0.56 1.55 ± 0.03 1.18 ± 0.21 1.03 ± 0.19 2.99 3.36 0.87 T able 1: Benchmark r egr ession tests. RMSE, ne g ati v e log-lik elihood (NLL), and inference speed for dropout sampling [ 9 ], model ensembling [ 28 ], and e vidential re gression. T op scores for each metric and dataset are bolded (within statistical significance), n = 5 for sampling baselines. Evidential models outperform baseline methods for NLL and inference speed on all datasets. their sampled inference ( n = 5 ). Dropout requires additional multiplications with the sampled mask, resulting in slightly slo wer inference compared to ensemble s, whereas e vidence only requires a single forw ard pass and netw ork. T raining details for T able 1 are a v ailable in Sec. S2.2. 4.2 Monocular depth estimation Gaussian Maximum Likelihood -6 -4 -2 0 2 4 6 150 100 50 -50 -100 -150 0 Aleatoric -6 -4 -2 0 2 4 6 150 100 50 -50 -100 -150 0 Ensembles n = 5 -6 -4 -2 0 2 4 6 150 100 50 -50 -100 -150 0 -6 -4 -2 0 2 4 6 150 100 50 -50 -100 -150 0 Epistemic Evidential, with regularization 150 100 50 0 -50 -100 -150 -6 -4 -2 0 2 4 6 Evidential, no regularization 150 100 50 0 -50 -100 -150 -6 -4 -2 0 2 4 6 Prediction Uncertainty No Data Data Ground T ruth Evidential, with regularization Evidential, no regularization A B Figure 3: T oy uncertainty estimation. Aleatoric (A) and epistemic (B) uncertainty estimates on the dataset y = x 3 + , ∼ N (0 , 3) . Re g ularized e vi dential re gression (right) enables preci se predic- tion within the training re gime and conserv ati v e epistemic uncertainty estimates in re gions with no training data. Baseline results are also illustrated. After establishing benchmark comparison results, in this subsection we demonstrate the scalability of our e vidential learning approach by e xtending it to the comple x, high-dimensional task of depth estimation. Monocular end-to-end depth estimation is a central problem in computer vision and in v olv es learning a representation of depth directly from an RGB image of the scene. This is a challenging learning task as the tar get y is v ery high-dimensional, with predictions at e v ery pix el. Our training data consists of o v er 27k RGB-to-depth, H × W , image pairs of indoor scenes (e.g. kitchen, bedroom, etc.) from the NYU Depth v2 dataset [ 35 ]. W e train a U-Net style NN [ 41 ] for inference and test on a disjoint test-set of scenes (3) . The final layer outputs a single H × W acti v ation map in t h e case of v anilla re gression, dropout, and ensembling. Spatial dropout uncertainty sampling [ 2 , 45 ] is used for the dropout implementation. Evidential re gression out- puts four of these output maps, corresponding to ( γ , υ , α , β ) , with constraints according to Sec. 3.3. W e e v aluate the models in terms of their accurac y and their predicti v e epi stemic uncertainty on unseen test data. Fig. 4A visualizes the predicted depth, absolute error from ground truth, and predicti v e entrop y across tw o randomly pick ed test images. Ideally , a strong epistemic uncertainty measure w ould capture errors in the predicti on (i.e., roughly correspond to where t he model is making errors). Compared to dropout and ense mbling, e vidential modeling captures the depth errors while pro viding RGB Input Depth Label Predicted Depth Absolute Error Predictive Uncertainty Dropout Ensembles Evidential RGB Input Depth Label Predicted Depth Absolute Error Predictive Uncertainty A B RMSE Confidence Level 0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.005 0.010 0.015 0.020 Dropout Ensembles Evidential C Expected Confidence Level 0.0 0.2 0.4 0.6 0.8 1.0 Dropout, Error = 0.126 Ensembles, Error = 0.0475 Evidential, Error = 0.0329 0.0 0.2 0.4 0.6 0.8 1.0 Observed Confidence Level Ideal calibration Figure 4: Epistemic uncertainty in depth estimation. (A) Example pix el-wise depth predictions and uncer - tainty for each model. (B) Relationship between prediction confidence le v el and observ ed error; a strong in v erse trend is desired. (C) Model uncertainty calibration [27]; (ideal: y = x ). Inset sho ws calibration errors. (3) Full dataset, model, training, and performance details for depth models are a v ailable in Sec. S3. 6 In Distribution ID: NYU Depth v2 A -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 Dropout Ensembles Evidential Entropy B OOD ID Evidential, AUC=0.99 Ensembles, AUC=1.0 Dropout, AUC=0.99 Entropy -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 0.2 0.0 0.4 0.6 0.8 1.0 CDF C ID: NYU Depth v2 OOD: ApolloScape Entropy PDF -5.0 -4.0 -3.0 -2.0 -1.0 0.0 0.5 1.0 1.5 2.0 D RGB Input Predicted Depth Entropy Out of Distribution OOD: ApolloScape Increasing Predictive Uncertainty far close high low ID: NYU Depth v2 OOD: ApolloScape Figure 5: Uncertainty on out-of-distrib ution (OOD) data. Evidential models estimate lo w uncertainty (entrop y) on in-distrib ution (ID) data and inflate uncertainty on OOD data. (A) Cumulati v e density function (CDF) of ID and OOD entrop y for tested methods. OOD detection assessed via A UC-R OC. (B) Uncertainty (entrop y) c omparisons across methods. (C) Full density histograms of entrop y estimated by e vidential re gression on ID and OOD data, along with sample images (D). All data has not been seen during training. clear and localized predictions of confidence. In general, dropout drastically underestimates the amount of uncertainty present, while ensembling occasionally o v erestimates the uncertainty . Fig. 4B sho ws ho w each model performs as pix els with uncertainty greater than certain thresholds are remo v ed. Evidential models e xhibit strong performance, as error steadily decreases with increasing confidence. Fig. 4C additionally e v aluates the calibration of our uncertainty estimates. Calibration curv es are computed according to [ 27 ], and ideally follo ws y = x to represent, for e xample, that a tar get f alls in a 90% confidence interv al approxi mately 90% of the time. Ag ain, we see that dropout o v erestimates confidence when considering lo w confidence scenarios (calibration e rror: 0 . 126 ). Ensembling e xhibits better calibration error ( 0 . 048 ) b ut is still outperformed by the proposed e vidential method ( 0 . 033 ). Results sho w e v aluations from multiple trials, with indi vidual trials a v ailable in Sec. S3.3. In addition to epistemic uncertainty e xperiments, we also e v aluate aleatoric uncertainty estimates, with compar isons to Gaussian MLE learning. Since e vidential models fit the data to a higher -order Gaussian distrib ution, it is e xpected that the y can accurately learn aleatoric uncertainty (as is also sho wn in [ 42 , 18 ]). Therefore, we present these al eatoric results in Sec. S 3.4 and focus the remainder of the results on e v aluating the harder task of epistemic uncertainty estimation in the conte xt of out-of-distrib ution (OOD) and adv ersarily perturbed samples. 4.3 Out-of distrib ution testing A k e y use of uncertainty estimation is to understand when a model is f aced with test samples that f all out-of-distrib ution (OOD) or when the model’ s output cannot be trusted. In this subsection, we in v estig ate the ability of e vidential models to capture increased epistemic uncertainty on OOD data, by testing on images from ApolloScape [ 21 ], an OOD dataset of di v erse outdoor dri ving. It is crucial to note here that related methods such as Prior Netw orks in classification [ 32 , 33 ] e xplicitly require OOD data during training to supervise instances of high uncertainty . Our e vidential method, lik e Bayesian NNs, does not ha v e this limitation and sees only in distrib ution (ID) data during training. F or each method, we feed in the ID and OOD test sets and record the mean predicted entrop y for e v ery test image. Fig. 5A sho ws the cumulati v e density function (CDF) of entrop y for each of the methods and test sets. A distinct positi v e shift in the entrop y CDFs can be seen for e vidential models on OOD data and is competiti v e across methods. Fig. 5B summarizes these entrop y distrib utions as interquartile boxplots to ag ain sho w clear separation in the uncertainty distrib ution on OOD data. W e focus on the distrib ution from our e vidential models in Fig. 5C and pro vide sample predictions (ID and OOD) in Fig. 5D. These results sho w that e vidential models, without training on OOD data, capture increased uncertainty on OOD data on par with epistemic uncertainty estimation baselines. 4.3.1 Rob ustness to adv ersarial samples Ne xt, we consider the e xtreme case of OOD detection where the inputs are adv ersarially perturbed to inflict error on the predictions. W e compute adv ersarial perturbations to our test set using the F ast Gradient Sign Method (FGSM) [ 16 ], with increasing scales, , of noise. Note that the purpose of this e xperiment is not to propose a defense for state-of-the-art adv ersarial attacks, b ut rather to demonstrate that e vidential models accurately capture increased predicti v e uncertainty on samples which ha v e been adv ersarily perturbe d. Fig. 6A confirms that the absolute error of all methods increases as adv ersarial noise i s added. W e also observ e a positi v e ef fect of noise on our predicti v e 7 A B C 0.0 0.2 0.4 0.6 0.8 1.0 -5.0 -4.0 -3.0 -2.0 -1.0 Noise 0.04 0.00 0.02 D Adversarial Input Depth Label Predicted Depth Absolute Error Predictive Uncertainty Increasing Adversarial Pertubation 0.000 0.010 0.020 0.030 0.040 0.01 0.03 0.05 0.07 Uncertainty 0.09 Noise 0.000 0.010 0.020 0.030 0.040 0.025 0.075 0.125 0.175 RMSE Dropout Ensembles Evidential Dropout Ensembles Evidential Noise CDF Entropy Figure 6: Evidential r ob ustness under adv ersarial noise. Relationship between adv ersarial noise and predicti v e error (A) and estimated epistemic uncertainty (B). (C) CDF of entrop y estimated by e vidential re gression under the presence of increasing . (D) V isualization of the ef fects o f increasing adv ersarial pertubation on the predictions, error , and uncertainty for e vidential re gression. Results of sample test- set image are sho wn. uncertainty estimates in Fig. 6B. Furthermore, we observ e that the entrop y CDF steadily shifts to w ards higher uncertainties as the noise in the input sample increases (Fig. 6C). The rob ustness of e vidential uncertainty ag ainst adv ersarial perturbations is visualized in greater detail in Fig. 6D, which illustrates the predicted depth, error , and est imated pix el-wise uncertainty as we perturb the input image with greater amounts of noise (left to right). Not on l y does the predicti v e uncertainty steadily increase with increasing noise, b ut the spatial concentrations of uncertai nty throughout the image also maintain tight correspondence with the error . 5 Related w ork Our w ork b uilds on a lar ge history of uncertai nty estimation [ 25 , 38 , 37 , 19 ] and modelling probability distrib utions using NNs [36, 4, 14, 26]. Prior netw orks and e vidential models. A lar ge focus within Bayesian inference is on placing prior distrib utions o v er hierarchical models to estimate uncertainty [ 12 , 13 ]. Our methodology closely relates to e vidential deep learning [ 42 ] and Prior Netw orks [ 32 , 33 ] which place Dirichlet priors o v er discrete classification predictions. Ho we v er , these w orks either rely on re gularizing di v er gence to a fix ed, well-defined prior [ 42 , 46 ], require OOD training dat a [ 32 , 31 , 7 , 19 ], or can only estimate aleatoric uncertainty by performing density estimation [ 11 , 18 ]. Our w ork tackles these limitations with focus on continuous re gression learning tasks where this di v er gence re gularizer is not well-defined, without requiring an y OOD training data to estimate both aleatoric and epistemic uncertainty . Bay esian deep lear ning . In Bayesian deep learning, priors are placed o v er netw ork weights that are estimated using v ariational inference [ 26 ]. Approximations via dropout [ 9 , 34 , 10 , 2 ], ensembling [ 28 , 40 ] or other approaches [ 5 , 20 ] rely on e xpensi v e samples to estimate predicti v e v ariance. In contrast, we train a deterministic NN to place uncertainty priors o v er the predicti v e distrib ution, requiring only a single forw ard pass to estimate uncertainty . Additionally , our approach of uncertainty estimation pro v ed to be well calibrated and w as capable of detecting OOD and adv ersarial data. 6 Conclusions, limitations, and scope In this paper , we de v elop a no v el method for learning uncertainty in re gression problems by placing e vidential priors o v er the lik elihood output. W e demonstrate combined prediction with aleatoric and epistemic uncertainty estimat ion, scalability to comple x vision tas ks, and calibrated uncer - 8 tainty on OOD data. This method is widely applicable across regression tasks including temporal forecasting [ 17 ], property prediction [ 8 ], and control learning [ 1 , 30 ]. While our method presents sev eral adv antages ov er existing approaches, its primary limitations are in tuning the regularization coefficient and in effecti vely removing non-misleading evidence when calibrating the uncertainty . While dual-optimization formulations [ 47 ] could be explored for balancing re gularization, we believ e further in v estigation is warranted to discov er alternati ve w ays to remov e non-misleading e vidence. Future analysis using other choices of the v ariance prior distrib ution, such as the log-normal or the heavy-tailed log-Cauchy distrib ution, will be critical to determine the effects of the choice of prior on the estimated likelihood parameters. The efficiency , scalablity , and calibration of our approach could enable the precise and fast uncertainty estimation required for rob ust NN deployment in safety-critical prediction domains. Broader Impact Uncertainty estimation for neural networks has very significant societal impact. Neural networks are increasingly being trained as black-box predictors and being placed in larger decision systems where errors in their predictions can pose immediate threat to do wnstream tasks. Systematic methods for calibrated uncertainty estimation under these conditions are needed, especially as these systems are deployed in safety critical domains, such for autonomous vehicle control [ 29 ], medical diagnosis [ 43 ], or in settings with large dataset imbalances and bias such as crime forecasting [ 24 ] and facial recognition [3]. This work is complementary to a large portion of machine learning research which is continually pushing the boundaries on neural network precision and accuracy . Instead of solely optimizing larger models for increased performance, our method focuses on ho w these models can be equipped with the ability to estimate their own confidence. Our results demonstrating superior calibration of our method ov er baselines are also critical in ensuring that we can place a certain le v el of trust in these algorithms and in understanding when they say “I don’ t kno w”. While there are clear and broad benefits of uncertainty estimation in machine learning, we believe it is also important to recognize potential societal challenges that may arise. W ith increased performance and uncertainty estimation capabilities, humans will inevitably become increasingly trusting in a model’ s predictions, as well as its ability to catch dangerous or uncertain decisions before they are ex ecuted. Thus, it is important to continue to pursue redundancy in such learning systems to increase the likelihood that mistakes can be caught and corrected independently . Acknowledgments and Disclosur e of Funding This research is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1122374 and T oyota Research Institute (TRI). W e also gratefully acknowledge the support of NVIDIA Corporation with the donation of the V olta V100 GPU used for this research. References [1] Alexander Amini, Guy Rosman, Sertac Karaman, and Daniela Rus. V ariational end-to-end navigation and localization. In 2019 International Confer ence on Robotics and Automation (ICRA) , pages 8958–8964. IEEE, 2019. [2] Alexander Amini, A va Soleimany , Sertac Karaman, and Daniela Rus. Spatial uncertainty sampling for end-to-end control. arXiv pr eprint arXiv:1805.04829 , 2018. [3] Alexander Amini, A va P Soleimany , W ilko Schwarting, Sangeeta N Bhatia, and Daniela Rus. Uncov ering and mitigating algorithmic bias through learned latent structure. In Pr oceedings of the 2019 AAAI/A CM Confer ence on AI, Ethics, and Society , pages 289–295, 2019. [4] Christopher M Bishop. Mixture density networks. In T ech. Rep. NCRG/94/004, Neural Computing Resear ch Gr oup . Aston Univ ersity , 1994. [5] Charles Blundell, Julien Cornebise, K oray Kavukcuoglu, and Daan W ierstra. W eight uncertainty in neural networks. arXiv preprint , 2015. 9 [6] Mariusz Bojarski, Da vide Del T esta, Daniel Dworako wski, Bernhard Firner , Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller , Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv pr eprint arXiv:1604.07316 , 2016. [7] W enhu Chen, Y ilin Shen, Hongxia Jin, and W illiam W ang. A v ariational dirichlet frame work for out-of-distribution detection. arXiv pr eprint arXiv:1811.07308 , 2018. [8] Connor W Coley , Regina Barzilay , W illiam H Green, T ommi S Jaakkola, and Klavs F Jensen. Con v olutional embedding of attributed molecular graphs for physical property prediction. Journal of c hemical information and modeling , 57(8):1757–1772, 2017. [9] Y arin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international confer ence on machine learning , pages 1050–1059, 2016. [10] Y arin Gal, Jiri Hron, and Alex K endall. Concrete dropout. In Advances in neur al information pr ocessing systems , pages 3581–3590, 2017. [11] Jochen Gast and Stefan Roth. Lightweight probabilistic deep networks. In Proceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 3369–3378, 2018. [12] Andrew Gelman et al. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian analysis , 1(3):515–534, 2006. [13] Andrew Gelman, Aleks Jakulin, Maria Grazia Pittau, Y u-Sung Su, et al. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics , 2(4):1360–1383, 2008. [14] Igor Gilitschenski, Roshni Sahoo, W ilko Schwarting, Alexander Amini, Sertac Karaman, and Daniela Rus. Deep orientation uncertainty learning based on a bingham loss. In International Confer ence on Learning Repr esentations , 2019. [15] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow . Unsupervised monocular depth estimation with left-right consistency . In Proceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 270–279, 2017. [16] Ian J Goodfellow , Jonathon Shlens, and Christian Szegedy . Explaining and harnessing adv ersar- ial examples. arXiv preprint , 2014. [17] Klaus Gref f, Rupesh K Sri v astav a, Jan K outník, Bas R Steunebrink, and Jürgen Schmidhuber . Lstm: A search space odysse y . IEEE transactions on neural networks and learning systems , 28(10):2222–2232, 2016. [18] Pa vel Gure vich and Hannes Stuk e. Gradient conjugate priors and multi-layer neural netw orks. Artificial Intelligence , 278:103184, 2020. [19] Danijar Hafner , Dustin T ran, T imothy Lillicrap, Alex Irpan, and James Davidson. Noise contrastiv e priors for functional uncertainty . In Uncertainty in Artificial Intelligence , pages 905–914. PMLR, 2020. [20] José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Confer ence on Machine Learning , pages 1861–1869, 2015. [21] Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng W ang, Y uanqing Lin, and Ruigang Y ang. The apolloscape dataset for autonomous driving. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition W orkshops , pages 954–960, 2018. [22] T aejong Joo, Uijung Chung, and Min-Gwan Seo. Being bayesian about categorical probability . arXiv pr eprint arXiv:2002.07965 , 2020. [23] Michael I Jordan. The e xponential family: Conjugate priors, 2009. [24] Hyeon-W oo Kang and Hang-Bong Kang. Prediction of crime occurrence from multi-modal data using deep learning. PloS one , 12(4):e0176244, 2017. [25] Alex K endall and Y arin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information pr ocessing systems , pages 5574–5584, 2017. 10 [26] Durk P Kingma, T im Salimans, and Max W elling. V ariational dropout and the local reparam- eterization trick. In Advances in Neural Information Pr ocessing Systems , pages 2575–2583, 2015. [27] V olodymyr Kulesho v , Nathan Fenner , and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. arXiv preprint , 2018. [28] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictiv e uncertainty estimation using deep ensembles. In Advances in Neural Information Pr ocessing Systems , pages 6402–6413, 2017. [29] Mathias Lechner , Ramin Hasani, Alexander Amini, Thomas A Henzinger, Daniela Rus, and Radu Grosu. Neural circuit policies enabling auditable autonomy . Nature Mac hine Intelligence , 2(10):642–652, 2020. [30] Serge y Le vine, Chelsea Finn, T revor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The J ournal of Machine Learning Resear ch , 17(1):1334–1373, 2016. [31] Andrey Malinin. Uncertainty Estimation in Deep Learning with application to Spoken Language Assessment . PhD thesis, Uni versity of Cambridge, 2019. [32] Andrey Malinin and Mark Gales. Predictiv e uncertainty estimation via prior networks. In Advances in Neural Information Pr ocessing Systems , pages 7047–7058, 2018. [33] Andrey Malinin and Mark Gales. Rev erse kl-div ergence training of prior netw orks: Improved uncertainty and adversarial rob ustness. In Advances in Neural Information Pr ocessing Systems , pages 14520–14531, 2019. [34] Dmitry Molchanov , Arsenii Ashukha, and Dmitry V etrov . V ariational dropout sparsifies deep neural networks. In Pr oceedings of the 34th International Confer ence on Mac hine Learning- V olume 70 , pages 2498–2507. JMLR. org, 2017. [35] Pushmeet K ohli Nathan Silberman, Derek Hoiem and Rob Fer gus. Indoor segmentation and support inference from rgbd images. In ECCV , 2012. [36] David A Nix and Andreas S W eigend. Estimating the mean and v ariance of the tar get proba- bility distribution. In Pr oceedings of 1994 ieee international conference on neur al networks (ICNN’94) , volume 1, pages 55–60. IEEE, 1994. [37] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin V an Roy . Deep exploration via bootstrapped dqn. In Advances in neural information pr ocessing systems , pages 4026–4034, 2016. [38] Harris Papadopoulos and Haris Haralambous. Reliable prediction interv als with re gression neural networks. Neural Networks , 24(8):842–851, 2011. [39] Giorgio P arisi. Statistical field theory . Addison-W esley , 1988. [40] T im Pearce, Mohamed Zaki, Alexandra Brintrup, N Anastassacos, and A Neely . Uncertainty in neural networks: Bayesian ensembling. stat , 1050:12, 2018. [41] Olaf Ronneberger , Philipp Fischer, and Thomas Brox. U-net: Con volutional networks for biomedical image segmentation. In International Confer ence on Medical image computing and computer-assisted intervention , pages 234–241. Springer , 2015. [42] Murat Sensoy , Lance Kaplan, and Melih Kandemir . Evidential deep learning to quantify classification uncertainty . In Advances in Neural Information Pr ocessing Systems , pages 3179– 3189, 2018. [43] Li Shen, Laurie R Margolies, Joseph H Rothstein, Eugene Fluder , Russell McBride, and W eiv a Sieh. Deep learning to improv e breast cancer detection on screening mammography . Scientific r eports , 9(1):1–12, 2019. [44] Joram Soch and Carsten Allefeld. Kullback-leibler di v ergence for the normal-gamma distrib u- tion. arXiv pr eprint arXiv:1611.01437 , 2016. [45] Jonathan T ompson, Ross Goroshin, Arjun Jain, Y ann LeCun, and Christoph Bregler . Efficient object localization using con volutional networks. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 648–656, 2015. [46] Theodoros Tsiligkaridis. Information robust dirichlet networks for predicti ve uncertainty estimation. arXiv pr eprint arXiv:1910.04819 , 2019. 11 [47] Shengjia Zhao, Jiaming Song, and Stef ano Ermon. The information autoencoding family: A lagrangian perspecti ve on latent v ariable generati ve models. arXiv pr eprint arXiv:1806.06514 , 2018. 12 Supplementary Materials S1 Derivations S1.1 Normal In verse-Gamma moments W e assume our data was drawn from a Gaussian with unknown mean and variance, ( µ, σ 2 ) . W e probabilistically model these parameters, θ , according to: µ ∼ N ( γ , σ 2 υ − 1 ) (S1) σ 2 ∼ Γ − 1 ( α, β ) . (S2) Therefore, the prior joint distribution can be written as: p ( µ, σ 2 | {z } θ | γ , υ , α, β | {z } m ) = p ( µ ) p ( σ 2 ) (S3) = N ( γ , σ 2 υ − 1 ) Γ − 1 ( α, β ) (S4) = β α √ υ Γ( α ) √ 2 π σ 2 1 σ 2 α +1 exp − 2 β + υ ( γ − µ ) 2 2 σ 2 . (S5) The first order moments of this distribution represent the maximum likelihood prediction as well as uncertainty (both aleatoric and epistemic). E [ µ ] = Z ∞ µ = −∞ µ p ( µ ) d µ = γ (S6) E [ σ 2 ] = Z ∞ σ 2 =0 σ 2 p ( σ 2 ) d σ 2 (S7) = Z ∞ σ =0 σ 2 p ( σ 2 ) (2 σ ) d σ (S8) = β α − 1 , ∀ α > 1 (S9) V ar[ µ ] = Z ∞ µ = −∞ µ 2 p ( µ ) d µ − ( E [ µ ]) 2 (S10) = γ 2 − σ 2 υ − ( E [ µ ]) 2 (S11) = γ 2 − β α − 1 υ − γ 2 (S12) = β υ ( α − 1) , ∀ α > 1 (S13) In summary , E [ µ ] = γ | {z } prediction , E [ σ 2 ] = β α − 1 | {z } aleatoric , V ar[ µ ] = β υ ( α − 1) | {z } epistemic . (S14) 13 S1.2 Model evidence & T ype II Maximum Likelihood Loss In this subsection, we deriv e the posterior predictiv e or model evidence (ie. Eq. 7) of a NIG distribution. Marginalizing out µ and σ gi ves our desired result: p ( y i | m ) = Z θ p ( y i | θ ) p ( θ | m ) d θ (S15) = Z ∞ σ 2 =0 Z ∞ µ = −∞ p ( y i | µ, σ 2 ) p ( µ, σ 2 | m ) d µ d σ 2 (S16) = Z ∞ σ 2 =0 Z ∞ µ = −∞ p ( y i | µ, σ 2 ) p ( µ, σ 2 | γ , υ , α, β ) d µ d σ 2 (S17) = Z ∞ σ 2 =0 Z ∞ µ = −∞ " r 1 2 π σ 2 exp − ( y i − µ ) 2 2 σ 2 # (S18) " β α √ υ Γ( α ) √ 2 π σ 2 1 σ 2 α +1 exp − 2 β + υ ( γ − µ ) 2 2 σ 2 # d µ d σ 2 (S19) = Z ∞ σ 2 =0 β α σ − 3 − 2 α √ 2 π p 1 + 1 /υ Γ( α ) exp − 2 β + υ ( y i − γ ) 2 1+ υ 2 σ 2 d σ 2 (S20) = Z ∞ σ =0 β α σ − 3 − 2 α √ 2 π p 1 + 1 /υ Γ( α ) exp − 2 β + υ ( y i − γ ) 2 1+ υ 2 σ 2 2 σ d σ (S21) = Γ(1 / 2 + α ) Γ( α ) r υ π (2 β (1 + υ )) α υ ( y i − γ ) 2 + 2 β (1 + υ ) − ( 1 2 + α ) (S22) p ( y i | m ) = St y i ; γ , β (1 + υ ) υ α , 2 α . (S23) St y ; µ St , σ 2 St , υ S t is the Student-t distribution e valuated at y with location parameter µ St , scale parameter σ 2 St , and υ St degrees of freedom. Using this result we can compute the negativ e log likelihood loss, L NLL i , for sample i as: L NLL i = − log p ( y i | m ) (S24) = − log St y i ; γ , β (1 + υ ) υ α , 2 α (S25) L NLL i = 1 2 log π υ − α log (Ω) + α + 1 2 log(( y − γ ) 2 υ + Ω) + log Γ( α ) Γ( α + 1 2 ) (S26) where Ω = 2 β (1 + υ ) . S1.3 KL-diver gence of the Normal In verse-Gamma The KL-div er gence between two Normal In verse-Gamma functions is giv en by [44]: KL p ( µ, σ 2 | γ 1 , υ 1 , α 1 , β 1 ) || p ( µ, σ 2 | γ 2 , υ 2 , α 2 , β 2 ) (S27) = KL NIG ( γ 1 , υ 1 , α 1 , β 1 ) || NIG ( γ 2 , υ 2 , α 2 , β 2 ) (S28) = 1 2 α 1 β 1 ( µ 1 − µ 2 ) 2 υ 2 + 1 2 υ 2 υ 1 − 1 2 + α 2 log β 1 β 2 − log Γ( α 1 ) Γ( α 2 ) (S29) + ( α 1 − α 2 )Ψ( α 1 ) − ( β 1 − β 2 ) α 1 β 1 (S30) Γ( · ) is the Gamma function and Ψ( · ) is the Digamma function. For zero e vidence, both α = 0 and υ = 0 . T o compute the KL div ergence between one NIG distribution and another with zero evidence we can set either υ 2 = α 2 = 0 (i.e., rev erse-KL) in which case, Γ(0) is not well defined, or 14 υ 1 = α 1 = 0 (i.e. forw ard-KL) which causes a di vide-by-zero error of υ 1 . In either approach, the KL-di v er gence between an arbitrary NIG and one with zero e vidence cannot be e v aluated. Instead, we briefly consider a nai v e alternati v e which can be obtained by considering an amount of e vidence, where is a small constant (instead of strictly 0-e vidence). This approach yields a well-defined KL-di v er gence (with fix ed γ , β at the consequence of a h yper -sensiti v e parameter . KL NIG ( γ , υ , α , β ) || NIG ( γ , , 1 + , β ) (S31) = 1 2 1 + υ − 1 2 − log Γ( α ) Γ(1 + ) + ( α − (1 + ))Ψ( α ) (S32) In Fig. S1.3 we compare the p e rformance of the KL-di v er gence re gularizer compared to our more direct e vidence re gularizer , for se v eral realizations of the re gularization coef ficient, λ . W e observ ed e xtreme sensiti vity to the setting of for dif ferent datasets such that we could not achie v e the desired re gularizing ef fect for an y re gularization amount, λ . Unless otherwise stated, a ll results were obtained using our direct e vidence re gularizer instead (Eq. 9). S2 Benchmark r egr ession tasks S2.1 Cubic toy examples S2.1.1 Dataset and experimental setup The training set consists of training e xamples dra wn from y = x 3 + , where ∼ N (0 , 3) in the re gion − 4 ≤ x ≤ 4 , whereas the test data is unbounded (we sho w in the re gion − 6 ≤ x ≤ 6 ). This problem setup is identical to that presented in [ 20 , 28 ]. All models consisted of 100 neurons with 3 hidden layers and were trained to con v er gence. The data presented in Fig. S1 illustrates the estimated epistemic uncertainty and predicted mean accross the entire test se t. Sampling based models [ 5 , 9 , 28 ] used n = 5 samples. The e vidential model used λ = 0 . 01 . All m odels were trained with the Adam optimizer η = 5e-3 for 5000 iterations and a batch size of 128. S2.1.2 Baselines Ensembles n = 5 Prediction Uncertainty No Data Data Ground T ruth -6 -4 -2 0 2 4 6 150 100 50 -50 -100 -150 0 -6 -4 -2 0 2 4 6 150 100 50 -50 -100 -150 0 -6 -4 -2 0 2 4 6 150 100 50 -50 -100 -150 0 -6 -4 -2 0 2 4 6 150 100 50 -50 -100 -150 0 Evidential, with regularization Dropout sampling Bayes by backpropagation Figure S1: Epistemic uncertainty estimation baselines on the dataset y = x 3 + , ∼ N (0 , 3) . S2.1.3 Impact of the e vidential r egularizer In the follo wing e xperiment, we demons trate the importance of augmenting the training objecti v e with our e vidential re gularizer L R as introduced in Sec. 3.3. Fig. S2 pro vides quantitati v e results on epistemic uncertainty estimation after training on the same re gression problem presented in S2.1 with dif ferent realizations of the re gularization coef ficients, λ . W e sho w the performance of our a b i lity to calibrate uncertainty on OOD data is hea vily related to our re gularizer . As we decrease our re gularizer weight, uncertainty on OOD e xamples decays to zero. Stronger re gularization inflates the uncertainty ( λ = 0 . 01 is a good choice for this problem) while aleatoric uncertainty is maintained constant. Please refer to Fig. 3 for the re gularization ef fect on both aleatoric and epistemic uncertainty . S2.1.4 Disentanglement of aleatoric and epistemic uncertainty In the follo wing e xperiment, we pro vide r esults to suggest that the e vidential re gularizer is capable of disentangling aleatoric and epistemic uncertainties by capturing incorrect e vidence. Specifically , 15 150 100 50 0 -50 -100 -150 150 100 50 0 -50 -100 -150 150 100 50 0 -50 -100 -150 150 100 50 0 -50 -100 -150 150 100 50 0 -50 -100 -150 150 100 50 0 -50 -100 -150 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 Prediction Uncertainty No Data Data Ground T ruth A B Evidential Regularizer KL Divergence Figure S2: Impact of r egulari zation str ength on epistemic uncertainty estimates. Epistemic uncertainty estimates on the dataset y = x 3 + , ∼ N (0 , 3) for e vidential re gression models re gularized with the e vidential re gularizer L R (A) or with the KL di v er gence (B) between the inferred NIG and another with zero e vidence, for v arying re gularization coef ficients λ . we construct a synthetic to y dataset with high data noise (aleatoric uncertainty) in the center of the in-distrib ution re gion. Rather than using the L 1 error in t he re gularization term, as in pre vious e xperiments, we use re gularize the standard score and estimate epistemic and aleatoric uncertainty (Fig. S3). This analysis s uggests that the method is capable of disentangling epistemic and aleatoric uncertainties in a re gion that is in-distrib ution b ut has high data noise. Predicted Uncertainty -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 I nput , x O utput , y Epist emic A lea t or ic T rue M ean P r ed M ean T r ain Da ta 100 50 0 -50 -100 Figure S3: Disentangled uncertainties. Epistemic a nd aleatoric uncertainty estimates on a synthetic dataset based on y = x 3 , where data noise increase s to w ards the center of the in-distrib ution re gion. The e vidential re gularizer L R is calculated based on the standard score. S2.2 Benchmark r egr ession pr oblems S2.2.1 Datasets and experimental setup This subsection describes the setup to create T able 1. W e follo w an identical e xperimental setup and training process as presented in [ 20 ]. All dataset features are normalized to ha v e zero mean and unit standard de viation. Features with no v ariance are only normalized to ha v e zero mean. The same normalization process is also performed on the tar get v ariables; ho we v er , this is undone at 16 inference time such that predictions are in the original scale of the targets. Datasets are split randomly into training and testing sets a total of 20 times. Each time we retrain the model and compute the desired metrics (RMSE, NLL, and speed). The results presented in the table represent the a verage and standard error across all 20 runs for e very method and dataset. Follo wing the lead of [ 28 ], we also directly compare against the other training methods by directly using their reported results since they follo wed an identical training procedure. S3 Depth estimation ev aluations S3.1 Experimental details W e ev aluate depth estimation on the NYU-Depth-v2 dataset [ 35 ]. For ev ery image scan in the dataset we fill in the missing holes in the depth using the Le vin Colorization method. The resulting depth map is con verted to be proportional to disparity by taking its in verse. This is common in depth learning literature as it ensures that far away objects result in numerically stable neural network outputs (very large depths hav e close to zero disparity). Objects closer than 1/255 meters to the camera would therefore be clipped due to the uint8 restriction on image precision. The resulting images are saved and used for supervising the learning algorithm. Training, v alidation, and test sets were randomly split (80-10-10) with no ov erlap in scans. All trained depth models hav e a U-Net [ 41 ] backbone, with fiv e con v olutional and pooling blocks down (and then back up). The input and tar get images had shape (160 , 128) with inputs ha ving 3 feature maps (RGB), while tar gets only had a single feature map (disparity). The dropout v ariants were trained with spatial dropout [ 45 ] ov er the con volutional blocks ( p = 0 . 1 ). Evidential models additionally had four output target maps, one map corresponding to each e vidential parameter γ , υ , α, β , with activ ations as described in 3.3. All models were trained with the following hyperparmeters: batch size of 32, Adam optimization with learning rate 5e-5, o ver 60000 iterations. The best model according to v alidation set RMSE is sav ed and used for testing. Evidential models additionally had λ = 0 . 1 . Each model was trained 3 times from random initialization to produce all presented results. S3.2 Depth estimation performance metrics T able S1 summarizes the size and speed of all models. Evidential models contain significantly fewer trainable parameters than ensembles (where the number of parameters scales linearly with the size of the ensemble). Since evidential regression models do not require sampling in order to estimate their uncertainty , their forward-pass inference times are also significantly more ef ficient. Finally , we demonstrate comparable predictiv e accuracy (through RMSE and NLL) to the other models. N # Parameters Infer ence Speed RMSE NLL Absolute Relativ e Seconds Relative Evidential (Ours) - 7,846,776 1.00 0.003 1.00 0.024 ± 0.032 -1.128 ± 0.290 Spatial Dropout 2 7,846,657 1.00 0.028 10.20 0.033 ± 0.037 -0.564 ± 0.231 Spatial Dropout 5 7,846,657 1.00 0.031 11.48 0.031 ± 0.033 -1.227 ± 0.374 Spatial Dropout 10 7,846,657 1.00 0.037 13.69 0.035 ± 0.042 -1.139 ± 0.379 Spatial Dropout 25 7,846,657 1.00 0.065 23.99 0.032 ± 0.035 -1.137 ± 0.327 Spatial Dropout 50 7,846,657 1.00 0.107 39.36 0.032 ± 0.036 -1.110 ± 0.381 Ensembles 2 15,693,314 2.00 0.005 1.94 0.026 ± 0.032 -1.080 ± 3.334 Ensembles 5 39,233,285 5.00 0.010 3.72 0.023 ± 0.027 -1.077 ± 0.298 Ensembles 10 78,466,570 10.00 0.019 6.82 0.025 ± 0.038 -0.980 ± 0.298 Ensembles 25 196,166,425 25.00 0.045 16.45 0.022 ± 0.029 -1.000 ± 0.259 Ensembles 50 392,332,850 50.00 0.112 41.26 0.022 ± 0.031 -0.996 ± 0.275 T able S1: Depth estimation performance metrics. Comparison of different uncertainty estimation algorithms and predictiv e performance on an unseen test set. Dropout and ensembles were sampled N times on parallel threads. The e vidential method outperforms all other algorithms in terms of space (#Parameters) and inference speed while maintaining competitiv e RMSE and NLL. S3.3 Epistemic uncertainty estimation on depth Fig. S4 shows indi vidual trial runs for each method on RMSE cutoff plots as summarized in Fig. 4B. 17 Fig. S5 sho ws indi vidual trial runs for each method on their respecti v e calibration plots as summarized in Fig. 4C. Fig. S6 sho ws indi vidual trial runs for each method on their respecti v e entrop y (uncertainty) CDF as a function of the amount of adv ersarial noise. W e present the e vidential portion of this figure in Fig. 6C, b ut also pro vide baseline results here. Evidential 0.000 0.005 0.010 0.015 0.020 Ensembles 0.000 0.005 0.010 0.015 0.020 Dropout 0.000 0.005 0.010 0.015 0.020 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 RMSE Confidence Level Figure S4: Relationship between pr ediction confidence le v el and obser v ed err or f or differ ent uncertainty estimation methods. A strong in v erse trend is desired to demonstrate that the uncertainty estimates e f fecti v ely capture accurac y . Plots sho w results from depth estimation task. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Expected Confidence Level Observed Confidence Level Dropout Ensembles Evidential Figure S5: Uncertainty calibration plots f or depth estimation. Calibration of epistemic uncertainty estimates for dropout, ensembling, and e vidential methods, assessed as the relationship between e xpected and observ ed predicti v e confidence le v els. Perfect calibration corresponds to the line y = x (black). -5 -4 -3 -2 -1 0.0 0.2 0.4 0.6 0.8 1.0 -5 -4 -3 -2 -1 0.0 0.2 0.4 0.6 0.8 1.0 Noise 0.04 0.00 0.02 -5 -4 -3 -2 -1 0.0 0.2 0.4 0.6 0.8 1.0 Entropy CDF Dropout Ensembles Evidential Figure S6: Effect of adv ersarial noise on uncertainty estimates . C umulati v e distrib ution functions (CDF) of entrop y (uncertainty) estimated by dropout, ensembling, and e vidential re gression methods, under the presence of increasing adv ersarial noise . S3.4 Aleatoric uncertainty estimation on depth Fig. S7 compares the e vidential aleatoric uncertainty to those obtained by Gaussian lik elihood optimization in se v eral domains with high data uncertainty (mirror reflections and poor illumination). The results between both methods are in strong agreement, identifying mirror reflections and dark re gions without visible geometry as sources of high uncertainty . These results are e xpected since e vidential models fit the data to a higher -order Gaussian distrib ution and therefore it is e xpected that the y can accurately learn aleatoric uncer tainty (as is also sho wn in [ 42 , 18 ]). While the main te xt 18 A lea t or ic E viden tial ( O urs) A lea t or ic G aussian NLL Gr ound T ruth I nput I mage Small M ir r or Dar k W all M ir r or Noise Figure S7: Aleatoric uncertainty in depth. V isualizing predicted aleatoric uncertainty in challenging reflection and illumination scenes. Comparison between e vidential and [25] sho w strong semantic agreement. focuses on the more challenging problem of epistem ic uncertainty estimation (especially on OOD data), we pro vi de these sample aleatoric uncertainty e xamples for here for depth as supplemental material. 19 W e thank the re vie wers for their v ery constructi v e and detailed feedback on our manuscript. In this w ork, we propose 1 a no v el and scalable method for inferring a continuous tar get as well as representations for epistemic and aleatoric 2 uncertainty , without sampling during inference. Our method does not require an y out-of-distrib ution (OOD) data during 3 training (unlik e Dirichlet Prior Netw orks [24]) and performs on-par with or better than state-of-the-art (SoA) approaches. 4 W e demonstrate learning well-calibrated measures of uncertainty on v arious benchmarks, scaling to high-dimensional 5 vision tasks, as well as rob ustness to ne w OOD and adv ersarial test samples. As a sampling-free and performant method, 6 our w ork will enable k e y adv ances in resource constrained areas, such as robotics, where sampling is infeasible. 7 Predicted Uncertainty -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 I nput , x O utput , y Epist emic A lea t or ic T rue M ean P r ed M ean T r ain Da ta 100 50 0 -50 -100 Fig. 1. Disentangled uncertainty R1: 3.1. Pseudo-counts: The o v erall e vidence is presented as the sum of pseudo-counts 8 [32]. W e could equi v alently a v erage, by applying a constant re-s caling directly captured 9 by λ , without changing an y of our results. 3.2a. Re gularizer and r esults: W e agree 10 that our method pro vides no guarantees that it will definiti v ely yield high epistem ic 11 uncertainty f ar from in-domain re gions; ho we v er , we belie v e that the e xtensi v e empirical 12 results achie v ed with our method, and results from similar related approaches which 13 also trai n on only in-di strib ution data (i.e., [32], [Joo, T . et al. ’20]), support the claim 14 that uncertainty increases on OOD data. Our approach will undoubtedly impro v e by 15 le v eraging the supervision of OOD data during training, closer to [24, 3]; ho we v er , 16 as we (and R2) point out, the need for OOD training data is often a critically limiting 17 assumption. 3.2b . “Confused e vidence”: As R1 correctly states, the re gularizer captures scena rios where the e vidence 18 is leading to the incorrect tar get (i.e., incorrect or “confused” e vidence, not lack of e vidence). W e fully agree with this 19 e xcellent point and ha v e updated the manuscript to reflect this. Ho we v er , we do not belie v e that the approach “conflat es 20 aleatoric and epistemic uncertainty” and pro vide results from the suggested e xperiment to support our claim (Fig. 1), 21 using the standar d scor e instead of L1 error . Further details and analysis are added to the manuscript. 3.3. Other 22 metrics: Le v eraging e vidential distrib utions to compute M.I. or e v en dif f erential entrop y is a great idea, as these are 23 rich statistics that our method captures. W e focus on first order moments for more direct comparability to e xisting SoA 24 baselines and lea v e further analysis of richer statistics to future w ork. 4.1. P erformance: RMSE for our method (and 25 baselines) is in f ac t pro vided in T ab . S1, Figs. 4B, 6A. W e observ ed little to no performance loss based on RMSE and 26 will certainly include the other metrics as suggested. 4.2. A UC: The histograms (and CDFs) pro vided in Figs. 5, 6, and 27 S5 (as in [21], [Nalisnick, E. et al. ’18], and othe rs) are richer performance statistics and directly reduce to the requested 28 A UC-R OC scores. T o address these concerns, we ha v e added all A UC-R OC v alues to our performance charts. 4.3 29 Adver sarial: W e updated the implementation details of the attack method (FGSM). While we can e v aluate additional 30 attacks, our paper is not proposing a ne w defense (neither are an y of our baselines), and thus this w ould be out of scope. 31 The goal of Sec. 4.3.1 is solely to e v aluate on additional OOD samples based on a basic adv ersarial method. 32 R2: 1. F ig . 3 aleatoric: W ithin the training re gion there are v ery fe w dif ferences, which can be attrib uted to intrisinic 33 randomness and initialization during t raining. OOD there is much more v ariability , aligning with MVE [18, 28]. Since 34 there is no training data in this re gion, we do not e xpect consistent results for aleatoric uncert ainty , unlik e epistemic 35 uncertainty as is pointed out. 2. Relation to K endall. et al: This is an e xcellent and v ery important point; we apologize 36 for the confusion. T o clarify , estimating aleatoric uncertainty using NNs without sampling has a well-accepted solution 37 dating back to 1994 with MVE (see [28]). This is the same approach used in K endall et al. [18] and is what we 38 compare ag ainst in our w ork (Fig. 3, and else where when e v aluating al eatoric uncertaint y). Further , [18] proposes 39 jointly learning MVE with a sampling-based epistemic uncertainty estimator (in their paper , dropout [5]). Thus, in [18] 40 aleatoric can be estimated sampling-free, b ut epistemic requires sampling. W e belie v e all our pro vided benchmarks do 41 indeed accurately compare ag ainst [18]. The majority of our results focus on epistemic comparisons since our method 42 uses a Gaussian lo wer -order distrib ution which achie v es ale atoric estimation similar t o [18] using MLE. In contrast, 43 we jointly learn a sampling-fr ee epistemic estimate which is not the case in [18], representing a k e y contrib ution of 44 our w ork. Sampling approaches, including [18], a re the current SoA and we agree with R2 that the benchmarking we 45 pro vide on these methods is absolutely critical. 3. Intuition of par ameter s: Thank you, additional e xposition has be en 46 added. 4. Baselines: Please refer to #2 abo v e, which clarifies the incorrect point about missing baselines. 47 R3: 1. Gaussian assumption: Thank you, e xplanations will be added. 2. Additional comparisons and prior work: 48 Discussion on these w orks will be added. Note, [20] proposes a w ay t o calibrate a gi v en uncertainty method as opposed 49 to a ne w uncertainty estimator , and can be used to strengthen an y uncertainty estimator - it is not a competing method. 50 R4: 1. Other distrib utions: Excellent point to be included in the manuscript. 2. P erformance on OOD: Results for a 51 v ariant of the proposed e xperiment can be found in Fig. 5. Further metrics ha v e also been added via R1 #4.2. 52 Summary: Thank you for running our softw are. R3: “This is so far the only code I was able to run among the ones I 53 have to r e vie w . A uthor s r eally went to a gr eat length to pr o vide runable code , and this is commendable . ” W e belie v e 54 this w ork supports ne w research through its broad applicability and accessible, easy to use code. W e hope the reb uttal 55 and ne w e xperiments address al l concerns (esp. R1), and appreciate all comments which ha v e impro v ed the manuscript. 56
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment