Assessing Reliability and Challenges of Uncertainty Estimations for Medical Image Segmentation

Assessing Reliabili t y and Challenges of Uncertain t y Estimations for Medical Image Segmen tation Alain Jungo 1 , 2 ( B ) [0000 − 0001 − 8327 − 4653] and Maur ic io Reyes 1 , 2 Insel Data Science Center, Inselspital, Bern Universit y Hospital, Universit y of Bern, Bern, Switzerland AR TOR G Center, Univer sit y of Bern, Bern, Switzerland alain.jung o@artorg.unibe.ch Abstract. Despite the recent impro vemen ts in o verall accuracy , deep learning systems still ex hibit lo w levels of robu stness. Detecting p ossible failures is critical for a successful clinical integration of these systems, where each data p oint corresponds to an ind iv idual patient. Uncertain ty measures are a promisi ng direction to improv e failure detection since they provide a measure of a system’s conﬁdence. Although many un certain ty estimation methods ha ve b een proposed for d eep learning, little is known on their b eneﬁts and cu rren t challenges for medical image segmen tation. Therefore, w e report results of ev aluating common voxel-wi se uncertaint y measures w ith respect to their reliability , and limitations on tw o medi- cal image segmen tation datasets. Results show th at current uncertaint y metho d s p erform similarly and although they are well-cal ib rated at th e dataset level, they tend to b e miscalibrated at sub ject-level . Therefore, the reliability of un certain ty estimates is compromised, highlighting the imp ortance of developing sub ject-wise uncertaint y estimations. Addi- tionally , among the b enchmark ed metho ds, we found aux ilia ry netw orks to b e a v alid alternative t o common uncertain ty method s since they can b e applied to any previously trained segmentation mod el. Keywords: Uncertaint y · Segmentation · Deep Learning 1 In tro duction Deep learning -based metho ds hav e led to impres siv e impr o vement s in medical image segmentation over the past years. F or ma n y tasks , the p erformance is comparable to hum an-level p erformance, or even surpasses it [11]. Nonetheless , despite improv ements in accurac y , the robustnes s aspects of these systems call for sig niﬁcan t improvemen ts for a s uccessful clinical integration of these technolo- gies, wher e each data p oin t corresp onds to an individual pa tien t. This highlights the imp ortance of having mechanisms to eﬀectiv e ly monitor co mputer results in order to detect and react on system’s failures at the pa tien t level. Among others, uncertaint y measures are a pr omising direction since uncerta in ties ca n provide informatio n as to ho w conﬁdent the system w as on p erfor ming a given 2 A. Jungo and M. Reyes task on a given patient. This information in tur n can b e used to leverage the decision-making pro cess of a user, a s well as to enable time- e ﬀective c orrections of computer r esults by for ins tance, fo cusing on areas of high uncertaint y . Diﬀerent appro ac hes hav e bee n pro pos e d to q ua n tify unce rtain ties in deep learning mo dels. Among the mo st p opular appr oaches are: a) Bayesian uncer- taint y estimation via test-time dro pout [5], b) aleator ic uncerta in ty estimation via a seco nd netw ork output [9], a nd c) uncertaint y estimation v ia ensem bling of netw ork s [10]. In medica l image se gmen tatio n, uncertaint y measure s a re of int e r est at three levels. The ﬁrst, most ﬁne-grained level, is the voxel 1 -wise un- certaint y , which provides a measure of uncertaint y for t he predicted class of each vo xel. This le vel of uncerta in ty is esp ecially useful for the in tera c tion with hu mans, b e it by providing additional information to foster comprehensibility or as guidance for cor rection tasks. The seco nd level is the uncertaint y a t the level of a segmented instance (or ob ject). Nair et al. [12] and Graha m et al. [6] used instance-level uncertaint y to reduce the false discov ery r a te o f br a in lesions a nd cells, respectively . In bo th appro ac hes v oxel-wise uncer ta in ties w ere aggre g ated to obtain an instance-wise uncertaint y . Simila rly , Eaton-Rose n et al. [4] aggr egated voxel-wise uncertainties of brain tumor segmentations to obtain conﬁdence interv als for tumor volumes. The third level is the sub ject-level un- certaint y , which infor ms us whether the segmen ta tion task w as s uccessful (e.g., ab o ve a certain metr ic). Having information ab out s uccess or failure would b e suﬃcient for many tasks, e.g., high-throug hput analysis or selection of ca ses for exp ert review. As prop osed b y Jungo et al. [8], task- speciﬁc aggreg ation o f the vo xel-wise uncertainties could b e used to obtain sub ject-level uncertainties. In contrast, DeV ries et al. [3] and Robinson et al. [13], prop osed an a uxiliary neural net work that predicts se gmen tatio n per formance at the sub ject-level. A current challenge to us e these la tter type of appro ach es is that considerable la rge training datasets are necessa ry in pr actice to ens ure their reliability [3]. In order to b etter understa nd the b eneﬁts and curr e n t challenges in uncer- taint y estimation for medica l image segmentation, w e ev alua ted common uncer- taint y meas ures with resp ect to their reliability , their b e neﬁt, and limitations. Additionally , we analyzed the re quiremen ts for uncer tain ties in medica l ima ge segmentation and we make practical re c ommendations for their ev alua tion. 2 Material & Metho ds 2 2.1 Data W e selected tw o publicly av aila ble, and distinct datasets for the exp eriments. The ﬁrst da taset is the brain tumor seg men tation (BraTS) challenge dataset 2018 [1] co nsisting o f 265 s ub jects. Ea c h sub ject featur es four magnetic r eso- nance imag e s (T1-weighted, T1-weigh ted p ost-contrast, T2 - w eig h ted, FLAIR) of a size of 240 × 240 × 155 isotro pic (1 mm 3 ) voxels. W e split the da taset into 1 F or simplicity , we u se voxel even if it could b e a tw o-dimensional image. 2 Code av ailable at https://g ithub.com/alainjungo/reli ability- challenges- uncertaint y Reliabilit y and Challenges of Uncertaint y Estimations 3 100 training, 25 v alidation, and 160 testing sub jects, c om bined the three tumor sub-compartment labels to segment the whole tumor , and per formed a z-s c ore int e ns it y nor malization ( µ = 0 , σ = 1) on each sub ject and image individually . The seco nd dataset is the international skin imaging collab oration (ISIC) lesion segmentation dataset 2 017 [2] consisting of 200 0 tra ining, 150 v alidation, and 600 testing images. W e res ized the co lor images to a s ize of 25 6 × 192 pixels and normalized the intensities to the r ange [0 , 1]. 2.2 Exp erimenta l setting Our a im is to ev aluate the reliability of uncertaint y measures for deep learning- based segmentation o f medical images. Ra ther than building a speciﬁc ﬁne-tuned, top-p erforming segmentation mo del, w e used a U-Net-like architecture [14] due to its p opularity , s implicit y , and to minimize architectural inﬂuence s on the outcomes 3 . The architecture consists of four p oo ling/upsampling steps and has drop out regularization (p=0.05) a nd ba tch normalization after eac h con volution. W e used a common training scheme consis ting of a cross -en tro p y loss with Adam optimizer (learning rate: 10 − 4 ), and a pplied early sto pping with r espect to the v a lidation s et Dice coeﬃcient. Any adaptation to this a r c hitecture and training scheme was p erformed to ﬁt the needs o f each studied uncertain ty a pproach. 2.3 Uncertain t y metho ds W e ev aluated the following ﬁve diﬀerent uncer tain ty measures : Baseline uncertain ty: Softmax entrop y . Althoug h the so ftmax output of a mo del is arg uably a pro babilit y meas ure [5], we consider ed it a s r eference c ompar- ison as it is implicitly generated b y segmentation netw o rks. W e named this str at- egy b aseline . W e used the normalized entrop y H = − P c ∈C p c log ( p c ) / log ( |C | ) ∈ [0 , 1 ] as a measur e of uncertain ty , where p c is the softmax output for c lass c a nd C is the set o f classes ( C = { 0 , 1 } in our ca se). MC drop out. T est time dropout can be viewed as a n appr o x imation of a Bay esia n neur al netw o rk [5]. T sto chastic netw o rk samples can be int erpreted as Monte-Carlo samples of the poster ior distribution of the netw or k’s weights and result in a class probability o f p c = 1 / T P T t =1 p t,c . W e employ ed the normalized ent ropy of these pro babilities as a measure of unce rtain ty . F or the experiments, we us ed T = 20 and conside r ed t wo diﬀeren t drop out layer p ositioning s tr ate- gies. Firs t, w e applied MC drop out on th e base m o del (see Sec . 2.2), which uses minimal drop out (p=0 .0 5) after each conv olution. Second, we a pplied mo re prominent dro pout (p=0.5) at the center positio ns (i.e., b efore po oling and af- ter upsampling, similar to [12]). Acco rdingly , we name these tw o stra tegies as b aseline+MC and c ent er+MC . Aleatoric u ncertain t y . In contrast to the mo del uncertaint y (captured by e.g. MC dr o pout), the a leatoric uncertaint y is said to capture noise inherent in 3 W e also conducted exp erimen ts with a DenseNet-like architecture with no notable diﬀerences in t h e ou t come and therefore omit it here for space and clarity reasons. 4 A. Jungo and M. Reyes the observ ation [9 ]. It is obtained b y deﬁning a net work f with t wo o utputs [ ˆ x, σ 2 ] = f ( x ) and input x , where the outputs ˆ x a nd σ 2 are the mean and v ari- ance of the logits p erturb ed with Gaussian noise. The alea toric los s optimizes bo th outputs simultaneously by MC sampling (ten samples in our case) of the per turbed logits. W e used ˆ x for the class pr edictions and the v ariance σ 2 as a measure of uncerta in ty . W e normalized the v ar iance to [0 , 1] ov er all pr e dictions. Ensembles. Ano ther way of quantifying uncertaint ie s is b y ensembling m ulti- ple mo dels [1 0]. W e combined the class pr obabilities of each net work k by the av era g e p c = 1 / K P K k =1 p k,c ov er all K = 1 0 netw o rks and used the no rmalized ent ropy as uncertain ty mea sure. The indiv idua l netw orks sha re the same archi- tecture (see Sec. 2 .2) but were trained on diﬀerent subs e ts (9 0%) of the training dataset and diﬀer e n t rando m initialization to enforce v ar iabilit y . Auxiliary ne t w ork. Inspired b y [3,13], where an a uxiliary netw o r k is used to predict segmentation per formance at the sub ject-lev e l, we apply an a uxiliary net work to predict voxel-wise uncer tain ties o f the seg men tation mo del by lear n- ing from the segmentation err ors (i.e., false positives a nd false negatives). F or the exp erimen ts, we cons ider ed tw o o pposing types of auxiliary net works. The ﬁrst one, named auxiliary fe at. , c o nsists of thre e consec utive 1 × 1 con volution lay ers cascaded after the la st feature maps of the se g men tation ne tw ork. The second auxiliary netw o rk, named auxiliary se gm. , is a completely indep endent netw ork (same U-Net as describ ed in Sec. 2 .2 ) tha t uses as input the or iginal images and the segmen tation masks pro duced by the segmentation model (generated b y ﬁve-fold cros s -v alida tion). W e normalize d the output uncer tain ty sub ject-wise to [0 , 1] for co mparability purpo ses. 2.4 Assessi ng quality of uncertaint ies W e a do pted three metrics to ev aluate the quality of uncer tain ties. Additionally , we computed the Dice co eﬃcient to also v er ify s egmen ta tio n p erformance as uncertaint y metho ds typically link b oth tas ks. Calibration. Mo del calibr ation is imp ortant when not o nly the pr edicted clas s but also its cor respo nding conﬁdence is of interest. In this rega rds, ca libration has been used as a surrog ate to a s ses the reliability of uncertainties [9]. A mo del is s aid to b e p erfectly calibra ted if its pre dic tio ns f ( x ) with conﬁdence p do o ccur with a fractio n p of the time ( P ( y = 1 | f ( x ) = p ) = p fo r the binar y case). Mea ning for example that for 10 0 predictions with a co nﬁdence of 0.7 , 7 0 predictions are e x pected to be cor rect [7]. W e ass e ssed calibra tion of uncerta in- ties by reliability diag rams a nd expected calibratio n err o r (E C E ) [7]. Reliabilit y diagrams show the deviation of the p erfect ca libration by plotting the binned predicted conﬁdences aga ins t the accuracy obtained for each bin (fraction of po sitiv es ). The ECE is deﬁned as the absolute error o f these bins (i.e., the gap betw een co nﬁdence and accur acy) weigh ted by the n umber of samples in the bins, w her e a lower ECE (close to zer o) indicates a be tter c a libration. In our exp erimen ts, w e used a bin size of ten and used the mo del o utput probabilities as co nﬁdence. F or methods not pro viding segmentation pr o babilities but direct segmentation uncertaint y estimates (i.e., auxiliary and alea to ric), w e tr anslated Reliabilit y and Challenges of Uncertaint y Estimations 5 the uncertainties by y (1 − 0 . 5 q ) + (1 − y )(0 . 5 q ) to conﬁdences , where y ∈ { 0 , 1 } is the s e gmen tatio n lab el and q ∈ [0 , 1] is the normalized uncertaint y . Uncertain t y-Error o verlap. In a practical setting, perfect calibration of a mo del is imp o ssible [7 ]. Often, seg mentation task s do no t requir e p erfect cali- bration but it would b e suﬃcient for a mo del to b e uncertain where it ma kes mistakes and cer tain where it is cor rect. T o as s ess this c o ndition, we used the ov erla p (determined b y the Dice co eﬃcien t) be t ween the segmentation error and the thresholded uncertaint y , ter med u nc ertainty-err or overlap ( U- E ). This met- ric is not inﬂuenced b y the true ne g ativ es from background areas, whic h are t y pically eno rmous in medica l image segment a tion. It is therefore an alternative for the ECE , which includes foreground a s w ell as background areas . Corrections. Motiv ated b y previo us works using uncer ta in ty estimatio ns, we assessed the quality of uncertainties by ev alua ting their b eneﬁt to co rrect seg- men tations. W e deﬁne TPU, TNU, FPU, FNU as uncer tain ty in the tr ue posi- tives (TP), true negatives (TP), false positives (FP), and false negatives (FN). A beneﬁcial correction is said to impro ve the Dice c o eﬃcient, hence, to beneﬁt from remov al of false p ositives, the relation F P U ( T P ) > T P U ( T P + F P + F N ) needs to b e satisﬁed (for the accuracy F P U > T P U is suﬃcien t). Similarly , in order to b eneﬁt from adding voxels (i.e., correct fals e nega tiv es), the relatio n F N U ( T P + F P + F N ) > T N U ( T P ), needs to b e satisﬁed. How ever, the latter relatio n is not pra c tica lly applicable due to la r ge backgrounds and thus t y pically large T N U . Since v oxel-wise corrections (as o pposed to instance-wise correctio ns) might b e more har mful than beneﬁcial, we ca lculated the prop ortion of sub jects that fulﬁll the b eneﬁt condition for false p ositive remov a l, BnF , as means of co mparison to other methods. 3 Results Fig. 1 compares the calibra tion at the dataset lev el (i.e., all vo xels in the dataset) with the calibratio n at the sub ject level (i.e., voxels of one sub ject). It shows the miscalibration that can o ccur at sub ject level (S1 and S2) while the calibration at dataset level is go o d. W e found a pproximately 2 8 %/46% underconﬁdent and 32%/18 % o verconﬁdent ca libr ations for the sub jects of the BraTS/ISIC dataset. This under lines the specia l caution needed when using the calibration-based metrics (e.g., ECE ) at the da ta set level, a s it c an lead to mis p erception on the actual ca libration quality of a model, and hence, the r eliabilit y of its uncertaint y estimations. No tice a ble is also the agreement among the uncertaint y metho ds a t sub ject-le v el, suggesting only little beneﬁt in selecting one uncertain ty metho d ov er another. In T able 1, we r epor t for BraTS and the ISIC dataset the follo w ing met- rics: av e r age sub ject-level ECE, uncertaint y-err or overlap (U-E), prop ortion of correctio n-beneﬁting test sub jects (BnF), and Dice co eﬃcien t. F or a fair com- parison, we selec ted the best-p erforming thres hold for ea c h metho d whenever the metric requir ed an uncerta in ty threshold (i.e., U-E and BnF). Overall for bo th datasets, no uncerta in ty method outp erforms and stands out ov er the oth- 6 A. Jungo and M. Reyes S 1 D S 3 S 2 BraTS ISIC Fig. 1. Calibration at the dataset level (D) compared to the (mis)calibration at the sub ject level (S1, S2, S3) f or the diﬀerent uncertaint y metho ds. S1, S2, S3 corresp ond to exemplary sub jects for which th e mo dels are un derconﬁdent (S1), ove rconﬁ den t (S2), and w ell-calibrated (S3). Rows corresp ond to results on the BRA TS and ISIC datase t s. ers. Particularly , the ale atoric metho d and metho ds with la rge drop out ( c en- ter / +MC ) yield worst p erforma nce. The ale atoric metho d fa ils to produce un- certaint y at the loca tions o f seg men tation er rors (i.e., low U-E) a nd is there- fore unable to improv e segmentation r esults thro ugh correctio ns, whereas the large dropout ma inly negatively aﬀects segmentation p erformance a nd ECE. The results further show that MC drop out ( b aseline+MC and c enter+ MC ) t y pically improves ECE, U-E, and Dice coeﬃcient o ver the non-MC versions ( b aseline a nd c enter ), but la rger amoun ts of drop out ( b aseline < c enter and b ase- line+MC < c enter+MC ) results in w orse pe r formances, whic h sugg ests using MC drop out in the re g imes where the be neﬁt with re s pect to the uncertaint y is min- imal compared to standard softmax. W e could conﬁrm this ﬁnding through in- termediate dr o pout strategies (not sho wn). W e also obser v e go o d p erforma nc e s of the auxilia ry netw ork s , which are t ypica lly well-calibrated and proﬁt from a go o d se g men tation p erformance o f their segmentation netw or k (i.e., b aseline mo del). In rega rds to the metrics, we note that low ECE v alue s stem from la r ge amount of low-conﬁdent background areas that p ositiv ely aﬀects the EC E . This also explains the lo wer ECE v alues for the B r aTS dataset, which co n tains more background (even with applied bra in mask) than the ISIC dataset, due to the ad- ditional image dimensio n. Additionally , the BnF only considers TP U and FP U uncertainties and is ther efore favorable for metho ds with low precision (more FP t y pically yields more FPU). W e found this to be the r eason for the ba d correctio n per formance of the ensem ble o n the Br aTS da taset, even though the uncertaint y-er ror ov erlap was go o d. Reliabilit y and Challenges of Uncertaint y Estimations 7 T able 1. Perf ormances of the diﬀerent uncertainties with resp ect to exp ected calibra- tion error (ECE), uncertaint y- error ov erlap (U- E), prop ortion of correction-b eneﬁting test sub jects (BnF), and Dice co eﬃcien t. V alues are presented as me an (r ank) . Stan- dard deviation is omitted d ue to marginal diﬀerences. Upw ards and dow nwa rd s arro w indicate desired higher and low er metric va lues, resp ectivel y . H orizon tal separation group typ es of un certain ty method s. BraTS ISIC ECE % ↓ U-E ↑ BnF ↑ Dice ↑ ECE % ↓ U-E ↑ BnF ↑ D ice ↑ baseline 0.925 (4) 0.432 (2) 0.39 (3) 0.874 (2) 7.256 (4) 0.424 (4) 0.26 (4) 0.814 (3) center 1.758 (7) 0.409 (5) 0.5 (1) 0.866 (5) 9.415 (8) 0. 411 (6) 0.27 (3) 0.78 (6) baseline+M C 0.9 (1) 0.433 (1) 0.36 ( 4) 0.874 (2) 7.36 (5) 0.428 (3) 0.24 (5) 0.813 (4) center+MC 1.233 (6) 0.433 (1) 0.27 (6) 0.868 (4) 8.766 (7) 0.428 (3) 0.17 (6) 0.794 (5) ensemble 0.919 (2) 0.433 (1) 0.32 (5) 0 .879 (1) 7.131 (1) 0.431 (2) 0.31 (2) 0 .831 (1) auxiliary feat. 0.923 (3) 0.427 (3) 0.48 (2) 0.874 (2) 7.216 ( 3) 0.421 (5) 0.3 3 (1) 0.814 (3) auxiliary segm. 0.925 (4) 0.412 (4) 0.48 (2) 0.874 (2) 7.212 (2) 0.43 3 (1) 0.27 (3) 0.814 (3) aleatoric 1.134 (5) 0.054 (6) 0.06 (7) 0.872 (3) 7.837 (6) 0.058 (7) 0.12 (7) 0.82 ( 2) 4 Discussion The results sho w that although current vo xel-wise uncertaint y measures are rather well-calibrated at the da taset level (i.e., all vo xels in the da taset) they tend to fail a t the sub ject level (Fig. 1). This observ ation is to b e expe cted s ince sub ject-le v el calibra tio n er rors (under- or o vercalibration) can a verage out at the dataset level. Based on the prop osed ca libr ation-based metric, no ov er all b est uncertaint y measure was found among the studied metho ds. F rom o ur exp eri- men ts we can conclude that metho ds that aggr egate v oxel-wise uncertaint y to provide sub ject-lev el estimations a r e not reliable enough to b e used as a mech- anism to detect failed segmentations. W e thus co nclude on the impo r tance of developing sub ject-level uncertain t y estimation in medica l image segmentation that c an cop e with the iss ue o f High-Dimension-Low-Sample-Size (HDLSS) to ensure their r eliabilit y in pr actice. Unsurprisingly , the ensemble metho d y ields rank-wis e the most reliable re- sults (T a b. 1) and would t y pically b e a go od c hoice (if the resources allow it). The results a lso revealed that metho ds bas ed on MC drop out a re heavily depe ndent on the inﬂuence o f drop out on the seg men tation p erformance. In contrast, aux - iliary netw or ks turned out to b e a pr omising alter nativ e to ex isting uncer tain ty measures. They p erform c omparable to other metho ds but hav e the b eneﬁt of being applicable to a ny high-p erforming segmentation netw ork not o ptimized to predict re liable uncertaint y es tima tes. No sig niﬁcan t diﬀerences were fo und betw een using auxiliary fe at. and auxiliary se gm. . Throug h a sensitivit y ana ly sis per formed over all studied uncertain ty metho ds (not sho w n), we could conﬁrm our o bs erv a tions that diﬀeren t uncertain ty estimation methods yield diﬀerent levels of precisio n and recall. F urthermore, we obser v ed that when using current uncertaint y metho ds for co rrecting segmentations, a maximum b eneﬁt c an b e 8 A. Jungo and M. Reyes attained when preferring a co m bination of low precision segmentation mo dels and uncertaint y-ba s ed false p ositive remov al. Our ev alua tion has several limitations worth mentioning. First, although the exp erimen ts were p erformed on t wo typical and distinctiv e datasets, they feature large structures to segment. The ﬁndings r epor ted herein may diﬀer for other datasets, esp ecially if these co nsists o f very small str uc tur es to b e segmented. Second, t he assess men t of the uncertaint y is inﬂuenced b y the segmentation per formance. Even though w e succ e e ded in building similarly p erforming models, their diﬀerences cannot b e fully decoupled and neglected when analyzing the uncertaint y . Overall, w e aim with these results to p oin t to the existing challenges for a reliable utiliza tion of vo xel-wise uncertainties in medica l image segmentation, and foster the developmen t of sub ject/patient-lev el uncerta in ty estimation ap- proaches under the co ndition of HDLSS. W e re commend that utilization of uncer - taint y metho ds ideally need to b e coupled with an as s essmen t o f mo del ca libra- tion a t the sub ject/pa tien t-level. Prop osed conditions, along with the threshold- free ECE metric ca n be adopted to test whether uncertaint y estimations c an b e of b eneﬁt for a given task. Ac kno wl edgmen ts . This work was supp orted by the Swiss National F ounda - tion by g rant nu m ber 1696 07. The authors thank F abian Balsig er for the v aluable discussions. References 1. Bak as, S., , et al.: Id en tifying the best mac hine learning algorithms fo r brain tumor segmen tation, progression assessmen t, and o verall surviv al p rediction in t h e brats chal lenge. arXiv preprint arXiv:1811.02 629 (2018) 2. Codella, N.C., et al.: Skin lesion analysis tow ard melanoma detection: A c h allenge at th e 2017 in ternational symp osium on biomedical ima ging (isbi), h osted by th e internatio n al skin imaging collab oration (isic). In: ISBI. pp. 168–17 2. IEEE (2018) 3. DeV ries, T., T a y lor, G.W.: Leveraging uncertainty estimates for predicting seg- mentatio n quality . arXiv preprint arXiv:1807. 00502 (2018) 4. Eaton-Rosen, Z., et al. : T ow ards safe deep learning: accurately quan tifying biomarke r uncertaint y in neural netw ork predictions. In: MICCAI. pp . 691–69 9. Springer (2018) 5. Gal, Y., Ghahramani, Z.: Dropou t as a bay esian appro ximation: Represen ting mod el uncertaint y in d eep learning. In: ICML. p p. 1050 –1059 (2016) 6. Graham, S ., et al.: Mi ld - net: Minima l information loss dilated netw ork for gland instance segmentatio n in colon histology images. Medical image analysis 52 , 199– 211 (2019) 7. Guo, C., Pleiss , G., Sun, Y ., W einberger, K.Q.: On calibratio n of modern neural netw orks. In: ICML. pp. 1321–1330 . JMLR. org (2017) 8. Jungo, A., et al.: Un certain ty-driven sanit y chec k: A p plication to p ostop erativ e brain tumor cavit y segmentation. MIDL (2018) 9. Kendall, A., Gal, Y.: What uncertain t ies do w e n eed in bay esian deep learning for computer vision? I n: N IPS. pp. 5574–55 84 (2017) Reliabilit y and Challenges of Uncertaint y Estimations 9 10. Lakshminara yanan, B., Pritzel, A., Blundell, C.: Simple an d scalable p red ictiv e uncertaint y estimation using deep ensembles. In: NIPS. pp. 6402–641 3 (2017) 11. Litjens, G., et al.: A surv ey on deep learning in medical image analysis. Medica l image analysis 42 , 60–88 (2017) 12. Nair, T., et al.: Exploring u n certain ty measures in deep net works for multiple sclerosis lesion d etection and segmen t ation. In: MICCAI. pp. 655–663. S pringer (2018) 13. Robinson, R., et al.: Real-time pred iction of segmentatio n quality . In: MICCAI. pp. 578–585. S pringer (2018) 14. Ronneb erger, O., Fisc her, P ., Bro x , T.: U-net : Conv olutional n et works for biomed- ical image segmentatio n . In: MICCAI. pp. 234–241. Springer (2015)

Assessing Reliability and Challenges of Uncertainty Estimations for Medical Image Segmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment