Concept-Centric Visual Turing Tests for Method Validation

Recent advances in machine learning for medical imaging have led to impressive increases in model complexity and overall capabilities. However, the ability to discern the precise information a machine learning method is using to make decisions has la…

Authors: Tatiana Fountoukidou, Raphael Sznitman

Concept-Centric Visual Turing Tests for Method Validation
Concept-Cen tric Visual T uring T ests for Metho d V alidation T atiana F oun toukidou* [0000 − 0001 − 9771 − 9609] and Raphael Sznitman AR TORG Cen ter, Univ ersity of Bern, Bern, Switzerland { tatiana.fountoukidou, raphael.sznitman } @artorg.unibe.ch Abstract. Recen t adv ances in mac hine learning for medical imaging ha ve led to impressive increases in model complexit y and ov erall capa- bilities. Ho wev er, the abilit y to discern the precise information a mac hine learning metho d is using to mak e decisions has lagged b ehind and it is often unclear how these p erformances are in fact ac hieved. Conv entional ev aluation metrics that reduce method p erformance to a single num b er or a curve only provide limited insights. Y et, systems used in clinical practice demand thorough v alidation that such crude c haracterizations miss. T o this end, w e presen t a framew ork to ev aluate classification meth- o ds based on a num b er of interpretable concepts that are crucial for a clinical task. Our approach is inspired b y the T uring T est concept and ho w to devise a test that adaptively questions a metho d for its ability to interpret medical images. T o do this, we mak e use of a Twen ty Ques- tions paradigm whereby w e use a probabilistic mo del to characterize the metho d’s capacity to grasp task-sp ecific concepts, and we introduce a strategy to sequen tially query the metho d according to its previous an- sw ers. The results show that the probabilistic mo del is able to exp ose b oth the dataset’s and the metho d’s biases, and can be used to reduce the num b er of queries needed for confident p erformance ev aluation. 1 In tro duction The field of medical image computing (MIC) has radically changed with the emergence of large neural netw orks, or Deep Learning (DL). F or MIC tasks that were long considered extremely challenging, suc h as image-based pathol- ogy classification and segmen tation, DL methods hav e now reached h uman-level p erformances on a v ariety of benchmarks. Y et, as these metho ds hav e b ecome increasingly pow erful, the o verall metho d- ology to v alidate them has largely remained intact. F or instance, challenge com- p etitions compare differen t methods on a common dataset by using metrics most often b orro wed from the computer vision literature. As recen tly noted in [9], chal- lenge comp etition rankings and outcomes are very often highly skew ed to the dataset or metrics used, and rarely relate to the clinical task. T o tac kle this, re- cen t dev elopments in visual question answ ering (V QA) [1,5,8,14] metho ds, which answ er questions related to image con ten t, show the ability to infer concepts be- y ond traditional classification. Here again, how ever, the metrics used to ev aluate V QA’s remain inadequate. 2 T. F oun toukidou et al. Fig. 1: Visual T uring T est (VTT) fundus image screening. Green arro ws corre- sp ond to selected questions, and orange lines to the answ ers giv en by the MuE. Instead, w e consider an alternativ e approac h to v alidating MIC metho ds, one inspired b y Alan T uring’s T uring T est concept [13], where a h uman unkno wingly comm unicates either with another human or an Artificial In telligence system that pro duces answers. The aim of the test is to distinguish betw een the tw o based on a set of ask ed questions. T uring tests ha ve b een used in medical imaging to ev aluate the quality of adv ersarial attacks, b y seeing if an exp ert can distinguish b et w een a real and an adv ersarial example [3,12]. Another approac h, fo cusing on the in terpretability of metho ds that infer semantic information from images (e.g. classification, segmen tation etc.) is seen in the work of Geman et al. [4] on automated Visual T uring T ests (VTT). In their work, an algorithm adaptively selects images and questions to pos e to a metho d under ev aluation (MuE) suc h that the answers can not be predicted from the history of answers. While this approac h has increased explanatory p o wer, it is limited to man ually fabricated story lines to guide questioning. This makes it ill suited for medical applications where suc h story lines are hard to formalize. F or this reason, w e propose a no v el VTT framew ork to ev aluate MIC metho ds (see Fig. 1). In particular, our approach fo cuses on ev aluating MIC classification metho ds and we present a framework that tests if the metho d has correctly un- dersto o d the relev ant medical concepts when inferring test data. W e do this by form ulating our problem as a Twen t y Questions game [2,7] in which we mo del the lik eliho od of a given MuE to provide correct answ ers for differen t concepts presen t in test images. Our framework then sequen tially picks test images and concepts suc h that the uncertain ty of this model is reduced as quic kly as p ossi- ble. W e demonstrate our framew ork in the context of three differen t m ulti-lab el classification problems where eac h concept is enco ded by a given binary label. 2 Metho d Our prop osed VTT framew ork ev aluates how a MuE, a MIC classification metho d in this case, performs with resp ect to core concepts relev an t to the task for whic h it w as trained. T o do this, w e make use of a v alidation image dataset that the Concept-Cen tric VTT 3 Fig. 2: Left: Ov erview of the prop osed sc heme. Right: Examples of GPs for 3 differen t concepts (p oints are the observ ations, dotted line is the mean of a GP and the shaded area corresp onds to the 95% confidence region). MuE has never had access to, D = { s i } N S i =1 , where s i is an av ailable test sam- ple, such as an image or an arbitrary region within an image. As suc h, N S can b e excessively large, as potentially millions of regions can be extracted from a single test image. F or each s i , w e denote the p oten tial concepts that could b e presen t in the sample as, C = { c j } N C j =1 . F rom this, we define a “question”, q = ( s q , c q ) ∈ D × C = Q , of the form “Is concept c q presen t in sample s q ?”. W e let q g t ∈ { 0 , 1 } be the true answ er to question q . In this work, w e consider MuEs that p erform multi-label classification tasks, f : Q → [0 , 1], taking as input the question q and pro ducing the probability of the answ er b eing “Y es”(see Fig. 1). Giv en that ev aluating all elemen ts of Q ma y b e computationally in tractable, our VTT framew ork instead only ev aluates a subset of Q . W e do this iteratively and adaptively , where we use the history of previously ask ed questions and their answ ers to build a p erformanc e mo del (Sec. 2.1). W e then use a questioning str at- e gy to determine whic h element of Q should b e asked to the MuE. In particular, w e prop ose a no v el strategy that selects the elemen t that maximally reduces the uncertain ty in the p erformance mo del (Sec. 2.2). The pro cess terminates after a fixed num ber of questions hav e b een asked or when the uncertain ty in the mo del has b een reduced to an acceptable level. Fig. 2 (Left) illustrates our framew ork and w e detail our p erformance mo del and questioning strategy next. 2.1 P erformance Mo del F rom a set of questions and the corresponding MuE resp onses, we aim to mo del the MuE performance with resp ect to concepts C . While the relation b et w een concepts could in practice b e complex, w e mo del them as independent here. By its definition, the MuE provides answ ers to binary questions, and for an y concept, there are 4 p ossible outcomes to a question: a T rue Negative (TN), a F alse P ositive (FP), a F alse Negativ e (FN) or a T rue P ositive (TP). Our goal then is to model the relation b et ween the frequency of these outcomes and the inputs to the MuE. T o do this, we define a discrete random v ariable Y c for every concept c ∈ C which enco des the counts 1 of outcomes given by f . W e achiev e 1 The approach would b e unchanged if the outcome frequency w as used. 4 T. F oun toukidou et al. Algorithm 1 Concept Centric VTT Require: Dataset: D , Concepts: C , stopping criteria: τ and MuE: f Q ← D × C Initialize f G P c , c ∈ C while # questions ask ed < τ do Q candidates ←  q ∈ Q : c q ∈ arg max c ∈C  max  u − c , u + c  and q gt = o c q  if select based only on uncertain ty then q ∗ ← randomly select q ∈ Q candidates else if select based on uncertaint y & unpredictabilit y then q ∗ ← random q ∈ { q ∈ Q candidates : | p ( f ( q ) = “Y es” |H ) − 0 . 5 | <  } end if Compute f ( q ∗ ), a c Up date f G P c H ← {H , ( q ∗ , f ( q ∗ ) } Q ← {Q \ q ∗ } end while this b y means of a Gaussian Pro cess (GP) [11] of the form, f G P c ( a c ) ∼ G P ( µ c ( a c ) , k c ( a c , a 0 c )) , (1) where µ c ( · ) and k c ( · , · ) are the mean and kernel functions of the GP , resp ectiv ely , and a c = f ( q ) + q g t 2 , (2) describ es the answ er of f with resp ect to the question q . The mean function µ c is initialized to 0, and the cov ariance function (k ernel) k c is a squared exp onen tial k c ( a c,m , a c,n ) = σ f · e − 1 2 l 2 ( a c,m − a c,n ) 2 + σ n δ mn , with characteristic length-scale l = 0 . 1, initial signal v ariance σ f = 1 and noise v ariance σ n = 0 . 025. T o then infer the v alue of Y c for any a c , we store the pairs { ( a ( i ) c , y ( i ) c ) } where y ( i ) c is the n umber of times f has giv en a = a ( i ) c , and use these as observ ations to infer the complete model f G P c using standard inference [11]. In practice, we discretized the range of a in bins of ∆a = 0 . 01. A consequence of this model is that we can now visualize the p erformance of f with resp ect to the concepts in C , the selected q ’s and the dataset D . In Fig. 2 (Righ t), w e illustrate f G P c in terms of a for each c . Such a visualization depicts an y bias that both f and the dataset D ma y hav e (e. g., D contains few samples regarding a sp ecific concept). Note that b y summing up the observ ations ov er concepts or in tegrating ov er the four different subregions of the supp ort set, one retriev es the total TN, FP , FN and TP , and other subsequent metrics. 2.2 Questioning Strategy With the performance mo del abov e and the fact that the v alidation dataset D ma y b e in tractably larger, we now describe how to select samples from Q to v erify that the MuE has grasp ed relev an t concepts. Concept-Cen tric VTT 5 T o do this, we present a strategy that looks to select samples s and concepts c that are likely to reduce our mo del’s uncertaint y . W e do this by computing the uncertain ty of a concept as the in tegral of the 95% confidence region of the GP o ver its support set (i. e., 2 standard deviations). That is, for concept c ∈ C , u c = 4 Z 1 0 k c ( a, a ) da, (3) whic h can b e decomposed as u − c = 4 R 0 . 5 0 k c ( a, a ) da and u + c = 4 R 1 0 . 5 k c ( a, a ) da , corresp onding to the uncertain t y asso ciated with negative and p ositive samples in D , respectively . Visually this corresp onds to the area ov er the interv als a = [0 , 0 . 5] and a = [0 . 5 , 1] (see Fig. 2, Righ t). Our strategy then c hooses which concept to ask, and whether to ask a sample that has or do es not ha v e this concept in it. This is p erformed by selecting q ∗ ∈  q ∈ Q : c q ∈ arg max c ∈C (max( u − c , u + c )) and q g t = o c q  , (4) where o c q = 0 if max( u − c , u + c ) = u − c or 1 if max( u − c , u + c ) = u + c . F rom this, q ∗ is selected either randomly , or based on its unpredictability . As in [4], the latter is computed using the same dataset D and the history of already answ ered questions H . An ov erview of the questioning strategy can b e seen in Algorithm 1. 3 Exp erimen ts and Results W e choose to ev aluate our framew ork on multi-label classification tasks where concepts are directly linked to groundtruth lab els. In the follo wing, we outline the differen t datasets and MuEs we use, and our experimental setup. 3.1 Datasets and MuE Indian Diab etic Retinopathy image Dataset (IDRiD) [10]: 143 fundus images from b oth healthy and diab etic retinopathy sub jects, with the task of iden tifying 4 different lesion types (hemorrhage, hard and soft exudates, microa- neurysm). The m ultilab el MuE is a ResNet [6] with pre-trained weigh ts, trained with 100 images. The remaining 43 images were used as samples in D . ISIC 2018 Skin Lesion Analysis 2 : 1’876 dermoscopic images of skin lesions. Here N C = 5, consisting of different skin lesions types: pigmen t net work, neg- ativ e netw ork, globules, milia lik e cysts and streaks. The MuE is a pre-trained ResNet [6] trained with 70% of data. A hundred randomly c hosen images from the 30% test images w ere used to p opulate D . OCT: 200 OCT cross-sectional slices from Age-Related Macular Degeneration and Diab etic Macular Edema patien ts. Eleven different biomark ers p otentially 2 h ttps://challenge2018.isic- arc hive.com/ 6 T. F oun toukidou et al. (a) random (b) unpredictability (c) uncertaint y (d) uncertaint y & unpredictability Fig. 3: P erformance mo del for the 4 questioning strategies. The most common concept (Drusen) and a rare concept (Geographic A troph y - GA) are sho wn after 100 questions are p osed to the m ultilab el MuE. can b e present in an y cross-section ( N C = 11). All 200 images are used for D and a sep erate trained Dilated Residual Net w ork [15] is used for the MuE. Giv en that w e are not focused on optimizing the performance of a sp ecific MuE but rather on ev aluating relative behavior with respect to concepts, we also pro vide a set of synthetically generated MuEs for which we understand their performance fully . That is, given that the distribution of concepts in D is not uniform for any of the datasets, we wish to compare each MuE to a “biased” MuE. T o do this, we sim ulate a biased MuE that answ ers with 90% accuracy questions regarding the most common concept and 50% on all others. Similarly , we simulate a MuE with a 50% accuracy for the most common concept, and 90% on all others. Last, we also sim ulate an unbiased algorithm, with a 70% accuracy regardless of the concept or class imbalance. 3.2 Exp erimen ts T o ev aluate our framework, w e compare four questioning strategies: (a) random, (b) based on the predictabilit y of the question given the previous questions and the dataset [4], (c) based on the uncertaint y of the question as defined in Sec. 2.2, and (d) based on the com bination of the ab ov e t wo, as described in Sec. 2.2. Concept-Cen tric VTT 7 Fig. 4: Uncertain ty of the p erformance mo del with respect to the n um b er of ques- tions asked for different MuEs. Each column corresp onds to a dataset. Rows, from top to b ottom: image example, uncertain ty for unbiased MuE, uncer- tain ty for biased to most common concept MuE, uncertaint y for negatively biased to most common concept MuE and uncertain ty for trained multilabel classifier. Fig. 3 depicts the state of the p erformance for t wo concepts (GA and Drusen) after a h undred questions hav e been ask ed on the OCT dataset using the trained MuE. In all plots the bias in concept occurrence can b e observ ed and w e see that b oth the random strategy , and the one relying solely on the unpredictabilit y criteria, are prone to not ask many questions ab out rare concepts (GA concept), th us delaying the assessment of the MuE’s on this concept. Both uncertaint y 8 T. F oun toukidou et al. based metho ds how ev er manage to sample adequately and are more confiden t in the p erformance model. In another experiment, we ask all possible questions follo wing eac h one of the four described strategies. That is, the state of the performance model is the same after all questions hav e b een asked. W e rep eat the experiments 10 times, and monitor the a verage uncertain ty as questions are ask ed. The results are sho wn in Fig. 4. Here we observ e that the prop osed questioning strategies reach saturated lev els of uncertaint y quic ker than the other strategies, and w ould require fewer questions for confiden t assessments. 4 Discussion T o summarize, w e presen t a more informativ e and interpretable p erformance mo del for ev aluating closed-end, “Y es/No” inference metho ds. T o this, we pro- p ose a strategy to sample from the - p ossibly intractable - set of questions, in order to reac h high certain ty in the p erformance model characterizing the MuE and the v alidation data. W e assess our method on three differen t medical imag- ing datasets and show that the p erformance mo del is able to capture the data distribution information and the MuE biases. Moreov er, the questioning strat- egy allows for faster conv ergence of the performance mo del to a low uncertaint y state. W e will lo ok to extend this concept to segmentation problems in future w ork. References 1. An tol, S., Agraw al, A., Lu, J., Mitchell, M., Batra, D., Lawrence, Z.C., Parikh, D.: V QA: Visual question answering. In: The IEEE In ternational Conference on Computer Vision (ICCV) (December 2015) 2. Bendig, A.: Tw ent y questions: an information analysis. Journal of Exerimen tal Psyc hology (5), 345348 (1953) 3. Ch uquicusma, M.J., Hussein, S., Burt, J., Bagci, U.: How to fool radiologists with generativ e adv ersarial netw orks? a visual turing test for lung cancer diagnosis. In: 2018 IEEE 15th in ternational symposium on biomedical imaging (ISBI 2018). pp. 240–244. IEEE (2018) 4. Geman, D., Geman, S., Hallonquist, N., Y ounes, L.: Visual turing test for computer vision systems. Pro ceedings of the National Academ y of Sciences 112 (12), 3618– 3623 (2015) 5. Hasan, S.A., Ling, Y., F arri, O., Liu, J., Lungren, M., M ¨ uller, H.: Overview of the ImageCLEF 2018 medical domain visual question answering task. In: CLEF2018 W orking Notes (2018) 6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro ceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016) 7. Jedynak, B., F razier, P ., Sznitman, R.: Tw ent y questions with noise: Bay es optimal p olicies for entrop y loss. Journal of Applied Probabilit y (1), 114136 (2012) Concept-Cen tric VTT 9 8. Lau, J.J., Gay en, S., Abac ha, A.B., Demner-F ushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5 , 180251 (2018) 9. Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stanko vic, M., Scholz, P ., Arb el, T., Bogunovic, H., Bradley , A.P ., Carass, A., et al.: Author correction: Wh y rankings of biomedical image analysis comp etitions should b e interpreted with care. Nature communications 10 (1), 588 (2019) 10. Prasanna, P ., Samiksha, P ., Ravi, K., Manesh, K., Girish, D., Vivek, S., Meri- audeau, F.: Indian diab etic retinopathy image dataset (IDRiD) (2018) 11. Rasm ussen, C.E.: Gaussian processes for machine learning. MIT Press (2006) 12. Sc hlegl, T., Seeb¨ ock, P ., W aldstein, S.M., Langs, G., Sc hmidt-Erfurth, U.: f-anogan: F ast unsup ervised anomaly detection with generative adversarial netw orks. Medical image analysis 54 , 30–44 (2019) 13. T uring, A.: Computing mac hinery and intelligence. Mind 49 (236), 433–460 (1950) 14. W u, Q., T eney , D., W ang, P ., Shen, C., Dick, A., v an den Hengel, A.: Visual question answering: A surv ey of metho ds and datasets. Computer Vision and Image Understanding 163 , 21–40 (2017) 15. Y u, F., Koltun, V., F unkhouser, T.: Dilated residual net works (May 2017)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment