Interpretable Image Recognition with Hierarchical Prototypes
Vision models are interpretable when they classify objects on the basis of features that a person can directly understand. Recently, methods relying on visual feature prototypes have been developed for this purpose. However, in contrast to how humans…
Authors: Peter Hase, Chaofan Chen, Oscar Li
Interpr etable Image Recognition with Hierar chical Prototypes Peter Hase, 1 , 2 Chaofan Chen, 2 Oscar Li, 2 Cynthia Rudin 2 1 UNC Chapel Hill 2 Duke Uni versity peter@cs.unc.edu, { chaofan.chen, rl144, cynthia.rudin } @duk e.edu Abstract V ision models are interpretable when they classify objects on the basis of features that a person can directly under- stand. Recently , methods relying on visual feature prototypes hav e been de veloped for this purpose. Howev er , in contrast to ho w humans categorize objects, these approaches ha ve not yet made use of an y taxonomical or ganization of class labels. W ith such an approach, for instance, we may see wh y a chim- panzee is classified as a chimpanzee, but not why it was con- sidered to be a primate or e ven an animal. In this work we in- troduce a model that uses hierarchically or ganized prototypes to classify objects at every level in a predefined taxonomy . Hence, we may find distinct explanations for the prediction an image receiv es at each lev el of the taxonomy . The hierarchi- cal prototypes enable the model to perform another important task: interpretably classifying images from previously unseen classes at the le vel of the taxonomy to which they correctly relate, e.g. classifying a hand gun as a weapon, when the only weapons in the training data are rifles. W ith a subset of Im- ageNet, we test our model against its counterpart black-box model on two tasks: 1) classification of data from familiar classes, and 2) classification of data from previously unseen classes at the appropriate level in the taxonomy . W e find that our model performs approximately as well as its counterpart black-box model while allowing for each classification to be interpreted. 1 Introduction What is clear from the study of human vision is that we or- ganize the world into “inductiv ely rich” categories that re- late to each other taxonomically: we glean useful informa- tion from recognizing that something is an animal, and we draw even more useful information from seeing that it is a tiger rather than a cat (Bloom 2001). Further, we can ex- plain our visual judgments by pointing to prototypical fea- tures that an object possesses as evidence for its membership in the class to which those features correspond (Salakhutdi- nov , T enenbaum, and T orralba 2012); a certain animal is a tiger because it is a large cat with black stripes, approxi- mately orange fur, menacing teeth, etc. Using some internal Copyright c 2019, Association for the Advancement of Artificial Intelligence (www .aaai.org). All rights reserved. Figure 1: The capuchin shown here is classified at three lev els in a pre-defined taxonomy: animal, primate, and ca- puchin. The classification is made based on similarities be- tween the latent representation of the capuchin and learned prototypes corresponding to each of these three hierarchi- cally organized classes. Beside each prototype, there is a heat map showing the localized areas in the test image that highly activ ated the prototype. Hence, we see the parts of the image that led the model to classify it as an animal (rather than, e.g., a vehicle), a primate (rather than a non-primate), and finally as a capuchin (rather than an orangutan or gib- bon). For the full class taxonomy , see Figure 3. taxonomy , people can explain their visual judgments at mul- tiple le vels of abstraction, and these e xplanations may dif fer across each lev el. What distinguishes animals from vehicles is dif ferent from what distinguishes lions and tigers, for in- stance. People can also tell when an object resides within a coarse class (like animals) but does not belong to an y famil- iar fine class (like deer). That is, we can tell when something is a kind of animal we hav e nev er seen before. For computer vision models to replace human judgment on important image recognition tasks, these models should fulfill the same functions and exhibit the same transparency as we do, which is to suggest that vision models should 1) classify images on the basis of human-interpretable features such as visual feature prototypes, 2) predict image classes not just at the lev el of the dataset labels, but also at each lev el of a taxonomy that is known to organize the classes, and 3) detect when images belong to never -before-seen subclasses within some coarse class in the known taxonomy . Why add a hierarchical class structure and no vel class de- tection to an interpretable model? First, such an approach makes explicit the trade-off between information gain and accuracy , since it is easier to distinguish among objects at less informativ e lev els of a taxonomy (Deng et al. 2012). Users can elect to make their decision using only the more reliable but more coarse-grained classification. This option is useful when policy responses to a prediction do not change after a certain lev el of specificity , or when they differ be- tween two “sibling” classes. For e xample, a decision-maker who is unsure whether an object is a pistol or an assault rifle might nonetheless produce an appropriate policy response simply by virtue of knowing the object is a gun. Second, explanations for predictions can be tailored to their corresponding taxonomical le vel. W e may identify the reasons for an ambulance being an automobile (which could include the presence of wheels), and then for it being an ambulance (which could include the presence of sirens, not simply the presence of wheels). This helps us focus our un- derstanding of each image’ s prediction to the most specific lev el at which we wish to distinguish its class from other classes. Further , when the predictions are wrong, we can see at what lev el of specificity they went wrong. Lastly , this approach enables the interpretation of predic- tions for images from novel (nev er-before-seen) classes, as long as these nov el classes fall under other broader classes in the model’ s known taxonomy . V ision models deployed in live en vironments will inevitably encounter such images, and it will be useful for them to recognize both that these images belong to no vel classes and that they are instances of some familiar b ut more coarse-grained class. In this paper, we present an algorithm that performs the three functions described abov e. In doing so, we draw upon work from three frameworks in computer vision: 1) inter- pretability through feature prototypes, 2) hierarchical class organization, and 3) novel class detection. Our algorithm combines and b uilds upon insights from each frame work, al- lowing for the interpretable classification of images at mul- tiple lev els of taxonomical specificity , ev en when these im- ages come from nov el classes (that f all under a broader class in the kno wn taxonomy). W ith a subset of ImageNet, we test our model against its counterpart black-box vision model on two tasks: classification of data from familiar classes and classification of data from previously unseen classes at the appropriate level in the model taxonomy . W e find that our model performs approximately as well as its counterpart black-box model while producing interpretable predictions. W e report se veral accuracy metrics here, including 1) fine- grained accuracy on in-distribution data, 2) coarse-grained accuracy on novel data, and 3) nov el class detection accu- racy . W e also gi ve a quantitati ve ev aluation of the quality of our interpretable model’ s learned latent space. 1 2 Related W ork W ithin computer vision, there are long lines of prior research in each of hierarchical classification, interpretable modeling, and novel class detection. T o date, howe ver , the approaches hav e not been successfully combined. Interpr etable Models. There is no shortage of post-hoc interpretations of CNN-based vision models (Erhan et al. 2009; Lee et al. 2009; Simonyan, V edaldi, and Zisser- man 2014; Sundararajan, T aly , and Y an 2017), but there are fewer methods where an attempt is made to learn ex- plicitly interpretable features. A few identify subsections of an image that were important to a classification, e.g., the class-attention maps of Pinheiro and Collobert (2015) and Zhou et al. (2016). Others feed only a localized por- tion of the image to the model that is selected in a super- vised manner , with densely labeled data (Huang et al. 2016; Zhou et al. 2018), or in an unsupervised manner with auxil- iary networks pre-trained for this purpose (Simon and Rod- ner 2015). The shortcomings of class attention and salienc y- based approaches in particular are ex emplified in Figure 2. The most interpretable methods include the prototype- based approaches of Branson et al. (2014), Li et al. (2018), and Chen et al. (2018). Our approach dif fers from each along sev eral important dimensions. Using the prototype-specific heat map method from Chen et al. (2018), we show that our prototypes encode for local information (i.e. image parts ), a quality of prototypes that Branson et al. (2014) do not demonstrate for their model. Li et al. (2018) use a decoder to visualize prototypes of MNIST classes, which does not work for complex naturalistic images. Our approach dif fers from Chen et al. (2018) by org aniz- ing the prototypes hierarchically rather than in a “flat” man- ner . Whereas past work in volv es learning prototypes partic- ular to each class in the dataset, our method learns both an analogous set of prototypes, which correspond to the leaf nodes in the class taxonomy , and additional sets of proto- types for each related group of classes, which correspond to parent nodes in the class taxonomy . This enables us to inter - pret image classifications at each parent node in taxonomy , e.g., what makes a panda an animal rather than a vehicle as well as what distinguishes the panda from a lion. Hierarchical Classification. Hierarchical classification has been performed with SVMs (Deng et al. 2012), Bayesian graphical models (Salakhutdinov , T enenbaum, and T orralba 2012), CNNs, (Ahmed and T orresani 2017; Redmon and Farhadi 2017; Zhu and Bain 2017; Kuang et al. 2018; Y an et al. 2015), and the use of a CNN and RNN to- gether (Guo et al. 2017). T ypically the problem is entirely supervised, but inferring the tree structure of classes has been done in an unsupervised fashion as well (Zhang et al. 2016). Our work falls into the supervised CNN category . W ithin this category , some approaches use a single CNN 1 W e are making our code publicly av ailable at: https://github .com/peterbhase/interpretable- image. Figure 2: Saliency maps show where the model is looking, but they do not tell why the model classifies an image as it does. Our prototype-based model provides more localized features, leading to better e xplanations of model classifications (see Figure 6). and construct predictions over the class taxonomy from a single network output (Redmon and Farhadi 2017), while others branch their networks to produce representations unique to each sub-classification task (Zhu and Bain 2017; Y an et al. 2015). Our w ork is an instance of the latter , as our network branches at a particular point before any classifica- tions are made. Our approach departs from previous work in hierarchi- cal classification through our introduction of prototypes that encode for image parts, allowing for model classifications to be directly interpreted. None of the prior CNN-based ap- proaches make use of prototypes in the latent space; they introduce hierarchical class labeling chiefly for purposes of increasing accuracy or dealing with labels of v arying speci- ficity . There is a Bayesian graphical model that utilizes pro- totypes (Salakhutdinov , T enenbaum, and T orralba 2012), but the prototypes in this model are in pix el space and each rep- resents an entire class, while our prototypes reside in a latent space and represent parts of a class. Novel Class Detection. Lastly , Shafaei, Schmidt, and Lit- tle (2018) revie w a variety of novel class detection meth- ods for vision models, which are known variously as out- of-distribution detection, outlier detection, or nov elty detec- tion methods. These methods include both unsupervised and supervised approaches and predominately operate by using information from the logits that a model outputs. Unsuper- vised methods here rely on standard statistical outlier detec- tion techniques (Bendale and Boult 2015), and the y perform consistently worse than supervised techniques. The super- vised approaches make use of classifiers on a model’ s logits, while of course requiring that some data are withheld from the modeling process to serve as out-of-distrib ution data. W e e xtend this body of w ork by adapting past methods to the context of hierarchical class organization. W e introduce a method for solving the problem of detecting when an in- stance resides within a kno wn coarse class (lik e animals) but not within an y of the kno wn sub-classes (lik e panda or lion). 3 Problem Descriptions W e describe the three problems treated by our approach. Ensuring Interpr etability . Interpretable vision models classify images on the basis of directly interpretable fea- tures. From the work of Bloom (2001), we identify two paths tow ard ensuring that features are interpretable. First, one Figure 3: Our class taxonomy defined over a subset of 15 ImageNet classes, where each of the fifteen classes is repre- sented as a leaf node. could produce features corresponding to object properties like redness or having legs. Second, one could produce fea- tures from measures of similarity between new instances and representativ e instances of each class. Note that as a point of terminology , Bloom identifies the first of these approaches as the pr ototype approach, while he identifies the second as the exemplar approach. The model we introduce here is best considered to follow the prototype approach, since the model learns features that are image parts rather than en- tire instances of classes. Our model could also be described as follo wing case-based reasoning, since ev en as the model learns feature prototypes that represent abstract object prop- erties, the feature prototypes are always drawn from con- crete instances in the training data. Can we ensure the model learns meaningful features? Quantitativ e metrics for interpretability hav e been developed (Bau et al. 2017), but they rely on particular densely labeled datasets that still may not capture all of the relev ant, mean- ingful features for a distinct setting of deployment. Conse- quently , domain experts must check that features are mean- ingful for applications in their domain. F or the domains like those captured in the ImageNet data used in this paper, a layperson can check if features are meaningful; they are not forced to trust a black box. Hierarchical Classification. The task of hierarchical clas- sification is to predict an image’ s class at each le vel of a tax- onomy (i.e. tree). Suppose we have sample images from the data space X and labels from the space Y . The ke y differ - ence with a standard classification framework is that each label y i has k elements, with y ( k ) i representing an image’ s label at the k th lev el in the tree. Here, Y (0) denotes the root, Y (1) will represent the first set of children below the root, which correspond to the coarsest classes, and Y ( K ) will represent the finest lev el. W e seek to learn a function f : X → Y that approxi- mates the true distribution P ( Y |X ) ov er paths in the tree; each path corresponds to an image’ s full label, e.g. { animal, cat } . Physically impossible paths, like { animal, truck } are known a priori to have 0 probability . Not all branches need to be the same depth, though for notational conv enience we will always write label sequences through Y ( K ) . This task is accomplished by learning each of the con- ditional distributions within a factorization of the full joint distribution P ( Y |X ) = P ( Y (1) , ..., Y ( K ) |X ) . Then there are as many distributions to learn as there are parent nodes, counting one root node corresponding to the distribution ov er Y (1) . Each distribution P ( Y ( k +1) |Y ( k ) = c ( k ) , X ) represents the multinomial distribution over the children classes of the parent node c ( k ) on the k th tree lev el, where c (0) is the class of all known entities: P ( Y |X ) = P ( Y (1) |X ) × ... × P ( Y ( K ) |Y ( K − 1) , X ) , which is a typical objectiv e for multinomial classification. Novel Class Detection. Novel class and out-of- distribution detection methods provide a mechanism for identifying data not from a known distribution. This can either be considered an unsupervised problem, or a binary classification problem, if one is willing and able to set aside some data that is “novel, ” while considering the remaining data to come from the kno wn distribution (Shafaei, Schmidt, and Little 2018). When combined with a hierarchical classification approach, we can apply novelty detection mechanisms to two different kinds of problems: detecting some entirely new classes (e.g., animals, when the known distribution contains only vehicles), and detecting some classes that are nov el at a specific lev el but not at a broader one, e.g., taxi cabs when the known distrib ution contains only pickup trucks and sports cars. W e will use the term “novel class” to refer to both forms of nov elty described above, while we will use “out-of-distribution” to refer strictly to instances that do not belong to any class in a known taxonomy , no matter ho w coarse-grained. W e can formalize these tasks as estimating the probabil- ity of membership in one set of classes, with the option of conditioning on membership in another set of classes. The standard view of out-of-distribution detection is to estimate the probability that data of fine-grained class Y ∗ belongs to the distribution of kno wn fine-grained classes, P ( Y ∗ ∈ Y ( K ) |X ) . Here, we introduce the problem as novel class detection , where we estimate the probability that the fine-grained label of data is in the kno wn children of the parent, conditioned on it being a member of the parent class, P ( Y ∗ ∈ c ( k ) children |Y ∗ ∈ c ( k ) , X ) where c ( k ) children = { c ( k +1) | c ( k +1) ∈ c ( k ) } and child-parent relationships like { primate } ∈ { animal } hold. Our description is thus a generalization of the standard view , which reduces to the standard problem when k = 0 and we consider c (0) to be the class of all known entities. 4 Model In this section we describe our image recognition model and the nov el class detection method. Denote the data as D = [ X , Y ] = { ( x i , y i ) } n i =1 , with hierarchical labels y i ∈ Y for i ∈ { 1 , ..., n } . Image Recognition Model The architecture of our image recognition model is rep- resented in Figure 4. W e term our model HPnet, for Hi- erarchical Prototype network. Our recognition model is a generalization of the model of Chen et al. (2018); with a pre-defined taxonomy consisting of only one lev el of fine- grained classes, our HPnet model reduces to their model, which we label Pnet. The core components of the Pnet model, including the design of the prototype layer and loss terms, are delineated by Chen et al. (2018), and are described again here for con venience. First, a CNN f maps images to a latent space. In our ex- periments, we use the V GG-16 netw ork (Simonyan and Zis- serman 2014), with the fully connected layers and classifier remov ed. W e append two 1x1 con volution layers to the end of the network to reduce the dimensionality of the con vo- lutional output from H × W × D to H × W × D 0 , where D 0 = 32 < D = 512 and the second acti vation function is a sigmoid. Let the con volutional output be z . Here, the con vo- lutional output is considered as a set of H W patch vectors each of size D 0 , { ˜ z i } H W i =1 . By virtue of the sigmoid activ a- tion, the patch vectors are in the unit hypercube in R D 0 . For each parent node in the class taxonomy , there is a pro- totype layer that operates directly on z . At a high lev el, pro- totypes in the latent space will be used to compute feature vectors from the latent representation z . Each element of a feature vector will be a similarity scor e for a particular pro- totype that will be higher when some patch vector ˜ z ∈ z is close to that prototype. During training, a set of m prototypes is learned for each prototype layer , denoted as P c ( k ) = n p c ( k ) j o m c ( k ) j =1 , where c ( k ) giv es the parent class. The prototype layer of class c ( k ) first transforms z into a set of m c ( k ) matrices of size H × W , where the i th matrix is the activ ation map corresponding to prototype p c ( k ) i . Then the max of each activ ation map is taken to produce a feature vector v c ( k ) in R m c ( k ) . T ogether , similarity score j of the layer’ s final feature vector is com- puted as g p c ( k ) j ( z ) = max ˜ z ∈ patches ( z ) log 1 + 1 / ( k ˜ z − p c ( k ) j k 2 2 + ) . Finally , a fully connected layer h c ( k ) transforms the feature vector into a distribution ov er the classes under that parent node. Figure 4: The HPnet architecture. There are a few motiv ations for the design of the pro- totype layer . By enforcing a constraint that each prototype p c ( k ) j be equal to some patch ˜ z obtained from an image in the training set, we obtain a correspondence between that latent prototype and a visually inspectable recepti ve field from the training set. Then by upsampling the activ ation map associ- ated with that prototype to the size of the input image, we get a heat map that shows the localized portions of the input image that highly acti vate the prototype (Chen et al. 2018). Consequently , we can understand a particular classification by checking the prototypes that were highly activ ated and viewing their corresponding heatmaps overlaid on the input image. Further , our prototypes encode for conditional informa- tion. T o giv e an example, our prototypes for ambulances are used only to distinguish amb ulances from other vehicles (here, pickups and sports cars). It is the vehicle prototypes that are used to classify a particular ambulance as a vehicle rather than an animal. Note that within each set of prototypes P c ( k ) , we allo- cate a pre-determined number of prototypes ev enly to each child class, so that every child class will be represented in the set of prototypes. The mechanism for this allocation op- erates in such a way that two prototype vectors can be equi v- alent, in which case the child node can be considered to ha ve fewer unique prototypes than the pre-determined number . In the e xperiments to follo w , we allocate 8 prototypes per child class. W e denote the subset of prototypes under this parent layer that are allocated to child c ( k +1) as P c ( k +1) . Lastly , if needed, a distribution over the most fine-grained classes in the taxonomy is obtained by computing the prob- ability of an instance belonging to each class in a path down the taxonomy , for every path down the taxonomy . That is, one computes P ( y ( K ) i = c ( K ) | x i ) K − 1 Y k =1 P ( y ( k +1) i = c ( k +1) | y ( k ) i = c ( k ) , x i ) , for each path in the taxonomy , { c (1) , ..., c ( K ) } . T raining Algorithm Similar to Chen et al. (2018), we train the model by al- ternating between optimization of the layers and a projec- tion phase, wherein prototypes are projected onto the closest patches ˜ z in the latent space. This projection phase is nec- essary since we could not optimize the objectiv e via gradi- ent descent methods while enforcing the constraint that each prototype is alw ays equal to some latent patch from the data. Note that we do not use a VGG-16 model pre-tr ained on ImageNet , as this would confound our later novel class de- tection testing, since our no vel class test set also comes from ImageNet. Rather , our VGG-16 is trained with random ini- tialization, and the Pnet and HPnet models are trained with their con volutional layers initialized from the most accurate of the trained VGG-16 models. Objective Function. The objecti ve function that we aim to minimize is the sum across prototype layers of four terms, a cross entropy between predictions and labels, a cluster- ing term, a separation term, and a regularization term. W e use the same clustering and separation terms as Chen et al. (2018), though we adapt them to be specific to each proto- type layer . W ith the set of all parent nodes as C , we mini- mize X c ( k ) ∈ C X i : y ( k ) i = c ( k ) CrossEntropy ( h c ( k ) ◦ g P c ( k ) ◦ f ( x i ) , y i ) + λ 1 Clust ( P c ( k ) , X , Y ) + λ 2 Sep ( P c ( k ) , X , Y ) + λ 3 Reg ( h c ( k ) ) . Each term is explained in turn. The cross entrop y encour- ages accuracy of predictions, and, notably , the sum of the cross entropies over parent classes is equi valent to a single cross entropy between the fine-grained labels of the data and the model’ s joint distrib ution o ver fine-grained classes, since the conditional probabilities decouple through the logarithm. Let us explain the clustering and separation costs. The clustering cost is designed to encourage the model to map at least one patch vector of each image close to a prototype corresponding to its class. For a given layer, the term is the sum ov er images of the minimum distance between some patch vector and some prototype of that input image’ s class. Clust ( P c ( k ) , X , Y ) = X i : y ( k ) i = c ( k ) min j : p j ∈ P c ( k +1) i min ˜ z ∈ patches ( f ( x i )) k ˜ z − p j k 2 2 . The separation cost is designed to encourage the model to av oid mapping an y patch v ectors of an image close to a pro- totype corresponding to a differ ent class. For a gi ven layer , the term is the negati ve sum over images of the minimum distance between some patch vector and some prototype not belonging to that input image’ s class. Sep ( P c ( k ) , X , Y ) = − X i : y ( k ) i = c ( k ) min j : p j 6∈ P c ( k +1) i min ˜ z ∈ patches ( f ( x i )) k ˜ z − p j k 2 2 . These additional cost terms induce a clustering structure in the latent space, ensuring that prototypes encode for infor- mation that is specific to the class they correspond to. W e confirm this empirically by checking which patch vectors are close to each prototype, as elaborated on in Section 5. Lastly , Re g is a re gularization term on the fully connected layers of each prototype layer . W ith respect to a gi ven class, the term imposes l 2 regularization on weights that connect to similarity scores of prototypes belonging to that class, while imposing l 1 regularization on the weights that con- nect to similarity scores of prototypes belonging to other classes. Thus as the model tallies the evidence for an in- stance belonging to a certain class, the Re g term encourages the model to rely only on similarity scores of prototypes be- longing to that class, as the weights connecting to scores of other class’ s prototypes will be sparse. This greatly simpli- fies model interpretation. W e gi ve an approximate algorithm for minimizing the objectiv e function. The algorithm proceeds through several training phases that are described below . The number of epochs spent in each phase is gi ven in T able 3 in Appendix A. Con volutional Layers and Prototypes. In the first opti- mization phase, the objective is optimized with respect to the weights of f and each P c ( k ) via stochastic gradient de- scent, while each h c ( k ) remains fixed. T o initialize the weights of each h c ( k ) layer , we adopt the technique of Chen et al. (2018), which is to set class con- nections to 1 when they correspond to similarity scores of prototypes belonging to that class, and − .5 otherwise. That is, supposing that the logit for a particular class c with parent c ( k ) is computed as α T c v c ( k ) , we set the j th element of α c to 1 if p c ( k ) j is a prototype allocated to class c and to − .5 if the prototype was allocated to another class. Thus e vidence for an image belonging to a certain class accrues as it acti- vates prototypes belonging to that class and diminishes as it activ ates prototypes belonging to other classes. Optimization of All Layers. In this phase, we optimize all layers of the network at once, including the fully con- nected layers in each prototype layer . It is in this training phase that the weights of each h c ( k ) layer become sparse. Projection of Prototypes. Ev ery five epochs of the above two phases, we project prototypes onto the patch vectors from the training data that they are closest to, with the con- straint that prototypes can only be projected onto patch v ec- tors from instances belonging to the class to which the proto- types hav e been pre-allocated. Since we do not restrict dis- tinct prototypes from being projected onto the same patch vector , multiple prototype vectors may be equal to each other after the projection. This phase is necessary to achie ve a direct correspondence between prototypes and latent rep- resentations from the training data. Con vex Optimization of Fully Connected Layers. Fol- lowing each projection phase, we perform the same con vex optimization described by Chen et al. (2018), b ut no w we do so for the fully connected layer in each prototype layer . Novel Class Detection Model Since we are treating a problem not considered by Shafaei, Schmidt, and Little (2018), we must adapt the model-fitting scheme to the problem at hand. W e fit a nov el class de- tector for ev ery parent node to discriminate between chil- dren of that node from the known distribution and not-yet- seen children of that node. For instance, a model is fit on a dataset consisting of vehicles seen during training of the im- age recognition model, vehicles = { ambulance, sports car , pickup } , as well as novel vehicles not seen during training of the recognition model, vehicles ∗ = { cab, forklift, tractor , mountain bike } . The goal is to discriminate between these Figure 5: The 5 nearest neighbors for this vehicle prototype from the test set. From the neighbors, it appears that this prototype encodes for wheels and wheel wells (the wheel edges). Notice that this prototype encodes for these properties in a vehicle- general manner . Among the 5 nearest neighbors, the wheels of both pickups and sports cars are acti vated. sets. W e test a number of classifiers from Shafaei, Schmidt, and Little (2018). By fitting a probabilistic classifier for each parent node, we are able to produce nuanced predictions such as “nov el vehicle”, which follo ws formally from the highest predicted probability being P ( y ∗ / ∈ vehicles , c (1) = vehicle | x ) = P ( y ∗ / ∈ vehicles | c (1) = vehicle , x ) P ( c (1) = vehicle | x ) where the left-hand probability is obtained from the novel vehicle detector and the right-hand probability is obtained from the image recognition model. 5 Experiments For a model trained on a subset of ImageNet data, we in- terpret a model prototype, show a case study of the model classifying a nov el image, and giv e image recognition and nov el class detection accuracies. For training our recognition model, we select a subset of 15 ImageNet classes and define a taxonomy ov er them as shown in Figure 3. Note that we hold out 50 images from the 1300 training images for each class to create a validation set used for early stopping in training. The test accuracies we report are calculated only after training is complete. For later no vel class detection, we hold out 15 addi- tional classes that fall under the same coarse classes in Y (1) , though in the nov el data we hav e four kinds of vehicles and no scuba div ers. The taxonomy for this data is sho wn in Fig- ure 7 in Appendix A. Data A ugmentation. W e implement the CEDA data aug- mentation technique of Hein, Andriushchenko, and Bitter- wolf (2018). This technique in volv es including with every training batch an equal number of random noise images. The only loss term these images are implicated in is a cross en- tropy between their predicted class distributions and a uni- form distribution over classes, thus encouraging the model to be maximally uncertain ov er random noise. The CEDA technique may improve the clustering quality of the HPnet % Correct Neighbors Model T rain T est HPnet 79 . 24 76 . 2 HPnet + CED A 84 . 90 79 . 24 T able 1: (Latent-space clustering quality .) Proportion of the nearest neighbors to the prototypes that belong to the same class as their neighboring prototype. Prototypes tend to be surrounded by patch vectors from images that belong to the correct classes, indicating a learned clustering quality to the latent space. Note that these two models are initialized with VGG-16 base models that achie ve dif fering accuracy , so the change in clustering quality could result from the differing initializations rather than the CED A technique. W e giv e con- fidence intervals in T able 4 in Appendix A. latent space (see T able 1), and it does not lo wer model accu- racy (Hein, Andriushchenko, and Bitterw olf 2018). Besides this technique, we use the standard ImageNet cropping procedure of, for training, random resized crops of size 224 by 224 and, for testing, resizing to 256 by 256 then cropping to 224 by 224. Interpr eting the Recognition Model’ s Latent Space How do we interpret the features learned by the model? W e can inspect each prototype along with the images of patch vectors that are closest to it in the latent space. In Fig- ure 5, we show a vehicle prototype, the test images whose patch vectors are closest to the prototype in the latent space, and those images with prototype activ ation maps overlaid on them. Notice that this prototype encodes for wheels in a vehicle-general manner . Among the 5 nearest neighbors, the wheels of both pickups and sports cars are activ ated. W e would also like to capture a global perspectiv e on the clustering quality of the latent space in order to check that prototypes tend to be surrounded by patch vectors from im- ages of the same class. T o do so, for each prototype we com- pute the percentage of its five nearest neighbors that belong Figure 6: A forklift is classified as a no vel v ehicle. Note that the strongest evidence for this forklift being a vehicle is its possession of a wheel, as evidenced by the most acti vated prototype. For context, the image recognition model used here obtained 61% accuracy on classifying the nov el fork- lifts as vehicles, and the logistic model for nov el class de- tection obtained 69% test accuracy on discriminating famil- iar vehicles from novel vehicles. These top four prototypes accounted for 74% of the magnitude of the vehicle logit. to the correct class (e.g., a v ehicle for one of the v ehicle pro- totypes); we average these percentages to obtain a metric of clustering quality , which we show in T able 1. Classifying a T est Image In this section, we gi ve an example of a forklift being classi- fied as a novel vehicle. A diagram is shown in Figure 6. By relying on learned concepts such as wheels, the model suc- cessfully classifies the forklift as a vehicle. The novel vehi- cle detector also successfully classifies it as a novel vehicle, i.e. not in the set of { ambulance, pickup, sports car } . For additional examples, see Figures 8 and 9 in Appendix A. Image Classification Accuracy W e giv e the test accuracies for each model in T able 2. The models include the VGG-16 network, our HPnet, and a flat T est Accuracies by Model Model F-ID C-ID C-Nov el VGG-16 + CED A 82.19 92.83 62.31 Pnet + CED A 81.60 92.56 60.17 HPnet + CED A 82.61 93.57 62.16 T able 2: (T est accuracies for each model.) The accuracies include the fine-grained accuracy on in-distribution data (F-ID), the coarse-grained accuracy on in-distribution data (C-ID), and the coarse-grained accuracy on novel data (C- Nov el). Accuracies are averages across fi ve models per method. W e give confidence interv als in T able 5 in Appendix A. Note that the weights of the con volutional layers for the Pnet and HPnet models are always initialized from the trained weights of the most accurate VGG-16 model, which achiev es 84.93% F-ID accuracy . version of our model, Pnet. See Appendix A for the method of computing coarse-grained predicted probabilities for the VGG-16 network. Our model attains on average similar accurac y to its black-box and flat counterparts. Ho wev er , relative to the pre- trained VGG-16 model used to initialize HPnet, there is an av erage drop in fine-grained accuracy of 2.32%. This drop is to be expected gi ven the additional constraints introduced for purposes of interpretability; it remains to practitioners to assess if a gain in accuracy of this size is worth sacrific- ing interpretability for . This gap will also likely narro w as the space of training approaches for prototype-based models like HPnet is explored to the extent that it has been for the VGG class of models. Finally , we observe that it is possible to correctly classify nov el data at the coarse level at far above the chance rate. Random performance on the novel data would yield about 17% accuracy . Novel Class Detection W e test three methods from Shafaei, Schmidt, and Little (2018), ScoreSVM, PbThreshold, and a logistic regression. The first of these is an SVM (with linear kernel) on the model logits, the second is a simple threshold on the max predicted probability , and the last is a logistic regression on the model logits. W e apply each of these three methods to two image recognition models, one with CEDA and one without, for a total of six methods. The accuracies obtained are in line with those reported by Shafaei, Schmidt, and Little (2018), and there is no clearly superior method. The logistic regression for the HPnet + CED A model, which is used in Figure 6, obtains 73 . 72% ac- curacy on a verage ov er fi ve recognition models. W e giv e the accuracies with confidence intervals in T able 6 in Appendix A, where the testing procedure is described as well. 6 Conclusion W e provide a model that classifies objects at each level in a semantic taxonomy , identifies when objects are novel at fine-grained levels of its taxonomy , and uses directly inter- pretable features that are tailored to each parent node in the taxonomy . In general, this is the first vision model to accom- plish all three tasks in a synchronized manner . In particular , it is the first CNN-based model to organize prototypes hi- erarchically and the first to perform novel class detection across le vels of the class taxonomy . As a result, the explana- tions are much richer , leading to dramatically improv ed in- terpretability . An application to ImageNet demonstrates the viability of the method. 7 Acknowledgements The authors would like to thank Alina Barnett and Chaofan T ao at Duke University , as well as Jonathan Su at MIT Lin- coln Laboratory , for their helpful contributions and feedback on this work. A A ppendix Coarse Predictions fr om Flat Models There are two ways to compute coarse-grained predictions from a flat model, as is needed to obtain the C-Nov el met- ric for the VGG and Pnet models in T able 2. First, one can take the fine-grained prediction of the model and check if it is in the same coarse class as the true label (e.g., pre- dicting the fine class to be ambulance yields the correct coarse prediction if the true label is sports car). Alterna- tiv ely , one could sum up the predicted probabilities over the fine-grained classes within each coarse class to get predicted coarse class probabilities, and then check that the max prob- ability is for the true coarse class (e.g., p ( vehicle ) is obtained as p ( ambulance ) + p ( pickup ) + p ( sports car ) , and the coarse prediction for an instance of vehicle is only correct if this p ( vehicle ) is the max of these coarse-grained probabilities). Empirically , we observe that the second approach inv olving sums ov er fine-grained probabilities outperforms the first ap- proach, so we present accuracies computed using the second approach. T esting Procedure f or Novel Class Detection W e present accuracies of nov el class detection methods in T able 6. Accuracies are averaged across models fit to logits from fi ve image recognition models. An accuracy metric for a single recognition model is obtained as follo ws: for each parent node, a test accuracy is obtained by an average across leav e-one-class-out tests. Thus for the vehicle detector , four tests are performed, where each class in { cab, forklift, trac- tor , mountain bike } serves as the test no vel dataset once, and each time the remaining three novel classes are included in the training data. This split is justified in (Shafaei, Schmidt, and Little 2018), since the classes in the test data should not hav e been seen by the novel class detector at all during train- ing. T raining data are drawn from the training partition of the ImageNet data, while testing data are drawn from the testing partition. Data are alw ays balanced to consist of half famil- iar classes and half novel classes. Hyperparameters for the models are tuned with 50 held out images from the training data. T ables and Figures Figure 7: The novel data taxonomy . The image recognition model does not see these data during training. Phase Num. Epochs Con v . Layers 5 All Layers 45 Con ve x Opt. 2 T able 3: (Number of epochs spent in each training phase.) The projection phase occurs ev ery 5 epochs of the “Con v . Layers” and “ All Layers” phases. Projection phases are al- ways followed by con ve x optimization phases. The very last con ve x optimization optimization phase lasts for 10 epochs rather than 2. % Correct Neighbors Model T rain T est HPnet 79 . 24 ± 1 . 65 76 . 20 ± 1 . 56 HPnet + CED A 84 . 90 ± 2 . 2 79 . 24 ± 1 . 83 T able 4: (Latent-space clustering quality .) Proportion of the nearest neighbors to the prototypes that belong to the same class as their neighboring prototype. W e report the average proportion across five trained models for each method, as well as the 95% confidence intervals. Figure 8: An assault rifle is classified as a weapon. The second most activ ated prototype detected the weapon’ s trigger (see Figure 10 in Appendix A), and the fourth most activ ated prototype detected the hand holding the weapon. T ogether , the top four prototypes account for 75% of the magnitude of the weapon logit. For context, the model obtained 83% accuracy on the weapon class. Figure 9: A rev olver is classified as an everyday object. For context, the image recognition model obtained 53% accuracy on classifying the novel revolv ers as weapons. The av ailability of the prototypes here is helpful for diagnosing what went wrong in the prediction; an inv estigation of each prototypes’ nearest neighbors would help reveal what lead to the mistake. W ith this information, one gains an idea as to what data collection or augmentation would help pre vent this kind of model error . Figure 10: One model’ s 5 nearest neighbors for this weapon prototype from the test dataset. The learned prototype encodes for the presence of a trigger and the trigger guard, or a similar pattern appearing in the fourth nearest neighbor . Figure 11: One model’ s 5 nearest neighbors for this ambulance prototype from the train and test datasets. The learned prototype encodes for a red-on-white pattern e xemplified in the training nearest neighbors, or , judging by the last tw o neighbors from the test data, a more general color-on-white pattern. Figure 12: One model’ s 5 nearest neighbors for this ambulance prototype from the train and test datasets. The learned prototype encodes for text of variance kinds, or , judging by the some of the nearest test neighbors, a dark-lines-on-light-background pattern. While the prototype is highly acti vated by the assault rifle and wine bottle images here, it is possible that amb ulances that highly acti vate this prototype are never classified as assault rifles or wine bottles, by virtue of always being classified as vehicles to be gin with, rather than weapons or ev eryday objects. T est Accuracies by Model (w/ CED A) Model F-ID C-ID C-Nov el VGG-16 82 . 19 ± 1 . 53 92 . 83 ± . 78 62 . 31 ± 1 . 72 Pnet 80 . 67 ± . 78 91 . 79 ± . 45 60 . 58 ± . 32 HPnet 82 . 61 ± . 44 93 . 57 ± . 43 62 . 16 ± . 70 T able 5: (T est accuracies for each model.) The accuracies include the fine-grained accuracy on in-distribution data (F-ID), the coarse-grained accuracy on in-distribution data (C-ID), and the coarse-grained accuracy on novel data (C- Nov el). The accuracies reported are averages across fiv e trained models per method, with 95% confidence intervals. Our interpretable model, HPnet, performs similarly to its black-box counterpart VGG-16 as well as its flat counterpart Pnet. Note that the weights of the con volutional layers for the Pnet and HPnet models are always initialized from the trained weights of the best VGG-16 model, which achie ves 84.93% F-ID accuracy . Classifier Accuracy PbThreshold 52 . 05 ± 1 . 86 PbThreshold + CED A 51 . 22 ± 1 . 50 ScoreSVM 74 . 69 ± . 96 ScoreSVM + CED A 73 . 94 ± 1 . 13 Logistic Reg. 74 . 75 ± 1 . 05 Logistic Reg. + CED A 73 . 72 ± . 95 T able 6: (Nov el class detection accuracies.) Accuracies are av eraged across five models per method. In-distribution and nov el data are balanced in testing. For each method, the test detection accuracies are av eraged across parent nodes. References [Ahmed and T orresani 2017] Ahmed, K., and T orresani, L. 2017. Branchconnect: Large-scale visual recognition with learned branch connections. CoRR abs/1704.06010. [Bau et al. 2017] Bau, D.; Zhou, B.; Khosla, A.; Oliv a, A.; and T orralba, A. 2017. Network Dissection: Quantifying Interpretability of Deep V isual Representations. In Com- puter V ision and P attern Recognition (CVPR), 2017 IEEE Confer ence on , 3319–3327. IEEE. [Bendale and Boult 2015] Bendale, A., and Boult, T . E. 2015. T owards open set deep networks. CoRR abs/1511.06233. [Bloom 2001] Bloom, P . 2001. Pr ´ ecis of how children learn the meanings of words. Behavioral and Brain Sciences 24:1095–1103. [Branson et al. 2014] Branson, S.; V an Horn, G.; Belongie, S.; and Perona, P . 2014. Bird species categorization us- ing pose normalized deep conv olutional nets. arXiv pr eprint arXiv:1406.2952 . [Chen et al. 2018] Chen, C.; Li, O.; T ao, C.; Barnett, A.; Su, J.; and Rudin, C. 2018. This looks like that: deep learning for interpretable image recognition. CoRR abs/1806.10574. [Deng et al. 2012] Deng, J.; Krause, J.; Ber g, A. C.; and Fei- Fei, L. 2012. Hedging your bets: Optimizing accuracy- specificity trade-offs in large scale visual recognition. In 2012 IEEE Confer ence on Computer V ision and P attern Recognition , 3450–3457. IEEE. [Erhan et al. 2009] Erhan, D.; Bengio, Y .; Courville, A.; and V incent, P . 2009. V isualizing Higher-Layer Features of a Deep Network. T echnical Report 1341, Univ ersity of Mon- treal. Also presented at the ICML 2009 W orkshop on Learn- ing Feature Hierarchies, Montreal, Canada. [Guo et al. 2017] Guo, Y .; Liu, Y .; Bakker , E.; Guo, Y .; and S. Lew , M. 2017. Cnn-rnn: a large-scale hierarchical image classification framework. Multimedia T ools and Applica- tions . [Hein, Andriushchenko, and Bitterwolf 2018] Hein, M.; An- driushchenko, M.; and Bitterwolf, J. 2018. Why relu networks yield high-confidence predictions far away from, the training data and how to mitigate the problem. CoRR abs/1812.05720. [Huang et al. 2016] Huang, S.; Xu, Z.; T ao, D.; and Zhang, Y . 2016. Part-stack ed cnn for fine-grained visual categoriza- tion. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 1173–1182. [Kuang et al. 2018] Kuang, Z.; Y u, J.; Li, Z.; Zhang, B.; and Fan, J. 2018. Integrating multi-lev el deep learning and concept ontology for large-scale visual recognition. P attern Recognition 78. [Lee et al. 2009] Lee, H.; Grosse, R.; Ranganath, R.; and Ng, A. Y . 2009. Conv olutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Represen- tations. In Pr oceedings of the 26th International Confer ence on Machine Learning (ICML) , 609–616. [Li et al. 2018] Li, O.; Liu, H.; Chen, C.; and Rudin, C. 2018. Deep Learning for Case-Based Reasoning through Proto- types: A Neural Network that Explains Its Predictions. In Pr oceedings of the Thirty-Second AAAI Confer ence on Arti- ficial Intelligence (AAAI) . [Pinheiro and Collobert 2015] Pinheiro, P . O., and Collobert, R. 2015. From Image-Lev el to Pixel-Le vel Labeling W ith Con volutional Networks. In Pr oceedings of the IEEE Con- fer ence on Computer V ision and P attern Recognition , 1713– 1721. [Redmon and Farhadi 2017] Redmon, J., and Farhadi, A. 2017. Y olo9000: better , faster , stronger . In Pr oceedings of the IEEE conference on computer vision and pattern r ecog- nition , 7263–7271. [Salakhutdinov , T enenbaum, and T orralba 2012] Salakhutdinov , R.; T enenbaum, J.; and T orralba, A. 2012. One-shot learning with a hierarchical nonparametric bayesian model. In Guyon, I.; Dror , G.; Lemaire, V .; T aylor , G.; and Silver , D., eds., Pr oceedings of ICML W orkshop on Unsupervised and T ransfer Learning , v olume 27 of Pr oceedings of Machine Learning Resear ch , 195–206. Bellevue, W ashington, USA: PMLR. [Shafaei, Schmidt, and Little 2018] Shafaei, A.; Schmidt, M.; and Little, J. J. 2018. Does your model know the digit 6 is not a cat? A less biased e valuation of ”outlier” detectors. CoRR abs/1809.04729. [Simon and Rodner 2015] Simon, M., and Rodner, E. 2015. Neural activ ation constellations: Unsupervised part model discov ery with con volutional networks. In Pr oceedings of the IEEE International Conference on Computer V ision , 1143–1151. [Simonyan and Zisserman 2014] Simonyan, K., and Zisser- man, A. 2014. V ery deep conv olutional networks for lar ge- scale image recognition. CoRR abs/1409.1556. [Simonyan, V edaldi, and Zisserman 2014] Simonyan, K.; V edaldi, A.; and Zisserman, A. 2014. Deep inside con volutional networks: V isualising Image Classification Models and Saliency Maps. In International Confer ence on Learning Repr esentations (ICLR) W orkshop . [Sundararajan, T aly , and Y an 2017] Sundararajan, M.; T aly , A.; and Y an, Q. 2017. Axiomatic Attribution for Deep Net- works. In Precup, D., and T eh, Y . W ., eds., Pr oceedings of the 34th International Confer ence on Machine Learning , volume 70 of Pr oceedings of Machine Learning Resear ch , 3319–3328. International Conv ention Centre, Sydney , Aus- tralia: PMLR. [Y an et al. 2015] Y an, Z.; Zhang, H.; Piramuthu, R.; Ja- gadeesh, V .; DeCoste, D.; Di, W .; and Y u, Y . 2015. Hd-cnn: hierarchical deep conv olutional neural networks for large scale visual recognition. In Pr oceedings of the IEEE inter- national confer ence on computer vision , 2740–2748. [Zhang et al. 2016] Zhang, H.; Hu, Z.; Deng, Y .; Sachan, M.; Y an, Z.; and Xing, E. P . 2016. Learning concept taxonomies from multi-modal data. CoRR abs/1606.09239. [Zhou et al. 2016] Zhou, B.; Khosla, A.; Lapedriza, A.; Oliv a, A.; and T orralba, A. 2016. Learning deep features for discriminati ve localization. In Computer V ision and P at- tern Reco gnition (CVPR), 2016 IEEE Confer ence on , 2921– 2929. IEEE. [Zhou et al. 2018] Zhou, B.; Sun, Y .; Bau, D.; and T orralba, A. 2018. Interpretable basis decomposition for visual ex- planation. In Pr oceedings of the Eur opean Conference on Computer V ision (ECCV) , 119–134. [Zhu and Bain 2017] Zhu, X., and Bain, M. 2017. B-CNN: branch con volutional neural network for hierarchical classi- fication. CoRR abs/1709.09890.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment