Object Boundary Detection and Classification with Image-level Labels
Semantic boundary and edge detection aims at simultaneously detecting object edge pixels in images and assigning class labels to them. Systematic training of predictors for this task requires the labeling of edges in images which is a particularly te…
Authors: Jing Yu Koh, Wojciech Samek, Klaus-Robert M"uller
Ob ject Boundary Detection and Classification with Image-lev el Lab els Jing Y u Koh 1 , W o jciech Samek 2 , Klaus-Rob ert M¨ uller 3 , 4 , Alexander Binder 1 1 ISTD Pillar, Singap ore Universit y of T echnology and Design, Singap ore 2 Departmen t of Video Coding & Analytics, F raunhofer Heinric h Hertz Institute, Berlin, Germany 3 Departmen t of Computer Science, TU Berlin, Germany 4 Departmen t of Brain and Cognitive Engineering, Korea Universit y , Seoul, Republic of Korea Abstract. Seman tic b oundary and edge detection aims at simultane- ously detecting ob ject edge pixels in images and assigning class lab els to them. Systematic training of predictors for this task requires the lab el- ing of edges in images which is a particularly tedious task. W e prop ose a no vel strategy for solving this task, when pixel-lev el annotations are not a v ailable, p erforming it in an almost zero-shot manner b y relying on con- v entional whole image neural net classifiers that were trained using large b ounding boxes. Our metho d p erforms the following tw o steps at test time. Firstly it predicts the class labels by applying the trained whole image net w ork to the test images. Secondly , it computes pixel-wise scores from the obtained predictions by applying backprop gradients as well as recen t visualization algorithms suc h as decon v olution and lay er-wise rel- ev ance propagation. W e show that high pixel-wise scores are indicative for the location of seman tic boundaries, whic h suggests that the seman tic b oundary problem can b e approac hed without using edge lab els during the train ing phase. 1 In tro duction Neural net based predictors ac hiev e excellen t results in man y data-driv en tasks, examples among the newer being [6,15,14,10,17], while others suc h as video de- tection or mac hine translation [21,8] are equally impressiv e. Rather than extend- ing neural netw orks to a new application, we focus here on the question whether a neural netw ork can solve problems which are har der than the one for whic h the net w ork w as trained. In particular, w e consider the task of seman tic bound- ary detection which w e aim to solve without appropriately fine-grained training lab els. The problem of semantic b oundary detection (SBD) [5] can b e defined as the sim ultaneous detection of ob ject edge pixels and assignmen t of class lab els to suc h edge pixels in images. Recen tly , the work of [3,16,24,25] sho wed sub- stan tial impro v ement using neural nets, how ever, the approac h relied on end-to- end training with a dataset for whic h seman tic b oundary lab els w ere av ailable. 2 JY Koh, W Samek, KR M¨ uller, A Binder When trying to build a predictor for SBD, practitioners face the problem that the classical inductive machine learning paradigm requires to create a dataset with semantic boundary labels, that is, for each image a subset of pixels in im- ages corresp onding to ob ject edges is lab eled with class indices. Creating such lab elling is a particularly tedious task, unlike lab elling whole images or drawing b ounding b o xes, b oth of which can b e done v ery quickly . The b est proof for this difficult y is the fact that w e are aw are of only one truly seman tic boundary dataset [5]. Note that SBD is different from con tour detection tasks [23] whic h aim at finding contours of ob jects without assigning class lab els to them. In that sense the scope of our prop osed w ork is differen t from unsup ervised con tour detection as in [13]. The main question in this paper is to what extent it is possible to solv e the semantic boundary or edge detection task without having appropriately fine-grained lab els, i.e., pixel-lev el ground truth, which are required for clas- sical training paradigms ? W e do not intend to replace the usage of pixel-wise b oundary lab els when they are a v ailable. W e aim at use cases in which pixel-wise b oundary lab els are not av ailable during the training phase. One example of us- ing w eaker annotations for semantic boundary detection is [9] where b ounding b o x labels are used to learn semantic boundaries. W e prop ose a nov el strategy to tackle a problem requiring fine-grained lab els, namely seman tic b oundary de- tection, with a classifier trained for image classification using only image-wise lab els. F or that we use neural nets that classify an image, and apply existing visualization methods that are able to assign class-specific-scores to single pixels. These class-sp ecific pixel scores can then b e used to define seman tic b oundary predictions. The contribution of this pap er is as follo ws. W e demonstrate that classifier vi- sualization metho ds are useful b ey ond pro ducing nice-to-lo ok-at images, namely for approaching prediction tasks on the pixel-lev el in the absence of appropri- ately fine-grained training labels. As an example, w e apply and ev aluate the p erformance of c lassifier visualization metho ds to the SBD task. W e sho w that these visualization metho ds can b e used for pro ducing quantifiably meaningful predictions at a higher spatial resolution than the lab els, whic h were the basis for training the classifiers. W e discuss the shortcomings of suc h approac hes when compared to the proper training paradigm that mak es use of pixel-level lab els. W e do not expect such metho ds to b eat baselines that emplo y the prop er train- ing paradigm and th us use pixel-level lab els during training, but rather aim at the practitioner’s case in whic h fine-grained training data is to o costly in terms of money or time. 2 Obtaining Pixel-level Scores from Image-wise Predictions In the follo wing we introduce the metho ds that w e will use for producing pixel- lev el scores without pixel-level labels during training time. It is common to all these methods that they take a classifier prediction f c ( x ) on an image x Ob ject Boundary Detection and Classification with Image-level Lab els 3 and pro duce scores s c ( p ) for pixels p ∈ x . Suppose w e hav e classifiers f c ( x ) for multiple classes c . Then w e can tackle the SBD problem b y (1) classifying an image, i.e., determine those classes that are presen t in the image, and (2) computing pixel-wise scores for those classes using one of the following metho ds. 2.1 Gradien t Probably the most ob vious idea to tackle the SBD problem is to run a forw ard prediction with a classifier, and compute the gradien t for eac h pixel. Let x b e an input image, f 1 , . . . , f C b e C outputs of a m ulti-class classifier and x p b e the p -th pixel. Computing pixel-wise scores for a class c and pixel p can b e ac hiev ed using s ( p ) = ∂ f c ∂ x p ( x ) 2 (1) The norm runs here ov er the partial deriv atives for the ( r , g , b )-subpixels of a pixel p . Alternatively one can sum up the subpixel scores in order to hav e a pixel-score. Using gradien ts for visualizing sensitivities of neural netw orks has b een shown in [20]. A high score in this sense indicates that the output f c has high sensitivit y under small changes of the input x p , i.e. there exists a direction in the tangen t space lo cated at x for which the slope of the classifier f c is very high. In order to see the impact of partial deriv ativ es, consider the case of a simple linear mapping that tak es subpixels x p,s of pixel p as input. f ( x ) = X p X s ∈{ r,g,b } w p,s x p,s (2) In this case bac kpropagation combined with an ` 2 -norm yields: s ( p ) = ( w 2 p,r + w 2 p,g + w 2 p,b ) 1 / 2 (3) Note that the input x p,s , and in particular its sign pla ys no role in a visualization ac hieved by bac kpropagation, although ob viously the sign of x p,s do es matter for deciding whether to detect an ob ject ( f ( x ) > 0) or not ( f ( x ) < 0). This is a limiting factor, when one aims to explain what pixels are relev an t for the prediction f ( x ) > 0. 2.2 Decon volution Decon volution [26] is an alternativ e method to compute pixel-wise scores. Start- ing with scores giv en at the top of a con volutional lay er, it applies the transp osed filter weigh ts to compute scores at the b ottom of the same lay er. Another im- p ortan t feature is used in max-p ooling lay ers, where scores from the top are distributed down to the input that yielded the maxim um v alue in the max p ool- ing. Consider the linear mapping case again. Then decon volution in the sense of 4 JY Koh, W Samek, KR M¨ uller, A Binder m ultiplying the transp osed weigh ts w (as it is for example implemented in the Caffe pac k age) yields for subpixel s of channel p s ( p, s ) = f c ( x ) w p,s (4) This score can be summed across subpixels, or one can tak e again an ` p -norm. When using summation across subpixels, then decon v olution is prop ortional to the prediction f c ( x ), in particular it expresses the dominating terms w p,s x p,s ≈ f c ( x ) correctly whic h contribute most to the prediction f ( x ). 2.3 La yer-wise Relev ance Propagation La yer-wise Relev ance Propagation (LRP) [2] is a principled metho d for explain- ing neural net work p redictions in terms of pixel-wise scores. LRP rev ersely prop- agates a n umerical quan tity , named relev ance, in a w a y that preserves the sum of the total relev ance at eac h la yer. The relev ance is initialized at the output as the prediction score f c ( x ) and propagated down to the inputs (i.e., pixels), so that the relev ance conserv ation prop ert y holds at each la yer f c ( x ) = . . . = X j R ( l +1) j . . . = X i R ( l ) i = . . . = X p R (1) p (5) where { R ( l +1) j } and { R ( l ) i } denote the relev ance at la y er l + 1 and l , resp ectiv ely , and { R (1) p } represen ts the pixel-wise relev ance scores. Let us consider the neural net work as an feed-forw ard graph of elemen tary computational units (neurons), eac h of them realizing a simple function of type x ( l +1) j = g 0 , X i x ( l ) i w ( l,l +1) ij + b ( l +1) j e.g. g ( z ) = max(0 , z ) (6) where j denotes a neuron at a particular la yer l + 1, and, where P i runs o ver all low er-la yer neurons connected to neuron j . w ( l,l +1) ij , b ( l +1) j are parameters of a neuron. The prediction of a deep neural netw ork is obtained b y computing these neurons in a feed-forward pass. Con v ersely , [2] ha ve shown that the same graph structure can be used to redistribute the relev ance f ( x ) at the output of the net work onto pixel-wise relev ance scores { R (1) p } , by using a local redistribution rule R ( l ) i = X j z ij P i 0 z i 0 j R ( l +1) j with z ij = x (l) i w (l , l+1) ij (7) where i indexes a neuron at a particular lay er l , and where P j runs ov er all upp er-la y er neurons to whic h neuron i contributes. Application of this rule in a backw ard pass produces a relev ance map (heatmap) that satisfies the desired conserv ation property . Ob ject Boundary Detection and Classification with Image-level Lab els 5 W e consider tw o other LRP algorithms in tro duced in [2], namely the -v ariant and the β -v arian t. The first rule is given b y: R ( l ) i = X j z ij P i 0 z i 0 j + sign( P i 0 z i 0 j ) R ( l +1) j (8) Here for > 0 the conserv ation idea is relaxated in order to gain b etter numerical stabilit y . The second formula is giv en by: R ( l ) i = X j α · z + ij P i 0 z + i 0 j + β · z − ij P i 0 z − i 0 j ! R ( l +1) j . (9) Here, z + ij and z − ij denote the positive and negative part of z ij resp ectiv ely , such that z + ij + z − ij = z ij . W e enforce α + β = 1, α > 0, β ≤ 0 in order for the relev ance propagation equations to be conserv ative lay er-wise. Note that for α = 1 this redistribution rule is equiv alent (for ReLU nonlinearities g ) to the z + -rule by [18]. In con trast to the gradient, LRP reco vers the natural decomp osition of a linear mapping f ( x ) = D X i =1 w i x i (10) i.e., the pixel-lev el score R i = w i x i (11) not only depends on whether the classifier reacts to this input dimension ( w i > 0), but also if that feature is actually presen t ( x i > 0). An implemen tation of LRP can b e found in [12]. 3 Exp erimen ts W e perform the exp eriments on the SBD dataset with a P ascal V OC m ultilab el classifier from [11] that is av ailable in the BVLC model zoo of the Caffe [7] pac k age. This classifier was trained using the 4 edge crops and the cen ter crops of the ground truth b ounding b o xes of the Pascal VOC dataset [4]. W e do not use pixel lab els at training time, ho w ever, for ev aluation at test time w e use the pixel- wise ground truth, in order to b e able to compare all metho ds quantitativ ely . Same as [5] we rep ort the maximal F-score and the av erage precision on the pixel-lev el of an image. W e stick to the same con ven tion regarding counting true p ositiv es in a neigh b orho od, as introduced in [5]. 3.1 P erformance on the SBD task T able 1 sho ws the av erage precision (AP) scores for all methods. W e can see 6 JY Koh, W Samek, KR M¨ uller, A Binder T able 1. Av erage precision (AP) and maximal F-scores (MF) scores of v arious metho ds to compute pixel-wise scores from whole image classifiers without pixel- lab els at training time, compared against the original metho d InverseDete ctors [5] and Boundary detection using Neural nets HFL [3]. Only the last tw o b oth use pixel-lab els at training time. All other use no pixel-level lab els during training. Grad denotes Gradien t, Decon v denotes [26], and β refer to LRP v arian ts giv en in equations (8) and (9) tak en from [2]. training phase: image-lev el labels pixel-lev el labels Metho d: Gradien t Deconv β = 0 β = − 1 = 1 = 0 . 01 InvDet [5] HFL [3] AP 22.5 25.0 28.4 27.3 31.4 31.2 19.9 54.6 MF 31.0 33.3 35.1 34.1 38.0 38.1 28.0 62.5 from the table that the neural-netw ork based metho d [3] which uses pixel-lev el ground truth at training time p erforms best by a large margin. Metho ds that do not employ pixel-level lab els at training time perform far w orse. How ev er, w e can see a certain surprise: all the metho ds p erform better than the method [5] on Seman tic Boundary Detection that was the b est baseline b efore the w ork of [3] replaced it. Note that [5] as well as [3] relies on pixel-wise lab els during training, whereas the prop osed methods require only image-wise lab els. This result gives a realistic comparison of ho w go od metho ds on pixel-wise prediction without pixel-lab els in the training phase can perform. The pixel-wise scores for LRP are computed by summing o v er subpixels. F or Gradien t and Deconv olution using the negativ e sum o ver subpixels p erformed b etter than using the sum or the ` 2 -norm. F or b oth cases negativ e pixel scores w ere set to zero. This follows our experience with Deconv olution and LRP that w av e-like low-lev el image filters, which are t ypically presen t in deep neural nets, receiv e equally wa v e-like scores with positive and negative half-w av es. Remov- ing the negativ e half wa ves improv es the prediction quality . T able 2 sho ws the T able 2. Comparison of v arious w ays to combine subpixel scores in to a pixel- wise score. subpixel aggre gation method sum sum of negativ e scores ` 2 -norm Gradien t AP 22.0 22.5 18.8 Decon volution AP 22.9 25.0 21.9 comparison of AP scores for v arious metho ds to compute a pixel-wise score from subpixel scores. Note that w e do not show the ` 2 -norm, or the summed negativ e scores for the LRP metho ds, as LRP do es preserv e the sign of the prediction and th us using the sum of negativ e scores or ` 2 -norm has no meaningful in ter- pretation for LRP . Ob ject Boundary Detection and Classification with Image-level Lab els 7 3.2 Shortcomings of visualization metho ds Seman tic b oundaries are not the most relev ant regions for the decision of ab o ve men tioned classifiers trained on images of natural scenes. This does not dev aluate mo dels trained on shap es. It merely sa ys that, giv en RGB images of natural scenes as input, ab o v e ob ject class predictors put considerable w eigh t on in ternal edges and textures rather than outer boundaries, an effect whic h can b e observed in the heatmaps in Figures 1 and 2. This is the primary h yp othesis why the visualization metho ds abov e are partially mismatching the seman tic b oundaries. W e demonstrate this hypothesis quantitativ ely by an exp eriment. F or this we need to in troduce a measure of relev ance of a set of pixels whic h is independent of the computed visualizations. Perturb ation analysis W e can measure the relev ance of a set of pixels S ⊂ x of an image x for the prediction of a classifier by replacing the pixels in this set by some v alues, and comparing the prediction on the modified image ˜ x S against the prediction score f ( x ) for the original image [19] (similar approac h has b een applied for text in [1]). This idea follo ws the in tuition that most random p erturbations in a region that is imp ortan t for classification will lead to a decline of the prediction score for the image as a whole: f ( ˜ x S ) < f ( x ). It is clear that there exist p erturbations of a region yield an increase of the prediction score: for example a c hange that follows the gradien t direction lo cally . Th us we will dra w man y p erturbations of the set S from a random distribution P and measure an appro ximation the exp ected decline of the prediction m = f ( x ) − E S ∼ P [ f ( ˜ x S )] (12) W e in tend to measure the expected decrease for the set S b eing the ground truth pixels for the SBD task, and compare it against the set of highest scoring pixels. F or a fair comparison the set of highest scoring pixels will b e limited to hav e the same size as the num b er of ground truth pixels. Highest scoring pixels will b e defined by the pixel-wise scores from the ab o ve methods. W e will show that the exp ected decrease is higher for the pixel-wise scores, which indicates that ground truth pixels represen ting seman tic b oundaries are not the most relev an t subset for the classifier prediction. The experiment to demonstrate this will b e designed as follows. F or each test image and each ground truth class w e will tak e the set of ground truth pixels, and randomly perturb them. F or a ( r, g, b )-pixel w e will draw the v alues from a uniform distribution in [0 , 1] 3 ⊂ R 3 . F or each image and presen t class of seman tic b oundary task ground truth w e rep eat 200 random perturbations of the set in order to compute an approximation to Equation (12). W e compute the a verage ov er all images to obtain the a verage decrease on ground truth pixels m GT . m GT is an av erage measure of relev ance of the ground truth pixels. m GT is to b e compared against the analogous quantit y m V deriv ed from the top- scoring pixels of a visualization metho d. F or a given visualization method V ∈ { Gr adient, Deconv , LRP - β , LRP - } , we define the set of pixels to b e p erturbed as the pixels with the highest pixel-wise scores computed from the visualization 8 JY Koh, W Samek, KR M¨ uller, A Binder metho d. The set size for this set will be the same as the n um b er of ground truth pixels of the seman tic b oundary task of the same image and class. Running the same perturbation idea according to Equation (12) on this set yields a measure m V of av erage decrease of classifier prediction that is sp ecific to the most relev ant pixels of the giv en visualization metho d. T able 3. Comparison of the a veraged prediction scores. f ( x ) denotes the a verage prediction for the unperturb ed images for all ground truth classes. m GT denotes the av erage prediction for images with p erturbed ground truth pixels. m Deconv and m LRP , =1 denotes the a verage prediction for images with perturb ed highest scoring pixels ha ving the same cardinality as the ground truth pixels, using Decon volution and LRP . f c ( x ) m GT m Deconv m LRP, =1 10.20 ± 0 7.73 ± 0.36 5.68 ± 0.38 1.73 ± 0.34 T able 3 shows the results of the comparison. Note that we take the ground truth in the image that has been resized to matc h th e receptiv e field of the neural net (227 × 227), and apply one step of classical morphological thick ening. This thic kened ground truth set will b e used. The standard deviation w as computed for the 200 random p erturbations and a v eraged o v er images and classes. W e can see from the table that the decrease is stronger for the visualization metho ds compared to the ground truth pixels. This holds for Deconv olution as w ell as for LRP . The pixels highligh ted b y these metho ds are more relev an t for the classifier prediction, e v en though they disagree with b oundary pixel lab els. In summary this supp orts our initially stated h yp othesis that b oundary pixels are not the most relev an t for classification, and our explanation why these metho ds are partially mismatc hing the set of b oundary ground truth lab els. W e can supp ort this n umerical observ ation also b y example images. W e can observ e t wo error cases. Firstly , the pixel-wise predictions may miss semantic b oundaries that are deemed to be less discriminativ e for the classification task. This adds to false negatives. Secondly , the pixel-wise predictions ma y assign high scores to pixels that are relev ant for the classification of an ob ject and lie inside the ob ject. Figure 1 shows some examples. W e can clearly see false negatives and false positives in these examples, for example for the car and LRP- = 1 where the window regions are deemed to b e highly relev ant for the classifier decision, but the outer b oundary on the car top is considered irrelev ant which is a bad result with resp ect to boundary detection. F or the cat most of the methods fo cus on its face rather than the cat b oundaries. The bird is an example where decon volution gives a go od result. F or the p eople with the b oats the heatmap is sho wn for the people class. In this example LRP- = 1 fo cuses correctly most selectiv ely on the p eople, same as for the tiny car example. W e can observ e from these figures a common sparsity of the pixel-wise pre- diction metho ds. This motiv ates why we did not aim at solving segmen tation Ob ject Boundary Detection and Classification with Image-level Lab els 9 image ground truth gradient Decon v LRP- β = 0 LRP- = 1 Fig. 1. Heatmaps of pixel-wise scores compared against the groundtruth. F rom left to right: original image, pixel-lev el ground truth, gradien t (negativ e scores), Decon volution (negativ e scores), LRP with β = 0 and with = 1. 10 JY Koh, W Samek, KR M¨ uller, A Binder tasks with these metho ds. Finally we remark that this sparsity is not an arte- fact of the particular deep neural netw ork from [11] tuned for P ASCAL VOC. Figure 2 sho ws the same effect for the Go ogleNet Reference Classifier [22] of the Caffe P ack age [7]. As an example, for the w olf, parts of the b ody in the right ha ve missing b oundaries. Indeed this part is not very discriminative. A similar in terpretation can b e made for the low er right side of the dog which has a strong image gradien t but not muc h dog-sp ecific evidence. Fig. 2. Heatmaps of pixel-wise scores computed for the Go ogleNet Reference Classifier of the Caffe P ack age sho w the sparsit y of pixel-wise prediction meth- o ds. The used classifiers were: Timber wolf, Bernese moun tain Dog and Ram. Left Column: image as it enters the deep neural net. Middle: pixel-wise scores computed by LRP with = 1. Righ t: pixel-wise scores computed by LRP with β = 0. 4 Conclusion W e presented here sev eral metho ds for zero-shot learning for seman tic b oundary detection and ev aluated them quantitativ ely . These metho ds are useful when pixel-lev el lab els are una v ailable at training time. These metho ds perform rea- sonably against previous state of the art. It w ould b e interesting to ev aluate these metho ds on other datasets with class-specifically lab eled edges, if they w ould b ecome av ailable in the future. F urthermore we ha ve sho wn that classifier visualization metho ds [20,26,2] ha ve applications beside pure visualization due to their prop ert y of computing predictions at a finer scale. Ob ject Boundary Detection and Classification with Image-level Lab els 11 References 1. Arras, L., Horn, F., Mon tav on, G., M ¨ uller, K.R., Samek, W.: Explaining predictions of non-linear classifiers in nlp. In: Pro c. of the 1st W orkshop on Represen tation Learning f or NLP . pp. 1–7. Association for Computational Linguistics (2016) 2. Bac h, S., Binder, A., Monta von, G., Klausc hen, F., M¨ uller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by lay er-wise relev ance propagation. PLOS ONE 10(7), e0130140 (2015) 3. Bertasius, G., Shi, J., T orresani, L.: High-for-low and lo w-for-high: Efficien t bound- ary detection from deep ob ject features and its applications to high-level vision. In: IEE E ICCV. pp. 504–512 (2015) 4. Ev eringham, M., V an Go ol, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual ob ject classes (v o c) challenge. IJCV 88(2), 303–338 (2010) 5. Hariharan, B., Arb elaez, P ., Bourdev, L.D., Ma ji, S., Malik, J.: Seman tic contours from in v erse detectors. In: IEEE ICCV. pp. 991–998 (2011) 6. Hariharan, B., Arbel´ aez, P .A., Girshick, R.B., Malik, J.: Hyp ercolumns for ob ject segmen tation and fine-grained localization. In: IEEE CVPR. pp. 447–456 (2015) 7. Jia, Y., Shelhamer, E., Donah ue, J., Karay ev, S., Long, J., Girshic k, R.B., Guadar- rama, S., Darrell, T.: Caffe: Conv olutional architecture for fast feature embedding. In: Proc. of the A CM Int. Conf. on Multimedia. pp. 675–678 (2014) 8. Karpath y , A., T o derici, G., Shetty , S., Leung, T., Sukthank ar, R., Li, F.: Large- scale video classification with conv olutional neural netw orks. In: IEEE CVPR. pp. 1725–1732 (2014) 9. Khorev a, A., Benenson, R., Omran, M., Hein, M., Sc hiele, B.: W eakly supervised ob ject boundaries. In: The IEEE Conference on Computer Vision and P attern Recognition (CVPR) (June 2016) 10. Koutn ´ ık, J., Cuccu, G., Schmidh ub er, J., Gomez, F.J.: Evolving large-scale neu- ral netw orks for vision-based reinforcement learning. In: GECCO. pp. 1061–1068 (2013) 11. Lapusc hkin, S., Binder, A., Monta von, G., M ¨ uller, K.R., Samek, W.: Analyzing classifiers: Fisher v ectors and deep neural netw orks. In: IEEE CVPR. pp. 2912– 2920 (20 16) 12. Lapusc hkin, S., Binder, A., Monta v on, G., M ¨ uller, K.R., Samek, W.: The lay er-wise relev ance propagation to olb o x for artificial neural netw orks. Journal of Machine Learning Research 17(114), 1–5 (2016) 13. Li, Y., P aluri, M., Rehg, J.M., Dollar, P .: Unsup ervised learning of edges. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016) 14. Long, J., Shelhamer, E., Darrell, T.: F ully con volutional net works for semantic segmen tation. In: IEEE CVPR. pp. 3431–3440 (2015) 15. Malino wski, M., Rohrbac h, M., F ritz, M.: Ask y our neurons: A neural-based ap- proac h to answering questions about images. In: IEEE ICCV. pp. 1–9 (2015) 16. Maninis, K.K., Pon t-T uset, J., Arb elaez, P ., Go ol, L.V.: Conv olutional orien ted b oundaries: F rom image segmen tation to high-lev el tasks. IEEE T ransactions on P attern Analysis and Mac hine In telligence PP(99), 1–1 (2017) 17. Mnih, V., Kavuk cuoglu, K., Silver, D., Rusu, A.A., V eness, J., Bellemare, M.G., Gra ves, A., Riedmiller, M., Fidjeland, A.K., Ostro vski, G., Petersen, S., Beattie, C., Sadik, A., An tonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-lev el con trol through deep reinforcement learning. Nature 518(7540), 529–533 (02 2015) 12 JY Koh, W Samek, KR M¨ uller, A Binder 18. Mon tav on, G., Bach, S., Binder, A., Samek, W., M¨ uller, K.R.: Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, 211–222 (2017) 19. Samek, W., Binder, A., Mon tav on, G., Lapusc hkin, S., M¨ uller, K.R.: Ev aluating the visualization of what a deep neural netw ork has learned. IEEE T ransactions on Neura l Net works and Learning Systems (2016) 20. Simon yan, K., V edaldi, A., Zisserman, A.: Deep inside con volutional netw orks: Visualising image classification models and saliency maps. CoRR abs/1312.6034 (2013) 21. Sutsk ever, I., Vin yals, O., Le, Q.V.: Sequence to sequence learning with neural net works. In: Adv. in NIPS. pp. 3104–3112 (2014) 22. Szegedy , C., Liu, W., Jia, Y., Sermanet, P ., Reed, S., Anguelo v, D., Erhan, D., V an- houc ke, V., Rabinovic h, A.: Going deep er with conv olutions. CoRR abs/1409.4842 (2014) 23. Xie, S., T u, Z.: Holistically-nested edge detection. International Journal of Com- puter Vi sion (2017), http://dx.doi.org/10.1007/s11263- 017- 1004- z 24. Y ang, J., Price, B., Cohen, S., Lee, H., Y ang, M.H.: Ob ject contour detection with a fully con v olutional enco der-decoder netw ork. In: The IEEE Conference on Computer Vision and P attern Recognition (CVPR) (June 2016) 25. Y u, Z., F eng, C., Liu, M.Y., Ramalingam, S.: CASENet: Deep Category-Aware Seman tic Edge Detection. ArXiv e-prints (May 2017) 26. Zeiler, M.D., F ergus, R.: Visualizing and understanding conv olutional net works. In: ECC V. pp. 818–833 (2014)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment