Object Boundary Detection and Classification with Image-level Labels

Ob ject Boundary Detection and Classiﬁcation with Image-lev el Lab els Jing Y u Koh 1 , W o jciech Samek 2 , Klaus-Rob ert M¨ uller 3 , 4 , Alexander Binder 1 1 ISTD Pillar, Singap ore Universit y of T echnology and Design, Singap ore 2 Departmen t of Video Coding & Analytics, F raunhofer Heinric h Hertz Institute, Berlin, Germany 3 Departmen t of Computer Science, TU Berlin, Germany 4 Departmen t of Brain and Cognitive Engineering, Korea Universit y , Seoul, Republic of Korea Abstract. Seman tic b oundary and edge detection aims at simultane- ously detecting ob ject edge pixels in images and assigning class lab els to them. Systematic training of predictors for this task requires the lab el- ing of edges in images which is a particularly tedious task. W e prop ose a no vel strategy for solving this task, when pixel-lev el annotations are not a v ailable, p erforming it in an almost zero-shot manner b y relying on con- v entional whole image neural net classiﬁers that were trained using large b ounding boxes. Our metho d p erforms the following tw o steps at test time. Firstly it predicts the class labels by applying the trained whole image net w ork to the test images. Secondly , it computes pixel-wise scores from the obtained predictions by applying backprop gradients as well as recen t visualization algorithms suc h as decon v olution and lay er-wise rel- ev ance propagation. W e show that high pixel-wise scores are indicative for the location of seman tic boundaries, whic h suggests that the seman tic b oundary problem can b e approac hed without using edge lab els during the train ing phase. 1 In tro duction Neural net based predictors ac hiev e excellen t results in man y data-driv en tasks, examples among the newer being [6,15,14,10,17], while others suc h as video de- tection or mac hine translation [21,8] are equally impressiv e. Rather than extend- ing neural netw orks to a new application, we focus here on the question whether a neural netw ork can solve problems which are har der than the one for whic h the net w ork w as trained. In particular, w e consider the task of seman tic bound- ary detection which w e aim to solve without appropriately ﬁne-grained training lab els. The problem of semantic b oundary detection (SBD) [5] can b e deﬁned as the sim ultaneous detection of ob ject edge pixels and assignmen t of class lab els to suc h edge pixels in images. Recen tly , the work of [3,16,24,25] sho wed sub- stan tial impro v ement using neural nets, how ever, the approac h relied on end-to- end training with a dataset for whic h seman tic b oundary lab els w ere av ailable. 2 JY Koh, W Samek, KR M¨ uller, A Binder When trying to build a predictor for SBD, practitioners face the problem that the classical inductive machine learning paradigm requires to create a dataset with semantic boundary labels, that is, for each image a subset of pixels in im- ages corresp onding to ob ject edges is lab eled with class indices. Creating such lab elling is a particularly tedious task, unlike lab elling whole images or drawing b ounding b o xes, b oth of which can b e done v ery quickly . The b est proof for this diﬃcult y is the fact that w e are aw are of only one truly seman tic boundary dataset [5]. Note that SBD is diﬀerent from con tour detection tasks [23] whic h aim at ﬁnding contours of ob jects without assigning class lab els to them. In that sense the scope of our prop osed w ork is diﬀeren t from unsup ervised con tour detection as in [13]. The main question in this paper is to what extent it is possible to solv e the semantic boundary or edge detection task without having appropriately ﬁne-grained lab els, i.e., pixel-lev el ground truth, which are required for clas- sical training paradigms ? W e do not intend to replace the usage of pixel-wise b oundary lab els when they are a v ailable. W e aim at use cases in which pixel-wise b oundary lab els are not av ailable during the training phase. One example of us- ing w eaker annotations for semantic boundary detection is [9] where b ounding b o x labels are used to learn semantic boundaries. W e prop ose a nov el strategy to tackle a problem requiring ﬁne-grained lab els, namely seman tic b oundary de- tection, with a classiﬁer trained for image classiﬁcation using only image-wise lab els. F or that we use neural nets that classify an image, and apply existing visualization methods that are able to assign class-speciﬁc-scores to single pixels. These class-sp eciﬁc pixel scores can then b e used to deﬁne seman tic b oundary predictions. The contribution of this pap er is as follo ws. W e demonstrate that classiﬁer vi- sualization metho ds are useful b ey ond pro ducing nice-to-lo ok-at images, namely for approaching prediction tasks on the pixel-lev el in the absence of appropri- ately ﬁne-grained training labels. As an example, w e apply and ev aluate the p erformance of c lassiﬁer visualization metho ds to the SBD task. W e sho w that these visualization metho ds can b e used for pro ducing quantiﬁably meaningful predictions at a higher spatial resolution than the lab els, whic h were the basis for training the classiﬁers. W e discuss the shortcomings of suc h approac hes when compared to the proper training paradigm that mak es use of pixel-level lab els. W e do not expect such metho ds to b eat baselines that emplo y the prop er train- ing paradigm and th us use pixel-level lab els during training, but rather aim at the practitioner’s case in whic h ﬁne-grained training data is to o costly in terms of money or time. 2 Obtaining Pixel-level Scores from Image-wise Predictions In the follo wing we introduce the metho ds that w e will use for producing pixel- lev el scores without pixel-level labels during training time. It is common to all these methods that they take a classiﬁer prediction f c ( x ) on an image x Ob ject Boundary Detection and Classiﬁcation with Image-level Lab els 3 and pro duce scores s c ( p ) for pixels p ∈ x . Suppose w e hav e classiﬁers f c ( x ) for multiple classes c . Then w e can tackle the SBD problem b y (1) classifying an image, i.e., determine those classes that are presen t in the image, and (2) computing pixel-wise scores for those classes using one of the following metho ds. 2.1 Gradien t Probably the most ob vious idea to tackle the SBD problem is to run a forw ard prediction with a classiﬁer, and compute the gradien t for eac h pixel. Let x b e an input image, f 1 , . . . , f C b e C outputs of a m ulti-class classiﬁer and x p b e the p -th pixel. Computing pixel-wise scores for a class c and pixel p can b e ac hiev ed using s ( p ) =     ∂ f c ∂ x p ( x )     2 (1) The norm runs here ov er the partial deriv atives for the ( r , g , b )-subpixels of a pixel p . Alternatively one can sum up the subpixel scores in order to hav e a pixel-score. Using gradien ts for visualizing sensitivities of neural netw orks has b een shown in [20]. A high score in this sense indicates that the output f c has high sensitivit y under small changes of the input x p , i.e. there exists a direction in the tangen t space lo cated at x for which the slope of the classiﬁer f c is very high. In order to see the impact of partial deriv ativ es, consider the case of a simple linear mapping that tak es subpixels x p,s of pixel p as input. f ( x ) = X p X s ∈{ r,g,b } w p,s x p,s (2) In this case bac kpropagation combined with an ` 2 -norm yields: s ( p ) = ( w 2 p,r + w 2 p,g + w 2 p,b ) 1 / 2 (3) Note that the input x p,s , and in particular its sign pla ys no role in a visualization ac hieved by bac kpropagation, although ob viously the sign of x p,s do es matter for deciding whether to detect an ob ject ( f ( x ) > 0) or not ( f ( x ) < 0). This is a limiting factor, when one aims to explain what pixels are relev an t for the prediction f ( x ) > 0. 2.2 Decon volution Decon volution [26] is an alternativ e method to compute pixel-wise scores. Start- ing with scores giv en at the top of a con volutional lay er, it applies the transp osed ﬁlter weigh ts to compute scores at the b ottom of the same lay er. Another im- p ortan t feature is used in max-p ooling lay ers, where scores from the top are distributed down to the input that yielded the maxim um v alue in the max p ool- ing. Consider the linear mapping case again. Then decon volution in the sense of 4 JY Koh, W Samek, KR M¨ uller, A Binder m ultiplying the transp osed weigh ts w (as it is for example implemented in the Caﬀe pac k age) yields for subpixel s of channel p s ( p, s ) = f c ( x ) w p,s (4) This score can be summed across subpixels, or one can tak e again an ` p -norm. When using summation across subpixels, then decon v olution is prop ortional to the prediction f c ( x ), in particular it expresses the dominating terms w p,s x p,s ≈ f c ( x ) correctly whic h contribute most to the prediction f ( x ). 2.3 La yer-wise Relev ance Propagation La yer-wise Relev ance Propagation (LRP) [2] is a principled metho d for explain- ing neural net work p redictions in terms of pixel-wise scores. LRP rev ersely prop- agates a n umerical quan tity , named relev ance, in a w a y that preserves the sum of the total relev ance at eac h la yer. The relev ance is initialized at the output as the prediction score f c ( x ) and propagated down to the inputs (i.e., pixels), so that the relev ance conserv ation prop ert y holds at each la yer f c ( x ) = . . . = X j R ( l +1) j . . . = X i R ( l ) i = . . . = X p R (1) p (5) where { R ( l +1) j } and { R ( l ) i } denote the relev ance at la y er l + 1 and l , resp ectiv ely , and { R (1) p } represen ts the pixel-wise relev ance scores. Let us consider the neural net work as an feed-forw ard graph of elemen tary computational units (neurons), eac h of them realizing a simple function of type x ( l +1) j = g  0 , X i x ( l ) i w ( l,l +1) ij + b ( l +1) j  e.g. g ( z ) = max(0 , z ) (6) where j denotes a neuron at a particular la yer l + 1, and, where P i runs o ver all low er-la yer neurons connected to neuron j . w ( l,l +1) ij , b ( l +1) j are parameters of a neuron. The prediction of a deep neural netw ork is obtained b y computing these neurons in a feed-forward pass. Con v ersely , [2] ha ve shown that the same graph structure can be used to redistribute the relev ance f ( x ) at the output of the net work onto pixel-wise relev ance scores { R (1) p } , by using a local redistribution rule R ( l ) i = X j z ij P i 0 z i 0 j R ( l +1) j with z ij = x (l) i w (l , l+1) ij (7) where i indexes a neuron at a particular lay er l , and where P j runs ov er all upp er-la y er neurons to whic h neuron i contributes. Application of this rule in a backw ard pass produces a relev ance map (heatmap) that satisﬁes the desired conserv ation property . Ob ject Boundary Detection and Classiﬁcation with Image-level Lab els 5 W e consider tw o other LRP algorithms in tro duced in [2], namely the  -v ariant and the β -v arian t. The ﬁrst rule is given b y: R ( l ) i = X j z ij P i 0 z i 0 j +  sign( P i 0 z i 0 j ) R ( l +1) j (8) Here for  > 0 the conserv ation idea is relaxated in order to gain b etter numerical stabilit y . The second formula is giv en by: R ( l ) i = X j α · z + ij P i 0 z + i 0 j + β · z − ij P i 0 z − i 0 j ! R ( l +1) j . (9) Here, z + ij and z − ij denote the positive and negative part of z ij resp ectiv ely , such that z + ij + z − ij = z ij . W e enforce α + β = 1, α > 0, β ≤ 0 in order for the relev ance propagation equations to be conserv ative lay er-wise. Note that for α = 1 this redistribution rule is equiv alent (for ReLU nonlinearities g ) to the z + -rule by [18]. In con trast to the gradient, LRP reco vers the natural decomp osition of a linear mapping f ( x ) = D X i =1 w i x i (10) i.e., the pixel-lev el score R i = w i x i (11) not only depends on whether the classiﬁer reacts to this input dimension ( w i > 0), but also if that feature is actually presen t ( x i > 0). An implemen tation of LRP can b e found in [12]. 3 Exp erimen ts W e perform the exp eriments on the SBD dataset with a P ascal V OC m ultilab el classiﬁer from [11] that is av ailable in the BVLC model zoo of the Caﬀe [7] pac k age. This classiﬁer was trained using the 4 edge crops and the cen ter crops of the ground truth b ounding b o xes of the Pascal VOC dataset [4]. W e do not use pixel lab els at training time, ho w ever, for ev aluation at test time w e use the pixel- wise ground truth, in order to b e able to compare all metho ds quantitativ ely . Same as [5] we rep ort the maximal F-score and the av erage precision on the pixel-lev el of an image. W e stick to the same con ven tion regarding counting true p ositiv es in a neigh b orho od, as introduced in [5]. 3.1 P erformance on the SBD task T able 1 sho ws the av erage precision (AP) scores for all methods. W e can see 6 JY Koh, W Samek, KR M¨ uller, A Binder T able 1. Av erage precision (AP) and maximal F-scores (MF) scores of v arious metho ds to compute pixel-wise scores from whole image classiﬁers without pixel- lab els at training time, compared against the original metho d InverseDete ctors [5] and Boundary detection using Neural nets HFL [3]. Only the last tw o b oth use pixel-lab els at training time. All other use no pixel-level lab els during training. Grad denotes Gradien t, Decon v denotes [26],  and β refer to LRP v arian ts giv en in equations (8) and (9) tak en from [2]. training phase: image-lev el labels pixel-lev el labels Metho d: Gradien t Deconv β = 0 β = − 1  = 1  = 0 . 01 InvDet [5] HFL [3] AP 22.5 25.0 28.4 27.3 31.4 31.2 19.9 54.6 MF 31.0 33.3 35.1 34.1 38.0 38.1 28.0 62.5 from the table that the neural-netw ork based metho d [3] which uses pixel-lev el ground truth at training time p erforms best by a large margin. Metho ds that do not employ pixel-level lab els at training time perform far w orse. How ev er, w e can see a certain surprise: all the metho ds p erform better than the method [5] on Seman tic Boundary Detection that was the b est baseline b efore the w ork of [3] replaced it. Note that [5] as well as [3] relies on pixel-wise lab els during training, whereas the prop osed methods require only image-wise lab els. This result gives a realistic comparison of ho w go od metho ds on pixel-wise prediction without pixel-lab els in the training phase can perform. The pixel-wise scores for LRP are computed by summing o v er subpixels. F or Gradien t and Deconv olution using the negativ e sum o ver subpixels p erformed b etter than using the sum or the ` 2 -norm. F or b oth cases negativ e pixel scores w ere set to zero. This follows our experience with Deconv olution and LRP that w av e-like low-lev el image ﬁlters, which are t ypically presen t in deep neural nets, receiv e equally wa v e-like scores with positive and negative half-w av es. Remov- ing the negativ e half wa ves improv es the prediction quality . T able 2 sho ws the T able 2. Comparison of v arious w ays to combine subpixel scores in to a pixel- wise score. subpixel aggre gation method sum sum of negativ e scores ` 2 -norm Gradien t AP 22.0 22.5 18.8 Decon volution AP 22.9 25.0 21.9 comparison of AP scores for v arious metho ds to compute a pixel-wise score from subpixel scores. Note that w e do not show the ` 2 -norm, or the summed negativ e scores for the LRP metho ds, as LRP do es preserv e the sign of the prediction and th us using the sum of negativ e scores or ` 2 -norm has no meaningful in ter- pretation for LRP . Ob ject Boundary Detection and Classiﬁcation with Image-level Lab els 7 3.2 Shortcomings of visualization metho ds Seman tic b oundaries are not the most relev ant regions for the decision of ab o ve men tioned classiﬁers trained on images of natural scenes. This does not dev aluate mo dels trained on shap es. It merely sa ys that, giv en RGB images of natural scenes as input, ab o v e ob ject class predictors put considerable w eigh t on in ternal edges and textures rather than outer boundaries, an eﬀect whic h can b e observed in the heatmaps in Figures 1 and 2. This is the primary h yp othesis why the visualization metho ds abov e are partially mismatching the seman tic b oundaries. W e demonstrate this hypothesis quantitativ ely by an exp eriment. F or this we need to in troduce a measure of relev ance of a set of pixels whic h is independent of the computed visualizations. Perturb ation analysis W e can measure the relev ance of a set of pixels S ⊂ x of an image x for the prediction of a classiﬁer by replacing the pixels in this set by some v alues, and comparing the prediction on the modiﬁed image ˜ x S against the prediction score f ( x ) for the original image [19] (similar approac h has b een applied for text in [1]). This idea follo ws the in tuition that most random p erturbations in a region that is imp ortan t for classiﬁcation will lead to a decline of the prediction score for the image as a whole: f ( ˜ x S ) < f ( x ). It is clear that there exist p erturbations of a region yield an increase of the prediction score: for example a c hange that follows the gradien t direction lo cally . Th us we will dra w man y p erturbations of the set S from a random distribution P and measure an appro ximation the exp ected decline of the prediction m = f ( x ) − E S ∼ P [ f ( ˜ x S )] (12) W e in tend to measure the expected decrease for the set S b eing the ground truth pixels for the SBD task, and compare it against the set of highest scoring pixels. F or a fair comparison the set of highest scoring pixels will b e limited to hav e the same size as the num b er of ground truth pixels. Highest scoring pixels will b e deﬁned by the pixel-wise scores from the ab o ve methods. W e will show that the exp ected decrease is higher for the pixel-wise scores, which indicates that ground truth pixels represen ting seman tic b oundaries are not the most relev an t subset for the classiﬁer prediction. The experiment to demonstrate this will b e designed as follows. F or each test image and each ground truth class w e will tak e the set of ground truth pixels, and randomly perturb them. F or a ( r, g, b )-pixel w e will draw the v alues from a uniform distribution in [0 , 1] 3 ⊂ R 3 . F or each image and presen t class of seman tic b oundary task ground truth w e rep eat 200 random perturbations of the set in order to compute an approximation to Equation (12). W e compute the a verage ov er all images to obtain the a verage decrease on ground truth pixels m GT . m GT is an av erage measure of relev ance of the ground truth pixels. m GT is to b e compared against the analogous quantit y m V deriv ed from the top- scoring pixels of a visualization metho d. F or a given visualization method V ∈ { Gr adient, Deconv , LRP - β , LRP -  } , we deﬁne the set of pixels to b e p erturbed as the pixels with the highest pixel-wise scores computed from the visualization 8 JY Koh, W Samek, KR M¨ uller, A Binder metho d. The set size for this set will be the same as the n um b er of ground truth pixels of the seman tic b oundary task of the same image and class. Running the same perturbation idea according to Equation (12) on this set yields a measure m V of av erage decrease of classiﬁer prediction that is sp eciﬁc to the most relev ant pixels of the giv en visualization metho d. T able 3. Comparison of the a veraged prediction scores. f ( x ) denotes the a verage prediction for the unperturb ed images for all ground truth classes. m GT denotes the av erage prediction for images with p erturbed ground truth pixels. m Deconv and m LRP , =1 denotes the a verage prediction for images with perturb ed highest scoring pixels ha ving the same cardinality as the ground truth pixels, using Decon volution and LRP . f c ( x ) m GT m Deconv m LRP, =1 10.20 ± 0 7.73 ± 0.36 5.68 ± 0.38 1.73 ± 0.34 T able 3 shows the results of the comparison. Note that we take the ground truth in the image that has been resized to matc h th e receptiv e ﬁeld of the neural net (227 × 227), and apply one step of classical morphological thick ening. This thic kened ground truth set will b e used. The standard deviation w as computed for the 200 random p erturbations and a v eraged o v er images and classes. W e can see from the table that the decrease is stronger for the visualization metho ds compared to the ground truth pixels. This holds for Deconv olution as w ell as for LRP . The pixels highligh ted b y these metho ds are more relev an t for the classiﬁer prediction, e v en though they disagree with b oundary pixel lab els. In summary this supp orts our initially stated h yp othesis that b oundary pixels are not the most relev an t for classiﬁcation, and our explanation why these metho ds are partially mismatc hing the set of b oundary ground truth lab els. W e can supp ort this n umerical observ ation also b y example images. W e can observ e t wo error cases. Firstly , the pixel-wise predictions may miss semantic b oundaries that are deemed to be less discriminativ e for the classiﬁcation task. This adds to false negatives. Secondly , the pixel-wise predictions ma y assign high scores to pixels that are relev ant for the classiﬁcation of an ob ject and lie inside the ob ject. Figure 1 shows some examples. W e can clearly see false negatives and false positives in these examples, for example for the car and LRP-  = 1 where the window regions are deemed to b e highly relev ant for the classiﬁer decision, but the outer b oundary on the car top is considered irrelev ant which is a bad result with resp ect to boundary detection. F or the cat most of the methods fo cus on its face rather than the cat b oundaries. The bird is an example where decon volution gives a go od result. F or the p eople with the b oats the heatmap is sho wn for the people class. In this example LRP-  = 1 fo cuses correctly most selectiv ely on the p eople, same as for the tiny car example. W e can observ e from these ﬁgures a common sparsity of the pixel-wise pre- diction metho ds. This motiv ates why we did not aim at solving segmen tation Ob ject Boundary Detection and Classiﬁcation with Image-level Lab els 9 image ground truth gradient Decon v LRP- β = 0 LRP-  = 1 Fig. 1. Heatmaps of pixel-wise scores compared against the groundtruth. F rom left to right: original image, pixel-lev el ground truth, gradien t (negativ e scores), Decon volution (negativ e scores), LRP with β = 0 and with  = 1. 10 JY Koh, W Samek, KR M¨ uller, A Binder tasks with these metho ds. Finally we remark that this sparsity is not an arte- fact of the particular deep neural netw ork from [11] tuned for P ASCAL VOC. Figure 2 sho ws the same eﬀect for the Go ogleNet Reference Classiﬁer [22] of the Caﬀe P ack age [7]. As an example, for the w olf, parts of the b ody in the right ha ve missing b oundaries. Indeed this part is not very discriminative. A similar in terpretation can b e made for the low er right side of the dog which has a strong image gradien t but not muc h dog-sp eciﬁc evidence. Fig. 2. Heatmaps of pixel-wise scores computed for the Go ogleNet Reference Classiﬁer of the Caﬀe P ack age sho w the sparsit y of pixel-wise prediction meth- o ds. The used classiﬁers were: Timber wolf, Bernese moun tain Dog and Ram. Left Column: image as it enters the deep neural net. Middle: pixel-wise scores computed by LRP with  = 1. Righ t: pixel-wise scores computed by LRP with β = 0. 4 Conclusion W e presented here sev eral metho ds for zero-shot learning for seman tic b oundary detection and ev aluated them quantitativ ely . These metho ds are useful when pixel-lev el lab els are una v ailable at training time. These metho ds perform rea- sonably against previous state of the art. It w ould b e interesting to ev aluate these metho ds on other datasets with class-speciﬁcally lab eled edges, if they w ould b ecome av ailable in the future. F urthermore we ha ve sho wn that classiﬁer visualization metho ds [20,26,2] ha ve applications beside pure visualization due to their prop ert y of computing predictions at a ﬁner scale. Ob ject Boundary Detection and Classiﬁcation with Image-level Lab els 11 References 1. Arras, L., Horn, F., Mon tav on, G., M ¨ uller, K.R., Samek, W.: Explaining predictions of non-linear classiﬁers in nlp. In: Pro c. of the 1st W orkshop on Represen tation Learning f or NLP . pp. 1–7. Association for Computational Linguistics (2016) 2. Bac h, S., Binder, A., Monta von, G., Klausc hen, F., M¨ uller, K.R., Samek, W.: On pixel-wise explanations for non-linear classiﬁer decisions by lay er-wise relev ance propagation. PLOS ONE 10(7), e0130140 (2015) 3. Bertasius, G., Shi, J., T orresani, L.: High-for-low and lo w-for-high: Eﬃcien t bound- ary detection from deep ob ject features and its applications to high-level vision. In: IEE E ICCV. pp. 504–512 (2015) 4. Ev eringham, M., V an Go ol, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual ob ject classes (v o c) challenge. IJCV 88(2), 303–338 (2010) 5. Hariharan, B., Arb elaez, P ., Bourdev, L.D., Ma ji, S., Malik, J.: Seman tic contours from in v erse detectors. In: IEEE ICCV. pp. 991–998 (2011) 6. Hariharan, B., Arbel´ aez, P .A., Girshick, R.B., Malik, J.: Hyp ercolumns for ob ject segmen tation and ﬁne-grained localization. In: IEEE CVPR. pp. 447–456 (2015) 7. Jia, Y., Shelhamer, E., Donah ue, J., Karay ev, S., Long, J., Girshic k, R.B., Guadar- rama, S., Darrell, T.: Caﬀe: Conv olutional architecture for fast feature embedding. In: Proc. of the A CM Int. Conf. on Multimedia. pp. 675–678 (2014) 8. Karpath y , A., T o derici, G., Shetty , S., Leung, T., Sukthank ar, R., Li, F.: Large- scale video classiﬁcation with conv olutional neural netw orks. In: IEEE CVPR. pp. 1725–1732 (2014) 9. Khorev a, A., Benenson, R., Omran, M., Hein, M., Sc hiele, B.: W eakly supervised ob ject boundaries. In: The IEEE Conference on Computer Vision and P attern Recognition (CVPR) (June 2016) 10. Koutn ´ ık, J., Cuccu, G., Schmidh ub er, J., Gomez, F.J.: Evolving large-scale neu- ral netw orks for vision-based reinforcement learning. In: GECCO. pp. 1061–1068 (2013) 11. Lapusc hkin, S., Binder, A., Monta von, G., M ¨ uller, K.R., Samek, W.: Analyzing classiﬁers: Fisher v ectors and deep neural netw orks. In: IEEE CVPR. pp. 2912– 2920 (20 16) 12. Lapusc hkin, S., Binder, A., Monta v on, G., M ¨ uller, K.R., Samek, W.: The lay er-wise relev ance propagation to olb o x for artiﬁcial neural netw orks. Journal of Machine Learning Research 17(114), 1–5 (2016) 13. Li, Y., P aluri, M., Rehg, J.M., Dollar, P .: Unsup ervised learning of edges. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016) 14. Long, J., Shelhamer, E., Darrell, T.: F ully con volutional net works for semantic segmen tation. In: IEEE CVPR. pp. 3431–3440 (2015) 15. Malino wski, M., Rohrbac h, M., F ritz, M.: Ask y our neurons: A neural-based ap- proac h to answering questions about images. In: IEEE ICCV. pp. 1–9 (2015) 16. Maninis, K.K., Pon t-T uset, J., Arb elaez, P ., Go ol, L.V.: Conv olutional orien ted b oundaries: F rom image segmen tation to high-lev el tasks. IEEE T ransactions on P attern Analysis and Mac hine In telligence PP(99), 1–1 (2017) 17. Mnih, V., Kavuk cuoglu, K., Silver, D., Rusu, A.A., V eness, J., Bellemare, M.G., Gra ves, A., Riedmiller, M., Fidjeland, A.K., Ostro vski, G., Petersen, S., Beattie, C., Sadik, A., An tonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-lev el con trol through deep reinforcement learning. Nature 518(7540), 529–533 (02 2015) 12 JY Koh, W Samek, KR M¨ uller, A Binder 18. Mon tav on, G., Bach, S., Binder, A., Samek, W., M¨ uller, K.R.: Explaining nonlinear classiﬁcation decisions with deep taylor decomposition. Pattern Recognition 65, 211–222 (2017) 19. Samek, W., Binder, A., Mon tav on, G., Lapusc hkin, S., M¨ uller, K.R.: Ev aluating the visualization of what a deep neural netw ork has learned. IEEE T ransactions on Neura l Net works and Learning Systems (2016) 20. Simon yan, K., V edaldi, A., Zisserman, A.: Deep inside con volutional netw orks: Visualising image classiﬁcation models and saliency maps. CoRR abs/1312.6034 (2013) 21. Sutsk ever, I., Vin yals, O., Le, Q.V.: Sequence to sequence learning with neural net works. In: Adv. in NIPS. pp. 3104–3112 (2014) 22. Szegedy , C., Liu, W., Jia, Y., Sermanet, P ., Reed, S., Anguelo v, D., Erhan, D., V an- houc ke, V., Rabinovic h, A.: Going deep er with conv olutions. CoRR abs/1409.4842 (2014) 23. Xie, S., T u, Z.: Holistically-nested edge detection. International Journal of Com- puter Vi sion (2017), http://dx.doi.org/10.1007/s11263- 017- 1004- z 24. Y ang, J., Price, B., Cohen, S., Lee, H., Y ang, M.H.: Ob ject contour detection with a fully con v olutional enco der-decoder netw ork. In: The IEEE Conference on Computer Vision and P attern Recognition (CVPR) (June 2016) 25. Y u, Z., F eng, C., Liu, M.Y., Ramalingam, S.: CASENet: Deep Category-Aware Seman tic Edge Detection. ArXiv e-prints (May 2017) 26. Zeiler, M.D., F ergus, R.: Visualizing and understanding conv olutional net works. In: ECC V. pp. 818–833 (2014)

Object Boundary Detection and Classification with Image-level Labels

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment