How to Explain Individual Classification Decisions

How to E xplain I ndividual Class ifica tion Decisions Ho w to Explain Individual Classiﬁcation Decisions Da vid Baehrens ∗ baehrens@cs.tu-berlin.de Timon Sc hro eter ∗ timon@cs.tu-berlin.de T e chnische Universit¨ at Berlin F r anklinstr. 28/29 , FR 6-9 10587 Berlin, Germany Stefan Harmeling ∗ stef an.harmeling@tuebingen.mpg.de MPI for Biolo gic al Cyb ernetics Sp emannstr. 38 72076 T¨ ubingen, Germany Motoaki Ka w ana b e motoaki.ka w anabe@first.fra unhofer.de F r aunhofer Institut e FIRST.ID A Kekulestr.7 12489 Berlin, Germany and T e chnische Universit¨ at Berlin F r anklinstr. 28/29 , FR 6-9 10587 Berlin, Germany Katja Hansen khansen@cs.tu-berlin.de Klaus-Rob ert M ¨ uller klaus-r ober t.mu eller@tu-berlin.de T e chnische Universit¨ at Berlin F r anklinstr. 28/29 , FR 6-9 10587 Berlin, Germany Editor: Carl Edw ar d Rasmussen Abstract After building a classiﬁer with mo dern to ols of machine learning we t ypically have a black box at hand that is a ble to predict w ell for uns e e n data . Thu s, we get an answer to the question what is the most likely label of a given unseen data p oint. How ever, mos t metho ds will provide no answer why the mo del predicted the particular la bel for a single instance and what features were most inﬂuen tial for that particular instance. The only metho d that is curr e n tly able to provide s uch explana tions a re decision tree s. This pap er prop os e s a pro cedur e which (based on a set o f assumptions) a llows to explain the dec is ions of any classiﬁcation metho d. Keyw ords: explaining, no nlinear, bla ck b ox mo del, kernel metho ds, Ames mutagenicity ∗ . T h e ﬁrst three authors contributed equally . 1 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Ha nsen, K. -R. M ¨ uller 1. In tr o duction Automatic nonlinear classiﬁcation is a common and p o werful to ol in data an alysis. M ac hine learning researc h has created metho ds that are practi cally useful and that can classify unseen data after b eing trained on a limited tr aining set of lab eled examples. Nev ertheless, most of the algorithms do not explain their d ecision. Ho wev er in practical data analysis it is essen tial to obtain an instance based explanati on, i.e. we w ould like to gain an und erstanding what input features made the nonlinear mac h ine giv e its answer for eac h individu al data p oin t. T yp ically , explanations are pro v id ed j oin tly for all instances of the training set, for ex- ample feature selection methods (including Automatic Relev ance D etermination) ﬁnd out whic h inputs are salien t for a goo d generalization (see for a review Guyo n and Elisseeﬀ, 2003). While this can give a coarse imp r ession ab out the global usefulness of eac h in- put dimension, it is s till an ensem b le view and do es not provide an answer on an in- stance basis. 1 In the neural netw ork literature also solely an ensem b le vie w w as tak en in algorithms lik e inp ut pruning (e.g. Bishop, 1995; LeCun, Bottou, O rr, and M ¨ uller, 1998). The only classiﬁcation wh ic h do es pro vide individual explanations are decisio n trees (e .g. Hastie, Tibsh irani, and F r iedman, 2001 ). This pap er p rop oses a simple framework that p ro vid es lo cal explanation vect ors applica- ble to any classiﬁcation metho d in order to h elp understanding prediction results f or sin gle data instances. The lo cal exp lanation yields the features b eing r elev an t for the pred iction at the very p oin ts of in terest in the data sp ace and is able to s p ot lo cal p eculiarities whic h are neglected in the global view e.g. due to cancellation eﬀects. The p ap er is organized as follo ws: W e deﬁne lo cal explanation ve ctors as class probabilit y gradien ts in Section 2 and giv e an il lustration for Gaussian Process Classiﬁcatio n (GPC). Some metho ds output a prediction without a direct probabilit y in terpretation. F or these w e prop ose in S ection 3 a wa y to estimate local explanations. In Secti on 4 w e will app ly our metho dology to learn distinguishin g p rop erties of Iris ﬂo wers b y estimating explanatio n v ectors for a k -NN classiﬁer applied to th e classic I r is data set. Section 5 will discu s s ho w our approac h applied to a SVM classiﬁer allo ws us to explain how digits ”t wo” are d istinguished from digit ”8” in the US PS data set. In Section 6 we discus s a more real-w orld application scenario where the prop osed exp lanation capabilities pr o ve useful in dr ug d isco v ery : Human exp erts regularly d ecide how to mo dify existing lead comp ound s in ord er to obtain n ew comp ounds w ith impr ov ed prop erties. Mo dels capable of explainin g predictions can h elp in the pro cess of c ho osing pr omising mo diﬁcations. Ou r automatically generated explanations matc h with c hemical domain knowledge ab out toxifying fun ctional groups of the comp ounds in question. Sectio n 7 con tr asts our appr oac h with related work and Section 8 discusses c haracteristic prop erties and limitations o f our appr oac h, before w e conclude the pap er in Section 9. 1. T h is p oint is illustrated in Figure 1 (Section 2). Applying feature selection methods to the training set (a) will lead to th e (correct) conclusion that b oth dimensions are equally imp ortant for accurate classiﬁcation. As an alternativ e to this ensem b le view, one may ask: Which features (or combinations thereof ) are most inﬂuential in the vicinity of each particular instance. A s can b e seen in Figure 1 (c), the answ er dep ends on where the resp ective instance is lo cated. On t h e hyp otenuse and at the corners of the tria n gle, b oth features con tribute jointly , whereas along eac h of the remaining tw o edges the classiﬁcation d ep ends almost completely on just one of the features. 2 How to E xplain I ndividual Class ifica tion Decisions 2. Deﬁnitions of Explanation V ect ors In this Section w e will giv e d eﬁnitions f or our approac h o f lo cal explanation v ectors in the classiﬁcation setting. W e start with a theoretica l deﬁnition for multi-cla ss Bay es classiﬁca- tion and then giv e a sp ecializ ed deﬁnition b eing more practical f or the b inary case. F or the m u lti-class ca se, supp ose we are giv en data p oin ts x 1 , . . . , x n ∈ ℜ d with la b els y 1 , . . . , y n ∈ { 1 , . . . , C } and w e inte nd to learn a function that p redicts the lab els of unlab eled data p oints. Assu m ing that the data could b e mo deled as b eing I ID-sampled fr om some unknown j oin t distribution P ( X, Y ), in theory , w e can deﬁn e the Ba yes classiﬁer, g ∗ ( x ) = arg min c ∈{ 1 ,...,C } P ( Y 6 = c | X = x ) whic h is optimal for the 0-1 loss function (see Devro y e, Gy¨ o r ﬁ, and Lugosi, 1996). F or the Ba yes classiﬁer we deﬁne the explanation ve ctor of a data p oin t x 0 to b e the deriv ativ e with r esp ect to x at x = x 0 of the conditional probabilit y of Y 6 = g ∗ ( x 0 ) giv en X = x , or form ally , Deﬁnition 1 ζ ( x 0 ) := ∂ ∂ x P ( Y 6 = g ∗ ( x 0 ) | X = x )     x = x 0 Note that ζ ( x 0 ) is a d -dimensional vec tor just lik e x 0 is. The classiﬁer g ∗ partitions the data space ℜ d in to up to C parts on wh ic h g ∗ is constan t. W e assum e that the conditional distribution P ( Y = c | X = x ) is ﬁ rst-order diﬀerentiable w.r.t. x for all classes c and o v er the en tire inp ut space. F or instance, the assump tion holds, if P ( X = x | Y = c ) is for all c ﬁrst-order diﬀeren tiable in x and the sup p orts of the class densities o ve r lap around the b oarder for all the neigh b oring pairs in the p artition by the Ba yes classiﬁer. The vec tor ζ ( x 0 ) d eﬁ nes on eac h of those parts a vec tor ﬁeld that c haracterizes the ﬂ o w a wa y f r om th e corresp ondin g class. Thus entries in ζ ( x 0 ) with large absolute v alues highligh t features that will inﬂuence t he class lab el decision of x 0 . A p ositiv e sign of suc h an entry implies that increasing that feature wo u ld lo w er the pr ob ab ility that x 0 is assigned to g ∗ ( x 0 ). Ignoring the orien tations of the explanat ion v ectors, ζ forms a con tinuously c h anging (o r ien tation-less) v ector ﬁ eld along whic h the class lab els c hange. This v ector ﬁ eld lets us lo c al ly un derstand the Ba yes classiﬁer. W e remark that ζ ( x 0 ) b ecomes a zero v ector, e. g. when P ( Y 6 = g ∗ ( x 0 ) | X = x ) | x = x 0 is equal to one in some neig hb orho o d of x 0 . Ou r explanation v ector ﬁ ts well to probabilistic classiﬁers suc h as Gaussian Pro cess Classiﬁcation (GPC), w here th e conditional d istribution P ( Y = c | X = x ) is us ually not completely ﬂat in some r egions. I n the case of deterministic classiﬁers, d espite of this issu e, P arzen win do w estimators with app ropriate widths (Section 3) can provide meaningful explanation vec tors for many samples in pr actice (see also Section 8). F or the case of b inary classiﬁcat ion we d ir ectly deﬁne lo cal explanation v ectors as lo cal gradien ts of the probabilit y fun ction p ( x ) = P ( Y = 1 | X = x ) of the learned mo d el for the p ositiv e class. So for a pr obabilit y function p : ℜ d → [0 , 1] of a classiﬁcation mo d el learned from examples { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ∈ ℜ d × {− 1 , +1 } th e explanation ve ctor for a classiﬁed test p oint x 0 is the lo cal gradient of p at x 0 : 3 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller Deﬁnition 2 η p ( x 0 ) := ∇ p ( x ) | x = x 0 By this deﬁnition the explanation η is again a d -dimensional v ector jus t lik e the test p oint x 0 is. The sign of eac h of its ind ividual en tries ind icates w hether th e p r ediction would increase or decrease when th e corresp ond ing feature of x 0 is increased lo cally and eac h en try’s absolute v alue giv e the amount of inﬂuence in the change in prediction. As a v ector η giv es the d irection of the steep est ascent from the test p oin t to higher p r obabilities for the p ositiv e c lass. F or binary classiﬁcation the negativ e v ersion − η p ( x 0 ) indicates the c hanges in features needed to increase th e probabilit y f or th e negativ e class which may b e esp ecially useful for x 0 predicted in the p ositive class. F or an example w e app ly Deﬁnition 2 to mo d el p redictions learned b y Gaussian Pr o cess Classiﬁcation (GPC), see Rasmussen and Williams (2006). GPC is used here for three rea- sons: (i) In our real-w orld applicat ion w e are in terested in classifying data from drug discov ery , whic h is an area where Gaussian pro cesses ha ve pro v en to sho w state-of-the-art p erformance, see e.g. Obrezano v a, Cs´ an yi, Gola, and Segall (2007); Sc hr o eter, Sch w aighofer, Mik a, ter Laak, S ¨ ulzle, Ganzer, H e i n r i c h , a n d M ¨ u l l e r (2007 c ); Schroeter, Sc hw aighofer, Mik a, Laak, Suelzle, Ganzer, Heinric h, and M ¨ u ller (2007 a ,b); Sc hw aighofer, Sc hro eter, Mik a, Laub, ter Laak, S ¨ ulzle, Ganzer, Heinr ic h, and M ¨ u ller (200 7); Sc hw aighofer, Sc hro eter, Mik a, Hansen, ter Laak, Lienau, Reic hel, Heinric h, and M ¨ uller (2008); Obrezano v a, Gola, Champ ness, and Segall (2008). It is n atural to exp ect a mo del with high prediction accuracy on a complex pr ob lem to capture relev an t stru cture of the data which is w orth explaining and ma y giv e domain sp eciﬁc insigh ts in addition to the v alues predicted. F or an ev aluation of the explaining capabilities of our approac h on a complex problem f rom c hemoinformatics see Section 6. (ii) GPC do es mo del the class probabilit y fu nction used in Deﬁn ition 2 directly . F or other classiﬁcation metho d s such as S u pp ort V ector Mac hines which do not pro vide a probabilit y function a s its output in Section 3 w e gi ve an example for an estimation metho d starting from Deﬁnition 1. (iii) T h e lo cal gradien ts of the probabilit y function can b e calc u lated analytic ally for diﬀer- en tiable k ern el as we discuss next. Let f ( x ) = P n i =1 α i k ( x, x i ) b e a GP mo del trained on sample p oin ts x 1 , . . . , x n ∈ ℜ d where k is a k ernel function and α i are the learned weigh ts of eac h sample p oin t. F or a test p oint x 0 ∈ ℜ d let v ar f ( x 0 ) b e the v ariance of f ( x 0 ) un der the GP p osterior for f . Because the p osterior cannot b e calculated analytically for GP classiﬁcation mo dels, we used an appro xim ation b y exp ectation propagati on (EP) (Kuss and Ramussen, 2005). In the case of the p robit lik eliho o d term deﬁned by the error fu nction, the p robabilit y for b eing of the p ositiv e class p ( x 0 ) can b e computed easily from this appro ximated p osterior as p ( x 0 ) = 1 2 erfc − f ( x 0 ) √ 2 ∗ p 1 + v ar f ( x 0 ) ! , where er f c denotes the complementary err or fu nction (see Equation 6 in Sc hw aighofer, Sc hro eter, Mik a, Hansen, t e r L a a k , L i e n a u , R e i c h e l , H e i n r i c h , a n d M ¨ u l l e r , 2008). Then the lo cal gradient of p ( x 0 ) is give n b y 4 How to E xplain I ndividual Class ifica tion Decisions ∇ p ( x ) | x = x 0 = ∇ 1 2 erfc − f ( x ) √ 2 ∗ p 1 + v ar f ( x ) !      x = x 0 = ∇ 1 2 1 − erf − f ( x ) √ 2 ∗ p 1 + v ar f ( x ) !!      x = x 0 = − 1 2 ∇ erf − f ( x ) √ 2 ∗ p 1 + v ar f ( x ) !      x = x 0 = − exp  − f ( x 0 ) 2 2(1+v ar f ( x 0 ))  √ π ∇ − f ( x ) √ 2 ∗ p 1 + v ar f ( x ) !      x = x 0 = − exp  − f ( x 0 ) 2 2(1+v ar f ( x 0 ))  √ π − 1 √ 2 ∇ f ( x ) p 1 + v ar f ( x ) !      x = x 0 ! = exp  − f ( x 0 ) 2 2(1+v ar f ( x 0 ))  √ 2 π ∇ f ( x ) | x = x 0 p 1 + v ar f ( x 0 ) + f ( x 0 )  ∇ v ar f ( x ) | x = x 0 ∗ − 1 2 (1 + v ar f ( x 0 )) − 3 2  = exp  − f ( x 0 ) 2 2(1+v ar f ( x 0 ))  √ 2 π ∇ f ( x ) | x = x 0 p 1 + v ar f ( x 0 ) − 1 2 f ( x 0 ) (1 + v ar f ( x 0 )) 3 2 ∇ v ar f ( x ) | x = x 0 ! . As a k ern el function c ho ose e.g. the RBF-k ernel k ( x 0 , x 1 ) = exp( − w ( x 0 − x 1 ) 2 ), which h as the d eriv ativ e ( ∂ /∂ x 0 ,j ) k ( x 0 , x 1 ) = − 2 w exp( − w ( x 0 − x 1 ) 2 )( x 0 ,j − x 1 ,j ) for j ∈ { 1 , . . . , d } . Then the elements of the lo cal gradien t ∇ f ( x ) | x = x 0 are ∂ f ∂ x 0 ,j = − 2 w n X i =1 α i exp( − w ( x 0 − x i ) 2 )( x 0 ,j − x i,j ) for j ∈ { 1 , . . . , d } . F or v ar f ( x 0 ) = k ( x 0 , x 0 ) − k T ∗ ( K + Σ) − 1 k ∗ the deriv ativ e is give n by 2 ∇ v ar f ( x ) | x = x 0 = ∂ v ar f ∂ x 0 ,j =  ∂ ∂ x 0 ,j k ( x 0 , x 0 )  − 2 ∗ k T ∗ ( K + Σ) − 1 ∂ ∂ x 0 ,j k ∗ for j ∈ { 1 , . . . , d } . P anel (a) of Figure 1 shows th e training data of a simple ob j ect classiﬁcation task and panel (b) sho ws the mo del learned using GPC 3 . The data is labeled − 1 for the blue p oints and +1 for the r ed p oin ts. As illustrated in panel (b) the mo d el is a probability function for the p ositiv e class which gi v es ev ery data p oint a probabilit y of b eing in this 2. Here k ∗ = ( k ( x 0 , x 1 ) , . . . , k ( x 0 , x n )) T is the ev aluation of the kernel function b etw een the test p oint x 0 and every training p oint. Σ is t he diagonal matrix of the vari ance site parameter. F or details see Rasmuss en and Williams (2006, Chapter 3) 3. Hyperparameters were tuned by a gradient ascend on the evidence. 5 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 (a) Ob ject (b) Mo del 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) Lo cal explanation vectors 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (d) Direction of explanation vectors Figure 1: Explaining simp le ob ject classiﬁcatio n with Gaussian Pro cesses class. P anel (c) sho ws the p robabilit y gradien t of the mo d el together with the lo cal gradient explanation v ectors. On the hyp oten u s e and at the corners of the triangle explanations from b oth features int eract to wards the triangle class wh ile along the edges the imp ortance of one of the t wo feature dimensions singles out. A t the transition f rom th e nega tive to th e p ositiv e class the length of th e local gradien t v ectors represen ts the increa sed imp ortance of the r elev ant features. In panel (d ) w e s ee that explanations close to the edges of the plot (esp ecially in the right hand side corn er ) p oin t a wa y from the p ositiv e class. Ho wev er, panel (c) sho ws that their magnitude is v ery small. F or discussion of this issu e, see Section 8. 3. Estimating E xplanation V ectors Sev eral cla s siﬁer methods estimate directly th e decision r ule, w h ic h often has no int er p re- tation as a pr ob ab ility fun ction whic h is used in our Deﬁnition 2 in S ection 2 . F or example 6 How to E xplain I ndividual Class ifica tion Decisions Supp ort V ector Mac h ines estimate a d ecision fun ction of the f orm f ( x ) = n X i =1 α i k ( x i , x ) + b, α i , b ∈ ℜ . Sup p ose we ha ve tw o classes (eac h w ith one cluster) in one dimension (see Figure 2) and train a SVM w ith R BF k ern el. F or p oint s outside the data clusters f ( x ) tends to ze r o. Thus, the deriv ativ e o f f ( x ) (sho wn as arro ws ab o v e the curv es) for p oints on the v ery left or on the v ery right side of the axis will p oin t to the wrong sid e. In the 0 x p ( | ) y=1 x 1 0.5 0 x classifier output Figure 2: Classiﬁer outpu t of an SVM (top) compared to p ( y = 1 | x ) (b ottom). follo wing, w e will explain h o w explanations can b e obtained for suc h classiﬁers. In pr actice w e do not ha v e access to the true underlyin g distribution P ( X , Y ). Con- sequen tly , we hav e n o access to the Ba y es classiﬁer as deﬁ n ed in S ection 2. Instead we can apply soph isticated learning mac h inery like Supp ort V ector Mac hines (V apn ik, 1995; Sc h¨ olk opf and Smola, 2002; M ¨ uller, Mik a, R¨ at sc h, Tsu da, and Sc h¨ olk opf, 2001) that esti- mates some classiﬁer g that tries to mimic g ∗ . F or test data p oin ts z 1 , . . . , z m ∈ ℜ d whic h are assumed to b e sampled fr om the same unkn o wn distribution as the trainin g data, g estimates lab els g ( z 1 ) , . . . , g ( z m ). No w, instead of trying to explain g ∗ to which w e ha ve no access, w e will deﬁne explanatio n v ectors that help us understand the classiﬁer g on the test data p oin ts. Since we do not assume that we ha ve access to some intermediate r eal-v alued classiﬁer output here (of whic h g might b e a thr esholded ve rsion and whic h furth er might not b e an estimate of P ( Y = c | X = x )), we suggest to app ro ximate g b y another classiﬁer ˆ g the actual form of w h ic h resembles the Ba ye s classiﬁer. There are s everal choice s for ˆ g , e.g. 7 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller GPC, logistic regression and P arzen win do ws. 4 In this pap er we apply P arzen wind ows to the training p oin ts to estimate the wei gh ted class d ensities P ( Y = c ) · P ( X | Y = c ), ˆ p σ ( x, y = c ) = 1 n X i ∈ I c k σ ( x − x i ) (1) for the index set I c = { i | g ( x i ) = c } and with k σ ( z ) b eing a Gaussian kernel k σ ( z ) = exp( − 0 . 5 z ⊤ z /σ 2 ) / √ 2 π σ 2 (as alw a ys other k ernels are also p ossible). Th is estimates P ( Y = c | X = x ) for all c , ˆ p σ ( y = c | x ) = ˆ p σ ( x, y = c ) ˆ p σ ( x, y = c ) + ˆ p σ ( x, y 6 = c ) ≈ P i ∈ I c k σ ( x − x i ) . P i k σ ( x − x i ) , (2) and thus an estimate of the Ba ye s cla s siﬁer (that mimics g ), ˆ g σ ( x ) = arg min c ∈{ 1 ,...,C } ˆ p σ ( y 6 = c | x ) . This app roac h has the adv an tage, that w e can us e our estimated classiﬁer g to generate an y amount of lab eled d ata for for constructing ˆ g . Th e single h yp er-parameter σ is c hosen, suc h that ˆ g appro ximates g (whic h w e w ant to explain), i.e. ˆ σ := arg m in σ m X j =1 I { g ( z j ) 6 = ˆ g σ ( z j ) } , where I {· · · } is the indicator function. σ is assigned the constan t v alue ˆ σ from h ere on and omitted as a subscrip t. F or ˆ g it is straigh tforward to d eﬁ ne explanation vect ors : Deﬁnition 3 ˆ ζ ( z ) := ∂ ∂ x ˆ p ( y 6 = g ( z ) | x )     x = z =  P i / ∈ I g ( z ) k ( z − x i )  P i ∈ I g ( z ) k ( z − x i )( z − x i )  σ 2  P n i =1 k ( z − x i )  2 −  P i / ∈ I g ( z ) k ( z − x i )( z − x i )  P i ∈ I g ( z ) k ( z − x i )  σ 2  P n i =1 k ( z − x i )  2 whic h is easily d eriv ed using Eq. (2) and the d eriv ativ e of Eq. (1), see Ap p end ix A.2.1. Note that we use g instead of ˆ g . This choic e ensur es that the orien tation of ˆ ζ ( z ) ﬁts to the lab els assigned by g , wh ic h allo w s b etter inte rpretations. In summary , we imitate the classiﬁer g which we wo u ld lik e to explain lo cally , b y a Pa r zen windo w classiﬁer ˆ g that has the same form as the Ba y es estimator and for whic h w e can th u s easily estimate the explanation vect ors using Deﬁnition 3. Practically there are some ca v eats: the mimic king classiﬁer ˆ g h as to b e estimated fr om g ev en in high d imensions; this needs to b e done with care. Ho wev er, in p rinciple w e h av e an arbitrary amount of training data a v ailable for constructing ˆ g since we ma y use our estimated classiﬁer g to generate lab eled d ata. 4. F or Supp ort V ector Machines Platt ( 1999) ﬁt s a sigmoid function to map the outputs to probabilities. In t he follo wing, we will present a more general method for estimating explanation vectors. 8 How to E xplain I ndividual Class ifica tion Decisions 4. Explaining Ir is Flo wer Classiﬁcation b y k -Nearest N eigh b ors The Iris ﬂo wer data set (in tr o duced in Fisher, 1936) describ es 150 ﬂo wers from the gen u s Iris by 4 features: sepal length, sepal width, p etal length, and p etal w id th, whic h are easily measured prop erties of certain leav es of the corolla of th e ﬂow er. There are th r ee clusters in th at data w hic h corresp ond to three diﬀerent sp ecies: Ir is setosa, Iris virginica, and Iris v ersicolor. Let u s consid er the pr oblem of classifying the d ata p oin ts of Iris versico lor (class 0) against the other t wo sp ecies (class 1). W e applied some standard classiﬁcation mac hin ery to this p r oblem as d etailed in the follo wing: • Class 0 consists of all examples of Ir is v ersicolor. • Class 1 consists of all examples of Ir is setosa and Iris virginica. • Randomly split 150 data p oin ts in to 100 training and 50 test examples. • Normalize training and test set using the mean and v ariance of the training s et. • Apply k -nearest neighbor classiﬁcation with k = 4 (c hosen by lea v e-one-out cross v alidation on the training data). • T raining error is 3% (i.e. 3 m istakes in 100). • T est error is 8% (i.e. 4 mistak es in 50). In order to estimate explanation v ectors w e mimic t he classiﬁcation results with a Pa r zen windo w classiﬁer. The b est ﬁt (3% error) is obtained with a k ern el width of σ = 0 . 26 (c hosen b y lea v e-one-out cross v alidation on the training d ata). Since the explanation vec tors liv e in the input space w e can visualize them w ith scatter plots of the initially measured features. Th e resulting explanations (i.e. vec tors) for the test set are sho wn in Figur e 3 . The blue dots corresp ond to explanation v ectors f or Iris setosa and the r ed dots f or Ir is vir ginica (b oth class 1) . Bo th group s of d ots p oin t to the green dots of Iris v er s icolor. The most imp ortant f eature is the com bination o f p etal length and p etal width (see the corresp onding panel), th e pr o duct of wh ic h corresp onds roughly to the area of the p etals. Ho wev er, the r esulting explanations for the tw o sp ecies in class 1 are diﬀeren t: • Iris setosa (c lass 1) is d iﬀeren t from Iris versicolo r (class 0) b ecause its p etal area is smal ler . • Iris virginica (class 1) is diﬀeren t from Iris v ersicolor (class 0) b ecause its p etal area is lar ger . Also the d im en sions of the sepal (another part of th e blossom) is relev ant, but not as distinguishing. 9 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller Figure 3: Scatter plots of the exp lanation vecto r s f or the test data. Sh o wn are all explana- tion v ectors for b oth classes: class 1 con taining Iris setosa (sho wn in blu e) and Iris virginica (sh o wn in red) versus class 0 cont ainin g only one sp ecies Iris versicolo r (sho wn in green). Note that the exp lanations wh y an Iris ﬂo wer is not an Iris v ersicolor is diﬀerent for Iris setosa and Iris virginica. 10 How to E xplain I ndividual Class ifica tion Decisions Figure 4: USPS d igits (training set): ’tw os’ (left) a n d ’eight s ’ (righ t) with correct classiﬁ- cation. F or eac h digit from left to righ t: (i) explanation v ector (with blac k b eing negativ e, white b eing p ositiv e), (ii) the original digit, (iii-end) artiﬁcial digits along the exp lanation v ector to wards th e other class. Figure 5: USPS digits (test set b ottom part): ’tw os’ (left) and ’eights’ (righ t) with correct classiﬁcation. F or eac h digit fr om left to r igh t: (i) explanation v ector (with black b eing n egativ e, white b eing p ositiv e), (ii) the original d igit, (iii-end) artiﬁcial digits along the explanation vec tor to wards the other class. 11 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller 5. Explaining USPS Digit Classiﬁcation b y Supp ort V ector Machin e W e now apply the f r amew ork of estimating explanation v ectors to a high dimensional d ata set, th e USPS digits. Th e classiﬁcati on problem that we designed f or illustration pur p oses is detailed in the follo wing list: • all digits are 16 × 16 images wh ic h are reshap ed to 256 × 1 dimensional column v ectors • classiﬁer: SVM from Sc hw aighofer (2002 ) with RBF-k ernel-width σ = 1 and regu- larizatio n co n stan t C = 10 (c h osen by grid searc h i n cross v alidation on the training data). • training set: 47 ’t wos’, 53 ’eigh ts’; training error 0.00 • test set: 48 ’t wos’, 52 ’eigh ts’; test error 0.05 W e appro xim ated the estimated class lab els obtained b y the SVM with the P arzen windo w classiﬁer (P arzen wind ow size σ = 10 . 2505, chosen by grid search in cross v alidation on the training data). Th e SVM and th e Parzen windo w cla ssiﬁer only disagreed on 2% of the test examples, so a go o d ﬁt was ac h iev ed. Figures 4 and 5 show our results. All parts s ho w three examples p er ro w. F or eac h example we displa y from left to r igh t: (i) the explanation v ector, (ii) the original digit, (iii-end) artiﬁcial digits along the explanation v ector to wards the other class. 5 These artiﬁcial digits should h elp to understand and interpret the exp lanation v ector. Let us ﬁrst ha ve a lo ok at the resu lts on th e training set: Figure 4 (left panel): Let us fo cus on the top example framed in r ed . The line that forms the ’t w o’ is p art of so me ’ei gh t’ fr om the data set. Thus the parts o f the lines that are missin g sho w up in the exp lanation v ector: if the dark parts (whic h corresp ond to the miss in g lines) are a dded to the ’t w o’ d igit then it will b e classiﬁed as an ’eigh t’. Or in other w ord s, b ecause of the lac k of those parts the digit w as classiﬁed as a ’t wo’ and not as an ’eight’. A similar explanation holds for the m iddle example framed in red of the same Figure. Not all examples transform easily to ’eights’: b esides adding parts of blac k lines, some existing blac k sp ots (that the digit has to b e a ’t wo’ ) m ust b e remo ved. Th is is reﬂected in the explanatio n v ector b y white sp ots/lines. Curious is the b ottom ’tw o’ fr amed in red, which is actually a dash and is in the data set b y mistak e. Ho wev er, its explanation v ector shows nicely wh ic h parts ha ve to b e added and which ha ve to b e remo ved. Figure 4 (right panel) : we s ee similar results for the ’eights’ class. The explanation v ectors again tell us h o w the ’eights’ m ust c h ange to b ecome classiﬁed as ’t wo s ’. Ho we ver, sometimes the transformatio n do es n ot r eac h the ’t w os’. Th is is probably due to th e fact th at some of the ’eigh ts’ are inside the cloud of ’eigh ts’. On the test set the explanation v ectors are not as pronounced as on the training set. Ho we ver, they sho w similar tendencies: 5. F or th e sake of simplicit y , no intermediate up dates were p erformed, i.e. artiﬁcial digits we re generated by taking equal sized steps in the direction giv en by the origi nal explanation vector calculated fot t h e original d igit. 12 How to E xplain I ndividual Class ifica tion Decisions Figure 5 (left panel): w e see th e correctly classiﬁed ’t wo s’. Let’s focus on the exa mple framed in red . Again the explanation v ector sho ws us ho w to edit the ima ge of the ’t wo ’ to mak e it some of the ’eigh ts’, i.e. exactly what p arts of the digit hav e b een imp ortant for the classiﬁcation resu lt. F or several other ’tw os’ the explanation v ectors do not directly lead to the ’eig hts’ b ut w eight the d iﬀerent parts of the d igits which ha ve b een relev an t for the classiﬁcation. Figure 5 (right panel) : similarly to the training data, we see that also these explanation v ectors are n ot b ringing all ’eigh ts’ to ’t wo’. Their explanation ve ctors mainly suggest to remo ve most of the eigh ts (the dark parts) and add some in the lo w er part (the ligh t parts, wh ic h lo ok lik e a w h ite shado w). Ov erall, our ﬁndings can b e summarized, that t he exp lanation vecto r s tell u s ho w to edit our example digits to c hange the assigned class lab el. Hereb y , w e get a b etter understand ing of the r easons why th e c hosen classiﬁer classiﬁed th e w ay it did. 6. Explaining Mutagenicity Classiﬁcation by Gaussi an Pro cesses In the follo wing Section we describ e an application of our lo cal gradient explanation metho d- ology to a complex real w orld data set. Our aim is to ﬁnd structure sp eciﬁc to the problem domain that has not b een fed in to trainin g explicitly but is captured implicitly by the GPC mo del in the high-dim en sional feature space used to d etermin e its prediction. W e in vestig ate the task of predicting Ames mutagenic activit y of chemica l comp ounds. Not b eing m utagenic (i.e. not able to cause m u tations in the DNA) is an imp ortan t require- men t for comp oun ds und er in vestig ation in dr ug disco ve r y and design. Th e Ames test (Ames, Gurn ey , Miller, and Bartsc h , 1972) is a standard exp erimen tal setup for measurin g m u tagenicit y . The foll o wing exp eriments are based on a set of Ames test results for 6512 c hemical comp ounds that w e published previously . 6 GPC wa s applied as detailed in the follo wing: • Class 0 consists of non-mutage nic comp ounds • Class 1 consists of mutag enic comp ounds • Randomly split 6512 data p oints in to 2000 training and 4512 test examples suc h that: – Th e training set consists of equally man y class 0 and class 1 examples. – F or the steroid comp oun d class the balance in th e train and test set is enforced. • 10 additional r andom splits were inv estigated individually . T his conﬁrmed the results present ed b elo w. • Eac h example (c hemical comp ound ) is represente d by a v ector of coun ts of 142 molecu- lar substructures calculated us in g the Dra gon softw are (T o deschini, Consonni, Mauri, and Pa v an , 2006). 6. See Hansen, Mik a, S chroeter, S utter, Laak, S teger-Hartmann, H einric h, and M ¨ uller (2009) for results of mod eling this set using diﬀerent mac h ine learning metho ds. The data itself is av ailable online at http://ml. cs.tu- berlin.de/ toxb enchmark 13 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 false positive rate true positive rate AUC = 0.84 Figure 6: Receiv er op erating curv e of GPC mo del f or m utagenicit y prediction • Normalize training and test set using the mean and v ariance of the training s et. • Apply GPC mo del with RBF k ernel • P erformance (84 % area und er curv e) conﬁrms our p r evious resu lts (Hansen, Mik a, Sc h r o eter, Sutter, Laak , S t e g e r - H a r t m a n n , H e i n r i c h , a n d M ¨ u l l e r , 2009). Error rates can b e obtained from Figure 6. T ogether with the prediction we calculated the explanation v ector (as introd uced in Section 2 with Deﬁnition 2 ) for eac h test p oin t. The remaind er of this Section is an ev aluation o f these lo cal exp lanations. In Figures 7 and 8 w e sho w the d istribution of the lo cal imp ortance of selected fea- tures across the test set: F or eac h input feature w e generate a histogram of lo cal imp or- tance v alues, as indicated b y its corresp ondin g en try in the explanation v ector of eac h of the 45 12 test comp ounds. Th e features examined in Figure 7 a r e co u n ts of sub structures kno wn to cause m utagenicit y . W e sho w all app ro ved “sp eciﬁc to xicophores” in tro duced b y Kazius, McGuire, an d Bursi (2005) that are also represen ted in the Dragon set of fea- tures. The features shown in Figure 8 are kno wn to deto xify certain to xicophores (again see Kazius, McGuire, and Bursi, 2005). With the exception of 7(e) the to xicophores also ha ve a to x if y in g inﬂuence according t o o u r GPC prediction model. F eature 7(e) seems to b e mostly irrelev an t for the prediction of the GPC mo del on the test p oin ts. In con trast the d eto xicophores sho w ov erall negativ e inﬂuence on the prediction outco me of the GPC mo del. Mo d ifying the te st comp ound s b y adding t o xicophores will increase the probabil- it y of b eing mutag en ic as pr edicted by the GPC mo d el while adding deto xicophores will decrease this predicted probab ility . So we ha ve seen that the conclusions dra w n from our explanation ve ctors agree with established knowledge ab ou t to xicophores and deto xicophores. While this is reassuring, suc h a s anit y c h ec k required existing kno wledge ab out whic h comp ounds are to xicophores and deto xicophores and whic h are n ot. Thus it is in teresting to ask, whether w e also could ha ve disc over e d that knowle d ge from the explanation v ectors. T o answe r this qu estion 14 How to E xplain I ndividual Class ifica tion Decisions −1 −0.5 0 0.5 1 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 local gradient DR17:nArNO2 relative frequency (a) aromatic nitro −1 −0.5 0 0.5 1 0 0.02 0.04 0.06 0.08 0.1 local gradient DR17:nArNH2 relative frequency (b) aromatic amine −1 −0.5 0 0.5 1 0 0.01 0.02 0.03 0.04 0.05 local gradient DR17:nArNO relative frequency (c) aromatic nitroso −1 −0.5 0 0.5 1 0 0.02 0.04 0.06 0.08 local gradient DR17:nRNNOx relative frequency (d) aliphatic nitrosamine −1 −0.5 0 0.5 1 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 local gradient DR17:nArNNOx relative frequency (e) aromatic nitrosamine −1 −0.5 0 0.5 1 0 0.01 0.02 0.03 0.04 0.05 local gradient DR17:nOxiranes relative frequency (f ) ep oxide −1 −0.5 0 0.5 1 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 local gradient DR17:nAziridines relative frequency (g) aziridine −1 −0.5 0 0.5 1 0 0.02 0.04 0.06 0.08 0.1 0.12 local gradient DR17:nN=N relative frequency (h) azide −1 −0.5 0 0.5 1 0 0.02 0.04 0.06 0.08 0.1 local gradient DR17:nArNHO relative frequency (i) aromatic hydro xy lamine −1 −0.5 0 0.5 1 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 local gradient X:AliphaticHalide relative frequency (j) aliphatic halide Figure 7: Distribution of lo cal imp ortance of selected features across the test set of 4512 comp ounds . Nine ou t of ten known to xicophores (Kazius, McGuire, and Bursi, 2005) ind eed exhibit p ositiv e lo cal gradien ts. 15 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller −1 −0.5 0 0.5 1 0 0.02 0.04 0.06 0.08 0.1 local gradient DR17:nSO2N relative frequency (a) sulfonamide −1 −0.5 0 0.5 1 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 local gradient DR17:nSO2OH relative frequency (b) sulfonic acid −1 −0.5 0 0.5 1 0 0.01 0.02 0.03 0.04 0.05 local gradient DR17:nS(=O)2 relative frequency (c) arylsulfonyl −1 −0.5 0 0.5 1 0 0.01 0.02 0.03 0.04 0.05 local gradient DR17:nRCOOH relative frequency (d) aliphatic carb oxylic acid −1 −0.5 0 0.5 1 0 0.02 0.04 0.06 0.08 0.1 local gradient DR17:nArCOOH relative frequency (e) aromatic carb o xylic acid Figure 8: Distribution of lo cal imp ortance of selected features across the test set of 4512 comp ounds . All ﬁve kno wn deto xicophores exhibit negativ e lo cal grad ients w e r ank ed all 142 features by the means of their lo cal gradien ts 7 . Clear trends result: 9 out of 10 known to xicophores can b e found close the top of the list (mean r ank of 19). The only exception (rank 81) is the aromatic n itrosamine feature. 8 This tren d is even stronger for th e deto xicophores: T he mean r ank of these ﬁv e features is 138 (out of 142), i.e. they consisten tly exhibit the largest negativ e lo cal gradien ts. Consequen tly , th e established kno wledge ab out toxi cophores and deto xicophores could ind eed hav e b een disc over e d using our metho d ology . In the follo win g paragraph w e w ill discus s steroids 9 as an example of an imp ortant comp ound class for whic h the meaning of features diﬀers f rom this global trend , so that lo cal explanation v ectors are needed to correctly id en tify r elev ant features. Figure 9 displays th e diﬀerence in relev ance of ep o xide (a) and aliphatic nitrosamine (c) substru ctures f or the predicted muta genicit y of steroids and non-steroid comp ounds. F or 7. T ables resulting from this ranking are made a va ilable as a supplement to th is pap er and can b e down- loaded from the journals w ebsite. 8. T h is ﬁnding agrees with th e result obt ained by v isually insp ecting Figure 7(e). W e found that only very few comp ounds with th is feature are presen t in th e data set. Consequently , detection of this feature is only p ossible if enough of these few comp ound s are included in the training d ata. This was not the case in the random split used to p ro du ce the results presented ab ov e. 9. Steroi ds are natural p rod u cts and occur in humans, animals and plants. They ha ve a characteristic backbone con taining four fused carb on-rings. Many hormones important to the dev elopment of the human b o dy are steroids, including androgens, estrogens, p rogestagens, cholesterol and natural anab olics. These hav e b een used as starting p oints for the d evelo p ment of many diﬀerent drugs, includ ing th e most reliable con t raceptives currently on th e market. 16 How to E xplain I ndividual Class ifica tion Decisions −0.5 0 0.5 1 0 0.05 0.1 0.15 0.2 KS p−value = 0.00 symm. KLD = 23.14 local gradient DR17:nOxiranes relative frequency steroid non−steroid (a) ep oxide feature: steroid vs. non-steroid −0.5 0 0.5 1 0 0.02 0.04 0.06 0.08 KS p−value = 0.12 symm. KLD = 11.08 local gradient DR17:nOxiranes relative frequency random others (b) ep oxide feature: rand om comp ounds v s. the rest −0.2 0 0.2 0.4 0.6 0 0.05 0.1 0.15 0.2 0.25 KS p−value = 0.00 symm. KLD = 12.51 local gradient DR17:nRNNOx relative frequency steroid non−steroid (c) aliphatic nitrosamine feature: steroid vs. n on - steroid −0.2 0 0.2 0.4 0.6 0 0.02 0.04 0.06 0.08 0.1 KS p−value = 0.16 symm. KLD = 5.07 local gradient DR17:nRNNOx relative frequency random others (d) aliphatic nitrosamine feature: random com- p ounds vs. the rest Figure 9: The lo cal distribution of feature imp ortance to steroids and random non-steroid comp ounds signiﬁcan tly diﬀers for tw o kno wn to xicophores. Th e small lo cal gra- dien ts found for the steroids (sho wn in blue) indicate that the presence of ea ch to xicophore is irrelev ant to the molecules to xicit y . F or non-steroids (sho wn in red) the kno wn to xicophores indeed exhibit p ositiv e lo cal gradien ts. 17 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller comparison w e also show the distribu tions for comp ound s chosen at rand om fr om the test set (b,d). Eac h subﬁgur e con tains t wo measures of (dis-)similarit y for eac h pair of distribu- tions. The p-v alue of the K olmogoro v-Sm irnoﬀ test (KS) giv es the probabilit y of error when rejecting the hyp othesis th at b oth relativ e frequencies are drawn from the same under lyin g distribution. The symmetrized K ullbac k-Leibler div ergence (KLD) gives a metric of the d is- tance b et ween the t w o distribu tions. 10 While contai ning ep o xides generally tends to make molecules m utagenic (see discussion ab o ve) , we do not observe this eﬀect for steroids: In Figure 9(a), almost all ep o xide con taining non-steroi ds exhibit p ositiv e gradients, th ereb y follo wing the global distr ib ution of ep o xide conta in ing comp ounds as sho wn in Figure 7(f ). In con trast, almost all ep o xide conta in ing steroids exhibit gradien ts just b elo w zero. “Im- m u n it y” of steroids to th e ep o xide to xicophore is an established fact and has ﬁrst b een discussed b y Glatt, Jung, and Oesc h (1983). T his p eculiarit y in chemical space is clearly exhibited b y the lo cal explanation giv en by our approac h . F or aliphatic nitrosamine, the situation in the GPC mo d el is less clear but still the to xify in g inﬂ u ence seems to b e less in steroids than in m an y o ther comp oun ds. T o our kno w ledge, this phenomenon h as not ye t b een d iscussed in the pharmaceutical literature. In conclusion, w e can learn fr om the explanation v ectors that: • to xicophores tend to make comp oun ds m u tagenic (class 1) • deto xicophores tend to m ak e comp ound s non-m u tagenic (class 0) • steroids are imm u ne to the presence of some to xicophores (ep o xide, p ossib ly also aliphatic n itrosamine) 7. Related W ork Assigning p oten tially diﬀeren t explanations to individual data p oin ts distinguishes our ap- proac h from conv entional feature extraction methods that e xtract global features that are relev ant f or all data p oints, i.e. those features that allo w to ac h iev e a sm all ov erall pred iction error. Our n otion of exp lanation is n ot related to the pr ed iction error, but only to the label pro vid ed b y th e prediction algorithm. Ev en though the error is large, our framewo r k is able to answer the question why the algorithm has decided on a data p oin t the w ay it d id. The explanation v ector prop osed here is similar in spirit to s ensitivit y analysis w hic h is common to v arious areas of information science. A classical example is the outlier sensitivit y in statistics (Hamp el, Ronc hetti, Rousseeu w, and S tahel, 1986). In this case, th e eﬀects of remo ving single data p oin ts on estimated parameters are ev aluated by an inﬂuen ce fun ction. If the inﬂuence for a data p oint is signiﬁ can tly large, it is detected as an outlier and should b e remo ve d for the follo w ing analysis. In regression problems, lev erage analysis is a pro cedure along similar lines. It detects leverag e p oints which ha ve p oten tial to giv e large imp act on th e estimate of the r egression function. I n con trast to the inﬂuential p oin ts (outliers), remo ving a lev erage samp le ma y not actually c hange the regressor, if its resp onse is v ery close to the predicted v alue. E .g. for linear r egression the samples whose in p uts are f ar fr om 10. Symmetry is achiev ed by av eraging the tw o Kullback-Leibler d iverg en ces: K L ( P 1 ,P 2)+ KL ( P 2 ,P 1) 2 , cf. Johnson and Sinan ovic (2000). T o preven t zero-v alues in t h e histograms which w ould lead to inﬁnite KL distances, an ε > 0 has b een added t o each bin count. 18 How to E xplain I ndividual Class ifica tion Decisions the mean are the leverag e p oints. Our framework of explanation v ectors considers a diﬀerent view. It describ es th e inﬂuence of moving single data p oint s lo cally and it th us ans w ers the question w hic h directions are lo cally most in ﬂuenti al to th e pr ediction. T he explanation v ectors are used for extracting sensitiv e f eatures whic h are r elev ant to the pred iction resu lts, rather than d etecting/e limin ating the inﬂuential samples. In recen t decades, explanation of results b y exp ert systems ha ve b een an imp ortan t topic in the AI co m m un it y . Esp ecially , for those based on Ba yesian b elief n etw orks, such explanation is crucial in pr actical u s e. In this con text sensitivit y analysis has also b een used as a guid ing pr inciple (Horvitz, Breese, and Henrion, 1988). There the in ﬂuence is ev aluated by remo v in g a set of v ariables (features) from evidences and the explanation is constructed from those v ariables which aﬀect inference (relev an t v ariables). F or example, Suermond t (1992) measures the cost of omitting a single feature E i b y the cross-ent r op y H − ( E i ) = H ( p ( D | E ); P ( D | E \ E i ) ) = N X j =1 P ( d j | E ) log P ( d j | E ) p ( d j | E \ E i ) , where E denotes evidences an d D = ( d 1 , . . . , d N ) T is the target v ariable. Th e cost of a subset F ⊂ E can b e d eﬁned similarly . T his line of researc h is more conn ected to our w ork, b ecause explanation can dep end on the assigned v alues of the evidences E , and is th u s lo cal. Similarly Robnik- ˇ Sik onj a and Kononen k o (2008) and ˇ Strumb elj and Kononenk o (2008 ) try to explain th e decision of trained kNN-, SVM- an d ANN-mo dels for individu al instances b y measuring the d iﬀeren ce in their p rediction with sets of features omitted. The cost o f omitting feat ures is ev aluated as the information diﬀerence, the log-odds ratio or the dif- ference of p robabilities b etw een the mo d el with kn o wledge ab out all features and with omissions resp ectiv ely . T o kno w what the pr ed iction w ould b e without th e kno w ledge of a certain feature th e mo d el is retrained for ev ery choice of features whose inﬂu ence is to b e explained. T o sa ve the time o f com b inatorial training Robnik- ˇ Sik onj a and Kononen k o (2008) prop ose to use neutral v alues whic h ha v e to b e estimated b y a kno wn prior distri- bution of all p ossib le p arameter v alues. As a theoretical fr amework for considering f eature in teractions, ˇ Strumb elj and Kononenk o (2008) pr op ose to calc u late the diﬀerences b et ween mo del pr ed ictions for every c hoice of feature su bset. F or multi-la yer p erceptrons F ´ eraud and Cl ´ erot ( 2002) measure the imp ortance of indi- vidual inpu t v ariables on clusters of test p oint s. Therefore the c hange in the mo del output is ev aluated for the c h ange of a single inp ut v ariable in a c hosen interv al while all other input v ariables are ﬁxed. Lemaire and F eraud (2007) use a similar app roac h on an in stance b y instance basis. By considering eac h input v ariable in tu r n there is no w a y to measure input feature int eractions on the mo d el outpu t (see LeCun, Bottou, Orr, and M ¨ ulle r, 1998). The principal diﬀerences b et wee n our appr oac h and these framew orks are: (i) W e con- sider contin uous features and no stru cture among them is required , w hile some other frame- w orks start f rom binary features and ma y r equ ire discretization steps with the n eed to estimate parameters for it. (ii) W e all ow c h an ges in any d irection, i.e. any w eight ed com- bination of v ariables, while other appr oac hes only consider one feature at a time or the omission of a set of v ariables. 19 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller p ( | ) y=1 x x 1 x 1 x 2 0.5 Figure 10: ζ ( x ) is the zero ve ctor in the middle of the cluster in the midd le. 8. Discussion By no w w e hav e sho wn that our methods f or calculating / estimating explanation v ectors are usefu l in a v ariet y of situations. In the follo wing we discuss their limitations. What can we do, if the deriv ativ e is zero? This situation is depicted in Figure 10. In the lo wer p anel we see a tw o-dimen s ional data set consisting of three clusters. Th e middle cluster has a diﬀeren t class than the clus ters on the left and on the right. Relev an t for the classiﬁcation is only the horizonta l co ord in ate (i.e. x 1 ). Th e upp er panel s h o ws the pro jected data and a represent ativ e slice of ζ ( x ). Ho wev er, the explanation ζ ( x ) for th e cen ter p oin t of the mid dle cluster is the zero v ector, b ecause at that p oint p ( Y = 1 | X = x ) is maximal. Wh at can we do in such situations? Actually , the (normalized) explanation v ector is d eriv ed from the follo w ing optimizatio n problem for ﬁnding the locally most inﬂuen tial direction: argmax k ε k =1 { p ( Y 6 = g ∗ ( x 0 ) | X = x 0 + ε ) − p ( Y 6 = g ∗ ( x 0 ) | X = x 0 ) } . In case th at the ﬁrst deriv ativ e of the ab o ve criterion is zero, its T a ylor expansion starts from the second order term, w hic h is a quadratic f orm in its Hessian matrix. In th e example data set with three clusters, the explanation vect or is constan t along the second d im en sion. The most inter esting direction is giv en by the eigen v ector corresp onding to t he largest eig env alue of the Hessian. This direction will b e in our example along the ﬁrst dimension. T h u s, w e can learn from the Hessian that the ﬁ rst co ordin ate is relev ant for the classiﬁcati on, but w e do not obtain an orien tation for it. Instead it m eans that b oth directions (left and right) will inﬂuence the classiﬁcation. How ever, if the conditional d istribution P ( Y = 1 | X = x ) is ﬂat in some regions, no meaningfu l explanation can b e obtained by th e gradient-based appr oac h with the r emedy mentio n ed ab o ve. Practically , by using P arzen windo w estimators w ith larger width s, the explanation vec tor ca n capture coarse structures of the classiﬁer a t the p oints wh ic h are n ot so far f rom the b oarders. In A.2.2 w e giv e an illustratio n of this p oin t. In the future, we would lik e to work on glo b al a p proac hes, e.g. based on distances to the b oarders, or extensions of the approac h by Robnik- ˇ Sik onj a and Kononenko (2 008). Since 20 How to E xplain I ndividual Class ifica tion Decisions these pro cedu r es are exp ected to b e computationally demandin g, our prop osal is us eful in practice, in particular for p robabilistic classiﬁers. Do es our framewo rk generate diﬀerent e xpla na tions for diﬀeren t prediction mo dels? When u s ing the lo cal gradient of the mo d el prediction directly as in Deﬁni- tion 2 and Section 6, the explanation fo llo ws the giv en mod el pr ecisely by deﬁn ition. F or the estimatio n framework this d ep ends on w hether the d iﬀeren t classiﬁers classify the data diﬀeren tly . In that case th e explanation vec tors will b e diﬀeren t, which make s sense, since they should explain the classiﬁer at hand, ev en if its estimated lab els were n ot all correct. On the other hand, if the diﬀerent classiﬁers agree on all lab els, the explanation will b e exactly equal. Whic h implicit limitations do analytical gradients inherit from Gaussian Pro cess mo dels? A particular p henomenon can b e observ ed at the b oun daries of the training d ata: F ar fr om the training d ata, Gaussian Pro cess Classiﬁcation mo dels predict a probabilit y of 0.5 for the p ositiv e class. When qu erying the mo del in an area of the f eature space where predictions are nega tiv e, and one approac hes the b oundaries of the sp ace p opu lated with training data, explanation v ectors will p oin t a w a y fr om any training data and therefore also aw a y fr om areas of p ositiv e prediction. This b eha vior can b e observed in Figure 1(d), where un it length v ectors indicate the d irection o f explanation v ectors. In the right hand side corner, arro w s p oint a wa y fr om the triangle. H ow ev er, we can see that the length of these vecto r s is so small, that they are not ev en visible in Figur e 1(c). Consequently , this p rop erty o f GPC mo dels do es n ot p ose a r estriction for ident if y in g the lo cally most inﬂuentia l features by inv estigating th e features with the highest absolute v alues in the resp ectiv e partial deriv ativ es, as sh o wn in Section 6. Stationarity of the data. Since explanation ve ctors are d eﬁned as lo cal gradien ts of the mo del p rediction (see Deﬁnition 2), no assump tion on the data is made: The lo cal gradient s follo w the predictive mo del in an y case. If, h o wev er, the m o del to b e explained assumes stationarit y of the data, the explanation v ectors will inherit this limitation and reﬂect an y shortcomings of the mo del (e.g. w h en the mo del is applied to non-stationary data). Ou r metho d for estimating e x p lanation vecto r s, on the other hand, assumes stationarit y of th e data. When mo d eling data that is in fact non-stationary , appropriate measur es to deal with suc h data sets should b e tak en. One option is to separate the feature space in to sta- tionary and non-stationary parts usin g Stationary Subs p ace Analysis as introd uced by v on B ¨ unau, Meinec k e, Kir´ aly , and M ¨ uller (2009). F or further app roac hes to data set shift see S ugiy ama, Nak a jima, Kashima, von Buenau, and Kaw anab e (2007b), S ugiy ama, Krauledat, and M ¨ uller (2007 a ) and the b o ok by Q uionero-Candela, Su giy ama, S c hw aighofer, and La wren ce (2009). 9. Conclusion This pap er prop oses a metho d that sheds ligh t in to the blac k b o xes of nonlinear classiﬁers. In other words, w e in tro d uce a metho d that can explain the local d ecisions take n by arbitrary (p ossibly) nonlinear classiﬁcation algorithms. In a nutshell, the estimated explanations are lo cal gradien ts that charact erize h o w a data p oint h as to b e mo v ed to change its predicted 21 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller lab el. F or mo d els where s uc h gradient information cannot b e calculated explicitly , we emplo y a prob ab ilistic appro ximate mimic of th e learning machine to b e explained. T o v alidate our metho dology we sh ow ho w it can b e used to dra w new conclusions on ho w the v arious Iris ﬂ o wers in Fisher’s famous data set a r e diﬀeren t from e ac h other and ho w to ident ify the features with whic h certain t yp es of d igits 2 and 8 in the USPS data set can b e distinguished . F ur thermore, we applied our metho d to a c hallenging dru g disco v ery problem. The r esults on that d ata fully agree with existing d omain kno wledge, w hic h was not a v ailable to our metho d. E v en lo cal p eculiarities in chemica l space (the extraordin ary b ehavio r of steroids) w as disco ve r ed u sing the local explanations give n by our appr oac h. F utur e dir ections are t wo-fo ld: First we b eliev e that our metho d will ﬁ nd its w ay into the to ol b o xes of practitioners who not only wan t to au tomatically classify their data but who also w ould like to un derstand the learned classiﬁer. Thus usin g our explanation fr amew ork in computation biolo gy (see Sonnenburg, Zien, Philips, and R¨ at s c h , 20 08) and in decision making exp eriment s in psyc h op hysics (e.g. Kienzle, F ranz, S c h¨ olk opf, and Wic h mann, 2009) seems most p romising. The second direction is to generalize our approac h to other prediction problems su c h as regression. Ac kno w ledgmen ts This w ork was supp orted in part by the FP7-ICT Programme of th e E u rop ean C ommunit y , under the P ASCAL2 Net wo r k of Excellence, IC T -21688 6 a n d by DF G Grant MU 987/4-1. W e w ould lik e to thank And reas Sutter, An tonius T er Laak, Thomas S teger-Hartmann an d Nik olaus Heinrich for p ublishing the Ames m utagenicit y data set (Hansen, Mik a, Sc hr o eter, Sutter, Laak, Stege r - H a r t m a n n , H e i n r i c h , a n d M ¨ u l l e r , 2009). A. App endix A.1 Illustration of direct lo cal gradien t s In the follo wing w e give some illustrativ e e xamples of our metho d to exp lain mo d els us in g lo cal gradients. Since the explanation is deriv ed d irectly f rom the r esp ectiv e mo d el, it is in teresting to inv estigate its acuteness dep en ding on diﬀeren t mo d el p arameters and in instructiv e s cenarios. W e examine the eﬀects th at lo cal gradients exhibit when c ho osin g diﬀeren t kernel fun ctions, w hen introdu cing outliers, and when the classes are n ot linearly separable lo cally . A.1.1 Choice of kernel funct ion Figure 11 sho w s the eﬀect of diﬀerent kernel functions on the triangle to y data from Figure 1 in S ection 2. The follo wing observ ations can b e made: • In any case note that the lo cal gradien ts exp lain the mo del, which in turn m a y or ma y not capture th e true situation. • In Subﬁ gure 11(a) the li near ke r nel leads to a mo del whic h fails to captur e the non- linear class separation. Th is mo del missp eciﬁcation is reﬂected b y the explanations giv en for this mo del in Su bﬁgure 11(b). 22 How to E xplain I ndividual Class ifica tion Decisions (a) linear model 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (b) linear explanation (c) rational quadratic model 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (d) rational quadratic explanation Figure 11: The eﬀect of diﬀerent k ernel functions to the lo cal gradient explanations • The r ational quadratic kernel is able to more accurately mo del the non-linear sep- aration. I n Su bﬁgure 11(c) a non-optimal degree p arameter h as b een c h osen for illustrativ e purp oses. F or other parameter v alues the rati on al quadratic k ernel leads to similar results as th e RBF k ern el fun ction used in Figure 1. • The explanations in Subﬁ gure 11(d) obtained for this mo del sh o w local p erturb ations at the small “bu mps” of the mo d el bu t the trends to w ard s th e p ositiv e class are still clear. As previously observed in Figure 1, the explanations mak e clear that b oth features interac t at the corners and on the h yp oten use of the tr iangle class. A.1.2 Ou tliers In Figure 12 the eﬀects of t wo outliers in the classiﬁcation d ata to GPC w ith RBF k ernel are sho wn . On ce more, n ote that the local gradients explain th e mo del, whic h in tur n may or 23 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 (a) outliers in classes (b) outliers in mod el 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) outlier explanation Figure 12: The eﬀect of outliers to the lo cal gradien t exp lanations ma y not capture the tru e situation. The size of the region aﬀected by the outliers dep ends on the kernel width parameter. W e consider the follo wing items: • Lo cal gradien ts are in the same w a y sensitiv e to outliers as the mo del whic h th ey try to explain. Here a sin gle outlier deforms the mo del and w ith it the explanation wh ich ma y b e extracted from it. • Being deriv ative s the sen sitivit y of lo cal gradient s to a nearby ou tlier is increased o ver the sensitivity of the mo del prediction itself. • Th u s th e lo cal gradient of a p oin t near an outlier ma y not reﬂect a true explanation of the features imp ortant in realit y . Neve r theless it is the m o del her e whic h is wrong around an outlier in the ﬁrst place. • The histograms in th e Figures 7, 8, and 9 in Section 6 sh o w the trend s of the r esp ectiv e features in the distr ib ution of all test p oin ts and are thus not aﬀected by sin gle outliers. 24 How to E xplain I ndividual Class ifica tion Decisions T o compens ate for the eﬀect of outliers to the local gradien ts of p oin ts in the aﬀected region we prop ose to use a sliding w in do w metho d to smo oth the gradients around eac h p oint of interest. Thus for eac h p oin t use the mean of all local gradients in the hyp ercub e cen tered at this p oin t and of appropriate size. This w ay th e disrupting eﬀect o f an outlier is a verag ed out for an app ropriately c hosen wind o w size. A.1.3 Loc al non -linearity 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 (a) locally non-linear ob ject (b) locally n on-linear model 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) lo cally non-linear explanation Figure 13: The eﬀect of lo cal non -linearity to the lo cal gradient explanations The eﬀect of lo cally non-linear class b oundaries in the data is shown in Figure 13 aga in for GPC with an RBF k ernel. The f ollo w ing p oint s can b e observed: • All the n on-linear class b ound aries are accurately follo w ed by the lo cal gradien ts • The circle shap ed region of negativ e examples su rround ed by p ositiv e ones sh o ws the full range of feature inte ractions to wa r ds the p ositiv e class 25 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller • On the ridge of single p ositiv e instances the m o del introd uces small v alleys whic h are reﬂected by the lo cal gradien ts A.2 Estimating b y P arzen windo w Finally we elab orate on some details of our estimation app roac h of lo cal grad ients b y Parzen windo w appr oximati on. First we giv e the deriv ation to obtain the explanation ve ctor a n d second we examine how the explanation v aries with the go o dness of ﬁ t of the Pa r zen w in do w metho d. A.2.1 Deriv a tion of expl ana tion vectors These are more d etails on the deriv ation of E q. (3). W e use the ind ex set I c = { i | g ( x i ) = c } : ∂ ∂ x k σ ( x ) = − x σ 2 k σ ( x ) ∂ ∂ x ˆ p σ ( x, y 6 = c ) = 1 n X i / ∈ I c k σ ( x − x i ) − ( x − x i ) σ 2 ∂ ∂ x ˆ p σ ( y 6 = c | x ) =  P i / ∈ I c k ( x − x i )  P n i =1 k ( x − x i )( x − x i )  σ 2  P n i =1 k ( z − x i )  2 −  P i / ∈ I c k ( x − x i )( x − x i )  P n i =1 k ( x − x i )  σ 2  P n i =1 k ( z − x i )  2 =  P i / ∈ I c k ( x − x i )  P i ∈ I c k ( x − x i )( x − x i )  σ 2  P n i =1 k ( z − x i )  2 −  P i / ∈ I c k ( x − x i )( x − x i )  P i ∈ I c k ( x − x i )  σ 2  P n i =1 k ( z − x i )  2 and thus for the index set I g ( z ) = { i | g ( x i ) = g ( z ) } ˆ ζ ( z ) = ∂ ∂ x ˆ p ( y 6 = g ( z ) | x )     x = z =  P i / ∈ I g ( z ) k ( z − x i )  P i ∈ I g ( z ) k ( z − x i )( z − x i )  σ 2  P n i =1 k ( z − x i )  2 −  P i / ∈ I g ( z ) k ( z − x i )( z − x i )  P i ∈ I g ( z ) k ( z − x i )  σ 2  P n i =1 k ( z − x i )  2 26 How to E xplain I ndividual Class ifica tion Decisions A.2.2 Good ness of fit by P arzen window In our estimation framewo rk the qu alit y of the lo cal gradien ts dep ends on the appro ximation of the classiﬁer we wan t to explain by P arzen windo ws for which w e can calculate the explanation ve ctors as giv en b y Deﬁnition 3. (a) SVM mo del 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (b) estimated explanation with σ = 0 . 00069 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) estimated explanation with σ = 0 . 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (d) estimated explanation with σ = 1 . 0 Figure 14: Go o d ﬁt of P arzen window appr o ximation aﬀects the qualit y of the estimated explanation vecto r s Figure 14(a ) sho w s an SVM mo del trained on the classiﬁcation data from Figur e 13(a). The lo cal gradien ts estimated for this mo del by diﬀeren t P arzen window appro ximations are depicted in Subﬁ gu r es 14(b), 14(c), and 14(d). W e obser ve the follo w ing p oin ts: • The SVM mo d el was trained w ith C = 10 and u sing an RBF kernel of wid th σ = 0 . 01 • In Subﬁ gure 14(b) a small window w idth has b een c hosen b y minimizing the mean absolute err or o v er the v alidation set of lab els pr edicted b y the SVM classiﬁer. Thus w e 27 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller obtain explaining local gradien ts on the class b oun daries but zero v ectors in the inner class regions. While th is resem bles the piecewise ﬂ at SVM mo del most accurately it ma y b e more useful practically to c h o ose a larger width to obtain non-zero gradien ts p ointi n g to the b orders in this regions as w ell. F or a more d etailed d iscu ssion of zero gradien ts see Section 8. • A larger width practical ly useful in this example is sho wn in Sub ﬁ gure 14(c). Here the lo cal gradien ts in the inn er class r egions p oin t to the other class as we ll. • F or a to o large window width in Subﬁ gure 14(d) the appro ximation fails to obtain lo cal gradient s whic h closely fol lo w the model. Here only t wo d ir ections are left and the gradients for the blu e class on the left and on the b ottom p oin t in the wrong direction. References Bruce N. Ames, E. G. Gurney , James A. M iller, and H. B artsch. Carcino- gens as F rameshift Mutagens: Metab olites and Deriv ativ es of 2-Acet ylaminoﬂu orene and Other Aromatic Amine Carcinogens. Pr o c e e dings of the Nationa l A c ademy of Scienc es of the Unite d States of A meric a , 69(11):3 128–3132, 197 2. URL http://w ww.pnas. org/content/69/11/3128.abstract . C.M. Bishop. Neur al Networks for Pattern R e c o gnition . Oxford Unive r sit y Press, 1995. L. Devro ye, L. Gy¨ o rﬁ, and G. Lugosi. A P r ob abilistic The ory of Pattern R e c o gnition . Num- b er 31 in Applications of Mathemati cs. Sprin ger, New Y ork, 1996. R.A. Fisher. The Use of Mu ltiple Measuremen ts in T axonomic Problems. Annals of Eu- genics , 7:179–18 8, 1936. Raphael F ´ eraud and F abrice Cl ´ erot. A met ho dology t o explain neu- ral net work classiﬁcation. Neur al Networks , 15(2): 237 – 246, 200 2. ISSN 0893-608 0. doi: DOI:10.101 6/S0893- 6080(01)00127- 7. URL http://w ww.scien cedirect.com/science/article/B6T08- 4441WFN- 5/2/d097075076605aa08026f96410 b f c 9 5 5 . Hansruedi Glatt, Reinhard Jun g, and F ranz O esc h. Bacterial m utagenicit y in ves- tigatio n of epoxides: dru gs, drug metab olites, steroids and p esticides. Muta- tion R ese ar ch/F undamental and Mole cular Me chanisms of Mutagenesis , 11 1(2):99– 118, 1983. ISSN 002 7-510 7. doi: DOI:10.101 6/0027- 5107(83)90056- 8. URL http://d x.doi.or g/10.1016/0027- 5107(83)90056- 8 . Isab elle Guy on and Andr´ e Elisseeﬀ. An int r o duction to v ariable and feature selection. The Journal of Machine L e arning R ese ar ch , 3:1157–1 182, 200 3. F. R. Hamp el, E . M. Ronc h etti, P . J. Rousseeuw, and W. A. S tahel. R obust Statistics: The Appr o ach Base d on Inﬂ u enc e F unctions . Wiley , New Y ork, 1986. Katja Hansen, Sebastian Mik a, Timon S c hr o eter, And reas Sutter, An tonius T er Laak, Thomas Steger-Ha rtmann, Nik olaus Heinrich, and Klaus-Rob ert M ¨ uller. A b enc h mark 28 How to E xplain I ndividual Class ifica tion Decisions data set for in silico prediction of ames m utagenicit y . Journal of Chemic al Information and Mo del ling , 49(9):207 7–2081, 2009. URL h ttp://dx. doi.org/ 10.1021/ci900161g . T revor Hastie, Rob ert Tibsh ir ani, and Jerome F riedman. The Elements of Statistic al L e arn- ing . S p ringer, 2001. E. J. Horvitz, J. S. Breese, and M. Henrion. Decision theory in exp ert systems and artiﬁcial in terigence. J ournal of Appr oximation R e asoning , 2:24 7–302, 1988. Sp ecial Issu e on Uncertain t y in Artiﬁcial Intellig en ce. Don H. Johnson and S inan Sinano vic. Symmetrizing the Kullbac k-Leibler distance. T ech- nical rep ort, IEEE T ransactions on Information Theory , 2000. Jero en Kazius, Ross McGuire, and Rob erta Bursi. Deriv ation and v alidation of to xicophores for muta genicit y prediction. J. Me d. Chem. , 48:31 2–320, 2005. W. Kienzle, M. O. F ranz, B. S c h¨ olk opf, and F. A. Wic hman n . Cente r-surroun d patte r ns emerge as optimal predictors for h um an sacc ade targets. Journal of V ision , 9(5):1–15 , 2009. M. Kus s and C . E. Ram uss en . Assesing appro ximate inference for bianry gaussian pro cess classiﬁcation. J ournal of M achine L e arning R ese ar ch , 6:1679–17 04, 2005. Y. LeCun, L. Botto u, G.B. Or r, and K.-R. M ¨ u ller. Eﬃcient bac kprop. In G.B. O r r and K.-R. M ¨ u ller, editors, Neur al Networks: T ricks of the tr ade , pages 9–53. Spr inger, 1998. Vincen t Lemaire and Rapha¨ el F eraud . Une m´ etho de d’interpr ´ etation de scores. In Monique Noirhomme-F raiture and Gilles V ent u rini, editors, EGC , volume RNTI-E-9 of R evue des Nouvel les T e chnolo gies de l’Information , pages 191–19 2. C´ epadu` es- ´ Editions, 2007. ISBN 978-2 -85428 -763-9. K.R. M ¨ uller, S. Mik a, G. R¨ atsc h, K. T suda, and B. Sc h¨ olk opf . An introd uction to k ernel- based learning algorithms. N eur al N etworks, IEEE T r ansactions on , 12(2):18 1–201, 2001. Olga Obrezano v a, G´ a b or Cs´ anyi, Jo elle M.R. Gola, and Matthew D. S egall. Gaussian pro cesses: A m etho d for automatic QS AR mo delling of adme pr op erties. J. Chem. Inf. Mo del. , 47(5):1 847–1857, 20 07. URL http ://dx.do i.org/10 .1021/ci7000633 . Olga Obr ezano v a, Jo elle M. R. Gola, Edmund J. Champness, and Matthew D. Segall. Automatic QS AR mo deling of adme prop erties: b lo o d-brain barr ier p en etration and aqueous solubilit y . J. Comput.-A ide d Mol. Des. , 2 2:431–440, 2008. URL http://d x.doi.or g/10.1007/s10822- 008- 9193- 8 . John C. Platt. Probabilistic outputs for supp ort ve ctor mac hin es and comparisons to reg- ularized lik eliho o d metho ds. In A dvanc es in L ar ge Mar gin Classiﬁers , pages 61–74. MIT Press, 1999. Joaquin Q u ionero-Candela, Masashi Su giy ama, Anton Sc hw aighofer, and Neil D. La w rence. Dataset Shift in Machine L e arning . The MIT Pr ess, 2009. ISBN 026217005 1, 97802 62170055. 29 D. Baehrens, T. Schroeter, S. Harmeling, M. Ka w anabe, K. Hans en, K. -R. M ¨ uller C. E. Rasmussen and C. K. I. Williams. Gaussian Pr o c esses for M achine L e arning . Springer, 2006. Mark o Robnik- ˇ Sik onj a and Igor Kononenk o. Explaining classiﬁcations for individu al in- stances. IE E E TKDE , 20(5):589– 600, 2008. B. Sch¨ olkopf and A. Smola. L e arning with Ke rnels . MIT, 2002. Timon Schroeter, Anton Sc h waighofer, S ebastian Mik a, Antonius T er Laak, Detlev S uel- zle, Ursula Ganzer, Nikol aus Heinr ic h, and Klaus-Rob ert M ¨ uller. Estimating the do- main of applicabilit y for mac h ine learning QSAR mo d els: A study on aqueous sol- ubilit y of dru g d isco v ery molecules. Journa l of Computer Aide d Mole cular Design - sp e cial issue on ”ADM E and Physic al Pr op erties” , 21(9):48 5–498, 2007a. URL http://d x.doi.or g/10.1007/s10822- 007- 9125- z . Timon Schro eter, Anto n Sch w aighofer, Sebastian Mik a, An tonius T er Laak, Detlev Suelzle, Ursula Ganzer, Nik olaus Heinric h, and Klaus-Rob ert M ¨ uller. Mac hine learning mo dels for lip ophilicit y and th eir domain of applicabilit y . Mol. Pharm. , 4(4):524–53 8, 2007 b . URL http:/ /dx.doi. org/10.1 021/mp0700413 . Timon Schroeter, An ton Sc hw aighofer, Sebastian Mik a, An tonius te r Laak, Detl ev S¨ ulzl e, Ursula Ganzer, Nik olaus Heinrich, and Klaus-Rob er t M¨ uller. Predicting lip ophilicit y of drug d isco v ery molecules using gaussian pr o cess mo dels. ChemMe dChem , 2(9):12 65–12 67, 2007c . URL http://dx. doi.org/ 10.1002/cmdc.200700041 . A. Sc hw aighofer. SVM T o olb ox for Ma tlab, Jan 20 02. URL http://i da.first .fraunhofer.de/ ~ anton/so ftware.h tml . An ton Sch w aighofer, Timon Sc h ro eter, S ebastian Mik a, Julian Laub, Antonius ter Laak, Detlev S ¨ ulzle, Ursula Ganzer, Niko laus Heinric h, and Klaus-Rob ert M ¨ uller. Accu- rate solubilit y prediction w ith er r or bars for electrolytes: A mac hine learning ap- proac h. Journal of Chemic al Information and Mo del ling , 47(2 ):407– 424, 2007. URL http://d x.doi.or g/10.1021/ci600205g . An ton Sch w aighofer, T im on Schroeter, S ebastian Mik a, Katja Hansen, An tonius ter Laak, Philip Lienau, Andreas Reic h el, Nik olaus Heinrich, and Klaus-Rob er t M¨ uller. A pr oba- bilistic approac h to classifying metab olic stabilit y . Journal of Chemic al Information and Mo del ling , 48(4):7 85–79 6, 2008. URL htt p://dx.d oi.org/1 0.1021/ci700142c . S¨ oren Sonnenburg, Alexander Zien, Petra Philips, and Gunnar R¨ atsc h. POIMs: p ositional oligomer imp ortance m atrices — unders tanding sup p ort vec tor m achine based signal detectors. Bioinformatics , 2008. (receiv ed the Best Student Paper Aw ard at ISMB ´ 08). H. Suermondt. Explanation in Bayesian Belief Networks . PhD thesis, Department of Computer Science and Medicine, S tanford Univ ersity , Stanford, C A, 1992. Masashi Sugiy ama, Matthias Krauledat, and K laus-Rob ert M¨ ulle r. C o v ariate shift adap ta- tion b y imp ortance weigh ted cross v alidation. Journal of Machine L e arning R ese ar ch , 8: 985–1 005, M a y 2007a. 30 How to E xplain I ndividual Class ifica tion Decisions Masashi Sugiy ama, Shin ic hi Nak a jima, Hisashi Kashima, Paul v on Buen au, and Motoaki Ka wanabe. Direct imp ortance estimation with mo del selection and its application to co v ariate shift adaptation. In A dvanc e s in Neur al Information Pr o c essing Systems 20 . MIT Press, 2007b. R. T o deschini, V. Consonni, A. Mauri, an d M. P a v an. Dragon for windo ws and linux 2006. http://w ww.talet e.mi.it/help/dragon_help/ (accessed 27 Marc h 2009), 2006 . V. V apnik. The Natur e of Statistic al L e arning Th e ory . S pringer, 1995. P aul von B ¨ u nau, F r an k C Meinec ke , F r anz J K ir´ aly , and Klaus-Rob ert M ¨ uller. Finding stationary subspaces in m ultiv ariate time series. Physic al r evie w let- ters , 103(2 1):214 101, 2009 . doi: 10.110 3/Ph ysRevLett.103.214101. URL http://l ink.aps. org/abstract/PRL/v103/e214101 . Erik ˇ Strumb elj and Igor Kononenko. T o w ard s a mo del ind ep endent metho d for explaining classiﬁcation f or ind ividual instances. In I.-Y. S ong, J. Eder, and T.M. Nguy en, editors, Data War ehousing and Know le dge Disc overy , volume 5182 of L e ctur e Notes in Computer Scienc e , pages 273–282. Springer, 2008. 31

How to Explain Individual Classification Decisions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment