On the Product Rule for Classification Problems

On the Pr oduct Rule f or Classiﬁcation Problems Marcelo Cicconet New Y ork University cicconet@gmail.com Ke yword s: Supervised Learning; Classiﬁcation; Product Rule. Abstract: W e discuss theoretical aspects of the product r ule f or classiﬁcation problems in supervised machine learning for the case of combining classiﬁers. W e sho w that (1) the product rule arises from the MAP classiﬁer suppos- ing equiv alent priors and conditional i ndepend ence giv en a class; (2) under some conditions, the product rule is equi va lent to minimizing the sum of the squ ared distances to the respe ctiv e centers of th e classes related with differe nt features, such distances being weighted by the spread of the classes; (3) observing some hypothesis, the product rule is equiv alent to concatenating the vecto rs of f eatures. 1 Introduction W ith the advance of the Machine Learn ing ﬁeld, and the discovery of many dif ferent techniques, the subject of c ombining multiple lea rners [2] e ventually drove attentio n, in particular the pr oblem of combin- ing c lassiﬁers . Many different method s a ppeared , and soon they we re compared in terms of e fﬁciency in solving problems. The pr oduct rule h as been present in some of these works ( e.g., [1, 7, 3, 6 , 5, 4, 8] ), in contexts rang ing from the accuracy o f the dif ferent combination rules to som e analy tical prop erties of the d ifferent metho ds. In [3] it was shown that, in the context of hand- written d igit recognition, the prod uct rule performs better f or co mbining linear classiﬁers. In g eneral, howe ver , the pro duct rule does not stand o ut from competitor s [6]. For the prob lem o f co mbining au- dio a nd vid eo signals in g uitar-chord recognition , the produ ct rule is better then the sum rule [5], b ut on the problem of iden tity veriﬁcation using face and voice proﬁles, the sum rule wins [7]. On the theor etical rea lm, [ 1] sh ows that for pro b- lems with two classes, the sum an d prod uct rules are equiv alent w hen using two classiﬁers an d th e sum o f the estimates of the a posterior i prob abilities is equa l to on e. In [7], the prod uct r ule is derived fr om the hypoth esis o f conditio nal statistical independen ce be- tween d ifferent representations of the data. Th ere are also some intuitive explanations for the choice of the produ ct rule, as fo r instance th e fact that the p roduct (“END” operator) is preferred with respect to the s um rule (“OR” o perator ) because it enforces a ll qualities deﬁned by the measures at once [9]. In this text, analytical properties of the product rule ar e fur ther analyzed, in the co ntexts of two o r more classiﬁers. W e show th at (1) the p rodu ct ru le arises fro m the MAP classiﬁer supp osing equiv alent priors and conditional ind epende nce giv en a class; (2) under some condition s, the product rule is equiv a- lent to m inimizing th e sum o f the squared distance s to the respecti ve ce nters of the classes related with d if- ferent features, such distanc es b eing weighted by the spread of the classes; (3) obser ving some hypothe- sis, the produ ct ru le is equi valent to concatenating the vectors of features. Our w ork e xtend s the c urrent theoretical und er- standing o f the pro duct ru le pr ovided by Alexandr e et al [1] and Kittler et a l [7], as it was made in the direction of the sum rule by Li and Zong [8]. 2 Theoretical Facts Deﬁnition 1. Let X , Y be (co ntinuou s) rando m vari- ables corr esponding to 2 distinct fea tur e vectors, and C the (discr ete) rando m variable co rr esponding to the class, whose outp ut can be c 1 , ..., c K . F or any Z ∈ { X , Y } a nd k ∈ { 1 , . . . , K } , let p Z , k be a functio n that outputs the conﬁdence that the class i s c k consid- ering that th e fe atur es-variable is Z . Sup posing th at the featur es ar e X = x and Y = y, the pro duct rule for classiﬁcation will assign C = c ˆ k pr ovided p X , ˆ k ( x ) · p Y , ˆ k ( y ) = max k = 1 ,..., K p X , k ( x ) · p Y , k ( y ) . In this deﬁn ition and in the following results we are using, for simp licity , only two rando m variables, named X and Y . W e could have u sed, instead, a set of N random variables, say X 1 , ..., X N , but that would unnecessarily overload the notation . Deﬁnition 2. Let ( X , Y ) be the random va riable o b- tained by c oncaten ating the featur es X a nd Y , and p ( ·| C = c k ) the density functio n for the variable ( X , Y ) co nditione d to C = c k . W e wil l d enote the va lue of this function a t th e po int ( x , y ) b y p ( X = x , Y = y | C = c k ) . Let P ( C = c k ) be the p rior pr oba bility that the class is C = c k . F in ally , let us deﬁ ne p ( X , Y ) , k ( x , y ) as follo ws: p ( X , Y ) , k ( x , y ) = p ( X = x , Y = y | C = c k ) · P ( C = c k ) . Given a samp led valu e ( X , Y ) = ( x , y ) , the MAP (Maximum a P osteriori) classiﬁer will assign C = c ˆ k pr ovided p ( X , Y ) , ˆ k ( x , y ) = max k = 1 ,..., K p ( X , Y ) , k ( x , y ) Fact 1. When usin g th e MAP classiﬁer , the pr oduct rule arises under the hypoth esis of (1) conditiona l in- depend ency given the cla ss and (2) sam e prior pr o b- ability for the classes. Pr oo f. The MAP classiﬁer is given b y p ( X = x , Y = y | C = c k ) · P ( C = c k ) . Now hypo thesis 1 means p ( X = x , Y = y | C = c k ) = = p ( X = x | C = c k ) · p ( Y = y | C = c k ) , and hypo thesis 2 implies that P ( C = c ˜ k ) = P ( C = c ˆ k ) for all ˜ k , ˆ k = 1 , ..., K . T herefo re max k = 1 ,..., K p ( X , Y ) , k ( x , y ) = = max k = 1 ,..., K p ( X = x | C = c k ) · p ( Y = y | C = c k ) , which is the p roduct rule (see de ﬁnition 1) fo r p X , k ( x ) = p ( X = x | C = c k ) and p Y , k ( y ) = p ( Y = y | C = c k ) . Fact 2. F or each Z ∈ { X , Y } , let d Z be the (ﬁnite) dimension of the varia ble Z , I d Z the identity matrix of dimensions d Z × d Z , and Σ Z , k = σ 2 Z , k I d Z (wher e σ Z , k is positive numb er). Also, for each k = 1 , . . . , K , le t µ Z , k be ﬁxed points in R d Z . Deﬁning conﬁd ence functions (see deﬁnitio n 1) p X , k ( x ) = e − 1 2 ( x − µ X , k ) ⊤ Σ − 1 X , k ( x − µ X , k ) , and (1) p Y , k ( y ) = e − 1 2 ( y − µ Y , k ) ⊤ Σ − 1 Y , k ( y − µ Y , k ) , (2) the pr od uct rule is equivalent to min k = 1 ,..., K 1 σ 2 X , k k x − µ X , k k 2 + 1 σ 2 Y , k k y − µ Y , k k 2 . That is, supposing gau ssian-like classiﬁers with co- variances pa rallel to the axis, the pr o duct rule tries to minimize th e sum of the squar ed distances to th e r espective “center s” of classes for X and Y , such dis- tances bein g weighted by th e inver se of the “spr ead” of the the classes (an intuitively r eason able str ate gy , in fact). Pr oo f. Under the mention ed hypothesis, we have max k = 1 ,..., K p X , k ( x ) · p Y , k ( y ) = = max k = 1 ,..., K e − 1 2 σ 2 X , k k x − µ X , k k 2 + 1 2 σ 2 Y , k k y − µ Y , k k 2 ! . Applying log a nd m ultiplying by 2 the second mem - ber of the above e quality results in max k = 1 ,..., K p X , k ( x ) · p Y , k ( y ) = = min k = 1 ,..., K 1 σ 2 X , k k x − µ X , k k 2 + 1 σ 2 Y , k k y − µ Y , k k 2 . Fact 3. Let us now deﬁne conﬁ dence fu nctions as fol- lows: p X , k ( x ) = 1 ( 2 π ) d X | Σ X , k | 1 / 2 e − 1 2 ( x − µ X , k ) ⊤ Σ − 1 X , k ( x − µ X , k ) , and p Y , k ( y ) = 1 ( 2 π ) d Y | Σ Y , k | 1 / 2 e − 1 2 ( y − µ Y , k ) ⊤ Σ − 1 Y , k ( y − µ Y , k ) , wher e, for each Z ∈ { X , Y } , | Σ Z , k | is the d eterminant of Σ Z , k . Let us suppose also that, cond itioned to the class c j , X a nd Y ar e u ncorr elated, that is, being Σ k the covariance of ( X , Y ) | C = c k , we can write Σ k =  Σ X , k 0 0 Σ Y , k  , wher e, for each Z ∈ { X , Y } , Σ Z , k is th e covarian ce of Z | C = c k . Then, putting µ j = ( µ X , j , µ Y , j ) , we have p X , k ( x ) · p Y , k ( y ) = = 1 ( 2 π ) d X + d Y | Σ k | 1 / 2 e − 1 2 (( x , y ) − µ k ) ⊤ Σ − 1 j (( x , y ) − µ k ) . That is, supposing ga ussian classiﬁers, the pr oduct rule is eq uivalent to learning using the co ncatena ted vectors of featur es. Pr oo f. The in verse of Σ k is Σ − 1 k = " Σ − 1 X , k 0 0 Σ − 1 Y , k # . This way , the expression ( x − µ X , k ) ⊤ Σ − 1 X , k ( x − µ X , k ) + ( y − µ Y , k ) ⊤ Σ − 1 Y , k ( y − µ Y , k ) reduces to (( x , y ) − µ k ) ⊤ Σ − 1 k (( x , y ) − µ k ) . Now 1 ( 2 π ) d X | Σ X , k | 1 / 2 · 1 ( 2 π ) d Y | Σ Y , k | 1 / 2 = 1 ( 2 π ) d X + d Y | Σ k | 1 / 2 . Therefo re p X , k ( x ) · p Y , k ( y ) = = 1 ( 2 π ) d X + d Y | Σ k | 1 / 2 e − 1 2 (( x , y ) − µ k ) ⊤ Σ − 1 k (( x , y ) − µ k ) . 3 Discussion According to F act 1, the prod uct rule arises w hen maximizing the poster ior under the hyp othesis o f equiv alent priors and condition al indepen dence gi ven a class. W e have ju st seen (Fact 3) that, su pposing only uncorre lation (which is less then independency), the prod uct rule appear s as well. But in fact we have used g aussian classiﬁers, i.e., we sup posed th e data was nor mally d istributed. This is in ac cordan ce with the fact that no rmality and u ncorre lation implies in - depend ency . An impor tant conseq uence o f Fact 3 h as to do with the curse of dimen sionality . If ther e is strong evidence that the conditional joint distribution of ( X , Y ) given any class C = c k is well ap proxim ated by a nor mal distribution, and that X | C = c k and Y | C = c k are un - correlated, than the p rodu ct ru le is an interesting op- tion, becau se we do not have to deal with a fe ature vector with dimension larger the largest of the dimen- sions of the or iginal descriptors. Besides, the product rule allows parallelization . REFERENCES [1] L. Alexan dre, A. Campilho an d M. Kamel. On Combining Classiﬁers Using Sum an d Pr oduct Rules . Pat. Rec. Letters 22. P . 1283 -128 9. 200 1. [2] E. Alpaydin . Intr od uction to Machine Learn ing . The MIT Press, Cambridg e, MA, 200 4. [3] M. van Breukelen, R. Duin , D. T ax and J. Har- tog. Handwritten Digit Recognition by Combined Classiﬁers . Kybernetica, V ol. 34, Number 4, P . 381-3 86. 1998. [4] M. Cicconet. The Guitar as a Human-Computer Interface . D.Sc. Th esis. Natio nal I nstitute of Pure and Applied Mathematics. Rio de Janeiro , 2010. [5] M. Cicco net, P . Car valho an d L. V elho. On Bi- modal Guitar-Chor d Recognition . Internatio nal Computer Music Confer ence. Ne w Y ork, 2010. [6] R. Duin an d D. T ax. Exp eriments with Classiﬁer Combining Rules . 1st In t. W orkshop on M ultiple Classiﬁer Systems. P . 16- 29. London , UK. 2 000. [7] J. Kittler , M. Hatef , R. Duin an d J. Matas. On Combining Classiﬁers . IEEE TP AMI, V ol. 20 , N. 3, March 1998 . [8] S. Li an d C. Zong. Classiﬁer Combining R ules Under Indepen dence Assumptions . 7th Interna- tional Confer ence on Mu ltiple Classiﬁer Systems. Springer-V er lag. Berlin Heidelberg. 2007. [9] T . Mertens, J. Kautz an d F . V an Reeth. E xposure Fusion . 1 5th Paciﬁc Con ference on Compu ter Graphics and Application s. P . 382-39 0. W ash ing- ton, DC, USA. 2007.

On the Product Rule for Classification Problems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment