On the Product Rule for Classification Problems
We discuss theoretical aspects of the product rule for classification problems in supervised machine learning for the case of combining classifiers. We show that (1) the product rule arises from the MAP classifier supposing equivalent priors and cond…
Authors: Marcelo Cicconet
On the Pr oduct Rule f or Classification Problems Marcelo Cicconet New Y ork University cicconet@gmail.com Ke yword s: Supervised Learning; Classification; Product Rule. Abstract: W e discuss theoretical aspects of the product r ule f or classification problems in supervised machine learning for the case of combining classifiers. W e sho w that (1) the product rule arises from the MAP classifier suppos- ing equiv alent priors and conditional i ndepend ence giv en a class; (2) under some conditions, the product rule is equi va lent to minimizing the sum of the squ ared distances to the respe ctiv e centers of th e classes related with differe nt features, such distances being weighted by the spread of the classes; (3) observing some hypothesis, the product rule is equiv alent to concatenating the vecto rs of f eatures. 1 Introduction W ith the advance of the Machine Learn ing field, and the discovery of many dif ferent techniques, the subject of c ombining multiple lea rners [2] e ventually drove attentio n, in particular the pr oblem of combin- ing c lassifiers . Many different method s a ppeared , and soon they we re compared in terms of e fficiency in solving problems. The pr oduct rule h as been present in some of these works ( e.g., [1, 7, 3, 6 , 5, 4, 8] ), in contexts rang ing from the accuracy o f the dif ferent combination rules to som e analy tical prop erties of the d ifferent metho ds. In [3] it was shown that, in the context of hand- written d igit recognition, the prod uct rule performs better f or co mbining linear classifiers. In g eneral, howe ver , the pro duct rule does not stand o ut from competitor s [6]. For the prob lem o f co mbining au- dio a nd vid eo signals in g uitar-chord recognition , the produ ct rule is better then the sum rule [5], b ut on the problem of iden tity verification using face and voice profiles, the sum rule wins [7]. On the theor etical rea lm, [ 1] sh ows that for pro b- lems with two classes, the sum an d prod uct rules are equiv alent w hen using two classifiers an d th e sum o f the estimates of the a posterior i prob abilities is equa l to on e. In [7], the prod uct r ule is derived fr om the hypoth esis o f conditio nal statistical independen ce be- tween d ifferent representations of the data. Th ere are also some intuitive explanations for the choice of the produ ct rule, as fo r instance th e fact that the p roduct (“END” operator) is preferred with respect to the s um rule (“OR” o perator ) because it enforces a ll qualities defined by the measures at once [9]. In this text, analytical properties of the product rule ar e fur ther analyzed, in the co ntexts of two o r more classifiers. W e show th at (1) the p rodu ct ru le arises fro m the MAP classifier supp osing equiv alent priors and conditional ind epende nce giv en a class; (2) under some condition s, the product rule is equiv a- lent to m inimizing th e sum o f the squared distance s to the respecti ve ce nters of the classes related with d if- ferent features, such distanc es b eing weighted by the spread of the classes; (3) obser ving some hypothe- sis, the produ ct ru le is equi valent to concatenating the vectors of features. Our w ork e xtend s the c urrent theoretical und er- standing o f the pro duct ru le pr ovided by Alexandr e et al [1] and Kittler et a l [7], as it was made in the direction of the sum rule by Li and Zong [8]. 2 Theoretical Facts Definition 1. Let X , Y be (co ntinuou s) rando m vari- ables corr esponding to 2 distinct fea tur e vectors, and C the (discr ete) rando m variable co rr esponding to the class, whose outp ut can be c 1 , ..., c K . F or any Z ∈ { X , Y } a nd k ∈ { 1 , . . . , K } , let p Z , k be a functio n that outputs the confidence that the class i s c k consid- ering that th e fe atur es-variable is Z . Sup posing th at the featur es ar e X = x and Y = y, the pro duct rule for classification will assign C = c ˆ k pr ovided p X , ˆ k ( x ) · p Y , ˆ k ( y ) = max k = 1 ,..., K p X , k ( x ) · p Y , k ( y ) . In this defin ition and in the following results we are using, for simp licity , only two rando m variables, named X and Y . W e could have u sed, instead, a set of N random variables, say X 1 , ..., X N , but that would unnecessarily overload the notation . Definition 2. Let ( X , Y ) be the random va riable o b- tained by c oncaten ating the featur es X a nd Y , and p ( ·| C = c k ) the density functio n for the variable ( X , Y ) co nditione d to C = c k . W e wil l d enote the va lue of this function a t th e po int ( x , y ) b y p ( X = x , Y = y | C = c k ) . Let P ( C = c k ) be the p rior pr oba bility that the class is C = c k . F in ally , let us defi ne p ( X , Y ) , k ( x , y ) as follo ws: p ( X , Y ) , k ( x , y ) = p ( X = x , Y = y | C = c k ) · P ( C = c k ) . Given a samp led valu e ( X , Y ) = ( x , y ) , the MAP (Maximum a P osteriori) classifier will assign C = c ˆ k pr ovided p ( X , Y ) , ˆ k ( x , y ) = max k = 1 ,..., K p ( X , Y ) , k ( x , y ) Fact 1. When usin g th e MAP classifier , the pr oduct rule arises under the hypoth esis of (1) conditiona l in- depend ency given the cla ss and (2) sam e prior pr o b- ability for the classes. Pr oo f. The MAP classifier is given b y p ( X = x , Y = y | C = c k ) · P ( C = c k ) . Now hypo thesis 1 means p ( X = x , Y = y | C = c k ) = = p ( X = x | C = c k ) · p ( Y = y | C = c k ) , and hypo thesis 2 implies that P ( C = c ˜ k ) = P ( C = c ˆ k ) for all ˜ k , ˆ k = 1 , ..., K . T herefo re max k = 1 ,..., K p ( X , Y ) , k ( x , y ) = = max k = 1 ,..., K p ( X = x | C = c k ) · p ( Y = y | C = c k ) , which is the p roduct rule (see de finition 1) fo r p X , k ( x ) = p ( X = x | C = c k ) and p Y , k ( y ) = p ( Y = y | C = c k ) . Fact 2. F or each Z ∈ { X , Y } , let d Z be the (finite) dimension of the varia ble Z , I d Z the identity matrix of dimensions d Z × d Z , and Σ Z , k = σ 2 Z , k I d Z (wher e σ Z , k is positive numb er). Also, for each k = 1 , . . . , K , le t µ Z , k be fixed points in R d Z . Defining confid ence functions (see definitio n 1) p X , k ( x ) = e − 1 2 ( x − µ X , k ) ⊤ Σ − 1 X , k ( x − µ X , k ) , and (1) p Y , k ( y ) = e − 1 2 ( y − µ Y , k ) ⊤ Σ − 1 Y , k ( y − µ Y , k ) , (2) the pr od uct rule is equivalent to min k = 1 ,..., K 1 σ 2 X , k k x − µ X , k k 2 + 1 σ 2 Y , k k y − µ Y , k k 2 . That is, supposing gau ssian-like classifiers with co- variances pa rallel to the axis, the pr o duct rule tries to minimize th e sum of the squar ed distances to th e r espective “center s” of classes for X and Y , such dis- tances bein g weighted by th e inver se of the “spr ead” of the the classes (an intuitively r eason able str ate gy , in fact). Pr oo f. Under the mention ed hypothesis, we have max k = 1 ,..., K p X , k ( x ) · p Y , k ( y ) = = max k = 1 ,..., K e − 1 2 σ 2 X , k k x − µ X , k k 2 + 1 2 σ 2 Y , k k y − µ Y , k k 2 ! . Applying log a nd m ultiplying by 2 the second mem - ber of the above e quality results in max k = 1 ,..., K p X , k ( x ) · p Y , k ( y ) = = min k = 1 ,..., K 1 σ 2 X , k k x − µ X , k k 2 + 1 σ 2 Y , k k y − µ Y , k k 2 . Fact 3. Let us now define confi dence fu nctions as fol- lows: p X , k ( x ) = 1 ( 2 π ) d X | Σ X , k | 1 / 2 e − 1 2 ( x − µ X , k ) ⊤ Σ − 1 X , k ( x − µ X , k ) , and p Y , k ( y ) = 1 ( 2 π ) d Y | Σ Y , k | 1 / 2 e − 1 2 ( y − µ Y , k ) ⊤ Σ − 1 Y , k ( y − µ Y , k ) , wher e, for each Z ∈ { X , Y } , | Σ Z , k | is the d eterminant of Σ Z , k . Let us suppose also that, cond itioned to the class c j , X a nd Y ar e u ncorr elated, that is, being Σ k the covariance of ( X , Y ) | C = c k , we can write Σ k = Σ X , k 0 0 Σ Y , k , wher e, for each Z ∈ { X , Y } , Σ Z , k is th e covarian ce of Z | C = c k . Then, putting µ j = ( µ X , j , µ Y , j ) , we have p X , k ( x ) · p Y , k ( y ) = = 1 ( 2 π ) d X + d Y | Σ k | 1 / 2 e − 1 2 (( x , y ) − µ k ) ⊤ Σ − 1 j (( x , y ) − µ k ) . That is, supposing ga ussian classifiers, the pr oduct rule is eq uivalent to learning using the co ncatena ted vectors of featur es. Pr oo f. The in verse of Σ k is Σ − 1 k = " Σ − 1 X , k 0 0 Σ − 1 Y , k # . This way , the expression ( x − µ X , k ) ⊤ Σ − 1 X , k ( x − µ X , k ) + ( y − µ Y , k ) ⊤ Σ − 1 Y , k ( y − µ Y , k ) reduces to (( x , y ) − µ k ) ⊤ Σ − 1 k (( x , y ) − µ k ) . Now 1 ( 2 π ) d X | Σ X , k | 1 / 2 · 1 ( 2 π ) d Y | Σ Y , k | 1 / 2 = 1 ( 2 π ) d X + d Y | Σ k | 1 / 2 . Therefo re p X , k ( x ) · p Y , k ( y ) = = 1 ( 2 π ) d X + d Y | Σ k | 1 / 2 e − 1 2 (( x , y ) − µ k ) ⊤ Σ − 1 k (( x , y ) − µ k ) . 3 Discussion According to F act 1, the prod uct rule arises w hen maximizing the poster ior under the hyp othesis o f equiv alent priors and condition al indepen dence gi ven a class. W e have ju st seen (Fact 3) that, su pposing only uncorre lation (which is less then independency), the prod uct rule appear s as well. But in fact we have used g aussian classifiers, i.e., we sup posed th e data was nor mally d istributed. This is in ac cordan ce with the fact that no rmality and u ncorre lation implies in - depend ency . An impor tant conseq uence o f Fact 3 h as to do with the curse of dimen sionality . If ther e is strong evidence that the conditional joint distribution of ( X , Y ) given any class C = c k is well ap proxim ated by a nor mal distribution, and that X | C = c k and Y | C = c k are un - correlated, than the p rodu ct ru le is an interesting op- tion, becau se we do not have to deal with a fe ature vector with dimension larger the largest of the dimen- sions of the or iginal descriptors. Besides, the product rule allows parallelization . REFERENCES [1] L. Alexan dre, A. Campilho an d M. Kamel. On Combining Classifiers Using Sum an d Pr oduct Rules . Pat. Rec. Letters 22. P . 1283 -128 9. 200 1. [2] E. Alpaydin . Intr od uction to Machine Learn ing . The MIT Press, Cambridg e, MA, 200 4. [3] M. van Breukelen, R. Duin , D. T ax and J. Har- tog. Handwritten Digit Recognition by Combined Classifiers . Kybernetica, V ol. 34, Number 4, P . 381-3 86. 1998. [4] M. Cicconet. The Guitar as a Human-Computer Interface . D.Sc. Th esis. Natio nal I nstitute of Pure and Applied Mathematics. Rio de Janeiro , 2010. [5] M. Cicco net, P . Car valho an d L. V elho. On Bi- modal Guitar-Chor d Recognition . Internatio nal Computer Music Confer ence. Ne w Y ork, 2010. [6] R. Duin an d D. T ax. Exp eriments with Classifier Combining Rules . 1st In t. W orkshop on M ultiple Classifier Systems. P . 16- 29. London , UK. 2 000. [7] J. Kittler , M. Hatef , R. Duin an d J. Matas. On Combining Classifiers . IEEE TP AMI, V ol. 20 , N. 3, March 1998 . [8] S. Li an d C. Zong. Classifier Combining R ules Under Indepen dence Assumptions . 7th Interna- tional Confer ence on Mu ltiple Classifier Systems. Springer-V er lag. Berlin Heidelberg. 2007. [9] T . Mertens, J. Kautz an d F . V an Reeth. E xposure Fusion . 1 5th Pacific Con ference on Compu ter Graphics and Application s. P . 382-39 0. W ash ing- ton, DC, USA. 2007.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment