A Quadratic Loss Multi-Class SVM

A Quadratic Loss Multi-Class SVM Emmanuel Monfrini — Y ann Guermeur November 1, 2021 A Quadrati Loss Multi-Class SVM Emman uel Monfrini ∗ , Y ann Guermeur †  No v em b er 1, 2021  24 pages Abstrat: Using a supp ort v etor ma hine requires to set t w o t yp es of h yp erparameters: the soft margin parameter C and the parameters of the k ernel. T o p erform this mo del seletion task, the metho d of  hoie is ross-v alidation. Its lea v e-one-out v arian t is kno wn to pro due an estimator of the generalization error whi h is almost un biased. Its ma jor dra wba k rests in its time requiremen t. T o o v erome this diult y , sev eral upp er b ounds on the lea v e-one-out error of the pattern reognition SVM ha v e b een deriv ed. Among those b ounds, the most p opular one is probably the radius-margin b ound. It applies to the hard margin pattern reognition SVM, and b y extension to the 2 -norm SVM. In this rep ort, w e in tro due a quadrati loss M-SVM, the M-SVM 2 , as a diret extension of the 2 -norm SVM to the m ulti-lass ase. F or this ma hine, a generalized radius-margin b ound is then established. Key-w ords: M-SVMs, mo del seletion, lea v e-one-out error, radius-margin b ound. ∗ UMR 7503-UHP † UMR 7503-CNRS Une SVM m ulti-lasse à oût quadratique Résumé : La mise en ÷uvre d'une ma hine à v eteurs supp ort requiert la détermination des v aleurs de deux t yp es d'h yp er-paramètres : le paramètre de marge doue C et les paramètres du no y au. P our eetuer ette tâ he de séletion de mo dèle, la métho de de  hoix est la v alidation roisée. Sa v arian te lea v e-one-out est onn ue p our fournir un estimateur de l'erreur en généralisation presque sans biais. Son défaut premier réside dans le temps de alul qu'elle néessite. An de surmon ter ette diulté, plusieurs ma joran ts de l'erreur lea v e-one-out de la SVM alulan t des di hotomies on t été prop osés. La plus p opulaire de es b ornes sup érieures est probablemen t la b orne ra y on-marge. Elle s'applique à la v ersion à marge dure de la ma hine, et par extension à la v arian te dite de norne 2 . Ce rapp ort in tro duit une M-SVM à oût quadratique, la M-SVM 2 , omme une extension direte de la SVM de norne 2 au as m ulti-lasse. P our ette ma hine, une b orne ra y on-marge généralisée est ensuite établie. Mots-lés : M-SVM, séletion de mo dèle, erreur lea v e-one-out, b orne ra y on-marge. A Quadr ati L oss Multi-Class SVM 3 1 In tro dution Using a supp ort v etor ma hine (SVM) [2, 4 ℄ requires to set t w o t yp es of h yp erparameters: the soft margin parameter C and the parameters of the k ernel. T o p erform this mo del seletion task, sev eral approa hes are a v ailable (see for instane [9 , 12 ℄). The solution of  hoie onsists in applying a ross-v alidation pro edure. Among those pro edures, the lea v e- one-out one app ears esp eially attrativ e, sine it is kno wn to pro due an estimator of the generalization error whi h is almost un biased [ 11 ℄. The seam y side of things is that it is highly time onsuming. This is the reason wh y , in reen t y ears, a n um b er of upp er b ounds on the lea v e-one-out error of pattern reognition SVMs ha v e b een prop osed in literature (see [3 ℄ for a surv ey). Among those b ounds, the tigh test one is the span b ound [16 ℄. Ho w ev er, the results of Chap elle and o-w ork ers presen ted in [3 ℄ sho w that another b ound, the radius- margin one [15 ℄, a hiev es equiv alen t p erformane for mo del seletion while b eing far simpler to ompute. This is the reason wh y it is urren tly the most p opular b ound. It applies to the hard margin ma hine and, b y extension, to the 2 -norm SVM (see for instane Chapter 7 in [13 ℄). In this rep ort, a m ulti-lass extension of the 2 -norm SVM is in tro dued. This ma hine, named M-SVM 2 , is a quadrati loss m ulti-lass SVM, i.e., a m ulti-lass SVM (M-SVM) in whi h the ℓ 1 -norm on the v etor of sla k v ariables has b een replaed with a quadrati form. The standard M-SVM on whi h it is based is the one of Lee, Lin and W ah ba [10 ℄. As the 2 -norm SVM, its training algorithm is equiv alen t to the training algorithm of a hard margin ma hine obtained b y a simple  hange of k ernel. W e then establish a generalized radius- margin b ound on the lea v e-one-out error of the hard margin v ersion of the M-SVM of Lee, Lin and W ah ba. The organization of this pap er is as follo ws. Setion 2 presen ts the m ulti-lass SVMs, b y desribing their ommon ar hiteture and the general form tak en b y their dieren t training algorithms. It fo uses on the M-SVM of Lee, Lin and W ah ba. In Setion 3, the M-SVM 2 is in tro dued as a partiular ase of quadrati loss M-SVM. Its onnetion with the hard margin v ersion of the M-SVM of Lee, Lin and W ah ba is highligh ted, as w ell as the fat that it onstitutes a m ulti-lass generalization of the 2 -norm SVM. Setion 4 is dev oted to the form ulation and pro of of the orresp onding m ulti-lass radius-margin b ound. A t last, w e dra w onlusions and outline our ongoing resear h in Setion 5. 4 Monfrini & Guermeur 2 Multi-Class SVMs 2.1 F ormalization of the learning problem W e are in terested here in m ulti-lass pattern reognition problems. F ormally , w e onsider the ase of Q -ategory lassiation problems with 3 ≤ Q < ∞ , but our results extend to the ase of di hotomies. Ea h ob jet is represen ted b y its desription x ∈ X and the set Y of the ategories y an b e iden tied with the set of indexes of the ategories: [ [ 1 , Q ] ] . W e assume that the link b et w een ob jets and ategories an b e desrib ed b y an unkno wn probabilit y measure P on the pro dut spae X × Y . The aim of the learning problem onsists in seleting in a set G of funtions g = ( g k ) 1 ≤ k ≤ Q from X in to R Q a funtion lassifying data in an optimal w a y . The riterion of optimalit y m ust b e sp eied. The funtion g assigns x ∈ X to the ategory l if and only if g l ( x ) > max k 6 = l g k ( x ) . In ase of ex æquo, x is assigned to a dumm y ategory denoted b y ∗ . Let f b e the deision funtion (from X in to Y S {∗} ) asso iated with g . With these denitions at hand, the ob jetiv e funtion to b e minimized is the probabilit y of error P ( f ( X ) 6 = Y ) . The optimization pro ess, alled tr aining , is based on empirial data. More preisely , w e assume that there exists a random pair ( X, Y ) ∈ X × Y , distributed aording to P , and w e are pro vided with a m -sample D m = (( X i , Y i )) 1 ≤ i ≤ m of indep enden t opies of ( X, Y ) . There are t w o questions raised b y su h problems: ho w to prop erly  ho ose the lass of funtions G and ho w to determine the b est andidate g ∗ in this lass, using only D m . This rep ort addresses the rst question, named mo del sele tion , in the partiular ase when the mo del onsidered is a M-SVM. The seond question, named funtion sele tion , is addressed for instane in [8 ℄. 2.2 Ar hiteture and training algorithms M-SVMs, lik e all the SVMs, b elong to the family of k ernel ma hines. As su h, they op erate on a lass of funtions indued b y a p ositiv e semidenite (Merer) k ernel. This alls for the form ulation of some denitions and prop ositions. Denition 1 (P ositiv e semidenite k ernel) A p ositiv e semidenite k ernel κ on the set X is a  ontinuous and symmetri funtion κ : X 2 → R verifying: ∀ n ∈ N ∗ , ∀ ( x i ) 1 ≤ i ≤ n ∈ X n , ∀ ( a i ) 1 ≤ i ≤ n ∈ R n , n X i =1 n X j =1 a i a j κ ( x i , x j ) ≥ 0 . Denition 2 (Repro duing k ernel Hilb ert spae [1 ℄) L et ( H , h· , ·i H ) b e a Hilb ert sp a e of funtions on X ( H ⊂ R X ). A funtion κ : X 2 → R is a repro duing k ernel of H if and only if: 1. ∀ x ∈ X , κ x = κ ( x, · ) ∈ H ; 2. ∀ x ∈ X , ∀ h ∈ H , h h, κ x i H = h ( x ) (r epr o duing pr op erty). A Quadr ati L oss Multi-Class SVM 5 A Hilb ert sp a e of funtions whih p ossesses a r epr o duing kernel is  al le d a repro duing k ernel Hilb ert spae (RKHS). Prop osition 1 L et ( H κ , h· , ·i H κ ) b e a RKHS of funtions on X with r epr o duing kernel κ . Then, ther e exists a map Φ fr om X into a Hilb ert sp a e  E Φ( X ) , h· , ·i  suh that: ∀ ( x, x ′ ) ∈ X 2 , κ ( x, x ′ ) = h Φ ( x ) , Φ ( x ′ ) i . (1) Φ is  al le d a feature map and E Φ( X ) a feature spae . The onnetion b et w een p ositiv e semidenite k ernels and RKHS is the follo wing. Prop osition 2 If κ is a p ositive semidenite kernel on X , then ther e exists a RKHS ( H , h· , ·i H ) of funtions on X suh that κ is a r epr o duing kernel of H . Let κ b e a p ositiv e semidenite k ernel on X and let ( H κ , h· , ·i H κ ) b e the RKHS spanned b y κ . Let ¯ H = ( H κ , h· , ·i H κ ) Q and let H = (( H κ , h· , ·i H κ ) + { 1 } ) Q . By onstrution, H is the lass of v etor-v alued funtions h = ( h k ) 1 ≤ k ≤ Q on X su h that h ( · ) = m k X i =1 β ik κ ( x ik , · ) + b k ! 1 ≤ k ≤ Q where the x ik are elemen ts of X , as w ell as the limits of these funtions when the sets { x ik : 1 ≤ i ≤ m k } b eome dense in X in the norm indued b y the dot pro dut (see for instane [17 ℄). Due to Equation 1, H an b e seen as a m ultiv ariate ane mo del on Φ ( X ) . F untions h an then b e rewritten as: h ( · ) = ( h w k , ·i + b k ) 1 ≤ k ≤ Q where the v etors w k are elemen ts of E Φ( X ) . They are th us desrib ed b y the pair ( w , b ) with w = ( w k ) 1 ≤ k ≤ Q ∈ E Q Φ( X ) and b = ( b k ) 1 ≤ k ≤ Q ∈ R Q . As a onsequene, ¯ H an b e seen as a m ultiv ariate linear mo del on Φ ( X ) , endo w ed with a norm k . k ¯ H giv en b y: ∀ ¯ h ∈ ¯ H ,   ¯ h   ¯ H = v u u t Q X k =1 k w k k 2 = k w k , where k w k k = p h w k , w k i . With these denitions and prop ositions at hand, a generi denition of the M-SVMs an b e form ulated as follo ws. Denition 3 (M-SVM, Denition 42 in [ 8 ℄) L et (( x i , y i )) 1 ≤ i ≤ m ∈ ( X × [ [ 1 , Q ] ]) m and λ ∈ R ∗ + . A Q -ategory M-SVM is a lar ge mar gin disriminant mo del obtaine d by minimizing over the hyp erplane P Q k =1 h k = 0 of H a p enalize d risk J M-SVM of the form: J M-SVM ( h ) = m X i =1 ℓ M-SVM ( y i , h ( x i )) + λ   ¯ h   2 ¯ H wher e the data t  omp onent involves a loss funtion ℓ M-SVM whih is  onvex. 6 Monfrini & Guermeur Three main mo dels of M-SVMs an b e found in literature. The oldest one is the mo del of W eston and W atkins [ 19 ℄, whi h orresp onds to the loss funtion ℓ WW giv en b y: ℓ WW ( y , h ( x )) = X k 6 = y (1 − h y ( x ) + h k ( x )) + , where the hinge loss funtion ( · ) + is the funtion max(0 , · ) . The seond one is due to Crammer and Singer [5℄ and orresp onds to the loss funtion ℓ CS giv en b y: ℓ CS ( y , ¯ h ( x )) =  1 − ¯ h y ( x ) + max k 6 = y ¯ h k ( x )  + . The most reen t mo del is the one of Lee, Lin and W ah ba [10 ℄ whi h orresp onds to the loss funtion ℓ LL W giv en b y: ℓ LL W ( y , h ( x )) = X k 6 = y  h k ( x ) + 1 Q − 1  + . (2) Among the three mo dels, the M-SVM of Lee, Lin and W ah ba is the only one that implemen ts asymptotially the Ba y es deision rule. It is Fisher  onsistent [ 20 , 14 ℄. 2.3 The M-SVM of Lee, Lin and W ah ba The substitution in Denition 3 of ℓ M-SVM with the expression of the loss funtion ℓ LL W giv en b y Equation 2 pro vides us with the expressions of the quadrati programming (QP) problems orresp onding to the training algorithms of the hard margin and soft margin v ersions of the M-SVM of Lee, Lin and W ah ba. Problem 1 (Hard margin M-SVM) min w , b J HM ( w , b ) s.t.      h w k , Φ( x i ) i + b k ≤ − 1 Q − 1 , (1 ≤ i ≤ m ) , (1 ≤ k 6 = y i ≤ Q ) P Q k =1 w k = 0 P Q k =1 b k = 0 wher e J HM ( w , b ) = 1 2 Q X k =1 k w k k 2 . Problem 2 (Soft margin M-SVM) min w , b J SM ( w , b ) A Quadr ati L oss Multi-Class SVM 7 s.t.          h w k , Φ( x i ) i + b k ≤ − 1 Q − 1 + ξ ik , (1 ≤ i ≤ m ) , (1 ≤ k 6 = y i ≤ Q ) ξ ik ≥ 0 , (1 ≤ i ≤ m ) , (1 ≤ k 6 = y i ≤ Q ) P Q k =1 w k = 0 P Q k =1 b k = 0 wher e J SM ( w , b ) = 1 2 Q X k =1 k w k k 2 + C m X i =1 X k 6 = y i ξ ik . In Problem 2, the ξ ik are slak variables in tro dued in order to relax the onstrain ts of orret lassiation. The o eien t C , whi h  haraterizes the trade-o b et w een predition auray on the training set and smo othness of the solution, an b e expressed in terms of the regularization o eien t λ as follo ws: C = (2 λ ) − 1 . It is alled the soft mar gin p ar ameter . Instead of diretly solving Problems 1 and 2 , one usually solv es their W olfe dual [ 6℄. W e no w deriv e the dual problem of Problem 1. Giving the details of the implemen tation of the Lagrangian dualit y will pro vide us with partial results whi h will pro v e useful in the sequel. Let α = ( α ik ) 1 ≤ i ≤ m, 1 ≤ k ≤ Q ∈ R Qm + b e the v etor of Lagrange m ultipliers asso iated with the onstrain ts of go o d lassiation. It is for on v eniene of notation that this v e- tor is expressed with double subsript and that the dumm y v ariables α iy i , all equal to 0 , are in tro dued. Let δ ∈ E Φ( X ) b e the Lagrange m ultiplier asso iated with the onstrain t P Q k =1 w k = 0 and β ∈ R the Lagrange m ultiplier asso iated with the onstrain t P Q k =1 b k = 0 . The Lagrangian funtion of Problem 1 is giv en b y: L ( w , b , α, β , δ ) = 1 2 Q X k =1 k w k k 2 − h δ, Q X k =1 w k i − β Q X k =1 b k + m X i =1 Q X k =1 α ik  h w k , Φ( x i ) i + b k + 1 Q − 1  . (3) Setting the gradien t of the Lagrangian funtion with resp et to w k equal to the n ull v etor pro vides us with Q alternativ e expressions for the optimal v alue of v etor δ : δ ∗ = w ∗ k + m X i =1 α ∗ ik Φ( x i ) , (1 ≤ k ≤ Q ) . (4) Sine b y h yp othesis, P Q k =1 w ∗ k = 0 , summing o v er the index k pro vides us with the expression of δ ∗ as a funtion of dual v ariables only: δ ∗ = 1 Q m X i =1 Q X k =1 α ∗ ik Φ( x i ) . (5) 8 Monfrini & Guermeur By substitution in to ( 4), w e get the expression of the v etors w k at the optim um: w ∗ k = 1 Q m X i =1 Q X l =1 α ∗ il Φ( x i ) − m X i =1 α ∗ ik Φ( x i ) , (1 ≤ k ≤ Q ) whi h an also b e written as w ∗ k = m X i =1 Q X l =1 α ∗ il  1 Q − δ k,l  Φ( x i ) , (1 ≤ k ≤ Q ) (6) where δ is the Krone k er sym b ol. Let us no w set the gradien t of (3) with resp et to b equal to the n ull v etor. It omes: β ∗ = m X i =1 α ∗ ik , (1 ≤ k ≤ Q ) and th us m X i =1 Q X l =1 α ∗ il  1 Q − δ k,l  = 0 , (1 ≤ k ≤ Q ) . Giv en the onstrain t P Q k =1 b k = 0 , this implies that: m X i =1 Q X k =1 α ∗ ik b ∗ k = β ∗ Q X k =1 b ∗ k = 0 . (7) By appliation of (6), Q X k =1 k w ∗ k k 2 = Q X k =1 h m X i =1 Q X l =1 α ∗ il  1 Q − δ k,l  Φ( x i ) , m X j =1 Q X n =1 α ∗ j n  1 Q − δ k,n  Φ( x j ) i = m X i =1 m X j =1 Q X l =1 Q X n =1 α ∗ il α ∗ j n h Φ( x i ) , Φ( x j ) i Q X k =1  1 Q − δ k,l   1 Q − δ k,n  = m X i =1 m X j =1 Q X l =1 Q X n =1 α ∗ il α ∗ j n  δ l,n − 1 Q  κ ( x i , x j ) . (8) Still b y appliation of (6 ), m X i =1 Q X k =1 α ∗ ik h w ∗ k , Φ( x i ) i = m X i =1 Q X k =1 α ∗ ik h m X j =1 Q X l =1 α ∗ j l  1 Q − δ k,l  Φ( x j ) , Φ( x i ) i A Quadr ati L oss Multi-Class SVM 9 = m X i =1 m X j =1 Q X k =1 Q X l =1 α ∗ ik α ∗ j l  1 Q − δ k,l  κ ( x i , x j ) . (9) Com bining (8 ) and (9) giv es: 1 2 Q X k =1 k w ∗ k k 2 + m X i =1 Q X k =1 α ∗ ik h w ∗ k , Φ( x i ) i = − 1 2 Q X k =1 k w ∗ k k 2 = − 1 2 m X i =1 m X j =1 Q X k =1 Q X l =1 α ∗ ik α ∗ j l  δ k,l − 1 Q  κ ( x i , x j ) . (10) In what follo ws, w e use the notation e n to designate the v etor of R n su h that all its omp onen ts are equal to e . Let H b e the matrix of M Qm,Qm ( R ) of general term: h ik,j l =  δ k,l − 1 Q  κ ( x i , x j ) . With these notations at hand, rep orting (7) and (10 ) in (3 ) pro vides us with the algebrai expression of the Lagrangian funtion at the optim um: L ( α ∗ ) = − 1 2 α ∗ T H α ∗ + 1 Q − 1 1 T Qm α ∗ . This ev en tually pro vides us with the W olfe dual form ulation of Problem 1: Problem 3 (Hard margin M-SVM, dual form ulation) max α J LL W,d ( α ) s.t. ( α ik ≥ 0 , (1 ≤ i ≤ m ) , (1 ≤ k 6 = y i ≤ Q ) P m i =1 P Q l =1 α il  1 Q − δ k,l  = 0 , (1 ≤ k ≤ Q ) wher e J LL W,d ( α ) = − 1 2 α T H α + 1 Q − 1 1 T Qm α, with the gener al term of the Hessian matrix H b eing h ik,j l =  δ k,l − 1 Q  κ ( x i , x j ) . Let the ouple  w 0 , b 0  denote the optimal solution of Problem 1 and equiv alen tly , let α 0 =  α 0 ik  1 ≤ i ≤ m, 1 ≤ k ≤ Q ∈ R Qm + b e the optimal solution of Problem 3. A ording to ( 6), the expression of w 0 k is then: w 0 k = m X i =1 Q X l =1 α 0 il  1 Q − δ k,l  Φ( x i ) . 10 Monfrini & Guermeur 2.4 Geometrial margins F rom a geometrial p oin t of view, the algorithms desrib ed ab o v e tend to onstrut a set of h yp erplanes { ( w k , b k ) : 1 ≤ k ≤ Q } that maximize globally the C 2 Q mar gins b et w een the dieren ts ategories. If these margins are dened as in the bi-lass ase, their analytial expression is more omplex. Denition 4 (Geometrial margins, Denition 7 in [7 ℄) L et us  onsider a Q - ate gory M-SVM (a funtion of H ) lassifying the examples of its tr aining set { ( x i , y i ) : 1 ≤ i ≤ m } without err or. γ kl , its margin b et w een ategories k and l , is dene d as the smal lest distan e of a p oint either in k or l to the hyp erplane sep ar ating those  ate gories. L et us denote d M-SVM = min 1 ≤ k 0 . W e kno w that su h a mapping exists, otherwise, giv en the equalit y onstrain ts of Problem 3, v etor α p w ould b e equal to the n ull v etor. F or K 2 ∈ R ∗ + , let µ p b e the v etor of R Qm that only diers from the n ull v etor in the follo wing w a y: ( µ p pn = K 2 ∀ k ∈ [ [ 1 , Q ] ] \ { n } , µ p I ( k ) k = K 2 . Ob viously , this solution is feasible (satises the onstrain ts 17 ). Indeed, 1 Q P m i =1 P Q k =1 µ p ik = K 2 and P m i =1 µ p ik = K 2 , (1 ≤ k ≤ Q ) . With this denition of v etor µ p , the righ t-hand side of (23) simplies in to: K 2   h p n ( x p ) + X k 6 = n h p k  x I ( k )  + Q Q − 1   . V etor µ p has b een sp eied so as to mak e it p ossible to exhibit a non trivial lo w er b ound on this last expression. By denition of n , h p n ( x p ) ≥ 0 . F urthermore, the Kuhn-T u k er optimalit y onditions: α p ik  h w p k , Φ( x i ) i + b p k + 1 Q − 1  = 0 , (1 ≤ i 6 = p ≤ m ) , (1 ≤ k 6 = y i ≤ Q ) imply that  h p k  x I ( k )  1 ≤ k 6 = n ≤ Q = − 1 Q − 1 1 Q − 1 . As a onsequene, a lo w er b ound on the righ t-hand side of ( 23 ) is pro vided b y: m X i =1 X k 6 = y i  h p k ( x i ) + 1 Q − 1  µ p ik ≥ K 2 Q − 1 . It springs from this b ound and (22 ) that J ( α p + K 1 µ p ) − J ( α p ) ≥ K 1 K 2 Q − 1 − K 2 1 2 Q X k =1      m X i =1 Q X l =1 µ p il  1 Q − δ k,l  Φ( x i )      2 . (24) Com bining (18), (21 ) and (24 ) nally giv es: 1 2 Q X k =1      m X i =1 Q X l =1 λ p il  1 Q − δ k,l  Φ( x i )      2 ≥ A Quadr ati L oss Multi-Class SVM 19 K 1 K 2 Q − 1 − K 2 1 2 Q X k =1      m X i =1 Q X l =1 µ p il  1 Q − δ k,l  Φ( x i )      2 . (25) Let ν p = ( ν p ik ) 1 ≤ i ≤ m, 1 ≤ k ≤ Q b e the v etor of R Qm + su h that µ p = K 2 ν p . The v alue of the salar K 3 = K 1 K 2 maximizing the righ t-hand side of (25 ) is: K ∗ 3 = 1 Q − 1 P Q k =1    P m i =1 P Q l =1 ν p il  1 Q − δ k,l  Φ( x i )    2 . By substitution in ( 25 ), this means that: ( Q − 1 ) 2 Q X k =1      m X i =1 Q X l =1 λ p il  1 Q − δ k,l  Φ( x i )      2 Q X k =1      m X i =1 Q X l =1 ν p il  1 Q − δ k,l  Φ( x i )      2 ≥ 1 . F or η in R Qm , let K ( η ) = 1 Q P m i =1 P Q k =1 η p ik . W e ha v e:      1 Q m X i =1 Q X l =1 λ p il Φ( x i ) − m X i =1 λ p ik Φ( x i )      2 = K ( λ p ) 2 k on v 1 (Φ( x i )) − on v 2 (Φ( x i )) k 2 where on v 1 (Φ( x i )) and on v 2 (Φ( x i )) are t w o on v ex om binations of the Φ( x i ) . As a onsequene, k on v 1 (Φ( x i )) − on v 2 (Φ( x i )) k 2 an b e b ounded from ab o v e b y D 2 m . Sine the same reasoning applies to ν p , w e get: ( Q − 1 ) 2 Q 2 K ( λ p ) 2 K ( ν p ) 2 D 4 m ≥ 1 . (26) By onstrution, K ( ν p ) = 1 . W e no w onstrut a v etor λ p minimizing the ob jetiv e funtion K . First, note that due to the equalit y onstrain ts satised b y this v etor, ∀ k ∈ [ [ 1 , Q ] ] , m X i =1 λ p ik = 1 Q m X i =1 Q X l =1 λ p il . As a onsequene, ∀ ( k , l ) ∈ [ [ 1 , Q ] ] 2 , m X i =1 λ p ik = m X i =1 λ p il . This implies that: ∀ k ∈ [ [ 1 , Q ] ] , m X i =1 λ p ik ≥ max l ∈ [ [ 1 ,Q ] ] α 0 pl . 20 Monfrini & Guermeur Ob viously , b oth the b o x onstrain ts in (16 ) and the nature of K all for the  hoie of small v alues for the omp onen ts λ p ik . Th us, there is a feasible solution λ p ∗ su h that: ∀ k ∈ [ [ 1 , Q ] ] , m X i =1 λ p ik ∗ = max l ∈ [ [ 1 ,Q ] ] α 0 pl . This solution is su h that K ( λ p ∗ ) = max k ∈ [ [ 1 ,Q ] ] α 0 pk . The substitution of the v alues of K ( ν p ) and K ( λ p ∗ ) in (26) pro vides us with:  max k ∈ [ [ 1 ,Q ] ] α 0 pk  2 ≥ 1 ( Q − 1) 2 Q 2 D 4 m . T aking the square ro ot of b oth sides onludes the pro of of the lemma. 4.3 Multi-lass radius-margin b ound Theorem 2 (Multi-lass radius-margin b ound) L et us  onsider a Q - ate gory har d mar- gin M-SVM of L e e, Lin and W ahb a on a domain X . L et d m = { ( x i , y i ) : 1 ≤ i ≤ m } b e its tr aining set, L m the numb er of err ors r esulting fr om applying a le ave-one-out r oss-validation pr o  e dur e to this mahine, and D m the diameter of the smal lest spher e of the fe atur e sp a e  ontaining the set { Φ( x i ) : 1 ≤ i ≤ m } . Then the fol lowing upp er b ound holds true: L m ≤ Q 2 D 2 m X k

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment