New Generalization Bounds for Learning Kernels

New Generalization Bounds f or Learni ng K ernel s Corinna Cortes Google Research New Y ork corinna@googl e.com Mehryar Mohri Courant Institute and Google Research mohri@cims.ny u.edu Afshin Rostamizadeh Courant Institute New Y ork Univ ersity rostami@cs.ny u.edu Abstract This p aper presents several novel generalizatio n bound s for the pro blem of learning kernels based on the analysis of the Rademacher complexity of the correspo nding hypoth esis sets. Our bound for learning kernels w ith a con vex comb ination of p ba se kernels has only a log p d epende ncy on the number of kernels, p , which is considera bly more fa vorable than the previous best bound gi ven for the same p roblem. W e also give a novel boun d fo r learning with a linear combina tion o f p base kernels with an L 2 regularization whose dependency on p is only in p 1 / 4 . 1 Intr oduction Kernel m ethods are widely used in statistical learning [17, 1 8]. Positive deﬁnite symme tric (PDS) kernels specify an in ner pro duct in an implicit Hilb ert space wh ere la rge-margin method s a re used for learning and estimation. They can be co mbined with algorithms such as sup port vector m achines (SVMs) [5, 1 0, 20] or other kernel-based algorithms to form powerful learning techniques. But, the choice of the kernel, which is critical to the success o f the algorith m, is ty pically left to the user . Rather than r equesting th e user to com mit to a speciﬁc kernel, which may not b e optimal for the task, especially if the u ser’ s p rior kn owledge about the task is poor, learn ing kernel meth ods requ ire him on ly to specify a family of kernels. The learning alg orithm then selects bo th the speciﬁc kerne l out of that family , and the hypo thesis deﬁned with respect to that kernel. There is a large bod y of literature dealing with various a spects of the pro blem o f learning kernels, including theoretical question s, o ptimization problem s related to th is problem, and exper imental results [13, 15, 2, 1, 1 9, 16, 14, 23, 11, 3, 8, 22, 9]. Some o f this previous work considers f amilies of Gaussian kernels [15] or hyp erkernels [16]. Non-linear co mbination s o f kern els have been recently con sidered by [21, 3, 9]. But, the m ost commo n family o f kernels examin ed is that of n on-negative combina tions of some ﬁxed k ern els constrained by a trace condition, which can be vie wed as an L 1 regularization [13 ], or by an L 2 regularization [8]. This p aper presents sev eral n ovel generalization b ound s fo r th e p roblem of learnin g kernels for the family of con vex comb inations o f base kernels o r linear combination s with an L 2 constraint. On e of the ﬁrst learning bo unds giv en by Lanckr iet et al. [13] for the family of convex combination s o f p base ker- nels is similar to that of Bousque t and Herrm ann [6] and has the following form: R ( h ) ≤ b R ρ ( h ) + O  1 √ m p max p k =1 T r ( K k ) max p i =1 ( k K k k / T r( K k )) /ρ 2  where R ( h ) is the generalization er ror of a hypoth- esis h , R ρ ( h ) is the fraction of training points with margin less than or equal to ρ and K k is the kernel matrix associated to the k th base kernel. T his bou nd was later shown b y Sreb ro and Ben-David [19] to b e always larger than one. Ano ther bound by Lanckr iet et al. [13] for the family of line ar combinatio ns of base kernels was also shown by the same authors to be alw ay s larger than one. But Lanck riet et al. [13] also presen ted a multiplicative boun d f or conve x comb inations of base kernels that is o f the form R ( h ) ≤ b R ρ ( h ) + O  q p/ρ 2 m  . Th is boun d con verges and can perhaps be viewed as the ﬁrst inform ativ e gener alization bo und for this family of kernels. Howe ver, the depe ndence of th e bo und on the nu mber of kernels p is multiplicative which therefore does n ot encou rage the use of too m any base kernels. Srebro and Ben-David [19] presented a gene ralization bound based on the pseudo-d imension o f the family o f kernels that signiﬁcantly improved on this b ound. Their b ound has th e fo rm R ( h ) ≤ b R ρ ( h ) + e O  q p + R 2 /ρ 2 m  , wh ere the n otation e O ( · ) h ides logar ithmic terms and wh ere R is an uppe r bound on K k ( x, x ) for all points x and base kernels k k , k ∈ [1 , p ] . Thus, disregard ing loga rithmic terms, th eir boun d is only additive in p . Th eir analysis also applies to o ther families of kern els. Y ing and Campbell [22] also gi ve generalizatio n bounds for learnin g kernels based on th e notion of Rademacher chaos co mplexity and the pseudo- dimension of the family of kerne ls used . It is no t clear howe ver h ow their bo und co mpares to that of Srebro and Ben- David. W e pr esent new generalization bounds for the family of conv ex combination s of base kernels that h av e only a logarithm ic dep endency on p . Ou r learning bound is based on a careful analysis of the Rademach er complexity of the hypoth esis set c onsidered and has the form: R ( h ) ≤ b R ρ ( h ) + O  q (log p ) R 2 /ρ 2 m  . Our bound is simpler and contains no other extra logarithm ic ter m. Thus, this represents a substantial improvement o ver the previous best bounds for this problem. Our bound is also valid for a very large n umber of kern els, in p articular for p ≫ m , while the pre v ious bo unds were not informative in that case. W e also present new ge neralization b ound s for th e family of lin ear co mbination s of b ase kern els with an L 2 regularization. W e had pr eviously given a stability bound for an algor ithm extending kernel rid ge regression to learn ing kernels that had an additi ve dependen cy with respe ct to p [8] assuming a technical condition of orthogo nality on the base kern els. The comp lexity ter m of ou r boun d was of the fo rm O (1 / √ m + p p/m ) . Our new l ear ning bound admits only a mild dependency of p 1 / 4 on the number of base kernels. The n ext section (Section 2) deﬁn es the family o f kern els an d h ypothesis sets we e xa mine. S ectio n 3 presents a bound on the R ad emacher complexity of the class of con vex combinatio ns of base kernels with an L 1 constraint an d a gen eralization bon d for binar y classiﬁcation directly d erived from that result. Similarly , Section 4 presents ﬁrst a bound o n the Rademacher complexity , then a generalizatio n bou nd f or the case of an of L 2 regularization. 2 Pr elimi naries Most learning kernel algorith ms are b ased on a h ypothe sis set der iv ed f rom conve x co mbination s of a ﬁxed set of kernels K 1 , . . . , K p : H p = n m X i =1 α i K ( x i , · ) : K = p X k =1 µ k K k , µ k ≥ 0 , p X k =1 µ k = 1 , α ⊤ K α ≤ 1 /ρ 2 o . (1) Note that linear co mbination s with possibly negative mixture weigh ts have also been consider ed in th e liter- ature, e.g., [13], howe ver these combinations do not ensure that the comb ined kernel is PDS . W e a lso consider the hypoth esis set H ′ p based on a L 2 condition on the vector µ and deﬁned as fo llows: H ′ p = n m X i =1 α i K ( x i , · ) : K = p X k =1 µ k K k , µ k ≥ 0 , p X k =1 µ 2 k = 1 , α ⊤ K α ≤ 1 /ρ 2 o . (2) W e b ound the empirical Rademacher complexity b R S ( H p ) or b R S ( H ′ p ) of these f am ilies for an arbitrary sam- ple S of size m , which immediately yields a g eneralization boun d for learnin g kern els b ased on this family of hyp otheses. For a ﬁxed sample S = ( x 1 , . . . , x m ) , th e empirica l Rademacher complexity of a hypoth esis set H is d eﬁned as b R S ( H ) = 1 m E σ h sup h ∈ H m X i =1 σ i h ( x i ) i . (3) The expectation is taken over σ = ( σ 1 , . . . , σ n ) where σ i s are in depend ent unifo rm random variables taking values in {− 1 , +1 } . Let h ∈ H p , then h ( x ) = m X i =1 α i K ( x i , x ) = p X k =1 m X i =1 µ k α i K k ( x i , x ) = w · Φ ( x ) , (4) where w = " w 1 . . . w p # with w k = µ k P m i =1 α i Φ k ( x i ) and Φ ( x ) = " Φ 1 ( x ) . . . Φ p ( x ) # with Φ k = K k ( x, · ) , for all k ∈ [1 , p ] . 3 Rademacher complexity bound for H p Theorem 1 F or any sample S of size m , th e R ademacher complexity of the hypothesis set H p can b e bou nded as follows: b R S ( H p ) ≤ k τ k r mρ with τ = ( p r T r[ K 1 ] , . . . , p r T r[ K p ]) ⊤ , (5) for any even inte ger r > 0 . If add itionally , K k ( x, x ) ≤ R 2 for all x ∈ X an d k ∈ [1 , p ] , then, for p > 1 , b R S ( H p ) ≤ r 2 e ⌈ log p ⌉ R 2 /ρ 2 m . Proof: Fix a sample S , then b R S ( H p ) can be bounded as follows for the hy pothesis set of kernel learning algorithm s for any q , r > 1 with 1 /q + 1 /r = 1 : b R S ( H p ) = 1 m E σ h sup h ∈ H p m X i =1 σ i h ( x i ) i = 1 m E σ h sup w w · m X i =1 σ i Φ ( x i ) i ≤ 1 m E σ h sup w  p X k =1 k w k k q  1 /q  p X k =1    m X i =1 σ i Φ k ( x i )    r  1 /r i ( Lemma 5 ) = 1 m  sup w  p X k =1 k w k k q  1 /q  E σ   p X k =1    m X i =1 σ i Φ k ( x i )    r  1 /r  . W e b ound each of these two factors separately . The ﬁrst term can be bo unded as follows.  p X k =1 k w k k q  1 /q ≤ p X k =1 ( k w k k q ) 1 /q ( sub-additivity of x 7→ x 1 /q , (1 / q ) < 1) = p X k =1 µ k k m X i =1 α i Φ k ( x i ) k ≤ v u u t p X k =1 µ k k m X i =1 α i Φ k ( x i ) k 2 ( con vexity ) = v u u t p X k =1 µ k α ⊤ K k α = √ α ⊤ K α ≤ 1 / ρ. W e b ound the second term as follows: E σ   p X k =1   m X i =1 σ i Φ k ( x i )   r  1 /r  ≤  E σ h p X k =1   m X i =1 σ i Φ k ( x i )   r i  1 /r ( Jensen’ s in equality ) =  p X k =1 E σ h   m X i =1 σ i Φ k ( x i )   r i  1 /r Suppose that r is an ev en integer , r = 2 r ′ . Then, we can bound the expectation as follo ws: E σ h   m X i =1 σ i Φ k ( x i )   r i = E σ h m X i,j =1 σ i σ j K k ( x i , x j )  r ′ i ≤ X 1 ≤ i 1 ,...,i r ′ ≤ m 1 ≤ j 1 ,...,j r ′ ≤ m    E σ h σ i 1 σ j 1 · · · σ i r ′ σ j r ′ i       K k ( x i 1 , x j 1 ) · · · K k ( x i r ′ , x j r ′ )    ≤ X 1 ≤ i 1 ,...,i r ′ ≤ m 1 ≤ j 1 ,...,j r ′ ≤ m    E σ h σ i 1 σ j 1 · · · σ i r ′ σ j r ′ i    ( K k ( x i 1 , x i 1 ) · · · K k ( x i r ′ , x i r ′ )) 1 / 2 ( K k ( x j 1 , x j 1 ) · · · K k ( x j r ′ , x j r ′ )) 1 / 2 ( Cauchy-Schwarz ) = X s 1 + ... + s m =2 r ′  2 r ′ s 1 ,...,s m     E σ h σ s 1 1 · · · σ s m m i    K k ( x 1 , x 1 ) s 1 / 2 · · · K k ( x m , x m ) s m / 2 . Since E[ σ i ] = 0 for all i and sinc e the Rad emacher variables are indepen dent, we can wr ite E[ σ i 1 . . . σ i l ] = E[ σ i 1 ] · · · E[ σ i l ] = 0 for any l distinct variables σ i 1 , . . . , σ i l . Thu s, E σ h σ s 1 1 · · · σ s m 1 i = 0 unless all s i s are ev en, in which case E σ h σ s 1 1 · · · σ s m m i = 1 . Therefo re, the following inequality holds: 1 E σ h   m X i =1 Φ k ( x i )   r i ≤ X 2 t 1 + ... +2 t m =2 r ′  2 r ′ 2 t 1 ,..., 2 t m  K k ( x 1 , x 1 ) t 1 · · · K k ( x m , x m ) t m ≤ (2 r ′ ) r ′ X t 1 + ... + t m = r ′  r ′ t 1 ,...,t m  K k ( x 1 , x 1 ) t 1 · · · K k ( x m , x m ) t m = (2 r ′ T r[ K k ]) r ′ = ( r T r[ K k ]) r / 2 . Thus, the Rademacher complexity is bounded by b R S ( H p ) ≤ k τ k r mρ with τ = ( p r T r[ K 1 ] , . . . , p r T r[ K p ]) ⊤ , (6) for any e ven integer r . Assume that K k ( x, x ) ≤ R 2 for all x ∈ X and k ∈ [1 , p ] . Then , T r [ K k ] ≤ mR 2 for any k ∈ [1 , p ] , thus the Rademacher complexity can be bounded as follows b R S ( H p ) ≤ 1 mρ ( p ( √ rmR 2 ) r ) 1 /r = p 1 /r r 1 / 2 r R 2 /ρ 2 m . For p > 1 , the fu nction r 7→ p 1 /r r 1 / 2 reaches its minimum at r 0 = 2 log p . T his gi ves b R S ( H p ) ≤ r 2 e ⌈ log p ⌉ R 2 /ρ 2 m . It is likely that the constants in the bound of theor em can be further improved. W e used a very r ough upper bound for the multinomial coefﬁ cien ts. A ﬁner bound using Sterling’ s approximation should provide a better result. Remarkably , the bound of the theorem has a very mild dependency with respect to p . The theorem can be u sed to derive genera lization bo unds for learning k ern els in classiﬁcation, re gre ssion, and other tasks. W e b rieﬂy illustrate its application to binary classiﬁcation wher e the labels y are in {− 1 , +1 } . Let R ( h ) deno te the gener alization error of h ∈ H p , that is R ( h ) = Pr[ y h ( x ) < 0] . For a tr aining samp le S = (( x 1 , y 1 ) , . . . , ( x m , y m )) and any ρ > 0 , let b R ρ ( h ) deno te the fraction o f the training po ints with margin less than or equal to ρ , that is b R ρ ( h ) = 1 m P m i =1 1 y i h ( x i ) ≤ ρ . Then, the following result holds. Corollary 2 F or any δ > 0 , with pr oba bility at least 1 − δ , the following bound holds for any h ∈ H p : R ( h ) ≤ b R ρ ( h ) + 2 k τ k r mρ + 2 s log 2 δ 2 m . (7) with τ = ( p r T r[ K 1 ] , . . . , p r T r[ K p ]) ⊤ , for any e ven inte ger r > 0 . If additionally , K k ( x, x ) ≤ R 2 for all x ∈ X an d k ∈ [1 , p ] , then, for p > 1 , R ( h ) ≤ b R ρ ( h ) + 2 r 2 e ⌈ log p ⌉ R 2 /ρ 2 m + 2 s log 2 δ 2 m . Proof: W ith our d eﬁnition of the Radema cher complexity , for any δ > 0 , with pr obability at least 1 − δ , the following bound holds for any h ∈ H p [12, 4]: R ( h ) ≤ b R ρ ( h ) + 2 b R S ( H p ) + 2 s log 2 δ 2 m . (8) 1 W e use the following rather rou gh inequality: ` 2 r ′ 2 t 1 ,..., 2 t m ´ = (2 r ′ )! (2 t 1 )! · · · (2 t m )! ≤ (2 r ′ )! ( t 1 )! · · · ( t m )! ≤ (2 r ′ ) · · · ( r ′ + 1) · r ′ ! ( t 1 )! · · · ( t m )! ≤ (2 r ′ ) r ′ · r ′ ! ( t 1 )! · · · ( t m )! = (2 r ′ ) r ′ ` r ′ t 1 ,...,t m ´ . Plugging in th e bou nd on the em pirical Rademache r complexity gi ven by T heorem 1 yields the statemen t of the corollary . The corollary giv es a generalization bound for learning kernels with H p that is in O  r (log p ) R 2 /ρ 2 m  . (9) In com parison, the bou nd for this pro blem gi ven by Srebr o and Ben-David [19] using the p seudo-d imension has a stronger depend ency with respect to p an d is more complex: O s 8 2 + p log 128 em 3 R 2 ρ 2 p + 256 R 2 ρ 2 log ρem 8 R log 128 mR 2 ρ 2 m ! . (10 ) This bound is also not informative for p > m . 4 Rademacher complexity bound for H ′ p Theorem 3 F or any sample S of size m , th e R ademacher complexity of the hypothesis set H ′ p can b e bou nded as follows: b R S ( H ′ p ) ≤ k τ k r mρ with τ = ( p r T r[ K 1 ] , . . . , p r T r[ K p ]) ⊤ , (11) for any even integer 0 < r ≤ 4 . If additio nally , K k ( x, x ) ≤ R 2 for all x ∈ X and k ∈ [1 , p ] , th en, for any p ≥ 1 , b R S ( H ′ p ) ≤ 2 p 1 / 4 r R 2 /ρ 2 m . This bound also hold without the condition µ k ≥ 0 , k ∈ [1 , p ] , on the hypo thesis s et H ′ p . Proof: W e can proceed as in the proof for bound ing the Rademacher complexity of H p , except for bound ing the following term:  p X k =1 k w k k q  1 /q = h p X k =1 µ q k ( α ⊤ K k α ) q/ 2 i 1 /q = h p X k =1 µ 2 k ( µ 2( q − 2) q k α ⊤ K k α ) q/ 2 i 1 /q ≤ h ( p X k =1 µ 2 k µ 2( q − 2) q k α ⊤ K k α ) q/ 2 i 1 /q ( con vexity ) = v u u t p X k =1 µ 4( q − 1) q k α ⊤ K k α . Assume now th at q > 4 / 3 , wh ich implies 4( q − 1) q < 1 . Then, sin ce µ k ∈ [0 , 1] , this im plies µ 4( q − 1) q k ≤ µ k . Thus, for any q > 4 / 3 , we can write:  p X k =1 k w k k q  1 /q ≤ v u u t p X k =1 µ k α ⊤ K k α = √ α ⊤ K α ≤ 1 /ρ 2 . T aking the limit q → 4 / 3 shows that the inequality is also veriﬁed for q = 4 / 3 . Thus, as in the proof for H p , the Rademacher complexity can be bounded as follows b R S ( H ′ p ) ≤ k τ k r mρ with τ = ( p r T r[ K 1 ] , . . . , p r T r[ K p ]) ⊤ , (12) but h ere r is an ev en integer such that 1 /r = 1 − 1 /q ≥ 1 − 3 / 4 = 1 / 4 , th at is r ≤ 4 . Assume that K k ( x, x ) ≤ R 2 for all x ∈ X a nd k ∈ [1 , p ] . Then, T r[ K k ] ≤ mR 2 for any k ∈ [1 , p ] , thus, for r = 4 , the Rademacher complexity can be bounded as follows b R S ( H ′ p ) ≤ 1 mρ ( p ( √ 4 mR 2 ) 4 ) 1 / 4 = 2 p 1 / 4 r R 2 /ρ 2 m . Thus, in this case, the boun d h as a mild depende nce ( 4 √ · ) on the n umber of kern els p . Procee ding as in the L 1 case leads to the following margin bound in binary classiﬁcation. Corollary 4 F or any δ > 0 , with pr oba bility at least 1 − δ , the following bound holds for any h ∈ H p : R ( h ) ≤ b R ρ ( h ) + 2 k τ k r mρ + 2 s log 2 δ 2 m . (13) with τ = ( p r T r[ K 1 ] , . . . , p r T r[ K p ]) ⊤ , for an y even integer r ∈ { 2 , 4 } . If additio nally , K k ( x, x ) ≤ R 2 for all x ∈ X an d k ∈ [1 , p ] , then, for an y p ≥ 1 , R ( h ) ≤ b R ρ ( h ) + 4 p 1 / 4 r R 2 /ρ 2 m + 2 s log 2 δ 2 m . 5 Conclusion W e presented several new generalization bounds for the problem of learning kernels with non-negative com- binations of base kernels. Our bou nds are simpler and sign iﬁcantly improve over previous bounds. Their very mild dependency on the number of kern els seems to suggest the use of a large number of kernels for this problem . Our experiments with th is pro blem in r egression using a large nu mber of kern els seems to co rrob- orate this idea [8]. Much needs to be do ne howe ver to combine these theoretical ﬁnding s with the somewhat disappoin ting performance observed in practice in most learning kernel e xp eriments [7]. Refer ences [1] And reas Argy riou, Raphael Hauser, Char les M icchelli, an d M assimiliano Pontil. A DC-p rogram ming algorithm for kernel selection. In ICML , 2006. [2] And reas Argyriou, Charles Micchelli, and Massimiliano Pontil. Learn ing con vex co mbinatio ns of con- tinuously parameter ized basic kernels. In COL T , 20 05. [3] F . Bach. Explo ring large feature spaces with hierarchical multiple kernel learning. NIPS , 2008. [4] Peter L. Bartlett and Shahar Me ndelson. Rademach er an d Gaussian complexities: Risk bounds and structural results. Journal of M a chine Learning Resear ch , 3:2002 , 2002. [5] Bernh ard Boser , Isabelle Guyon, and Vladimir V apnik. A tra ining algorithm for optimal margin classi- ﬁers. In COL T , volume 5, 1992 . [6] Olivier Bousquet and Daniel J . L. Herrmann. On the co mplexity of learning the kernel matrix. In NIPS , 2002. [7] Corin na Cortes. In vited talk: Can le arning kernels help performanc e? In ICML , p age 161, 2009. [8] Corin na Cortes, M ehryar Moh ri, and Afshin Rostamizad eh. L 2 regularization for lear ning kernels. In Pr oceedin gs of the 25 th Con fer ence o n Uncertainty in Artiﬁcial Intelligence (UAI 20 09) , Montr ´ eal, Canada, June 2009. [9] Corin na Cor tes, Mehr yar Moh ri, a nd Afshin Rostamizadeh . Learning non- linear com binations o f ker- nels. In Advan ces in Neural In formation Pr ocessing Systems (NIPS 2 009) , V a ncouver, Canada, 2009. MIT Press. [10] Cor inna Cortes and Vladimir V apnik. Support- V ector Networks. Machine Learning , 20(3), 1995 . [11] T ony Jebara. Multi-task feature and kernel selection for SVMs. In ICML , 2004. [12] V . Koltchinskii and D. Panchenko. Rademach er p rocesses and bo undin g the risk o f fun ction lear ning. In High Dimensional Pr obability II , pages 443–45 9. preprint, 2000 . [13] Ger t Lan ckriet, Nello Cristian ini, Peter Bartlett, Lau rent El Ghaoui, a nd Micha el Jordan. Learning the kernel matrix with semideﬁnite programm ing. JMLR , 5, 200 4. [14] Dar rin P . Lewis, T ony Jeb ara, and W illiam Stafford Nob le. No nstationary kern el comb ination. In ICML , 2006. [15] Char les Micche lli a nd M assimiliano Po ntil. Lea rning th e kerne l fun ction via regularizatio n. JMLR , 6, 2005. [16] Chen g Soon Ong, Alexand er Smola, and Rob ert W illiamson. Learning the kernel with hyperkernels. JMLR , 6, 2005. [17] Bern hard Sch ¨ olkopf and Ale x Smola . Learning with K ernels . MIT Press: Cambridg e, MA, 2002. [18] Joh n Shawe-T ay lor and Nello Cristianini. K ernel Methods for P attern Analysis . Cambrid ge Univ . Press, 2004. [19] Nath an Srebro and Shai Ben-David. Learn ing bo unds for suppor t vector machines with learned kernels. In COL T , 2 006. [20] Vlad imir N. V apnik. Statistical Learning Theory . John W iley & Sons, 1998. [21] Ma nik V a rma and Bodla Rakesh Babu. Mor e ge nerality in efﬁcient m ultiple kernel learnin g. In In ter - nationa l Confer ence on Machine Learning , 2009. [22] Y iming Y ing and Co lin Campbell. Generalization bo unds fo r learn ing the kernel pro blem. In COLT , 2009. [23] Alexand er Zien and Cheng S oo n Ong. Multiclass multiple kernel learning. In ICML , 2007. A Lemma 5 The following lemma is a straightfo rward version of H ¨ older’ s inequality . Lemma 5 Let q , r > 1 with 1 / q + 1 /r = 1 . Then , the following r esu lt similar to H ¨ older’s ineq uality holds: | w · Φ ( x ) | ≤  p X k =1 k w k k q  1 /q  p X k =1    Φ ( x )    r  1 /r . (14) Proof: Let Ψ q ( w ) = ( P p k =1 k w k k q ) 1 /q and Ψ r ( Φ ( x )) = ( P p k =1 k Φ k ( x ) k r ) 1 /r , then | w · Φ ( x ) | Ψ q ( w )Ψ r ( Φ ( x )) =    p X k =1 w k Ψ q ( w ) · Φ k ( x ) Ψ r ( Φ ( x ))    ≤ p X k =1    w k Ψ q ( w ) · Φ k ( x ) Ψ r ( Φ ( x ))    ≤ p X k =1 k w k k Ψ q ( w ) · k Φ k ( x ) k Ψ r ( Φ ( x )) ( Cauchy-Schwarz ) ≤ p X k =1 1 q k w k k q Ψ q ( w ) q + 1 r k Φ k ( x ) k r Ψ r ( Φ ( x )) r ( Y oung’ s inequality ) = 1 q + 1 r = 1 .

New Generalization Bounds for Learning Kernels

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment