The Rate of Convergence of AdaBoost
The AdaBoost algorithm was designed to combine many "weak" hypotheses that perform slightly better than random guessing into a "strong" hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of th…
Authors: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
The Ra te of Converge nce of AdaBoost The Rate of Con v e rgence of Ad aBo ost Indraneel Mukherjee imukherj@cs.princeton.edu Princ eton University Dep artm ent o f Co mputer Scienc e Princ eton, NJ 0 8540 USA Cyn thia Rudin rudin@mit.edu Massachusetts Institute of T e c hnolo gy MIT Slo an Scho ol of Management Cambridge, MA 02139 USA Rob ert E. Schapire schapire@cs.princeton.edu Princ eton University Dep artm ent of Computer Scienc e Princ eton, NJ 08540 US A Editor: Abstract The AdaBo ost algo rithm w as designed to combine man y “weak” h ypotheses that perform slightly better th an random guessing into a “str ong” hypothesis that has very low error. W e study the ra te at which AdaBo ost iteratively conv erges to the minimum of the “exp onential loss.” Unlike pr evious work, our pr o ofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that at iteration t , the exp onential loss o f AdaBo os t’s computed parameter vector will b e at most ε more than that of any parameter vector of ℓ 1 -norm b o unded by B in a num ber of rounds that is at mos t a polynomial in B and 1 / ε . W e also provide low er b ounds showing that a polynomial depe ndenc e on these parameters is necessary . Our second result is that within C /ε itera tions, AdaBoo st achiev es a v alue of the exp onential lo ss that is at mos t ε more than the bes t po s sible v a lue, where C depends on the data set. W e show that this depe ndence of the rate on ε is o ptimal up to constant fa ctors, i.e., at lea st Ω(1 /ε ) rounds are necessary to achiev e within ε of the optimal expo nential loss. Keyw ords: AdaBo ost, optimization, coo rdinate desc e nt, co nv ergence rate. 1. In troduct ion The AdaBoost algorithm of F reund and Sc hapire (1997) w as desig ned to com bine man y “w eak” h ypotheses that p er f orm slightly b etter than r andom guessing in to a “strong” hypo- thesis that has v ery lo w error. Despite extensiv e theoretical and empirical study , basic prop erties of Ad aBo ost’s con v e rgence are not fu lly understo o d. In this wo rk, w e f o cus on one of those prop erties, namely , to fi nd conv ergence r ates that hold in the absence of an y simplifying assu m ptions. Suc h assum p tions, relie d up on in muc h of the preceding w ork, 1 Mukherjee, Rudin and Schapire mak e it easier to prov e a f ast conv er gence rate for AdaBo ost, b ut often d o not hold in the cases wh er e AdaBo ost is commonly ap p lied. AdaBo ost can b e view ed as a co ordinate descent (or functional gradient descent) al- gorithm that iterativ ely minimizes an ob jectiv e fu nction L : R n → R called the exp o nen- tial loss (Breiman, 1999; F rean and Do wns, 1998; F r iedman et al., 2000; F riedman, 2001; Mason et al., 2000; On o da et al., 1998; R¨ atsc h et al., 2001; Sc hapire and Singer, 1999 ). Giv en m lab eled training examples ( x 1 , y 1 ) , . . . , ( x m , y m ), wh ere the x i ’s are in some domain X and y i ∈ {− 1 , +1 } , and a fin ite (b u t typicall y v ery large) sp ace of weak h yp otheses H = { ~ 1 , . . . , ~ N } , wh ere eac h ~ j : X → {− 1 , +1 } , the exp onen tial loss is defin ed as L ( λ ) △ = 1 m m X i =1 exp − N X j =1 λ j y i ~ j ( x i ) where λ = h λ 1 , . . . , λ N i is a v e ctor of w eig h ts o r parameters. In ea c h iteration, a co ordinate descen t algo rithm mo v es some distance alo ng some coord inate direction λ j . F or AdaBo ost, the co ord in ate d irections corr esp ond to the individual weak hyp otheses. Th us, on eac h round, AdaBo ost chooses some we ak hyp othesis and step length, and adds th ese to the current weigh ted combination of w eak hypotheses, which is equiv alent to up d ating a single w eight . Th e directio n and step length are so c h osen th at the resulting v ector λ t in iteration t yields a lo wer v alue of the exp onen tial lo s s than in th e pr evious iteration, L ( λ t ) < L ( λ t − 1 ). This rep eat s u n til it reac hes a minimizer if one exists. It was sh o wn by Collins et al. (2002), and later by Zh ang and Y u (2 005 ), that AdaBo ost asymptotically con verge s to the m inim u m p ossible exp onen tial loss. That is, lim t →∞ L ( λ t ) = inf λ ∈ R N L ( λ ) . Ho we ver, that w ork did not address a co nv ergence r ate to the minimizer of the exp onentia l loss. Our work sp ecifically add r esses a recent conjecture of Sc hapire (2010) stating that th ere exists a p ositiv e constan t c and a p olynomial p oly() such that for all training sets and all finite sets of w eak h yp otheses, and for all B > 0, L ( λ t ) ≤ min λ : k λ k 1 ≤ B L ( λ ) + p oly(log N , m, B ) t c . (1) In other words, the exp onen tial loss of AdaBoost will b e at most ε more than that of an y other parameter v ector λ of ℓ 1 -norm b ounded b y B in a num b er of rounds that is b ounded b y a p olynomial in log N , m , B and 1 / ε . (W e require log N rather than N since the num b er of w eak h y p otheses will t ypically b e extremely large.) Along with an upp er b oun d that is p olynomial in these parameters, we also pro vide lo we r b oun d constructions sho w ing some p olynomial dep endence on B and 1 /ε is necessary . Without an y additional assumptions on the exp onenti al loss L , and without altering AdaBo ost’s m inimization algorithm for L , the b est kno wn con ve r gence rate of AdaBoost pr ior to this w ork that w e are a ware of is that of Bic k el et al. (2006) who pro ve a b ound on the rate of the form O (1 / √ log t ). W e pr o vide also a conv ergence rate of AdaBoost to the minim u m v alue of the exponential loss. Namely , w ithin C /ǫ iterations, AdaBoost ac h iev es a v alue of the exponential loss that 2 The Ra te of Converge nce of AdaBoost is at most ǫ more than the b est p ossible v alue, where C d ep ends on the dataset. Th is con ve r gence rate is differen t from the one discuss ed ab o v e in that it h as b etter dep endence on ǫ (in f act the dep enden ce is optimal, as w e sh o w), and do es not dep en d on the b est solution within a ball of size B . Ho w ever, this second conv ergence rate cannot b e used to pro ve (1) since in certain w orst case situati on s , w e sho w the constan t C ma y b e larger than 2 m (although u s ually it will b e m u c h smaller). Within the pro of of the second con vergence rate, we provide a lemma (called the de- c omp osition lemma ) th at sho w s that the training set can b e split in to t w o sets of examples: the “finite margin set,” and the “zero loss set.” Examples in the finite margin set alw a ys mak e a p ositiv e cont r ibution to the exp onent ial loss, and they never lie to o f ar from th e decision b oundary . Examples in the zero loss set do not ha v e these prop erties. If w e con- sider the exp onen tial loss where the s u m is only ov er the finite margin set (rather than o ver all training examples), it is minimized by a fin ite λ . The fact that the trainin g set can b e decomp osed into these t wo classes is the k ey step in proving the second con vergence rate. This pr oblem of determinin g the rate of conv ergence is relev an t in the pr o of of the consistency of AdaB o ost giv en b y Bartlett and T raskin (2 007 ), wh ere it has a direct impact on the r ate at whic h AdaBo ost con ve r ges to the Ba yes optimal classifier (und er suitable assumptions). It ma y also b e relev an t to practitioners who wish to ha ve a guarant ee on the exp onenti al loss v alue at iteration t (although, in general, minimizatio n of the exp onen tial loss n eed not b e p erfectly correlated with test accuracy). There ha ve b een sev eral works that make additional assump tions on the exp onentia l loss in order to attain a b etter b ound on the rate, b ut those assumptions are not true in general, and cases are kn o wn where eac h of these assu mptions are violated. F or in- stance, b etter b ounds are pro ved by R¨ a tsch et al. (20 02 ) using results from Luo and Tseng (1992), but these app ear to r equire that th e exp onen tial loss b e m inimized b y a finite λ , and also dep end on quant ities that are not easily measured. There are m an y cases where L d o es not ha ve a fin ite minimizer; in fact, one such case is provided by Sc h apire (2010). Shalev-Shw artz and Singer (200 8 ) ha ve p ro ve n b ounds for a v arian t of AdaBo ost. Zhang and Y u (2005) also ha ve giv en rates of con verge n ce, but their tec hnique requ ires a b oun d on the change in the size of λ t at eac h iteration that d o es not necessarily hold for AdaBo ost. Many classic results are known on the con vergence of iterativ e algorithms generally (see for in stance Lu enb erger and Y e, 2008; Bo yd and V andenb erghe, 2004); ho w- ev er, these typically start b y assuming that the minimum is attained at some fin ite p oin t in the (usually compact) space of in terest, assu m ptions that do not generally hold in our setting. Wh en the w eak learning assumption h olds , there is a parameter γ > 0 that go v ern s the impro v emen t of the exp onent ial loss at eac h iteratio n . F reund and Schapire (1 997 ) and Sc hapir e and Singer (1999) sho we d that the exp onentia l loss is at most e − 2 tγ 2 after t r ounds, so Ad aBo ost rapidly con v erges to the minim um p ossib le loss un der this assu mption. In Section 2 w e su mmarize the co ordin ate descen t view of AdaBo ost. Section 3 con tains the pro of of the conjecture, with asso ciated lo we r b ounds pro ved in Section 3.3. Section 4 pro vid es the C /ǫ con v ergence r ate. The pro of of the decomp osition lemma is giv en in Section 4.2. 3 Mukherjee, Rudin and Schapire Giv en: ( x 1 , y 1 ) , . . . , ( x m , y m ) w h ere x i ∈ X , y i ∈ {− 1 , +1 } set H = { ~ 1 , . . . , ~ N } of we ak hyp otheses ~ j : X → {− 1 , +1 } . Initialize: D 1 ( i ) = 1 /m for i = 1 , . . . , m . F or t = 1 , . . . , T : • T r ain wea k learner u s ing distribu tion D t ; that is, find wea k hyp othesis h t ∈ H wh ose correlation r t △ = E i ∼ D t [ y i h t ( x i )] h as maxim um magnitud e | r t | . • Cho ose α t = 1 2 ln { (1 + r t ) / (1 − r t ) } . • Up date, for i = 1 , . . . , m : D t +1 ( i ) = D t ( i ) exp( − α t y i h t ( x i )) /Z t where Z t is a normalization f actor (c h osen so that D t +1 will b e a distrib ution). Output the final h yp othesis: F ( x ) = sign P T t =1 α t h t ( x ) . Figure 1: The bo o sting algorithm AdaBoost. 2. Co ordinate Descen t V iew of AdaBo ost F rom the examples ( x 1 , y 1 ) , . . . , ( x m , y m ) and h yp otheses H = { ~ 1 , . . . , ~ N } , AdaBoost iter- ativ ely computes th e fu nction F : X → R , wh ere sign( F ( x ) ) can b e used as a classifier f or a new instance x . T he function F is a linear com bination of the h yp otheses. At eac h iterat ion t , AdaBoost c ho oses on e of the w eak hyp otheses h t from the set H , and adju s ts its co efficient b y a sp ecified v alue α t . Then F is constru cted after T iterations as: F ( x ) = P T t =1 α t h t ( x ). Figure 1 shows the AdaBo ost algorithm (F reu nd an d Schapire, 1997). Since eac h h t is equal to ~ j t for some j t , F can also b e wr itten F ( x ) = P N j =1 λ j ~ j ( x ) for a vec tor of v alues λ = h λ 1 , . . . λ N i (such vecto r s will sometimes also b e referred to as c ombinations , since they repr esen t combinations of w eak h yp otheses). In different notati on , w e can w rite AdaBo ost as a coord inate descent algorithm on v ector λ . W e define the fe atur e matrix M elemen t wise b y M ij = y i ~ j ( x i ), so that this matrix conta in s all of the inputs to AdaBoost (the trainin g examples and h yp otheses). Then the exp onential loss ca n b e w ritten more compactly as: L ( λ ) = 1 m m X i =1 e − ( M λ ) i where ( M λ ) i , th e i th co ordinate of the v ector M λ , is the (un normalized) mar gi n achiev ed b y v ector λ on training example i . Co ordinate descent algorithms choose a co ordinate at eac h iteration where the d irec- tional deriv ativ e is the steep est, and choose a step that maximally decreases the ob jectiv e along that coordin ate. T o perf orm co ord inate descen t on the exp onenti al loss, w e determine the co ord inate j t at iteration t as follo ws, where e j is a vecto r that is 1 in the j th p osition and 0 elsewhere: j t ∈ argmax j − dL ( λ t − 1 + α e j ) dα α =0 = argmax j 1 m m X i =1 e − ( M λ t − 1 ) i M ij . (2) W e can sho w that this is equiv alen t to the w eak learnin g step of AdaBoost. Unra veling the recursion in Figure 1 for AdaBoost’s w eigh t v ector D t , we can see that D t ( i ) is prop ortional 4 The Ra te of Converge nce of AdaBoost to exp − X t ′ 1 / 2 , the e dge satisfies δ t ≥ B − c 1 R c 2 t − 1 in e ach r ound t , then A daBo ost achieves at most L ( λ ∗ ) + ε loss after 2 B 2 c 1 ( ε ln 2) 1 − 2 c 2 r ounds. Pro of F rom the definition of R t and (4) w e ha ve ∆ R t = ln L ( λ t − 1 ) − ln L ( λ t ) ≥ − 1 2 ln(1 − δ 2 t ) . (5) Com bin ing the ab o ve with the inequality e x ≥ 1 + x , and the assu mption on the ed ge ∆ R t ≥ − 1 2 ln(1 − δ 2 t ) ≥ 1 2 δ 2 t ≥ 1 2 B − 2 c 1 R 2 c 2 t − 1 . Let T = ⌈ 2 B 2 c 1 ( ε ln 2) 1 − 2 c 2 ⌉ b e the b ound on the num b er of roun ds in the le m ma. If any of R 0 , . . . , R T is n egativ e, then by monotonicit y R T < 0 and w e are done. Otherwise, they are all non-negativ e. Then, applying Lemma 32 f rom th e App endix to the sequence R 0 , . . . , R T , and u sing c 2 > 1 / 2 we get R 1 − 2 c 2 T ≥ R 1 − 2 c 2 0 + c 2 B − 2 c 1 T > (1 / 2) B − 2 c 1 T ≥ ( ε ln 2 ) 1 − 2 c 2 = ⇒ R T < ε ln 2 . If either ε or L ( λ ∗ ) is greater than 1, then the lemma follo ws since L ( λ T ) ≤ L ( λ 0 ) = 1 < L ( λ ∗ ) + ε . Otherw ise, L ( λ T ) < L ( λ ∗ ) e ε ln 2 ≤ L ( λ ∗ )(1 + ε ) ≤ L ( λ ∗ ) + ε, where the second inequalit y uses e x ≤ 1 + (1 / ln 2) x for x ∈ [0 , ln 2]. W e next show that large edges are ac h iev ed p r o vided S t is s mall compared to R t . Lemma 3 In e ach r ound t , the e dge satisfies δ t ≥ R t − 1 /S t − 1 . 7 Mukherjee, Rudin and Schapire Pro of F or an y com b ination λ , define p λ as the distribution on examples { 1 , . . . , m } that puts weigh t prop ortional to the loss D λ ( i ) = e − ( M λ ) i / ( mL ( λ )). Cho ose an y λ suffering at most the target loss L ( λ ) ≤ L ( λ ∗ ). By non-negativit y of relativ e entrop y w e get 0 ≤ RE ( D λ t − 1 k D λ ) = m X i =1 D λ t − 1 ln 1 m e − ( M λ t − 1 ) i /L ( λ t − 1 ) 1 m e − ( M λ ) i /L ( λ ) ! = − R t − 1 + m X i =1 D λ t − 1 ( i ) M λ − M λ t − 1 i . (6) Note that D λ t − 1 is the d istribution D t that AdaBo ost creates in round t . The ab ov e summation can b e rewritten as m X i =1 D λ t − 1 ( i ) N X j =1 λ j − λ t − 1 j M ij = N X j =1 λ j − λ t − 1 j m X i =1 D t ( i ) M ij ≤ N X j =1 λ j − λ t − 1 j max j m X i =1 D t ( i ) M ij = δ t k λ − λ t − 1 k 1 . (7) Since the pr evious holds for an y λ suffering less than the target loss, the last expression is at most δ t S t − 1 . Combining this w ith (7) completes the pro of. T o complete the pro of of Theorem 1, we sho w S t is small compared to R t in rounds t ≤ T 0 (during wh ich we ha v e assu med S t , R t are all p ositiv e). In fact we prov e: Lemma 4 F or any t ≤ T 0 , S t ≤ B 3 R − 2 t . This, along with Lemm as 2 and 3, immediately pro ves T heorem 1. The b oun d on S t in Lemma 4 ca n b e pro v en if w e can first sho w S t gro ws slo wly compared to the rate at whic h the sub optimalit y R t falls. In tu itiv ely this holds sin ce gro wth in S t is caused by a large step, w hic h in turn w ill dr iv e d o wn the sub optimalit y . In fact w e can p r o ve the follo w ing. Lemma 5 In any r ound t ≤ T 0 , we have 2∆ R t R t − 1 ≥ ∆ S t S t − 1 . Pro of Firstly , it follo ws from the definition of S t that ∆ S t ≤ k λ t − λ t − 1 k 1 = | α t | . Next, using (5) and (3) w e ma y write ∆ R t ≥ Υ( δ t ) | α t | , where the function Υ has b een defin ed in (R¨ a tsch and W arm u th, 2005) as Υ( x ) = − ln(1 − x 2 ) ln 1+ x 1 − x . It is kno wn (R¨ atsc h and W arm u th, 20 05 ; Rudin et al., 2007) that Υ( x ) ≥ x/ 2 f or x ∈ [0 , 1]. Com bin ing and using Lemma 3, ∆ R t ≥ δ t ∆ S t / 2 ≥ R t − 1 (∆ S t / 2 S t − 1 ) . Rearranging completes the pro of. 8 The Ra te of Converge nce of AdaBoost Using th is we ma y prov e Lemma 4. Pro of W e first show S 0 ≤ B 3 R − 2 0 . Note, S 0 ≤ k λ ∗ − λ 0 k 1 = B , an d by defi n ition th e quan tity R 0 = − ln 1 m P i e − ( M λ ∗ ) i . The quan tit y ( M λ ∗ ) i is the inner pr o duct of ro w i of matrix M with the ve ctor λ ∗ . Since the en tries of M lie in [ − 1 , +1], this is at m ost k λ ∗ k 1 = B . Therefore R 0 ≤ − ln 1 m P i e − B = B , which is wh at we needed. T o complete the pro of, we sho w that R 2 t S t is non-increasing. It suffices to s ho w for an y t the inequalit y R 2 t S t ≤ R 2 t − 1 S t − 1 . This holds by the follo win g chain: R 2 t S t = ( R t − 1 − ∆ R t ) 2 ( S t − 1 + ∆ S t ) = R 2 t − 1 S t − 1 1 − ∆ R t R t − 1 2 1 + ∆ S t S t − 1 ≤ R 2 t − 1 S t − 1 exp − 2∆ R t R t − 1 + ∆ S t S t − 1 ≤ R 2 t − 1 S t − 1 , where the first in equalit y follo ws from e x ≥ 1 + x , and the s econd one from Lemma 5 . This completes the pro of of T h eorem 1. Although our b ound pro vides a rate p olynomial in B , ε − 1 as desired b y the conjecture in (Sc hapire, 2010), th e exp onen ts are rather large, and (we b eliev e) not tight. On e p ossible sour ce of slac k is the b oun d on S t in Lemma 4. Qualitativ ely , the distance S t to some solution h a ving target loss should decrease with rounds, whereas Lemma 4 only s a ys it do es n ot increase to o fast. Impro ving this will directly lead to a faster conv ergence rate. In particular, showing that S t nev er decreases w ould imply a B 2 /ε rate of conv ergence. Whether or not the monotonicit y of S t holds, we b eliev e that the obtained rate b ound is probably tru e, and s tate it as a conjecture. Conjecture 6 F or any λ ∗ and ε > 0 , A daBo ost c onver ge s to within L ( λ ∗ ) + ε loss in O ( B 2 /ε ) r ounds, wher e the or der notation hides only absolute c onstants. As evidence supp orting the conjecture, w e sho w in the next section ho w a m inor mo dification to AdaBo ost can achiev e the ab o ve rate. 3.2 F aster rates for a v ariant In this section we introdu ce a new algorithm, AdaBo ost. S , whic h will enj o y the m uch faster rate of con vergence ment ioned in Conjecture 6. AdaBo ost. S is the same as AdaBoost, except that at the end of eac h round , the cur ren t com bination of wea k hyp otheses is sc ale d b ack , that is, m ultiplied by a sca lar in [0 , 1] if doing s o will red u ce th e exponential loss further. The co de is largely the same as in Sectio n 2, main taining a com bination λ t − 1 of weak hypotheses, and greedily choosing α t and ~ j t on eac h round to form a new combinatio n ˜ λ t = λ t − 1 + α t ~ j t . Ho we ver, after creating the new com b ination ˜ λ t , the result is multiplied by the v alue s t in [0 , 1] that causes the greatest decrease in the exp onential loss: s t = argmin s L ( s ˜ λ t ), and λ t = s t ˜ λ t . Since L ( s ˜ λ t ), as a fun ction of s , is con v ex, its minimum on [0 , 1] can b e found easily , for in stance, u sing a simp le b inary search. T h e new distribution D t +1 on the examples is constructed usin g λ t as b efore; the weigh t D t +1 ( i ) on example i is pr op ortional to its exp onenti al loss D t +1 ( i ) ∝ e − ( M λ t ) i . With this m o dification we ma y p r o ve the follo wing: Theorem 7 F or a ny λ ∗ , ε > 0 , A daBo ost. S ach ie ves at most L ( λ ∗ ) + ε loss within 3 k λ ∗ k 2 1 /ε r ounds. 9 Mukherjee, Rudin and Schapire The pro of is similar to that in the previous section. Reusing th e same notation, note that pro of of Lemma 2 con tinues to hold (with ve r y min or mo d ifications to that are straigh t- forw ard ). Next w e can exploit th e changes in AdaBoost. S to sho w an impro ved version of Lemma 3. I ntuitiv ely , scaling bac k has the effect of preve nting the w eight s on the w eak h yp otheses from b ecoming “to o large”, and w e ma y show Lemma 8 In e ach r ound t , the e dge satisfies δ t ≥ R t − 1 /B . Pro of W e will r eu se parts of the pro of of Lemma 3. Setting λ = λ ∗ in (6) we ma y write R t ≤ m X i =1 D λ t − 1 ( i ) ( M λ ∗ ) i + m X i =1 − D λ t − 1 ( i ) M λ t − 1 i . The first summation can b e upp er b ound ed as in (7) b y δ t k λ ∗ k = δ t B . W e w ill next sho w that the second summation is non-p ositiv e, whic h will co m p lete the pro of. The scaling step w as added ju st so that this last fact would b e true. If w e define G : [0 , 1] → R to b e G ( s ) = L s ˜ λ t = P i e − ( M ˜ λ t ) 8 , then observe that the scaled deriv ativ e G ′ ( s ) /G ( s ) is exactly equal to th e second su mmation. Since G ( s ) ≥ 0, it suffices to show the d eriv ativ e G ′ ( s ) ≤ 0 at th e op timum v alue of s , denoted by s ∗ . Since G is a strictly conv ex function ( ∀ s : G ′′ ( s ) > 0), it is either strictly increasing or strictly decreasing throughout [0 , 1], or it has a local minima. In the case when it is strictly decreasing throughout, then G ′ ( s ) ≤ 0 eve r ywhere, whereas if G has a local minima, then G ′ ( s ) = 0 at s ∗ . W e finish the pro of by showing that G cannot b e str ictly increasing througout [0 , 1]. If it were, w e would ha ve L ( ˜ λ t ) = G (1) > G (0) = 1, an imp ossibilit y sin ce the loss decreases thr ough roun ds. Lemmas 2 and 8 together n ow imply Theorem 7, where we used th at 2 ln 2 < 3. In exp erimen ts w e ran, the scaling b ac k nev er o ccurs. F or suc h datasets, AdaBoost and Ad aBoost. S are id en tical. W e b elie v e that ev en for con tr iv ed examples, the rescaling could h app en only a few times, implying that b oth AdaBo ost and AdaBo ost. S w ould enjo y the con ve r gence r ates of Th eorem 7. In the next section, we construct rate low er b oun d examples to sho w that this is nearly the b est rate one can hop e to show. 3.3 Lo wer-bounds Here w e show that the dep end en ce of the rate in Th eorem 1 on the norm k λ ∗ k 1 of the solution ac hieving target accuracy is necessary for a wide class of datasets. T he argumen ts in this section are not tailored to AdaBo ost, bu t hold more generally for an y co ord inate descen t algorithm, and can b e readily generalized to an y loss fu n ction L ′ of the form L ′ ( λ ) = (1 /m ) P i φ ( M λ ), w here φ : R → R is an y non-decreasing function. The fi rst lemma connects the size of a reference solution to the required num b er of rounds of b o osting, and sho ws th at for a wide v ariet y of d atasets the con verge n ce rate to a target loss can b e lo wer b ound ed b y the ℓ 1 -norm of the sm allest solution ac hieving that loss. Lemma 9 Supp ose the fe atur e matrix M c orr esp onding to a dataset has two r ows with {− 1 , + 1 } entries which ar e c omplements of e ach other, i . e., ther e ar e two examples on which any hyp othesis gets one wr ong and one c orr e ct pr e diction. Then the numb er of r ounds r e quir e d to achieve a tar get loss L ∗ is at le ast inf { k λ k 1 : L ( λ ) ≤ L ∗ } / (2 ln m ) . 10 The Ra te of Converge nce of AdaBoost − + + + + + − − − − 0 + − − − 0 0 + − − 0 0 0 + − 0 0 0 0 + Figure 2 : The matrix us ed in Theore m 10 when m = 5. Pro of W e first show that th e t wo examples corresp onding to the complement ary rows in M b oth satisfy a certain margin b ound edness prop erty . Since eac h hypothesis predicts opp ositely on these, in any round t their margins will b e of equal magnitud e an d opp osite sign. Unless b oth margins lie in [ − ln m , ln m ], one of them will b e smaller than − ln m . But then the exp onentia l loss L ( λ t ) = (1 /m ) P j e − ( M λ t ) j in that round w ill exceed 1, a con tradiction since the losses are non-increasing through roun ds, and the loss at the start w as 1. Th us, assigning one of th ese examples th e ind ex i , we h a ve the absolute margin ( M λ t ) i is b oun d ed by ln m in any round t . Letting M ( i ) den ote the i th r ow of M , the step length α t in roun d t th erefore satisfies | α t | = | M ij t α t | = |h M ( i ) , α t e j t i| = ( M λ t ) i − ( M λ t − 1 ) i ≤ ( M λ t ) i + ( M λ t − 1 ) i ≤ 2 ln m, and the statemen t of the lemma directly follo ws. When the weak hyp otheses are abstaining (Sc hapire and Singer, 1999), it can mak e a definitiv e pred iction that the lab el is − 1 or +1, or it can “abstain” b y pr edicting zero. No other lev els of confid ence are allo w ed, and the resulting feature matrix has en tr ies in {− 1 , 0 , +1 } . Th e next theorem constructs a feature matrix satisfying the prop erties of Lemma 9 and wh er e additionally the smallest size of a solution ac h ieving L ∗ + ε loss is at least Ω (2 m ) ln(1 /ε ), for some fi xed L ∗ and ev ery ε > 0. Theorem 10 Consider the fol lowing ma trix M with m r ows (or examples) lab ele d 0 , . . . , m − 1 and m − 1 c olumns lab ele d 1 , . . . , m − 1 (assume m ≥ 3 ). The squar e sub-matrix ignoring r ow zer o is an upp er triangular matrix, with 1 ’s on the diagonal, − 1 ’s ab ove the diagonal, and 0 b elow the diagonal. Ther efor e r ow 1 is (+1 , − 1 , − 1 , . . . , − 1) . R ow 0 is define d to b e just the c omplement of r ow 1. Then, for any ε > 0 , a loss of 2 /m + ε is achievable on this dataset, but with lar ge norms inf {k λ k 1 : L ( λ ) ≤ 2 /m + ε } ≥ (2 m − 2 − 1) ln(1 / (3 ε )) . Ther efor e, by L emma 9, the minimum numb er of r ounds r e quir e d for r e aching loss at most 2 /m + ε is at le ast 2 m − 2 − 1 2 ln m ln(1 / (3 ε ) ) . A picture of the matrix constructed in the ab o ve lemma for m = 5 is sho wn in Figure 2 . Theorem 10 sho ws that when ε is a sm all constant (sa y ε = 0 . 01), and λ ∗ is some v ector with loss L ∗ + ε/ 2, AdaBo ost tak es at least Ω (2 m / ln m ) steps to get within ε/ 2 of th e loss 11 Mukherjee, Rudin and Schapire ac hiev ed b y λ ∗ , that is, to within L ∗ + ε loss. Since m and ε are indep endent quantiti es, this sho ws that a p olynomial dep endence on the norm of the reference solution is un av oidable, and this norm might b e exponential in the num b er of training examples in the worst case. Corollary 11 Consider fe atur e matric e s c ontaining only {− 1 , 0 , +1 } entries. If, for some c onstants c and β , the b ound in The or e m 1 c an b e r e plac e d b y O k λ ∗ k c 1 ε − β for al l such matric es, then c ≥ 1 . F urther, for such matric es, the b ound p oly(1 /ε, k λ ∗ k 1 ) in The or em 1 c annot b e r eplac e d by p oly (1 /ε, m, N ) . W e no w pro v e Theorem 10. Pro of of Lemma 10 . W e first lo wer b ound th e norm of solutions ac hieving loss at most 2 /m + ε . Ob serv e that sin ce rows 0 and 1 are complemen tary , an y solution’s loss on jus t examples 0 and 1 will add up to at least 2 /m . Therefore, to get within 2 /m + ε , the margins on examples 2 , . . . , m − 1 should b e at least ln ( ( m − 2) / ( mε )) ≥ ln(1 / (3 ε )) (for m ≥ 3). No w, the feature matrix is designed so that the margi ns due to a com bination λ satisfy the follo wing recursive relationships: ( M λ ) m − 1 = λ m − 1 , ( M λ ) i = λ i − ( λ i +1 + . . . + λ m − 1 ) , for 1 ≤ i ≤ m − 2 . Therefore, the m argin on example m − 1 is at least ln(1 / (3 ε )) implies λ m − 1 ≥ ln(1 / (3 ε )). Similarly , λ m − 2 ≥ ln(1 / (3 ε )) + λ m − 1 ≥ 2 ln(1 / (3 ε )). C on tinuing this w a y , λ i ≥ ln 1 3 ε + λ i +1 + . . . + λ m − 1 ≥ ln 1 3 ε n 1 + 2 ( m − 1) − ( i +1) + . . . + 2 0 o = ln 1 3 ε 2 m − 1 − i , for i = m − 1 , . . . , 2. Hence k λ k 1 ≥ ln(1 / (3 ε )) (1 + 2 + . . . + 2 m − 3 ) = (2 m − 2 − 1) ln(1 / (3 ε )). W e end b y sho wing that a loss of at most 2 /m + ε is ac h iev able. The ab o ve argumen t implies that if λ i = 2 m − 1 − i for i = 2 , . . . , m − 1, then examples 2 , . . . , m − 1 attain margin exactly 1. If we c h o ose λ 1 = λ 2 + . . . + λ m − 1 = 2 m − 3 + . . . + 1 = 2 m − 2 − 1, then the recursive relationship imp lies a zero margin on example 1 (and hence example 0). T herefore th e com bination ln(1 /ε )( 2 m − 2 − 1 , 2 m − 3 , 2 m − 4 , . . . , 1) ac h iev es a loss (2 + ( m − 2) ε ) /m ≤ 2 /m + ε , for any ε > 0. W e finally sho w that if th e we ak hyp otheses are confidence-rated with arbitrary lev els of confidence, so that the feature matrix is allo w ed to hav e non-integ r al en tries in [ − 1 , +1], then the m inim u m n orm of a solution ac h ieving a fixed accuracy can b e arb itrarily large. Our constructions will satisfy the requiremen ts of Lemma 9, so that the norm low er b ound translates into a rate lo wer b ound. Theorem 12 L et ν > 0 b e an arbitr ary numb er, and let M b e the (p ossibly) non-inte gr al matrix with 4 examples and 2 we ak hyp otheses shown in Figur e 3. Then for any ε > 0 , a loss of 1 / 2 + ε is achievable on this dataset, but with lar ge norms inf {k λ k 1 : L ( λ ) ≤ 1 / 2 + ε } ≥ 2 ln(1 / (2 ε )) ν − 1 . Ther efor e, by L emma 9, the nu mb er of r ounds r e qu i r e d to achieve loss at most 1 / 2 + ε is at le ast ln(1 / (2 ε ) ) ν − 1 / ln( m ) . 12 The Ra te of Converge nce of AdaBoost − 1 +1 +1 − 1 − 1 + ν +1 +1 − 1 + ν Figure 3: A picture of the matrix used in Theor em 1 2. Pro of W e first sho w a loss of 1 / 2 + ε is achiev able. O b serv e th at the v ector λ = ( c, c ), with c = ν − 1 ln(1 / (2 ε ) ), ac h iev es margins 0 , 0 , ln (1 / (2 ε )) , ln (1 / ( 2 ε )) on examples 1 , 2 , 3 , 4, resp ectiv ely . Therefore λ ac hieve s loss 1 / 2 + ε . W e next sho w a lo w er b ound on the norm of a solutio n ac hieving this loss. Obser ve that since the first tw o ro ws are complemen tary , the loss d ue to just the fi rst tw o examp les is at least 1 / 2. T herefore, any solution λ = ( λ 1 , λ 2 ) ac hieving at most 1 / 2 + ε loss ov erall must ac hiev e a margin of at least ln(1 / (2 ε )) on b oth the th ird and fourth examples. By ins p ecting the t wo columns, this implies λ 1 − λ 2 + λ 2 ν ≥ ln (1 / (2 ε )) λ 2 − λ 1 + λ 1 ν ≥ ln (1 / (2 ε )) . Adding th e t wo equations w e find ν ( λ 1 + λ 2 ) ≥ 2 ln (1 / (2 ε )) = ⇒ λ 1 + λ 2 ≥ 2 ν − 1 ln (1 / (2 ε )) . By the triangle inequalit y , k λ k 1 ≥ λ 1 + λ 2 , and the lemma follo ws. Note that if ν = 0, then th e op timal solution is f ou n d in zero roun ds of b o osting and h as optimal loss 1. Ho wev er, even the tiniest p erturb ation ν > 0 causes the optimal loss to fall to 1 / 2, and causes the rate of con ve r gence to in crease drastically . In fact, by T heorem 12, the num b er of roun ds required to ac hiev e an y fixed loss b elo w 1 gro ws as Ω(1 /ν ), whic h is arbitrarily large when ν is infin itesimal. W e ma y conclude that with non-integ r al feature matrices, the dep endence of the rate on the n orm of a reference solution is absolutely necessary . Corollary 13 When using c onfidenc e r ate d we ak-hyp otheses with arbitr ary c onfidenc e lev- els, the b ound p oly(1 /ε, k λ ∗ k 1 ) in The or e m 1 c annot b e r eplac e d by any function of pur ely m , N and ε alone. The construction in Figure 3 can b e generaliz ed to pro d uce datasets with an y num b er of examples that suffer the same p o or r ate of con v er gence as the one in Th eorem 12. W e discussed the smallest such construction, since w e feel that it b est highligh ts the drastic effect n on-in tegralit y can ha ve on the rate. In this section w e s aw ho w the norm of the reference solution is an imp ortan t paramete r for b oundin g the con ve r gence rate. In the next section w e in v estigate the optimal dep en- dence of the rate on the parameter ε and s ho w that Ω (1 /ε ) roun ds are necessary in the w orst case. 13 Mukherjee, Rudin and Schapire 4. Second conv ergence rate: Con vergence to optimal loss In the previous section, our r ate b ound dep ended on b oth the appro xim ation parameter ε , as w ell as the size of the smalle st solution ac hieving th e targe t loss. F or man y datasets, the optimal target loss inf λ L ( λ ) cannot b e realized b y an y finite solution. In suc h cases, if we w ant to b ound the num b er of round s needed to ac hiev e with in ε of the optimal loss, the only wa y to use Theorem 1 is to first decomp ose the accuracy parameter ε in to t w o p arts ε = ε 1 + ε 2 , find some finite solution λ ∗ ac hieving within ε 1 of the optimal loss, and then use the b ound p oly(1 /ε 2 , k λ ∗ k 1 ) to ac h iev e at most L ( λ ∗ ) + ε 2 = inf λ L ( λ ) + ε loss. Ho we ver, this intro d uces im p licit d ep endence on ε through k λ ∗ k 1 whic h may not b e immediately clear. In this section, w e sho w b ounds of the form C /ε , where the constan t C dep ends only on the feature matrix M , and not on ε . Additionally , we sho w that this dep endence on ε is optimal in Lemm a 31 of the App endix, wh ere Ω(1 /ε ) rounds are sh o wn to b e necessary for con verging to within ε of th e optimal loss on a certain dataset. Finally , we note that the low er b ounds in the previous sec tion indicate that C can b e Ω(2 m ) in the worst case for in teger matrices (although it w ill t yp ically b e m uc h smaller), and hence this b oun d , though stronger than that of Theorem 1 with resp ect to ε , cannot b e used to pro ve the conjecture in (Schapire, 2010), since the constan t is not p olynomial in the n u m b er of examples m . 4.1 Upp er Bound The main resu lt of this section is the follo wing rate u pp er b oun d . A similar approac h to solving this problem w as tak en indep enden tly by T elgarsky (2011). Theorem 14 A daBo ost r e aches within ε of the optimal loss in at most C /ε r ounds, wher e C only dep ends on the fe atur e matrix. Our tec hn iques bu ild up on earlier w ork on the rate of con verge n ce of AdaBoost, whic h ha v e mainly consid er ed tw o particular cases. In the fi rst case, the we ak le arning assumption holds, that is, the edge in eac h round is at least some fixed constant . In this situation, F reund and Schapire (199 7 ) and Sc hapir e and Singer (199 9 ) sho w that the optimal loss is zero, th at no solution with finite size can ac hieve this lo s s, b ut AdaBo ost ac h iev es at most ε loss within O (ln(1 /ε )) rounds. In the second case s ome finite com b in ation of the w eak clas- sifiers ac hieve s the optimal loss, and R¨ atsc h et al. (2002), u sing results from Luo and Tseng (1992), show that AdaBo ost ac h iev es within ε of th e optimal loss again within O (ln(1 /ε )) rounds. Here w e consider the most general situation, where the wea k learning assumption ma y fail to hold, and y et no finite solution ma y ac h iev e the optimal loss. Th e dataset used in Lemma 31 and sh own in Figure 4 exe m plifies th is situation. Our m ain tec hn ical con tribu tion sho ws that the examples in an y dataset can b e partitioned in to a zer o-loss set and finite- mar gin set , su ch th at a certain form of the w eak learning assump tion holds within the zero-loss set, while the optimal loss consid ering only the finite-margin set can b e obtained b y some finite solution. The t wo partitions pro vid e d ifferen t wa ys of making p rogress in ev ery round, and one of the t w o kinds of pr ogress will alw a ys b e sufficien t for u s to pro ve Theorem 14. 14 The Ra te of Converge nce of AdaBoost W e next state our d ecomp osition result, illustrate it with an example, and then state sev eral lemmas quantifying the nature of the progress w e can mak e in eac h r ound. Usin g these lemmas, we pro ve Theorem 14. Lemma 15 (De c omp osition L emma) F or any dataset, ther e exists a p artition of the set of tr aining e xamples X into a (p ossibly empty) zero-loss set Z and a (p ossibly empty) finite- margin set F = Z c △ = X \ Z such that the fol lowing hold simultane ously : 1. F or some p ositive c onstant γ > 0 , ther e exists some ve ctor η † with unit ℓ 1 -norm k η † k 1 = 1 that attains at le ast γ mar gin on e ach example in Z , and exactly zer o mar gin on e ach example i n F ∀ i ∈ Z : ( M η † ) i ≥ γ , ∀ i ∈ F : ( M η † ) i = 0 . 2. The optimal loss c onsidering only examples within F is achieve d by some finite c om- bination η ∗ . 3. Ther e is a c onstant µ max < ∞ , such that for any c ombination η with b ounde d loss on the finite-mar gin set, P i ∈ F e − ( M η ) i ≤ m , the mar gin ( M η ) i for any example i in F lies in the b ounde d interval [ − ln m, µ max ] . A pro of is deferred to the next sect ion. The decomp osition lemma imm ediately implies that the v ector η ∗ + ∞ · η † , wh ic h denotes η ∗ + c η † in the limit c → ∞ , is an optimal s olution, ac hieving zero loss on the zero-loss set, but only finite margins (and hence p ositiv e losses) on the finite-margin set (thereb y ju stifying the names). ~ 1 ~ 2 a + − b − + c + + Figure 4: A dataset requir - ing Ω(1 /ε ) ro unds for conv er- gence. Before pr o ceeding, we giv e an example d ataset and indi- cate the zero-loss set, finite-margin set, η ∗ and η † to illus- trate our defin itions. Consid er a d ataset with three examples { a, b, c } and t wo hypotheses { ~ 1 , ~ 2 } and the f eature matrix M in Figure 4. Here + means corr ect ( M ij = +1) and − means wrong ( M ij = − 1). The optimal sol ution is ∞ · ( ~ 1 + ~ 2 ) with a lo s s of 2 / 3. T h e fi n ite-margin s et is { a, b } , the zero-loss set is { c } , η † = (1 / 2 , 1 / 2) and η ∗ = (0 , 0); for this dataset these are unique. T h is dataset also serv es as a low er-b ound example in Lemma 31, where w e sho w that 2 / (9 ε ) round s are necessary for AdaBo ost to ac hiev e loss at most (2 / 3) + ε . Before providing pro ofs, we introdu ce some n otation. By k·k we will mean ℓ 2 -norm; ev ery other norm will hav e an appropriate sub script, su c h as k·k 1 , k·k ∞ , etc. The set of all training examples will b e denoted b y X . By ℓ λ ( i ) w e mea n the exp -loss e − ( M λ ) i on example i . F or any subs et S ⊆ X of examples, ℓ λ ( S ) = P i ∈ S ℓ λ ( i ) denotes the total exp-loss on the set S . Notice L ( λ ) = (1 /m ) ℓ λ ( X ), and that D t +1 ( i ) = ℓ λ t ( i ) /ℓ λ t ( X ), where λ t is the com bination found b y AdaBo ost at the end of roun d t . By δ S ( η ; λ ) w e mean the edge obtained on the set S by the vecto r η , when the weig hts ov er the examples are giv en by ℓ λ ( · ) /ℓ λ ( S ): δ S ( η ; λ ) = 1 ℓ λ ( S ) X i ∈ S ℓ λ ( i )( M η ) i . 15 Mukherjee, Rudin and Schapire In the rest of the section, by “loss” w e mean the u nnormalized loss ℓ λ ( X ) = mL ( λ ) and sho w that in C /ε roun ds AdaBoost con verges to within ε of the optimal unnormalized loss inf λ ℓ λ ( X ), henceforth denoted by K . Note that this means Ad aBoost take s C /ε rounds to conv erge to within ε/m of the optimal n ormalized loss, th at is to loss at most inf λ L ( λ ) + ε/m . Replacing ε b y mε , it take s C / ( mε ) steps to atta in normalized loss at most inf λ L ( λ ) + ε . Thus, wh ether we use normalized or unn ormalized do es not substantiv ely affect the result in Theorem 14. The p rogress due to the zero-loss set is no w immed iate from Item 1 of the d ecomp ositio n lemma: Lemma 16 In any r ound t , the maximum e dge δ t is at le ast γ ℓ λ t − 1 ( Z ) /ℓ λ t − 1 ( X ) , wher e γ is as in Item 1 of the de c omp osition lemma. Pro of Recall th e distr ib ution D t created b y AdaBo ost in r ound t puts wei gh t D t ( i ) = ℓ λ t − 1 ( i ) /ℓ λ t − 1 ( X ) on eac h example i . F rom Item 1 w e get δ X ( η † ; λ t − 1 ) = 1 ℓ λ t − 1 ( X ) X i ∈ X ℓ λ t − 1 ( i )( M η † ) i = 1 ℓ λ t − 1 ( X ) X i ∈ Z γ ℓ λ t − 1 ( i ) = γ ℓ λ t − 1 ( Z ) ℓ λ t − 1 ( X ) ! . Since ( M η † ) i = P j η † j ( Me j ) i , we ma y rewr ite the edge δ X ( η † ; λ t − 1 ) as follo ws: δ X ( η † ; λ t − 1 ) = 1 ℓ λ t − 1 ( X ) X i ∈ X ℓ λ t − 1 ( i ) X j η † j ( Me j ) i = X j η † j 1 ℓ λ t − 1 ( X ) X i ∈ X ℓ λ t − 1 ( i )( Me j ) i = X j η † j δ X ( e j ; λ t − 1 ) ≤ X j η † j δ X ( e j ; λ t − 1 ) . Since the ℓ 1 -norm of η † is 1, the weig h ts η † j form some distribution p o v er the column s 1 , . . . , N . W e may therefore conclude γ ℓ λ t − 1 ( Z ) ℓ λ t − 1 ( X ) ! = δ X ( η † ; λ t − 1 ) ≤ E j ∼ p δ X ( e j ; λ t − 1 ) ≤ max j δ X ( e j ; λ t − 1 ) ≤ δ t . If the set F w ere emp t y , then Lemma 16 imp lies an edge of γ is av ailable in eac h rou n d. This in fact means that the w eak learning assumption h olds, and usin g (4), we can sho w an O (ln(1 /ε ) γ − 2 ) b ound matching the rate b oun ds of F reun d and Sc hapire (1997) and Sc hapir e and Singer (1999). So henceforth, w e assume that F is non-empty . Note that this implies that the optimal loss K is at least 1 (since any solution w ill get non-p ositiv e margin on some example in F ), a fact we will u se later in th e pro ofs. Lemma 16 sa ys that the edge is large if th e loss on the zero-loss set is large. On the other hand , when it is small, L emmas 17 and 18 together sho w how AdaBoost can mak e 16 The Ra te of Converge nce of AdaBoost go o d p rogress using the finite margin set. Lemma 17 uses second order metho d s to show ho w p rogress is made in the case wh ere there is a finite solution. Similar argumen ts, under additional assum ptions, hav e earlier app eared in (R¨ atsc h et al., 2002). Lemma 17 Supp ose λ is a c ombination such that m ≥ ℓ λ ( F ) ≥ K . Then in some c o- or dinate dir e ction the e dge i s at le ast p C 0 ( ℓ λ ( F ) − K ) /ℓ λ ( F ) , wher e C 0 is a c onstant dep ending only on the fe atur e matrix M . Pro of Let M F ∈ R | F |× N b e the matrix M r estricted to only the ro w s corresp onding to the examples in F . Cho ose η su c h that λ + η = η ∗ is an optimal sol ution o ver F . Without lo s s of generalit y assume that η lies in the orthogonal subs pace of the null-space { u : M F u = 0 } of M F (since w e can tran s late η ∗ along the null sp ace if necessary for this to hold). If η = 0 , then ℓ λ ( F ) = K and w e are d one. Otherwise k M F η k ≥ λ min k η k , w here λ 2 min is the smallest p ositiv e eigen v alue of the symmetric matrix M T F M F (exists sin ce M F η 6 = 0 ). Now define f : [0 , 1] → R as the loss along the (r escaled) segment [ η ∗ , λ ] f ( x ) △ = ℓ ( η ∗ − x η ) ( F ) = X i ∈ F ℓ η ∗ ( i ) e x ( M η ) i . This implies that f (0) = K and f (1) = ℓ λ ( F ). Notice that the firs t and second d eriv ativ es of f ( x ) are giv en b y: f ′ ( x ) = X i ∈ F ( M F η ) i ℓ ( η ∗ − x η ) ( i ) , f ′′ ( x ) = X i ∈ F ( M F η ) 2 i ℓ ( η ∗ − x η ) ( i ) . W e next lo wer b ound p ossible v alues of the second d eriv ativ e as follo ws: f ′′ ( x ) = X i ′ ∈ F ( M F η ) 2 i ′ ℓ ( η ∗ − x η ) ( i ′ ) ≥ X i ′ ∈ F ( M F η ) 2 i ′ min i ℓ ( η ∗ − x η ) ( i ) ≥ k M F η k 2 min i ℓ ( η ∗ − x η ) ( i ) . Since b oth λ = η ∗ − η , and η ∗ suffer total loss at most m , by con ve xity , so do es η ∗ − x η for an y x ∈ [0 , 1]. Hence w e ma y apply Item 3 of th e d ecomp osition lemma to the v ector η ∗ − x η , for an y x ∈ [0 , 1], to conclude that ℓ ( η ∗ − x η ) ( i ) = exp {− ( M F ( η ∗ − x η )) i } ≥ e − µ max on ev er y example i . T h erefore we ha ve, f ′′ ( x ) ≥ k M F η k 2 e − µ max ≥ λ 2 min e − µ max k η k 2 (b y c hoice of η ) . A standard second-order r esult is (see e.g. Bo yd and V andenber gh e, 2004, eqn . (9.9)) f ′ (1) 2 ≥ 2 inf x ∈ [0 , 1] f ′′ ( x ) ( f (1) − f (0)) . Collecting our results s o far, w e get X i ∈ F ℓ λ ( i )( M η ) i = f ′ (1) ≥ k η k q 2 λ 2 min e − µ max ( ℓ λ ( F ) − K ) . Next let ˜ η = η / k η k 1 b e η rescaled to hav e unit ℓ 1 norm. Th en w e ha v e X i ∈ F ℓ λ ( i )( M ˜ η ) i = 1 k η k 1 X i ℓ λ ( i )( M η ) i ≥ k η k k η k 1 q 2 λ 2 min e − µ max ( ℓ λ ( F ) − K ) . 17 Mukherjee, Rudin and Schapire Applying the Cauc hy-Sc hw arz in equ alit y , w e ma y low er b ound k η k k η k 1 b y 1 / √ N (since η ∈ R N ). Along with the fact ℓ λ ( F ) ≤ m , w e ma y write 1 ℓ λ ( F ) X i ∈ F ℓ λ ( i )( M ˜ η ) i ≥ q 2 λ 2 min N − 1 m − 1 e − µ max q ( ℓ λ ( F ) − K ) /ℓ λ ( F ) . If w e define p to b e a distribution on the columns { 1 , . . . , N } of M F whic h puts p robabilit y p ( j ) prop ortional to | ˜ η j | on column j , then we ha ve 1 ℓ λ ( F ) X i ∈ F ℓ λ ( i )( M ˜ η ) i ≤ E j ∼ p 1 ℓ λ ( F ) X i ∈ F ℓ λ ( i )( Me j ) i ≤ max j 1 ℓ λ ( F ) X i ∈ F ℓ λ ( i )( Me j ) i . Notice the qu an tit y inside the max is pr ecisely the edge δ F ( e j ; λ ) in direction j . Com b in ing ev erything, the maxim um p ossible edge is max j δ F ( e j ; λ ) ≥ q C 0 ( ℓ λ ( F ) − K ) /ℓ λ ( F ) , where we define C 0 = 2 λ 2 min N − 1 m − 1 e − µ max . Lemma 18 Supp ose, at some stage of b o osting, the c ombination found by A daBo ost is λ , and the loss is K + θ . L et ∆ θ denote the dr op in the sub optimality θ after one mor e r ound; i.e., the loss after one mor e r ound is K + θ − ∆ θ . Then ther e ar e c onstants C 1 , C 2 dep ending only on the fe atur e matrix (and not on θ ), such that if ℓ λ ( Z ) < C 1 θ , then ∆ θ ≥ C 2 θ . Pro of Let λ b e th e curren t solutio n foun d by b o osting. Usin g Lemma 17, pic k a direction j in wh ic h the edge δ F ( e j ; λ ) r estricted to the finite loss set is at least p 2 C 0 ( ℓ λ ( F ) − K ) /ℓ λ ( F ). W e can b ound the edge δ X ( e j ; λ ) on the en tire set of examples as follo ws: δ X ( e j ; λ ) = 1 ℓ λ ( X ) X i ∈ F ℓ λ ( i )( Me j ) i + X i ∈ Z ℓ λ ( i )( Me j ) i ≥ 1 ℓ λ ( X ) ℓ λ ( F ) δ F ( e j ; λ ) − X i ∈ Z ℓ λ ( i ) ! (using the triangle inequalit y) ≥ 1 ℓ λ ( X ) q 2 C 0 ( ℓ λ ( F ) − K ) ℓ λ ( F ) − ℓ λ ( Z ) . No w, ℓ λ ( Z ) < C 1 θ , and ℓ λ ( F ) − K = θ − ℓ λ ( Z ) ≥ (1 − C 1 ) θ . F u rther, we will choose C 1 < 1, so that ℓ λ ( F ) ≥ K ≥ 1. Hence, th e previous inequalit y implies δ X ( e j ; λ ) ≥ 1 K + θ p 2 C 0 (1 − C 1 ) θ − C 1 θ . Set C 1 = min n 1 / 2 , (1 / 4) p C 0 / (2 m ) o . Using θ ≤ K + θ = ℓ λ ( X ) ≤ m , we can b ound the square of the term in brac ket s on the pr evious line as p 2 C 0 (1 − C 1 ) θ − C 1 θ 2 ≥ 2 C 0 (1 − C 1 ) θ − 2 C 1 θ p 2 C 0 (1 − C 1 ) θ ≥ 2 C 0 (1 − 1 / 2) θ − 2 (1 / 4) p C 0 / (2 m ) θ p 2 C 0 (1 − 0) m = C 0 θ / 2 . 18 The Ra te of Converge nce of AdaBoost So, if δ is the maxim um edge in an y direction, then δ ≥ δ X ( e j ; λ ) ≥ p C 0 θ / (2( K + θ ) 2 ) ≥ p C 0 θ / (2 m ( K + θ )) , where, for the last inequalit y , we again used K + θ ≤ m . Th erefore the loss after one more step is at most ( K + θ ) √ 1 − δ 2 ≤ ( K + θ )(1 − δ 2 / 2) ≤ K + θ − C 0 4 m θ . Setting C 2 = C 0 / (4 m ) completes th e pro of. Pro of of Theorem 14. At an y stage of b o osting, let λ b e th e current com bination, and K + θ b e the current loss. W e sh o w that the n ew loss is at most K + θ − ∆ θ for ∆ θ ≥ C 3 θ 2 for some constan t C 3 dep end in g only on the d ataset (and not θ ). T o see this, either ℓ λ ( Z ) < C 1 θ , in which case Lemma 18 app lies, and ∆ θ ≥ C 2 θ ≥ ( C 2 /m ) θ 2 (since θ = ℓ λ ( X ) − K ≤ m ). Or ℓ λ ( Z ) ≥ C 1 θ , in which case applying Lemma 16 yields δ ≥ γ C 1 θ /ℓ λ ( X ) ≥ ( γ C 1 /m ) θ . By (4), ∆ θ ≥ ℓ λ ( X )(1 − √ 1 − δ 2 ) ≥ ℓ λ ( X ) δ 2 / 2 ≥ ( K / 2)( γ C 1 /m ) 2 θ 2 . Using K ≥ 1 and c ho osing C 3 appropriately giv es the requ ired condition. If K + θ t denotes the loss in rou n d t , then th e ab ov e claim implies θ t − θ t +1 ≥ C 3 θ 2 t . Applying Lemma 32 to the sequ ence { θ t } w e h a ve 1 /θ T − 1 /θ 0 ≥ C 3 T for an y T . Since θ 0 ≥ 0, we ha ve T ≤ 1 / ( C 3 θ T ). Hence to ac h iev e loss K + ε , C − 1 3 /ε rounds suffice. 4.2 Pro of of the decomp osition lemma Throughout this section w e only consider (unless otherwise stated) admissible com binations λ of weak classifiers, which h a ve loss ℓ λ ( X ) b ound ed by m (since suc h are the ones foun d b y b o osting). W e pr o v e Lemma 15 in three steps. W e b egin with a simp le lemma that rigorously d efines the zero-loss and finite-margin sets. Lemma 19 F or any se qu enc e η 1 , η 2 , . . . , of admissible c ombinations of we ak classifiers, we c an find a subse quenc e η (1) = η t 1 , η (2) = η t 2 , . . . , whose losses c onver ge to zer o on al l examples in some fixe d (p ossibly empty) subset Z (the zer o-loss set), and losses b ounde d away fr om zer o in its c omplement X \ Z (the finite-mar gi n set) ∀ x ∈ Z : lim t →∞ ℓ η ( t ) ( x ) = 0 , ∀ x ∈ X \ Z : inf i ℓ η ( t ) ( x ) > 0 . (8) Pro of W e will build a zero-loss set and th e fin al subs equence in cr ementally . I nitially the set is empt y . Pic k the first example. If the infi mal loss ev er attained on the examp le in th e sequence is b ounded aw ay f r om zero, then we do not add it to the set. Otherwise we add it, and consider only th e subs equ ence whose t th elemen t attains loss less than 1 /t on the example. Beginning with this su b sequence, w e n o w r ep eat with other examples. Th e fi nal sequence is the r equired subsequence, and the examples we h a v e add ed form the zero-loss set. W e apply Lemma 19 to some admissible sequence con v er ging to the optimal loss (for in- stance, the one found b y AdaBoost). Let us call the resulting s ubsequence η ∗ ( t ) , the obta in ed zero-loss set Z , and the finite-margin set F = X \ Z . The next lemma shows ho w to extract a single com bination out of the sequence η ∗ ( t ) that s atisfies the pr op erties in Item 1 of the decomp osition lemma. 19 Mukherjee, Rudin and Schapire Lemma 20 Supp ose M i s the fe atur e matrix, Z is a subset of the exampl es, and η (1) , η (2) , . . . , is a se qu enc e of c ombinations of we ak classifiers su c h that Z i s its zer o loss set, and X \ Z its finite loss set, that is, (8) holds. Then ther e is a c ombination η † of we ak classifiers that achieves p ositive mar gin on every example in Z , and zer o mar g i n on every e xample in its c omplement X \ Z , that i s: ( M η † ) i ( > 0 if i ∈ Z, = 0 if i ∈ X \ Z . Pro of Since the η ( t ) ac hiev e arbitrarily large p ositiv e margins on Z , k η ( t ) k will b e u n- b ound ed, and it will b e hard to extract a usefu l single solution out of them. On the other hand, the r escaled com b inations η ( t ) / k η ( t ) k lie on a co m p act set, and therefore ha ve a limit p oint , w hic h migh t ha ve useful prop erties. W e formalize this next. W e pro ve the statemen t of the lemma by indu ction on th e total num b er of trainin g examples | X | . If X is empty , then the lemma holds v acuously for an y η † . Assume ind uctiv ely for a ll X of size less than m > 0, and consider X of size m . Since translating a v ector along the null space of M , ker M = { x : Mx = 0 } , has no effect on the margins pr o duced by the v ector, assume without loss of generalit y that the η ( t ) ’s are orthogonal to k er M . Also, since the margins pro duced on the zero loss set are unb ounded, so are the norms of η ( t ) . Therefore assume (b y pic kin g a subsequence and relab eling if necessary) that k η ( t ) k > t . Let η ′ b e a limit p oin t of the sequence η ( t ) / k η ( t ) k , a un it v ector that is al s o orth ogonal to the null-space . Then fi rstly η ′ ac hiev es non-negativ e margin on every example; otherwise by contin uit y for some extremely large t , the m argin of η ( t ) / k η ( t ) k on that example is also negativ e and b ounded aw ay from zero, and th erefore η ( t ) ’s loss is more than m , a con tradiction to admissibilit y . Secondly , the margin of η ′ on eac h example in X \ Z is zero; otherwise, by con tinuit y , for arbitrarily large t the margin of η ( t ) / k η ( t ) k on an example in X \ Z is p ositiv e and b ounded aw ay from zero, and hence that example atta in s arb itrarily small loss in th e sequence, a con tr ad iction to (8). Finally , if η ′ ac hiev es zero margin ev ery w here in Z , then η ′ , b eing orthogonal to the n u ll-space, must b e 0 , a con tradiction since η ′ is a unit vect or. Therefore η ′ m us t ac h iev e p ositive margin on some non-empt y sub set S of Z , and zero margins on ev ery other example. Next w e use induction on the reduced set of examples X ′ = X \ S . Since S is non-empty , | X ′ | < m . F urther, using the same sequence η ( t ) , the zero-loss a n d finite-loss sets, restricted to X ′ , are Z ′ = Z \ S and ( X \ Z ) \ S = X \ Z (since S ⊆ Z ) = X ′ \ Z ′ . By th e ind uctiv e h yp othesis, there exists s ome η ′′ whic h ac hiev es p ositiv e margins on Z ′ , and zero margins on X ′ \ Z ′ = X \ Z . Therefore, b y setting η † = η ′ + c η ′′ for a large enough c , w e can a chiev e the d esired prop erties. Applying Lemma 20 to the sequence η ∗ ( t ) yields s ome con v ex combinatio n η † ha ving margin at least γ > 0 (for some γ ) on Z and zero margin on its co m plemen t, proving I tem 1 of the decomp osition lemma. The next lemma prov es Item 2. Lemma 21 The o ptimal loss c onsidering only examples with in F is achieve d by some finite c ombination η ∗ . Pro of The existe n ce of η † with prop erties as in Lemma 20 implies that the optimal loss is the same whether considering all the examples, or just examples in F . Therefore it suffices to show the existence of finite η ∗ that ac h iev es loss K on F , that is, ℓ η ∗ ( F ) = K . 20 The Ra te of Converge nce of AdaBoost Recall M F denotes the matrix M r estricted to the ro ws corresp ond ing to examples in F . Let ker M F = { x : M F x = 0 } b e the n ull-space of M F . Let η ( t ) b e the pro jection of η ∗ ( t ) on to th e orthogonal sub space of k er M F . Th en the losses ℓ η ( t ) ( F ) = ℓ η ∗ ( t ) ( F ) con verge to the optimal loss K . If M F is identica lly zero, th en eac h η ( t ) = 0 , and then η ∗ = 0 has loss K on F . O therwise, let λ 2 b e the smallest p ositiv e eigen v alue of M T F M F . Then k M η ( t ) k ≥ λ k η ( t ) k . By the defin ition of fi nite margin set, in f t →∞ min i ∈ F ℓ η ( t ) ( i ) = inf t →∞ min i ∈ F ℓ η ∗ ( t ) ( i ) > 0. Therefore, the norms of the margin v ectors k M η ( t ) k , and hence that of η ( t ) , are b ou n ded. T h erefore the η ( t ) ’s ha ve a (fin ite) limit point η ∗ that m ust ha ve loss K ov er F . As a corollary , we pro v e Item 3. Lemma 22 Ther e is a c onstant µ max < ∞ , such that for any c ombination η that achieves b ounde d loss on the finite-mar gi n set, ℓ η ( F ) ≤ m , the mar gin ( M η ) i for any example i in F lies in the b ounde d interval [ − ln m, µ max ] . Pro of Since the loss ℓ η ( F ) is at most m , therefore no margin may b e less than − ln m . T o pro ve a fin ite upp er b ound on the margins, w e argue by con tradiction. Sup p ose arbitrarily large margins are pro ducible b y b ound ed loss v ectors, that is arbitrarily large elemen ts are present in the set { ( M η ) i : ℓ η ( F ) ≤ m, 1 ≤ i ≤ m } . Then for some fix ed example x ∈ F there exists a sequence of com bin ations of w eak classifiers, whose t th elemen t ac hieve s more than margin t on x bu t has loss at most m on F . Applying Lemma 19 w e can fi n d a subsequence λ ( t ) whose tail ac hiev es v anish ingly small loss on some n on-empt y subset S of F con taining x , and b ounded margins in F \ S . App lying Lemma 20 to λ ( t ) w e get some con ve x com b ination λ † whic h has p ositiv e margins on S and zero margin on F \ S . Let η ∗ b e as in Lemma 21, a finite com bination ac h ieving the optimal loss on F . Th en η ∗ + ∞ · λ † ac hiev es the same loss on eve r y example in F \ S as the optimal solution η ∗ , but zero loss for examples in S . This solution is strictly b etter than η ∗ on F , a con tradiction to the optimalit y of η ∗ . Therefore our assump tion is false, and some finite upp er b ou n d µ max on the m argins ( M η ) i of vecto r s satisfying ℓ η ( F ) ≤ m exists. 4.3 In vestigating the constants In this section, w e try to estimate the constant C in Th eorem 14. W e sho w that it can b e arbitrarily large for adv ersarial feature matrices with real en tries (corresp ondin g to confi- dence r ated w eak hyp otheses), but has an u p p er-b ound doubly exp onen tial in the n u m b er of examples when the feature matrix has {− 1 , 0 , +1 } en tries only . W e also sho w that this doubly exp onentia l b ound cannot b e imp ro ve d w ithout significan tly c hanging th e pro of in the p revious section. By in s p ecting the pro ofs, we can b ound the constan t in Theorem 14 as follo ws. Corollary 23 The c onstant C in The or em 14 that emer ges fr om the pr o ofs i s C = 32 m 3 N e µ max γ 2 λ 2 min , 21 Mukherjee, Rudin and Schapire wher e m is the numb er of exam ples, N is the numb er of h yp otheses, γ and µ max ar e as given by Items 1 and 3 of the de c omp osition lemma, and λ 2 min is the smal lest p ositive eigenvalue of M T F M F ( M F is the fe atur e matrix r estricte d to the r ows b elonging to the finite mar g in set F ). Our b oun d on C will b e obtained b y in turn b ounding the quan tities λ − 1 min , γ − 1 , µ max . T h ese are strongly related to the singular v alues of the feature matrix M , and in general cannot b e easily measured. In fact, when M has real entries, we h a ve already seen in Section 3.3 that the rate can b e arbitrarily la r ge, implying these parameters can ha ve v ery large v alues. Ev en when the matrix M h as in teger entries (that is, − 1 , 0 , +1), the next lemma sho w s that these quantit ies can b e exp onen tial in the n u m b er of examples. Lemma 24 Ther e ar e examples of fe atur e matric es with − 1 , 0 , +1 entries and at most m r ows or c olumns (wher e m > 10 ) for which the quantities γ − 1 , λ − 1 and µ max ar e at le ast Ω(2 m /m ) . Pro of W e first show the b ounds for γ and λ . Let M b e an m × m upp er triangular matrix w ith +1 on the diagonal, and − 1 ab o v e the d iagonal. Let y = (2 m − 1 , 2 m − 2 , . . . , 1) T , and b = (1 , 1 , . . . , 1) T . Then My = b , although the y has muc h b igger n orm than b : k y k ≥ 2 m − 1 , while k b k = m . Since M is in vertible, b y the definition of λ min , w e ha ve k My k ≥ λ min k y k , so th at λ − 1 min ≥ k y k / k My k ≥ 2 m /m . Next, note that y pro du ces all p ositiv e margins b , and hence th e zero-loss set consists of all the examples. In particular, if η † b e as in Item 1 of the decomp osition lemma, then the ve ctor γ − 1 η † ac hiev es more than 1 margin on eac h example: M ( γ − 1 η † ) ≥ b . On the other han d , our matrix is ve r y similar to the one in Lemma 10, and the same argumen ts in the pro of of that lemma can b e u sed to sho w th at if for some x we hav e ( Mx ) ≥ b , then x ≥ y . This imp lies that γ − 1 k η † k 1 ≥ k y k 1 = (2 m − 1). Since η † has un it ℓ 1 -norm, the b oun d on γ − 1 follo ws to o. Next we pro vide an example showing µ max can b e Ω(2 m /m ). Consider an m × ( m − 1) matrix M . The b ottom r o w of M is all +1. The upp er ( m − 1 ) × ( m − 1) su b matrix of M is a lo wer triangular matrix with − 1 on th e diagonal and +1 b elo w the d iagonal. Observe that if y T = (2 m − 2 , 2 m − 3 , . . . , 1 , 1), then y T M = 0 . Th erefore, for any v ector x , the inner pro duct of the m argins Mx with y is zero: y T M x = 0. T h is implies that ac hieving p ositiv e margin on an y example forces some other example to receiv e negativ e margin. By Item 1 of the decomp osition lemma, the zero loss set in this d ataset is empt y , and all the examples b elong to the fin ite loss set. Next, w e c h o ose a com bin ation with at most m loss that n evertheless ac hieve s Ω(2 m /m ) p ositiv e margin on some example. Let x T = (1 , 2 , 4 , . . . , 2 m − 2 ). Then ( Mx ) T = ( − 1 , − 1 , . . . , − 1 , 2 m − 1 − 1). Then the margins using ε x are ( − ε, . . . , − ε, ε (2 m − 1 − 1)) with total loss ( m − 1) e ε + e ε (1 − 2 m − 1 ) . Cho ose ε = 1 / (2 m ) ≤ 1, so that the loss on examples corresp onding to the fir st m − 1 ro ws is at most e ε ≤ 1 + 2 ε = 1 + 1 /m , where the first inequalit y holds since ε ∈ [0 , 1]. F or m > 10, the c h oice of ε guaran tees 1 / (2 m ) = ε ≥ (ln m ) / (2 m − 1 − 1), so that the loss on th e example corresp ondin g to the b ottom most row is e − ε (2 m − 1 − 1) ≤ e − l n m = 1 /m . Therefore the net loss of ε x is at m ost ( m − 1)(1 + 1 /m ) + 1 /m = m . On the other hand the margin on the example corresp onding to the last r o w is ε (2 m − 1 − 1) = (2 m − 1 − 1) / (2 m ) = Ω (2 m /m ). The abov e result implies any b ound on C derive d from Corolla r y 23 w ill b e at least 2 Ω(2 m /m ) in the w orst ca se. Th is do es not imply that the b est b ound one can hop e to prov e is doub ly 22 The Ra te of Converge nce of AdaBoost exp onenti al, only that our tec hniques in the previous section do not admit an ything b etter. W e next show that the b ounds in Lemma 24 are nearly the wo r st p ossible. Lemma 25 Supp ose e ach entry of M is − 1 , 0 or +1 . Then e ach of the q uantities λ − 1 min , γ − 1 and µ max ar e at most 2 O ( m ln m ) . The pro of of Lemma 25 is rather tec hn ical, and we defer it to the App endix. Lemma 25 and Corollary 23 together imply a con ve r gence r ate of 2 2 O ( m ln m ) /ε to the optimal loss for inte ger matrices. T his b ou n d on C is exp onentiall y worse than the Ω(2 m ) lo w er b ound on C we s aw in S ection 3.3, a price we pay for obtaining optimal dep endence on ε . In the next sectio n w e will see ho w to obtain p oly(2 m ln m , ε − 1 ) b ound s, although with a worse d ep enden ce on ε . W e end this s ection b y sho wing, ju s t for completeness, ho w a b ound on the norm of η ∗ as defined in Item 2 of th e d ecomp osition lemma follo ws as a quic k corollary to Lemma 25. Corollary 26 Supp ose η ∗ is as given by Item 2 of the de c omp osition lemma. When the fe atur e matrix has only − 1 , 0 , + 1 entries, we may b ound k η ∗ k 1 ≤ 2 O ( m ln m ) . Pro of Note that ev ery ent r y of M F η ∗ lies in the range [ − ln m, µ max = 2 O ( m ln m ) ], and hence k M F η ∗ k ≤ 2 O ( m ln m ) . Next, w e may choose η ∗ orthogonal to the n u ll sp ace of M F ; then k η ∗ k ≤ λ − 1 min k M F η ∗ k ≤ 2 O ( m ln m ) . Since k η ∗ k 1 ≤ √ N k η ∗ k , and the n u m b er of p ossi- ble columns N with {− 1 , 0 , + 1 } en tr ies is at most 3 m , th e pro of follo w s. 5. Impro v ed E stimates In this s ection we shed m ore ligh t on the r ate b ounds by cross-application of tec hniqu es from Sections 3 and 4. W e obtain b oth new upp er b oun ds for con vergence to the optimal loss, as wel l as lo wer b ounds for conv ergence to an arbitrary target loss. W e also indicate what we b eliev e might b e the optimal b ounds for either situ ation. W e fi rst sho w how the finite r ate b oun d of Theorem 1 along with the decomp osition lemma yields a new rate of con vergence to the optimal loss. Although the dep end ence on ε is w orse than in Theorem 14, the dep endence on m is nearly optimal. W e will need the follo wing ke y application of the decomp osition lemma. Lemma 27 When the fe atur e matrix has − 1 , 0 , +1 entries, for any ε > 0 , ther e is some solution with ℓ 1 -norm at most 2 O ( m ln m ) ln(1 /ε ) that achieves within ε of the optimal loss. Pro of Let η ∗ , η † , γ b e as giv en b y th e decomp osition lemma. Let c = min i ∈ Z ( M η ∗ ) i b e the min im um margin prod uced by η ∗ on any examp le in the zero-loss set Z . Then η ∗ − c η † pro du ces non-negativ e margins on Z , and th e optimal margins on the fi nite loss set F . Therefore, th e v ector λ ∗ = η ∗ + ln(1 /ε ) γ − 1 − c η † ac hiev es at least ln(1 / ε ) margin on ev ery example in Z , and optimal margins on the finite loss s et F . Hence L ( λ ∗ ) ≤ inf λ L ( λ ) + ε . Using | c | ≤ k M η ∗ k ≤ m k η ∗ k , and the results in C orollary 26 and Lemm a 25, we may conclude th e v ector λ ∗ has ℓ 1 -norm at most 2 O ( m ln m ) ln(1 /ε ). W e ma y no w in vok e Theorem 1 to obtain a 2 O ( m ln m ) ln 6 (1 /ε ) ε − 5 rate of con v ergence to the optimal solution. Rate b ounds with similar dep en dence on m and slightly b etter dep endence 23 Mukherjee, Rudin and Schapire on ε can be obtained b y mo difying the p ro of in S ection 4 to us e fi r st order instead of s econd order tec hniqu es. I n that wa y w e ma y obtain a p oly( λ − 1 min , γ − 1 , µ max ) ε − 3 = 2 O ( m ln m ) ε − 3 rate b ound. W e omit the the rather long but straigh tforward pr o of of this fact. Finally , note that if Conjecture 6 is true, then Lemma 27 implies a 2 O ( m ln m ) ln(1 /ε ) ε − 1 rate b ound for con v erging to the optimal loss, w hic h is nearly optimal in b oth m and ε . W e state this as an indep en den t conjecture. Conjecture 28 F or fe atur e matric es with − 1 , 0 , + 1 entries, A daBo ost c onver ges to within ε of the optimal loss within 2 O ( m ln m ) ε − (1+ o (1)) r ounds. W e next fo cus on lo w er b ounds on the con v ergence rate to arbitrary target losses dis- cussed in Section 3. W e b egin by sho wing the rate d ep endence on the norm of the solution as give n in Lemma 9 holds for muc h more general datasets. Lemma 29 Supp ose a fe atur e matrix has only ± 1 entries, and the finite loss set is non- empty. Then, for a ny c o or dinate desc ent pr o c e dur e, the numb er of r ounds r e quir e d to achieve a tar get loss φ ∗ is at le ast inf {k λ k 1 : L ( λ ) ≤ φ ∗ } / (1 + ln m ) . Pro of It suffices to upp er-b ound the s tep size | α t | in any round t by at most 1 + ln m . Notice that when the feature matrix has ± 1 en tries, a step in a dir ection that do es n ot end up increasing the loss is at m ost of length (1 / 2) ln ((1 + δ ) / ( 1 − δ )), where δ is the edge in that d ir ection. Therefore, if δ t is th e maxim u m edge ac h iev able in an y direction, we hav e | α t | ≤ 1 2 ln 1 + δ t 1 − δ t . F urther, by (4), a large edge δ t ensures that for some coord inate step, the new vect or λ t will ha v e m u c h smaller loss than the v ector λ t − 1 at the b eginning of round t : L ( λ t ) ≤ L ( λ t − 1 ) p 1 − δ 2 t . On the ot h er hand, b efore the step, the loss is at most 1, L ( λ t − 1 ) ≤ 1, and after the step the loss is at most 1 /m (sin ce the optimal loss on a dataset with non-empt y finite set is at least 1 /m ): L ( λ t ) ≥ 1 /m . Combining these inequalities we get 1 /m ≤ L ( λ t ) ≤ L ( λ t − 1 ) q 1 − δ 2 t ≤ q 1 − δ 2 t , that is, p 1 − δ 2 t ≥ 1 /m . No w the step length can b e b oun ded as | α t | ≤ 1 2 ln 1 + δ t 1 − δ t = ln(1 + δ t ) − 1 2 ln(1 − δ 2 t ) ≤ δ t + ln m ≤ 1 + ln m. W e en d by sho wing a new lo we r b ound for the conv er gence rate to an arbitrary target loss studied in Section 3. Corollary 11 implies that the r ate b ound in Th eorem 1 has to b e at least p olynomially large in th e norm of th e solution. W e n o w sh o w that a p olynomial dep end en ce on ε − 1 in the r ate is unav oidable too. T h is sho ws that rates for comp eting with a fin ite solution are d ifferent fr om r ates on a d ataset where the optim um loss is ac h iev ed by a fi n ite solution, since in the latter w e ma y ac h iev e a O (ln(1 /ε )) rate. 24 The Ra te of Converge nce of AdaBoost Corollary 30 Consider any dataset (e.g. the one in Figur e 4) for which Ω(1 /ε ) r ounds ar e ne c essary to get within ε of the optimal loss. If ther e ar e c onstants c and β suc h that for any λ ∗ and ε , a loss of L ( λ ∗ ) + ε c an b e achieve d in at most O ( k λ ∗ k c 1 ε − β ) r ounds, then β ≥ 1 . Pro of Th e decomp osition le mma imp lies that λ ∗ = η ∗ + ln (2 /ε ) η † with ℓ 1 -norm O (ln(1 /ε )) ac hiev es loss at most K + ε/ 2 (recall K is the optimal loss). Sup p ose th e corollary f ails to hold for constan ts c and β ≤ 1. Then L ( λ ∗ ) + ε/ 2 = K + ε loss can b e ac hieve d in O ( ε − β ) / ln c (1 /ε )) = o (1 /ε ) rounds, cont r adicting the Ω(1 /ε ) lo wer b oun d. 6. Conclusion In this pap er w e studied the con vergence rate of AdaBo ost w ith resp ect to the exp onen tial loss. W e sho wed upp er and low er b ounds for con v ergence rates to b oth an arbitrary target loss ac hiev ed by some finite com bination of the wea k hypotheses, as well as to the infim u m loss w hic h m ay n ot b e realizable. F or the first con vergence rate, we sho wed a str on g re- lationship exists b etw een the size of the min im um ve ctor ac h ieving a target loss and the n u m b er of rounds of co ordinate descent required to achiev e that loss. In particular, we sho wed that a p olynomial dep en d ence of the rate on the ℓ 1 -norm B of the minimum size solution is absolutely necessary , and that a p oly( B , 1 /ε ) upp er b ound holds, where ε is the accuracy parameter. The actual rate we derive has rather large exp onents, and we d iscuss a min or v arian t of AdaBoost that ac hiev es a m u c h tigh ter and near optimal rate. F or the second kind of con v ergence, using en tirely separat e tec h niques, w e deriv ed a C /ε upp er b ound, and sho wed that this is tigh t up to constan t factors. In the pro cess, w e sho wed a certain d ecomp ositio n lemma that migh t b e of indep end en t in terest. W e also study th e constan ts and sho w ho w they dep end on certain intrinsic parameters related to the singular v alues of the feature matrix. W e estimate the w orst case v alues of these parameters, and considering feature matrices with only {− 1 , 0 , +1 } en tries, this leads to a b ound on th e rate constan t C th at is doub ly exp on ential in the num b er of training examples. Since this is rather large, w e also include b ounds p olynomial in b oth the n u m b er of training examples and the accuracy parameter ε , although the d ep endence on ε in these b ounds is non-optimal. Finally , for eac h kin d of con v ergence, we conjecture tighter b ounds that are not known to hold pr esently . A table contai n ing a summary of the resu lts in this pap er is in cluded in Figure 5. Ac kno w ledgmen ts This researc h was funded b y the National Science F oundation u n der gran ts I IS -10160 29 and I IS-1053407. W e thank Nikhil Sriv asta v a for informing us of the matrix used in Theorem 10. W e also thank Adit ya Bhask ara and Matus T elgarsky for many helpful d iscussions. 25 Mukherjee, Rudin and Schapire Con vergence rate with resp ect to: Reference solution (Section 3) Optimal solution (Section 4) Upp er b ounds: 13 B 6 /ε 5 p oly( e µ max , λ − 1 min , γ − 1 ) /ε ≤ 2 2 O ( m ln m ) /ε p oly( µ max , λ − 1 min , γ − 1 ) /ε 3 ≤ 2 O ( m ln m ) /ε 3 Lo wer b oun ds with: ( B /ε ) 1 − ν for any constan t ν max n 2 m ln(1 /ε ) ln m , 2 9 ε o a) { 0 , ± 1 } en tries (2 m / ln m ) ln(1 /ε ) b) real en tries Can b e arbitrarily large ev en wh en m, N , ε are h eld fixed Conjectured upp er b ound s: O ( B 2 /ε ) 2 O ( m ln m ) /ε 1+ o (1) , if en tries in { 0 , ± 1 } Figure 5: Summary of o ur mos t imp ortant r esults and conjectures r egarding the con vergence r ate of AdaBo o st. Here m refers to the n umber of training ex a mples, and ε is the accuracy parameter . The quantit y B is the ℓ 1 -norm of the reference solution used in Section 3. The pa rameters λ min , γ and µ max depe nd on the dataset and are defined and studied in Section 4. App endix Lemma 31 F or any ε < 1 / 3 , to get within ε of the optimum loss on the dataset in T able 4, A daBo ost takes at le ast 2 / (9 ε ) steps. Pro of Note that the optimal loss is 2 / 3, and w e are b ounding the num b er of r ounds necessary to get within (2 / 3 ) + ε loss for ε < 1 / 3. W e will compute the edge in eac h round analyticall y . Let w t a , w t b , w t c denote the normalized-losses (add ing u p to 1) or weig hts on examples a, b, c at the b eginning of r ou n d t , h t the wea k hyp othesis c hosen in round t , and δ t the edge in roun d t . The v alues of these parameters are s h o wn b elo w for the first 5 rounds, where we hav e assu med (without loss of generalit y) that th e hyp othesis pic ked in round 1 is ~ b : Round w t a w t b w t c h t δ t t = 1 : 1 / 3 1 / 3 1 / 3 ~ b 1 / 3 t = 2 : 1 / 2 1 / 4 1 / 4 ~ a 1 / 2 t = 3 : 1 / 3 1 / 2 1 / 6 ~ b 1 / 3 t = 4 : 1 / 2 3 / 8 1 / 8 ~ a 1 / 4 t = 5 : 2 / 5 1 / 2 1 / 10 ~ b 1 / 5. Based on the p atterns ab o ve, w e first claim that for rou n ds t ≥ 2, the edge ac hieved is 1 /t . In fact w e prov e the stronger claims, that for roun ds t ≥ 2, the follo wing h old: 1. One of w t a and w t b is 1 / 2. 2. δ t +1 = δ t / (1 + δ t ). Since δ 2 = 1 / 2, the r ecur rence on δ t w ould im m ediately imply δ t = 1 /t for t ≥ 2. W e prov e the stronger claims b y induction on the round t . The base case for t = 2 is sho wn ab o ve and ma y b e v erifi ed . Supp ose the inductiv e assu m ption holds for t . Ass u me without loss of generalit y that 1 / 2 = w t a > w t b > w t c ; note this implies w t b = 1 − ( w t a + w t c ) = 1 / 2 − w t c . 26 The Ra te of Converge nce of AdaBoost F urther, in th is round, ~ a gets pic ked, and has edge δ t = w t a + w t c − w t b = 2 w t c . No w for an y dataset, the we ights of the examples lab eled correctly and incorrectly in a round of AdaBo ost are rescaled durin g the w eigh t u p date step in a w a y suc h that eac h add up to 1 / 2 after the rescaling. Therefore, w t +1 b = 1 / 2 , w t +1 c = w t c 1 / 2 w t a + w t c = w t c / (1 + 2 w t c ). Hence, ~ b gets p ic k ed in round t + 1 and, as b efore, w e get e dge δ t +1 = 2 w t +1 c = 2 w t c / (1 + 2 w t c ) = δ t / (1 + δ t ). The pro of of our claim follo ws by induction. Next w e fin d the loss after eac h iteration. Using δ 1 = 1 / 3 and δ t = 1 /t for t ≥ 2, the loss after T r ounds can b e wr itten as T Y t =1 q 1 − δ 2 t = p 1 − (1 / 3) 2 T Y t =2 p 1 − 1 /t 2 = 2 √ 2 3 v u u t T Y t =2 t − 1 t t + 1 t . The p r o duct can b e rewritten as follo w s : T Y t =2 t − 1 t t + 1 t = T Y t =2 t − 1 t ! T Y t =2 t + 1 t ! = T Y t =2 t − 1 t ! T +1 Y t =3 t t − 1 ! . Notice almost all the terms cancel, except for the first term of the first p ro duct, and the last term of the second pro duct. Th erefore, the loss after T roun ds is 2 √ 2 3 s 1 2 T + 1 T = 2 3 r 1 + 1 T ≥ 2 3 1 + 1 3 T = 2 3 + 2 9 T , where th e inequalit y holds for T ≥ 1. Since the initial error is 1 = (2 / 3) + 1 / 3 , therefore, for an y ε < 1 / 3, the n u m b er of roun ds needed to ac h iev e loss (2 / 3) + ε is at le ast 2 / (9 ε ). Lemma 32 Supp ose u 0 , u 1 , . . . , ar e non-ne gative nu mb ers satisfying u t − u t +1 ≥ c 0 u 1+ c 1 t , for some non-ne gative c onstants c 0 , c 1 . Then, for any t , 1 u c 1 t − 1 u c 1 0 ≥ c 1 c 0 t. Pro of By indu ction on t . Th e base case is an ident ity . Assume the statemen t holds at iteration t . Then, 1 u c 1 t +1 − 1 u c 1 0 = 1 u c 1 t +1 − 1 u c 1 t + 1 u c 1 t − 1 u c 1 0 ≥ 1 u c 1 t +1 − 1 u c 1 t + c 1 c 0 t (by indu ctive h yp othesis). Th us it suffices to sho w 1 /u c 1 t +1 − 1 /u c 1 t ≥ c 1 c 0 . Multiplyin g b oth sid es by u c 1 t and adding 1, this is equiv alen t to showing ( u t /u t +1 ) c 1 ≥ 1 + c 1 c 0 u c 1 t . W e will in fact sh o w the stronger inequalit y ( u t /u t +1 ) c 1 ≥ (1 + c 0 u c 1 t ) c 1 . (9) 27 Mukherjee, Rudin and Schapire Since (1 + a ) b ≥ 1 + ba for a, b non-n egativ e, (9) will imply ( u t /u t +1 ) c 1 ≥ (1 + c 0 u c 1 t ) c 1 ≥ 1 + c 1 c 0 u c 1 t , whic h will complete our pro of. T o sho w (9), w e first rearrange the condition on u t , u t +1 to obtain u t +1 ≤ u t (1 − c 0 u c 1 t ) = ⇒ u t u t +1 ≥ 1 1 − c 0 u c 1 t . Applying the fact (1 + c 0 u c 1 t ) (1 − c 0 u c 1 t ) ≤ 1 to the p revious equation w e get, u t u t +1 ≥ 1 + c 0 u c 1 t . Since c 1 ≥ 0, w e may raise b oth sides of the ab ov e inequalit y to the p o we r of c 1 to sho w (9), fi nishing our pro of. Pro of of Lemma 25 In this section we pr o v e Lemma 25, by separately b oun ding th e qu an tities λ − 1 min , γ − 1 and µ max , th r ough a s equence of Lemmas. W e w ill us e the next result r ep eatedly . Lemma 33 If A is an n × n invertible matrix with − 1 , 0 , +1 entries, then min x : k x k =1 k Ax k is at le ast 1 /n ! = 2 − O ( n l n n ) . Pro of It suffices to sho w that k A − 1 x k ≤ n ! for any x with unit norm. No w A − 1 = adj( A ) / d et( A ) w here adj( A ) is the adjoin t of A , wh ose i, j -th en try is the i, j th cofactor of A (giv en by ( − 1) i + j times the determinant of the n − 1 × n − 1 matrix obtained by r emoving the i th r ow an d j th column of A ), and d et( A ) is the determinan t of A . The determinant of an y k × k matrix G can b e written as P σ sgn( σ ) Q k i =1 G ( i, σ ( j )), where σ ran ges o ve r all the p erm u tations of 1 , . . . , k . T herefore eac h en try of adj( A ) is at most ( n − 1)!, and the det( A ) is a non-zero in teger. Therefore k A − 1 x k = k adj( A ) x k / d et( A ) ≤ n! k x k , and the pro of is complete. W e first show our b ound holds for λ min . Lemma 34 Supp ose M h as − 1 , 0 , + 1 entries, and let M F , λ min b e as in Cor ol lary 23. Then λ min ≥ 1 /m ! . Pro of Let A den ote the matrix M F . It suffi ces to sh ow that A do es not squeeze to o m uch the n orm of any v ector orthogonal to the null-space ke r A △ = { η : A η = 0 } of A , i.e. k A λ k ≥ (1 /m !) k λ k for any λ ∈ ker A ⊥ . W e first c haracterize ker A ⊥ and then study how A acts on this subsp ace. Let the r ank of A b e k ≤ m (notice A = M F has N columns and few er than m ro ws). Without loss of generalit y , assume the fi rst k column s of A are ind ep endent. Then ev ery column of A can b e wr itten as a linear com bination of th e fi rst k columns of A , and we ha ve A = A ′ [ I | B ] (that is, the matrix A is th e pr o duct of matrice s A ′ and [ I | B ]), where A ′ is th e submatrix consisting of the fi rst k columns of A , I is the k × k iden tit y matrix, and B is some k × ( N − k ) matrix of linear com binations (here | denotes concatenation). Th e n u ll-space of A consists of x su c h that 0 = Ax = A ′ [ I | B ] x = A ′ ( x k + Bx − k ), where x k is the first k co ordinates of x , and x − k the remaining N − k co ordinates. Sin ce the columns 28 The Ra te of Converge nce of AdaBoost of A ′ are ind ep enden t, this happ ens if and only if x k = − Bx − k . Therefore k er A = ( − Bz , z ) : z ∈ R N − k . Since a vecto r x lies in the orthogonal subspace of k er A if it is orthogonal to ev ery ve ctor in the latter, we ha ve k er A ⊥ = ( x k , x − k ) : h x k , Bz i = h x − k , z i , ∀ z ∈ R N − K . W e next see ho w A acts on this subspace. Recall A = A ′ [ I | B ] where A ′ has k indep enden t columns. By basic linear algebra, the row rank of A ′ is also k , and assume without loss of generalit y that the first k ro ws of A ′ are indep end en t. Denote by A k the k × k submatrix of A ′ formed by these k r ows. T hen for an y vec tor x , k Ax k = k A ′ [ I | B ] x k = k A ′ ( x k + Bx − k ) k ≥ k A k ( x k + Bx − k ) k ≥ 1 k ! k x k + Bx − k k , where the last inequalit y follo ws f rom Lemma 33. T o fi nish the p ro of, it suffices to sho w that k x k + Bx − k k ≥ k x k for x ∈ k er A ⊥ . Indeed, by expanding out k x k + Bx − k k 2 as inner pro du ct with itself, we ha ve k x k + Bx − k k 2 = k x k k 2 + k Bx − k k 2 + 2 h x k , Bx − k i ≥ k x k k 2 + 2 k x − k k 2 ≥ k x k 2 , where the first in equalit y follo ws since x ∈ k er A ⊥ implies h x k , Bx − k i = h x − k , x − k i . T o sho w the b ounds on γ − 1 and µ max , we will n eed an intermediate resu lt. Lemma 35 Supp ose A is a matrix, and b a ve ctor, b oth with − 1 , 0 , 1 entries. If Ax = b , x ≥ 0 is solvable, then ther e is a solution satisfying k x k ≤ k · k ! , wher e k = rank( A ) . Pro of Pic k a solution x with maxim um num b er of zero es. Let J b e th e set of co ordinates for whic h x i is zero. W e fir st claim that there is no other solution x ′ whic h is also zero on the set J . S upp ose there were su ch an x ′ . Note a ny p oin t p on the in fi nite line joining x , x ′ satisfies Ap = b , and p J = 0 (that is, p i ′ = 0 for i ′ ∈ J ). If i is an y co ordinate not in J suc h that x i 6 = x ′ i , then f or some p oin t p i along the line, w e ha v e p i J ∪{ i } = 0 . Cho ose i so that p i is as close to x as p ossible. S ince x ≥ 0 , by con tinuit y this w ould also imp ly that p i ≥ 0 . But then p i is a solution w ith more zero es than x , a con tradiction. The claim implies that the reduced pr oblem A ′ ˜ x = b , ˜ x ≥ 0 , obtained by sub stituting x J = 0 , has a unique s olution. Let k = rank( A ′ ), A k b e a k × k submatrix of A ′ with fu ll rank, and b k b e the restrictio n of b to the ro ws corresp onding to those of A k (note that A ′ , and h ence A k , conta in only − 1 , 0 , +1 ent ries). Then, A k ˜ x = b k , ˜ x ≥ 0 is equiv alen t to the reduced problem. In particular, by uniqueness, solving A k ˜ x = b k automatica lly ensu res the obtained x = ( ˜ x , 0 J ) is a non-negativ e solution to the original p roblem, and satisfies k x k = k ˜ x k . Bu t, by Lemma 33, k ˜ x k ≤ k ! k A k ˜ x k = k ! k b k k ≤ k · k ! . The b ound on γ − 1 follo ws easily . Lemma 36 L et γ , η † b e as in Item 1 of L emma 15. Then η † c an b e chosen such that γ ≥ 1 / √ N m · m ! ≥ 2 − O ( m ln m ) . 29 Mukherjee, Rudin and Schapire Pro of W e kn o w that M ( η † /γ ) = b , where b is zero on the set F and at least 1 for ev ery example in the zero loss set Z (as giv en by Item 1 of Lemma 15). S ince M is closed under complemen ting columns, w e ma y assume in addition that η † ≥ 0 . Introduce slac k v ariables z i for i ∈ Z , and let ˜ M b e M augment ed with th e columns − e i for i ∈ Z , wh ere e i is the standard basis vecto r with 1 on the i th co ord inate and zero ev erywhere else. Then, b y setting z = M ( η † /γ ) − b , we ha v e a solution ( η † /γ , z ) to the system ˜ Mx = b , x ≥ 0 . Applying Lemma 35 , we know there exists some solution ( y , z ′ ) w ith norm at most m · m ! (here z ′ corresp onds to the slac k v ariables). Ob serv e that y / k y k 1 is a v alid choic e for η † yielding a γ of 1 / k y k 1 ≥ 1 / ( √ N m · m !). T o sho w the b ound for µ max w e will need a version of L emm a 35 with strict inequalit y . Corollary 37 Supp ose A is a matrix, and b a ve ctor, b oth with − 1 , 0 , 1 entries. If Ax = b , x > 0 is solvable, then ther e i s a solution satisfying k x k ≤ 1 + k · k ! , wher e k = r ank( A ) . Pro of Using Lemma 35, pick a solution to Ax = b , x ≥ 0 with norm at m ost k · k !. If x > 0 , then w e are done. O therwise let y > 0 satisfy Ax = b , and consider the s egment joining x and y . Ev ery p oint p on the segment satisfies Ap = b . F urther an y co ordinate b ecomes zero at most once on the segment. Th erefore, ther e are p oints arbitrarily close to x on the segmen t with p ositiv e co ordinates that s atisfy the equatio n , and these ha ve norms approac hing that of x . W e next c h aracterize the feature matrix M F restricted to the fin ite-loss examples, which migh t b e of indep enden t in terest. Lemma 38 If M F is the fe atur e matrix r estricte d to the finite-loss examples F (as given by Item 2 of L emma 15), then ther e exists a p ositive line ar c ombination y > 0 such that M T F y = 0 . Pro of Item 3 of the decomp osition lemma states that when ev er the loss ℓ x ( F ) of a v ector is b ound ed b y m , then th e largest margin max i ∈ F ( M F x ) i is at most µ max . This implies that there is no v ector x such that M F x ≥ 0 and at least one of the margins ( M F x ) i is p ositiv e; otherwise, an arb itrarily large m ultiple of x w ould still ha v e loss at most m , but margin exceeding the constant µ max . In other w ord s, M F x ≥ 0 implies M F x = 0 . In particular, th e subsp ace of p ossible margin vec tors M F x : x ∈ R N is disjoin t fr om the con ve x set ∆ F of distrib utions ov er examples in F , whic h consists of p oin ts in R | F | with all non-negativ e and at least one p ositiv e co ordinates. By the Hahn-Banac h Separation theorem, there exists a hyperp lane sep arating these t wo b o dies, i.e. there is a y ∈ R | F | , suc h that for an y x ∈ R N and p ∈ ∆ F , we ha ve h y , M F x i ≤ 0 < h y , p i . By choosing p = e i for v arious i ∈ F , the second inequalit y yields y > 0 . Since M F x = − M F ( − x ), the firs t inequalit y implies that equalit y holds for all x , i.e. y T M F = 0 T . W e can fin ally upp er -b ound µ max . Lemma 39 L et F , µ max b e as in Items 2,3 of the de c omp osition lemma. Then µ max ≤ ln m · | F | 1 . 5 · | F | ! ≤ 2 O ( m ln m ) . Pro of Pic k an y example i ∈ F and an y combination λ whose loss on F , P i ∈ F e − ( M λ ) i , is at m ost m . Let b b e the i th ro w of M , and let A T b e the matrix M F without the i th 30 The Ra te of Converge nce of AdaBoost ro w. Then Lemma 38 sa y s that Ay = − b for some p ositiv e vect or y > 0 . This implies the margin of λ on example i is ( M λ ) i = − y T A T λ . Since the loss of λ on F is at most m , eac h margin on F is at least − ln m , and therefore max i ∈ F − A T λ i ≤ ln m . Hence, the m argin on example i can b e b ounded as ( M λ ) i = y T , − A T λ ≤ ln m k y k 1 . Using Corollary 37, we can fi n d y with b ounded norm, k y k 1 ≤ p | F |k y k ≤ p | F | (1 + k · k !) , where k = rank( A ) ≤ r ank( M F ) ≤ | F | . The pro of follo w s . References P eter L. Bartlett and Mikhail T raskin. AdaBoost is consistent. J ournal of Machine L e arning R ese ar ch , 8:2347–236 8, 2007. P eter J. Bic kel, Y a’aco v Rito v, and Alon Z ak ai. Some theory for generalized b oosting algorithms. Journal of Machine L e arning R ese ar ch , 7:705–732 , 2006. Stephen Bo yd and Lieve n V and enb erghe. Convex Optimization . Cambridge Univ ers it y Press, 2004. Leo Breiman. Prediction games and arcing classifiers. Neu r al Computation , 11(7):1493 – 1517, 1999. Mic hael Collins, Rob ert E. Schapire, a nd Y oram Singer. Logistic regression, AdaBo ost and Bregman d istances. Machine L e arning , 48(1/2/ 3), 2002. Marcus F rean and T om Do wn s. A simple cost function f or b o osting. T ec hnical rep ort, Departmen t of C omputer Science and Electrical Engineering, Unive r sit y of Queensland, 1998. Y oa v F reund and Rob ert E. Sc hapir e. A d ecision-theoretic ge neralization of on-line learnin g and an application to b o osting. Journal of Computer and System Scienc es , 55(1 ):119–139, August 1997. Jerome F riedman, T rev or Hastie, and Rob ert Tibsh irani. Additive logistic r egression: A statistica l view of b oosting. Annals of Statistics , 28(2):337 –374, April 2000. Jerome H. F riedman. Greedy function app ro ximation: A gradien t b o osting mac h ine. Anna ls of Statistics , 29(5) , Octob er 2001. Da vid G. Lu en b erger and Yin yu Y e. Line ar and nonline ar pr o gr amming . Sprin ger, third edition, 2008. Z. Q. Lu o and P . T seng. On the con v ergence of the co ordin ate d escen t metho d for con ve x differen tiable minimization. Journal of Optimization The ory and Applic ations , 72(1) : 7–35, January 1992. Llew Mason, Jonathan Baxter, Pe ter Bartle tt, and Marcus F rean. Bo osting algorithms as gradien t descen t. In A dvanc es in Neur al Informatio n Pr o c essing Systems 12 , 2000. 31 Mukherjee, Rudin and Schapire T. O no da, G. R¨ atsc h , and K.-R. M¨ uller. An asymp totic analysis of AdaBo ost in the binary classification case. In Pr o c e e dings of the 8th Internatio nal Confer enc e on Ar tificial Ne u r al Networks , pages 195–2 00, 1998. G. R¨ atsc h, T. Onod a, and K.-R. M¨ uller. Soft margins for AdaBoost. Machine L e arning , 42 (3):28 7–320 , 2001. Gunnar R¨ atsc h an d Manfr ed K . W armuth. E ffi cien t margin maximizing with b oosting. Journal of M achine L e arning R ese ar ch , 6:2131–21 52, 2005. Gunnar R¨ atsc h, Sebastian Mik a, and Manfred K. W armuth. On the con v ergence of lev er- aging. In A dvanc es in Neur al Informat ion Pr o c essing Systems 14 , 2002. Cynt h ia Rud in, Rob ert E. Sc hapire, and In grid Daub echies. Analysis of bo osting algorithms using the smo oth margin fun ction. Anna ls of Statistics , 35(6):27 23–2768, 2007. Rob ert E. S c hapire. The con v ergence rate of AdaBo ost. In The 23r d Confer enc e on L e arning The ory , 2010. op en problem. Rob ert E. Sc hapire and Y oram S inger. Impro ved b o osting algorithms using confidence-rated predictions. M achine L e arning , 37(3):29 7–336, Decem b er 1999. Shai Shalev-Sh wartz and Y oram Singer. On the equiv alence of wea k learnability and linear separabilit y: New relaxations and efficien t b o osting algorithms. In 21st Annual Confer- enc e on L e arning The ory , 2008. Matus T elgarsky . The con vergence rate of AdaBo ost and fr iends. h ttp://arxiv.org/abs/1101 .4752, J an uary 2011. T ong Zhang and Bin Y u. Bo osting with early stopping: Conv ergence and consistency . Anna ls of Statistics , 33(4) :1538–1579, 2005. 32
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment