The Margitron: A Generalised Perceptron with Margin

The Margitron: A Generalised P erceptron with Margin Constantinos Panagiotakopoulo s and Petroula Tsamp ouk a Physics Divisio n, School of T ec hnology Aristotle Universit y of Thessaloniki, Greece costapan@e ng.auth.gr, petroula@gen.auth.gr Abstract. W e identify the classical P erceptron algorithm with margin as a member of a broader family of large margin classiﬁers which w e collectiv ely call t h e Margitron. The Margitron, (despite its) sharing the same up date rule with the P erceptron, is sho wn in an incremental setting to conv erge in a ﬁnite num b er of up dates to solutions p ossessing any desirable fraction of the maximum margin. Exp eriments comparing the Margitron with decomposition SVMs on tasks i nvol vin g l inear kernels and 2-norm soft margin are also rep orted. 1 In tro duction It is widely accepted that the la rger the margin of the solution hyperplane the greater is the gener alisation ability of the learning machine [18, 14]. The simplest online learning alg orithm for binary linear c lassiﬁcation, the Perceptro n [12, 11], do es not aim at any margin. The problem, ins tead, o f ﬁnding the optimal margin hyperplane lies a t the co re of Supp ort V ector Machines (SVMs) [18 , 1]. Their eﬃcient implement atio n, how ever, is s omewhat hinder ed by the fact that they require solving a quadra tic pro gramming pro ble m. The co mplications encountered in implementing SVMs has res purred the in- terest in alternative large mar gin class iﬁe r s many of which ar e bas e d on the Perceptron algo rithm. The o ldest s uch algor ithm which app ear ed lo ng b efor e the a dven t of SVMs is the standard Perceptron with margin [2], a str aightfor- ward extension of the Perceptron, which, how ever, in an incr ement al setting is known to b e able to guar antee achieving only up to 1 / 2 of the maximum margin that the dataset po ssesses [8, 10, 15]. Subsequently , v arious algorithms succeeded in ac hieving larger fra ctions of the maxim um margin b y employing modiﬁed per ceptron-like up date rules. Such algo rithms include ROMMA [9], ALMA [3], CRAMMA [1 6] and MICRA [1 7 ]. A somew ha t diﬀerent approa ch from the har d margin one ado pted by most of the algor ithms ab ove was a ls o developed which fo cuses on the minimisation o f the 1-nor m soft mar gin los s thro ugh sto chastic gradient des cent . There is a connectio n, how ever, b etw een such a lg orithms and the Perceptron s ince their unr egularise d for m with consta n t lear ning rate is iden- tical to the Perceptron with ma rgin. Nota ble repres ent a tives of this approach ar e the pioneer NORMA [7] and the very rec e n t Pegasos [13 ]. 2 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a A question tha t aris es na tur ally and which w e attempt to answer in the present work is whether it is p ossible to a chiev e a guaranteed fra ction of the maximum marg in larg e r than 1 / 2 while retaining the original p erceptron up date rule. T o this end we construct a whole new family of algor ithms at least o ne mem b er of which has g ua ranteed convergence in a ﬁnite num b er of steps to a solution hyperplane p osse s sing any des irable fraction of the unknown max im um margin. This family o f alg orithms in which the classica l Perceptron with marg in is natur ally embedded will b e termed the Margitro n. Hop efully , the alg orithms belo nging to the mar gitron family by virtue of b eing genera lisations of the very successful Perceptron will hav e a r e sp e ctable p erformanc e in v arious classiﬁcatio n tasks. Section 2 contains some preliminar ies and the description of the Margitr on algorithm. Section 3 is devoted to a theoretical ana ly sis. Section 4 co n tains our exp erimental results while Section 5 o ur conclusions. 2 The Margitron Algorithm In what follows we as sume that we are given a training set which either is linear ly separable from the b eginning or b ecomes separ able by a n a ppropriate feature mapping in to a space of a higher dimension [18, 1]. This higher dimensional feature spac e in which the patterns ar e line a rly separa ble will b e the considered space. By placing all patterns in the same po s ition at a distance ρ in a n additiona l dimension we construct an embedding of our data into the s o -called a ugmented space [2]. The adv ant a g e o f this embedding is that the linear hypo thesis in the augmented space b ecomes homo g eneous. Throug hout our discus sion a reﬂe c tion with r esp ect to the origin in the augmented s pa ce o f the negatively la b e lled patterns is assumed in order to allow for a uniform trea tmen t of b oth categor ies of pa tterns. Also, R ≡ max k k y k k , with y k the k th augmented pa ttern. O bviously , R ≥ ρ . The relation characteris ing optimally correct classiﬁcation o f the tr aining patterns y k by a weight vector u of unit norm in the augmented space is u · y k ≥ γ d ≡ max u ′ : k u ′ k =1 min i { u ′ · y i } ∀ k . (1) W e shall r e fer to γ d as the maximum directional margin. I t coincides with the maximum mar gin in the a ugmented space with resp ect to hyperplanes passing through the origin if no r eﬂection is assumed. The directional marg in γ d and the maximum geometric marg in γ in the or iginal (non- augmented) feature spa ce satisfy the inequality 1 ≤ γ /γ d ≤ R/ρ . As ρ → ∞ , R/ρ → 1 and fr om the ab ov e inequality γ d → γ [1 5]. In the Mar gitron algorithm the augmented weight vector a t is initially set to zero, i.e. a 0 = 0 , a nd is upda ted accor ding to the cla ssical p erc eptron rule a t +1 = a t + y k (2) The Margitron: A Generalised Perceptron with Margin 3 each time a misclassiﬁca tio n co nditio n is s atisﬁed b y a training patter n y k . F or the misc lassiﬁcation co ndition w e consider tw o o ptio ns. The ﬁrst is to repla ce the constant functional margin thresho ld b > 0 in the misclassiﬁcatio n co ndition of the cla s sical Perceptron with marg in by a term pr op ortional to a p ower of the nu mber of s teps (up dates) t a t · y k ≤ b t 1 − ǫ , ǫ > 0 . (3) As a second option we employ a margin threshold prop ortio nal to a p ow er of the length of the augmented weigh t vector leading to a misclass iﬁc a tion condition a t · y k ≤ b k a t k 1 − ǫ , ǫ > 0 . (4) F or t = 0 in b oth ( 3 ) and (4) the thresho ld is set to 0 resulting in the ﬁrst pattern b eing alwa ys mis c la ssiﬁed. The Mar gitron with miscla ssiﬁcation condi- tion given by (3) will b e r e ferred to as the t - margitro n whereas the version with condition g iven b y (4 ) as the ℓ -marg itr on. Se tting ǫ = 1 in b oth the t - and the ℓ -marg itron we recov er the Perceptron with mar gin. Notice that the in tro duction of a constant learning rate is p ointless since it a mount s to a r escaling of b . t-margitron Input: A linear ly separa ble augmented set S = ( y 1 , . . . , y k , . . . , y m ) with reﬂection assumed Fix: ǫ, b Deﬁne: ¯ ǫ = 1 − ǫ Initialise: t = 0 , a 0 = 0 , b 0 = 0 rep eat for k = 1 to m do p tk = a t · y k if p tk ≤ b t then a t +1 = a t + y k t ← t + 1 b t = b t ¯ ǫ end i f end for un til no update made within the for loop ℓ -margitron Input: A linear ly separa ble augmented set S = ( y 1 , . . . , y k , . . . , y m ) with reﬂection assumed Fix: ǫ, b Deﬁne: q k = k y k k 2 , ¯ ǫ = 1 2 (1 − ǫ ) Initialise: t = 0 , a 0 = 0 , ℓ 0 = 0 , b 0 = 0 rep eat for k = 1 to m do p tk = a t · y k if p tk ≤ b t then a t +1 = a t + y k ℓ t +1 = ℓ t + 2 p tk + q k t ← t + 1 b t = b ℓ ¯ ǫ t end i f end for un til no update made within the for lo op Fig. 1. The algorithms t -margitron and ℓ -margitron. Both (3 ) and (4 ) can be written for t > 0 in the form u t · y k ≤ C ( t ) (5) ( u t ≡ a t / k a t k , C ( t ) > 0) in volving the mar gin u t · y k in the augment ed space of the pattern y k with resp ect to the zero - threshold hyper plane normal to a t (i.e. the directional margin of y k ) instead of its functiona l margin a t · y k . The function 4 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a C ( t ) is given b y C ( t ) = bt 1 − ǫ k a t k − 1 for the t -mar g itron and by C ( t ) = b k a t k − ǫ for the ℓ -margitr on. W e exp ect that ǫ < 1 will result in an enha ncement of the margin thres hold C ( t ) relative to the ca s e ǫ = 1 (Perceptron with mar gin) a nd that this enha nc e ment will even tually lead to a slower av er age fall oﬀ of C ( t ) with t progressing instead of a genuine increase which is desira ble in order for the algorithm to converge. This exp ectation is further s uppo rted by the fact that, as we demonstrate below, C ( t ) ≤ ct − ǫ with c > 0. Hop efully , such a slow er decrease of the margin required by the misclassiﬁcation condition will ensure conv er gence to solutions p o ssessing mar gins which are lar ger fra ctions of γ d . T aking the inner pro duct of (2) w ith the optimal direction u we obtain a t +1 · u − a t · u = y k · u ≥ γ d a r ep eated application of which gives [11 ] k a t k ≥ a t · u ≥ γ d t . (6) Using (6) we get C ( t ) ≤ ct − ǫ with c = bγ − 1 d and c = bγ − ǫ d for the t - and the ℓ -marg itron, resp ectively . 3 Theoretical Analysis Lemma 1. L et g ( t ) = t ǫ − αt ǫ − 1 − β with t ∈ [1 , + ∞ ) , ǫ > 0 , α ≥ 1 and β > 0 . Then, ther e is a single value t b of t satisfying g ( t b ) = 0 which is b ounde d as fol lows α + β 1 ǫ ≤ t b ≤ 1 ǫ α + β 1 ǫ ǫ ≤ 1 1 ǫ α + β 1 ǫ < t b < α + β 1 ǫ ǫ > 1 . Pr o of. The function g ( t ) with g (1) < 0 is unbounded fro m ab ov e and is either strictly increasing (if ǫ ≤ 1) or has at most o ne local minimum ( if ǫ > 1). Therefore, there is a single ro ot t b of g ( t ). In a ddition, for g ( t ) 6 = 0 sig n( t − t b ) = sign( g ( t )). Let 0 < ǫ < 1. W e have g ( α + β 1 ǫ ) = β 1 ǫ ( α + β 1 ǫ ) ǫ − 1 − β < β 1 ǫ β ǫ − 1 ǫ − β = 0 , implying that t b > α + β 1 ǫ . Mo reov er, g ( 1 ǫ α + β 1 ǫ ) = ( 1 − ǫ ǫ α + β 1 ǫ )( 1 ǫ α + β 1 ǫ ) ǫ − 1 − β = β (1 + 1 − ǫ ǫ αβ − 1 ǫ )(1 + 1 ǫ αβ − 1 ǫ ) ǫ − 1 − β > β (1 + 1 ǫ αβ − 1 ǫ ) 1 − ǫ (1 + 1 ǫ αβ − 1 ǫ ) ǫ − 1 − β = 0 implying that t b < 1 ǫ α + β 1 ǫ . (Here we make use of 1 + q z > (1 + z ) q for − 1 < z 6 = 0 and 0 < q < 1.) If ǫ > 1, instead, bo th the ab ove inequalities are reversed. (Here we make use of 1 − q z < (1 + z ) − q for z , q > 0 .) Finally , for ǫ = 1 obviously t b = α + β . The Margitron: A Generalised Perceptron with Margin 5 Theorem 1. The t-mar gitr on with 0 < ǫ ≤ 1 c onver ges in t c ≤ 1 ǫ R 2 γ 2 d +  2 2 − ǫ b γ 2 d  1 ǫ (7) up dates to a solution hyp erplane p ossessing dir e ctional mar gin γ ′ d which is a fr action f of t he max imu m dir e ctional mar gin γ d ob eying t he ine quality f ≡ γ ′ d γ d ≥  R 2 b + 2 2 − ǫ  − 1 . (8) Mor e ove r, an after-running estimate of γ ′ d γ d is obtainable fr om γ ′ d γ d ≥ f est ≡  R 2 b t ǫ − 1 c + 2 2 − ǫ  − 1 . (9) Pr o of. F rom (2) and ta king in to acc o unt (3) we g et k a t +1 k 2 − k a t k 2 = k y k k 2 + 2 y k · a t ≤ R 2 + 2 bt 1 − ǫ a r ep eated application t times of which leads to k a t k 2 ≤ R 2 t + 2 b t − 1 X l =1 l 1 − ǫ ≤ R 2 t + 2 b Z t 0 l 1 − ǫ dl = R 2 t + 2 2 − ǫ bt 2 − ǫ . (10) Combining (6) with (10) we obtain γ d t ≤ k a t k ≤ R q t + 2 2 − ǫ b R 2 t 2 − ǫ (11) from where t ǫ ≤ R 2 γ 2 d t ǫ − 1 + 2 2 − ǫ b γ 2 d (12) or, equiv alently , g ( t ) ≡ t ǫ − R 2 γ 2 d t ǫ − 1 − 2 2 − ǫ b γ 2 d ≤ 0 . (13) The v alue t b of t for which the a bove rela tion ho lds as an equality provides an upper b ound on the nu mber of up dates t c required for convergence. According to Lemma 1 there is a sing le such v alue which is bo unded as sta ted there. This leads to the lo ose r b ound of (7). Combining (3) with (5 ) and using (10) we obtain C ( t ) γ d = b t 1 − ǫ γ d k a t k ≥  γ d R b q t 2 ǫ − 1 + 2 2 − ǫ b R 2 t ǫ  − 1 . (14) Multiplying b oth sides of (12 ) with its r .h.s. we ge t t ǫ  R 2 γ 2 d t ǫ − 1 + 2 2 − ǫ b γ 2 d  ≤  R 2 γ 2 d t ǫ − 1 + 2 2 − ǫ b γ 2 d  2 , 6 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a or γ d R b q t 2 ǫ − 1 + 2 2 − ǫ b R 2 t ǫ ≤ R 2 b t ǫ − 1 + 2 2 − ǫ . Using this la s t inequality and taking into a ccount that f = γ ′ d /γ d ≥ C ( t c ) /γ d (14) leads to (9). Setting t c = 1 in (9) we obtain the weak er b ound of (8). R emark 1. Noticing that the num b er o f up dates t c required for convergence of the t-margitro n satisﬁes (12) we get γ d ≤ R q t − 1 c + 2 2 − ǫ b R 2 t − ǫ c from where an alternative after-r unning lower b ound o n γ ′ d /γ d is obtainable. This b ound, how ever, do es no t hav e to b e s maller than 1 − ǫ 2 . R emark 2. The r.h.s. of (14) has in the int er v al [1 , + ∞ ) a sing le extr em um, which is a maximum, at t ⋆ =  | 1 − 2 ǫ | (2 − ǫ )(2 ǫ ) − 1 R 2 b  1 1 − ǫ sign(1 − 2 ǫ ). Ther e fore, it is legitima te in calcula ting a lower b ound on C ( t c ) /γ d using (14 ) to replace t c with t b provided t c ≥ t ⋆ . This leads to the stronger than the one of (8) b ound f ≥  R 2 b t ǫ − 1 b + 2 2 − ǫ  − 1 (15) which, how ever, is γ d -dep endent . The condition t c ≥ t ⋆ is automa tically s atis- ﬁed for 1 2 ≤ ǫ ≤ 1 . F or 0 < ǫ < 1 2 , instead, we may ensure that t c ≥ t ⋆ if the r.h.s . o f (14) is larger tha n or equal to 1 for t = 1 and as a cons e q uence the norma lised mar gin threshold C ( t ) is initia lly not lower than the maximum directional mar gin γ d , i.e. C (1) ≥ γ d . A condition suﬃcient for this to b e the case is b R 2 ≥ γ d R  1 + 2 2 − ǫ γ d R  . In this even t the a lgorithm is for ced to conv er ge only after C ( t ) has fallen b ellow γ d which ca nnot o ccur a s long a s t < t ⋆ . If we choose b R 2 =  1 − ǫ 2  1 − ǫ δ − ǫ  γ 2 d R 2  1 − ǫ (16) and r eplace in (15) t b with its lower b ound t lb ≡  2 2 − ǫ b γ 2 d  1 ǫ = 2 2 − ǫ δ − 1 R 2 γ 2 d , which is lower than the low er bo und inferred from Lemma 1, we can e a sily verify that f ≥  δ + 2 2 − ǫ  − 1 . If 0 < ǫ < 1 2 the para meter δ should satisfy the constraint δ ≤  1 − ǫ 2  1 − ǫ ǫ  γ d R  1 ǫ − 2  1 + 2 2 − ǫ γ d R  − 1 ǫ which for 0 < ǫ ≪ 1 2 and γ d R ≪ 1 sugg e sts a rather slow co nv ergence. Thus, it is no t advisable in this c a se to employ v alues of b for which the constraint on δ is satisﬁed. The a lgorithm will still b e able to achiev e a large fraction of γ d if it happ ens to conv erg e in a suﬃciently larg e num b er of up dates t c as it can b e deduced from (9). The Margitron: A Generalised Perceptron with Margin 7 Lemma 2. F or x, y > 0 and − 1 < ǫ ≤ 1 it holds t hat x 1+ ǫ 1 + ǫ − y 1+ ǫ 1 + ǫ ≤ x 2 − y 2 2 y 1 − ǫ . (17) Pr o of. F or ǫ = 1 (17) holds obviously as an equality . F or − 1 < ǫ < 1 (17) is equiv alent to 1 ≤ 1+ ǫ 2 α 1 − ǫ + 1 − ǫ 2 α − (1+ ǫ ) , with α = x/y . The r.h.s. o f the ab ov e inequality is minimised for α = 1 and ta kes the v alue 1. Lemma 3. F or t ≥ 1 and 0 < ǫ ≤ 1 it holds that t ǫ − 1 ǫ ≤ t ǫ (ln t ) 1 − ǫ − [ ǫ ] , (18) wher e [ ǫ ] denotes the inte ger p art of ǫ . Pr o of. F or t = 1 or ǫ = 1 (18) holds obviously as an equality . Le t t > 1 and 0 < ǫ < 1 . Then, with x = t ǫ (18), as a strict inequality , is equiv alent to f ( ǫ ) = ǫ ǫ x (ln x ) 1 − ǫ − x + 1 > 0. F or x ≥ e e we have d f dǫ < 0 from where f ( ǫ ) > lim ǫ → 1 f ( ǫ ) = 1. F or 1 < x < e e , ins tead, f has only one lo cal minim um at ǫ = e − 1 ln x with v alue at that minimum g iven by h ( x ) = x e − 1 e ln x − x + 1. It can be e asily shown that dh dx = (1 − e ) x − e − 1 ln x − e − 1 + x − e − 1 − 1 has no lo cal minima in the interv al (1 , e e ). Thus, dh dx > min n lim x → 1 dh dx , lim x → e e dh dx o = 0. Therefore, h ( x ) > lim x → 1 h ( x ) = 0 and c onsequently f ( ǫ ) > 0. Lemma 4. L et g ( t ) = t ǫ −  α 1  ln t t  1 − ǫ + α 2 t − 1  − β with t ∈ [1 , + ∞ ) , 0 < ǫ < 1 , α 1 , α 2 , β > 0 and α ≡ α 1 + α 2 ≥ 2 + ǫ . Then, g ( t 0 ) > 0 with t 0 = ( 1 ǫ α + β 1 ǫ )  ln( 1 ǫ α + β 1 ǫ )  1 − ǫ . Pr o of. Let λ = α/ ( α + ǫβ 1 ǫ ) < 1 a nd x = ln α λǫ ≥ ζ ≡ ln  1 + 2 ǫ  > 1 such that t 0 = α λǫ x 1 − ǫ ≥ 1 + 2 ǫ > e , ln t 0 = x + (1 − ǫ ) ln x > 1, β = α ǫ (1 − λ ) ǫ / ( λǫ ) ǫ and α 2 t − 1 0 < α 2 (ln t 0 /t 0 ) 1 − ǫ . Then, g ( t 0 ) > t ǫ 0 − α  ln t 0 t 0  1 − ǫ − β = t ǫ 0  1 − λǫ  1 + (1 − ǫ ) ln x x  1 − ǫ − (1 − λ ) ǫ x (1 − ǫ ) ǫ  > 1 − λǫ (1 + (1 − ǫ ) e − 1 ) 1 − ǫ − (1 − λ ) ǫ ζ ( ǫ − 1) ǫ . 8 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a Here w e made use of t ǫ 0 > 1, ln x x ≤ e − 1 and 1 x (1 − ǫ ) ǫ ≤ ζ ( ǫ − 1) ǫ . This last expression is minimised with res pec t to λ for λ = 1 + (1 − ǫ ) e − 1 − ζ − ǫ 1 + (1 − ǫ ) e − 1 which substituted leads to g ( t 0 ) > (1 + (1 − ǫ ) e − 1 ) − ǫ f ( ǫ ) with f ( ǫ ) ≡ (1 − ǫ )(1 − ζ − ǫ ) + (1 + (1 − ǫ ) e − 1 ) ǫ − (1 + ǫ (1 − ǫ ) e − 1 ). Employing the expansion ln z = P ∞ k =1 2 2 k − 1  z − 1 z +1  2 k − 1 for z > 0 [4 ] we obtain ζ >  1+ ǫ 2  − 1 from where ζ − ǫ <  1+ ǫ 2  ǫ =  1 − 1 2 (1 − ǫ )  ǫ < 1 − 1 2 ǫ (1 − ǫ ). Moreover, (1 + (1 − ǫ ) e − 1 ) ǫ − (1 + ǫ (1 − ǫ ) e − 1 ) > − 1 2 ǫ (1 − ǫ ) 3 e − 2 since (1 + z ) q − (1 + q z ) > 1 2 q ( q − 1) z 2 for z > 0 and 0 < q < 1. T hus, f ( ǫ ) > 1 2 ǫ (1 − ǫ ) 2 (1 − (1 − ǫ ) e − 2 ) > 0 lea ding to g ( t 0 ) > 0. Theorem 2. The ℓ -mar gitr on with 0 < ǫ ≤ 1 c onver ges in t c ≤ ( 1 ǫ A + B 1 ǫ )  ln( 1 ǫ A + B 1 ǫ )  1 − ǫ (19) up dates, with A = (2 + ǫ − 2[ ǫ ]) R 2 γ 2 d and B = (1 + ǫ ) b γ 1 + ǫ d , to a s olut ion hyp erpla ne p ossessing dir e ctional mar gin γ ′ d which is a fr action f of the maximum dir e ctional mar gin γ d ob eying t he ine quality f ≡ γ ′ d γ d ≥  (1+ ǫ ) 2 ǫ − [ ǫ ] (2 ǫ ) ǫ  R 1 + ǫ b  + 1 + ǫ  − 1 . (20) Mor e ove r, for 0 < ǫ < 1 an after-running estimate of γ ′ d γ d is obtainable fr om γ ′ d γ d ≥ f est ≡  R 1 + ǫ b  N 1+ ǫ + 1+ ǫ 2 ǫ  R γ ′ d  1 − ǫ  t ǫ c − N − ǫ N 1 − ǫ   t − 1 c + 1 + ǫ  − 1 . (21) Her e the inte ger N > 0 satisﬁes any of the c onstr aints t c ≥ N ≥ 1+ ǫ 2  R γ ′ d  1 − ǫ , t c ≥ N  1 − ǫN − 1 1 − ǫ  1 ǫ . (22) Obviously, the choic e N = 1 is always ac c eptable . A ne ar optimal choic e of N is N opt =  1 2  R γ ′ d  1 − ǫ  + 1 , pr ovide d it satisﬁes one of the ab ove c onstr aints. Pr o of. F rom (2) and ta king in to acc o unt (4) we g et k a t +1 k 2 − k a t k 2 = k y k k 2 + 2 y k · a t ≤ R 2 + 2 b k a t k 1 − ǫ or, assuming t ≥ 1, k a t +1 k 2 −k a t k 2 2 k a t k 1 − ǫ ≤ R 2 2 k a t k 1 − ǫ + b . The Margitron: A Generalised Perceptron with Margin 9 By us ing (6 ) in the r.h.s. of the ab ov e ineq uality and (17) in its l.h.s. we o btain k a t +1 k 1 + ǫ 1+ ǫ − k a t k 1 + ǫ 1+ ǫ ≤ 1 2 R 2 γ 1 − ǫ d t ǫ − 1 + b a r ep eated application t − N times ( t > N ≥ 1) of which gives k a t k 1 + ǫ 1+ ǫ − k a N k 1 + ǫ 1+ ǫ ≤ 1 2 R 2 γ 1 − ǫ d t − 1 P l = N l ǫ − 1 + b ( t − N ) ≤ 1 2 R 2 γ 1 − ǫ d  N ǫ − 1 + t − 1 R l = N l ǫ − 1 dl  + bt = 1 2 R 2 γ 1 − ǫ d  N ǫ − 1 + 1 ǫ ( t − 1) ǫ − 1 ǫ N ǫ  + bt ≤ 1 2 R 2 γ 1 − ǫ d  N ǫ − 1 + 1 ǫ t ǫ − [ ǫ ] − 1 ǫ N ǫ  + bt . Thu s, employing the obvious b o und k a N k ≤ RN , we are led to k a t k ≤ L N t ≡ R  N 1+ ǫ + 1+ ǫ 2 ǫ  R γ d  1 − ǫ  t ǫ − N − ǫ + [ ǫ ] N 1 − ǫ  + (1 + ǫ ) b R 1+ ǫ t  1 1+ ǫ (23) which, althoug h der ived for t > N , turns out to b e s atisﬁed even for t = N . Combining (6) with (23) we obtain γ 1+ ǫ d t 1+ ǫ ≤ k a t k 1+ ǫ ≤ ( L N t ) 1+ ǫ (24) from where t ǫ ≤  L N t γ d  1+ ǫ t − 1 (25) or, equiv alently , g N ( t ) ≤ 0 (26) with g N ( t ) ≡ t ǫ −  L N t γ d  1+ ǫ t − 1 = t ǫ −  R γ d  1+ ǫ N 1+ ǫ t − 1 − 1+ ǫ 2 ǫ R 2 γ 2 d  t ǫ − N ǫ + ( ǫ − [ ǫ ]) N ǫ − 1  t − 1 − (1 + ǫ ) b γ 1 + ǫ d . (27) Let us consider the deriv ative of g N ( t ) dg N dt = D N ( t ) t − 2 , where D N ( t ) = ǫ t 1+ ǫ +  R γ d  1+ ǫ N 1+ ǫ + 1+ ǫ 2 ǫ R 2 γ 2 d  (1 − ǫ ) t ǫ − N ǫ + ( ǫ − [ ǫ ]) N ǫ − 1  . 10 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a D N ( t ) is str ictly increasing and therefo r e has at mos t one ro o t t r N ( D N ( t r N ) = 0) where obviously g N ( t ) a cquires a minimum (since g N ( t ) is unbounded from ab ov e) with g N ( t r N ) < 0 (since g N ( N ) < 0 ). Thus, g N ( t ) sta rts from negative v alues at t = N and with t increas ing e ither tends mo notonically to inﬁnity or decrea ses further until it acquire s a minimum a t t = t r N and then increases monotonically tow ar ds inﬁnity . In bo th case s ther e is a s ingle v alue t b N of t for which g N ( t b N ) = 0 (28) and mo reov er for g N ( t ) 6 = 0 sign( t − t b N ) = sign( g N ( t )) . (29) The unique v alue t b N of t for which (24), (25 ) and (26 ) hold as equalities provides an upp er b ound on the num be r of up da tes t c required for conv er g ence. Combining (4), (5 ), (23), (27) and (28) we get f = γ ′ d γ d ≥ C ( t c ) γ d = b γ d k a t c k ǫ ≥ b γ d L ǫ N t c ≥ b γ d L ǫ N t b N = b γ 1 + ǫ d t ǫ b N =  R 1 + ǫ b  N 1+ ǫ + 1+ ǫ 2 ǫ  R γ d  1 − ǫ  t ǫ b N − N − ǫ + [ ǫ ] N 1 − ǫ   t − 1 b N + 1 + ǫ  − 1 . (30) F or ǫ = 1 the ab ov e low er b ound o n f is optimised for N = 1 in which c a se it reduces to (20 ). F o r 0 < ǫ < 1 we may replace in the ab ov e low er b ound on f ﬁrst γ d with γ ′ d and subsequently , on the condition that o ne of the constra ints (22) is sa tisﬁed, t b N with t c since b o th repla cements can b e shown to lo osen the b ound. Thus, we obta in f ≥ f est with f est given b y (2 1). An approximate maximisation o f f est with resp ect to N leads to the near o ptimal v alue N opt of Theorem 2. Let us cho o se N = 1 in (27 ) and re pla ce  R γ d  1+ ǫ with R 2 γ 2 d , ther e b y lowering the v alue of g 1 ( t ) g 1 ( t ) ≥ t ǫ − 1+ ǫ 2 R 2 γ 2 d  t ǫ − 1 ǫ + 2 1+ ǫ + 1 − [ ǫ ]  1 t − (1 + ǫ ) b γ 1 + ǫ d . (31) By e mploying (18) in the r.h.s. of (31 ) we obtain g 1 ( t ) ≥ ¯ g ( t ) ≡ t ǫ − R 2 γ 2 d  1+ ǫ 2 t ǫ (ln t ) 1 − ǫ + 3+ ǫ 2 (1 − [ ǫ ])  1 t − (1 + ǫ ) b γ 1 + ǫ d . (32) F or 0 < ǫ < 1 ¯ g ( t ) b ecomes a function of the type co nsidered in Lemma 4 with α = (2 + ǫ ) R 2 γ 2 d ≥ 2 + ǫ . Obviously t 0 of Lemma 4 satisﬁes g 1 ( t 0 ) ≥ ¯ g ( t 0 ) > 0 and according to (2 9) is an upper b ound on t b 1 . Also for ǫ = 1 ¯ g ( t ) b ecomes a function of the type cons idered in Lemma 1 and t b of Lemma 1 is an upp er bo und on t b 1 . Actually in this very sp ecial cas e t b 1 coincides with t b (since (32) holds as an equality) which, in turn, coincides with its upp er and lower b ound. This, given that t c ≤ t b 1 , completes the pro of of (19). The Margitron: A Generalised Perceptron with Margin 11 Alternatively using − 1 ǫ + 2 1+ ǫ ≤ 0, 1 − [ ǫ ] ≤ (1 − [ ǫ ]) t ǫ and (1 + ǫ )(1 + ǫ − [ ǫ ]) = (1 + ǫ ) 2 − [ ǫ ] in the r.h.s. of (31) we obtain g 1 ( t ) ≥ ˜ g ( t ) ≡ t ǫ − 1 2 ǫ (1 + ǫ ) 2 − [ ǫ ] R 2 γ 2 d t ǫ − 1 − (1 + ǫ ) b γ 1 + ǫ d . The function ˜ g ( t ) is of the t yp e c o nsidered in Lemma 1 and its only ro ot ˜ t b satisfying ˜ g ( ˜ t b ) = 0 (33) is an upp er b ound on the num b er o f upda tes lo oser than t b 1 i.e. t b 1 ≤ ˜ t b . Moreov er , the upp er b ound on ˜ t b from Lemma 1 is an alterna tive upp er b ound on t c . Combining (30) for N = 1, the inequality t b 1 ≤ ˜ t b and (33) we obtain f ≥ b γ 1 + ǫ d t ǫ b 1 ≥ b γ 1 + ǫ d ˜ t ǫ b =  R 1 + ǫ b 1 2 ǫ (1 + ǫ ) 2 − [ ǫ ]  R γ d  1 − ǫ ˜ t ǫ − 1 b + 1 + ǫ  − 1 . (34) Additionally , ˜ t lb1 ≡ 1 2 ǫ (1 + ǫ ) 2 − [ ǫ ] R γ d is a low er b ound o n ˜ t b since it is lower than the lower b ound inferred fr om Le mma 1 . Replacing ˜ t b with its lower b ound ˜ t lb1 in the r.h.s. of (34) we get the weaker b ound (20 ). R emark 3. The low er b ounds (30) a nd (34 ) on the fraction f inv olv ing the un- known max im um ma rgin γ d are o f great theoretical imp or tance b ecause they guarantee b efore running that the algor ithm will achieve a mar gin which is a more substantial fra ction of γ d than the one inferred from (20 ). As a co nsequence, v alues of the parameter b smaller than the ones inferred from (20) suﬃce in order for the b efor e-running low er b ound on the fractio n f to b e clo se to its a symptotic v alue (1 + ǫ ) − 1 . This is quantiﬁed in the following theor em. Theorem 3. The ℓ -mar gitr o n with 0 < ǫ ≤ 1 and b (at le ast as lar ge as the one) given by b R 1 + ǫ = (1+ ǫ ) 3 ǫ − 1 − [ ǫ ] (2 ǫδ ) ǫ  γ d R  1 − ǫ (35) ( δ > 0 ) c onver ges in a ﬁnite numb er of up dates to a solution hyp erplane p os- sessing dir e ctional mar gin γ ′ d which is a fr action f of t he maximum dir e ctional mar gin γ d ob eying t he ine quality f = γ ′ d γ d ≥ ( δ + 1 + ǫ ) − 1 . (36) Pr o of. Notice that ˜ t lb2 ≡  (1 + ǫ ) b γ 1 + ǫ d  1 ǫ = (1+ ǫ ) 3 − [ ǫ ] 2 ǫδ R 2 γ 2 d is a low er b ound on ˜ t b of (3 3 ) since it is low er than the lower b ound inferred fro m Lemma 1. Replac ing ˜ t b with its lower b ound ˜ t lb2 in the r.h.s . of (34) co mpletes the pro of. (Larger b ’s may b e reg arded a s co rresp onding to smaller δ ’s.) 12 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a R emark 4. F or ǫ ≪ 1 a mor e a c curate determina tion o f b ensuring that (36) holds is obtained fro m b R 1+ ǫ = ω ǫ  γ d R  1 − ǫ with ω = 1 δ (1 − ǫ )(1 + e − 1 )(2 + ǫ )(1 + ǫ ) ǫ − 1 ǫ ln  1 δ e 1 1 − ǫ (1 − ǫ )(1 + e − 1 )(2 + ǫ )(1 + ǫ ) ǫ − 1 ǫ R 2 γ 2 d  and 0 < δ ≤ e − 1 (1 + e − 1 )(2 + ǫ ) R 2 γ 2 d . F or such a b and taking in to account the co nstraint on δ it can be veriﬁed that ¯ t lb ≡  (1 + ǫ ) b γ 1 + ǫ d  1 ǫ = (1 + ǫ ) 1 ǫ ω R 2 γ 2 d satisﬁes the inequal- it y ¯ t lb > e . Mor eov er, a ny p ossible r o ot of ¯ g ( t ) deﬁned in (32) and the single ro ot t b 1 of g 1 ( t ) are necessarily larger tha n ¯ t lb . Therefore, since ¯ t lb > e and given that d ¯ g dt > 0  d dt ln t t < 0  for t > e ther e is a single ro ot ¯ t b of ¯ g ( t ) satisfying ¯ t b ≥ t b 1 > ¯ t lb > e . Combining (30 ) fo r N = 1 with the last inequality and the relation ¯ g ( ¯ t b ) = 0 we get f ≥ b γ 1 + ǫ d t ǫ b 1 ≥ b γ 1 + ǫ d ¯ t ǫ b =  R 2 bγ 1 − ǫ d  1+ ǫ 2 ¯ t ǫ b (ln ¯ t b ) 1 − ǫ + 3+ ǫ 2  ¯ t − 1 b + 1 + ǫ  − 1 >  (2 + ǫ ) R 2 bγ 1 − ǫ d  ln ¯ t b ¯ t b  1 − ǫ + 1 + ǫ  − 1 >  (2 + ǫ ) R 2 bγ 1 − ǫ d  ln ¯ t lb ¯ t lb  1 − ǫ + 1 + ǫ  − 1 . Let x = 1 δ e 1 1 − ǫ (1 − ǫ )(1 + e − 1 )(2 + ǫ )(1 + ǫ ) ǫ − 1 ǫ R 2 γ 2 d . Then, ω R 2 γ 2 d = e − 1 1 − ǫ x ln x and (2 + ǫ ) R 2 bγ 1 − ǫ d  ln ¯ t lb ¯ t lb  1 − ǫ = (2 + ǫ )(1 + ǫ ) ǫ − 1 ǫ ω − 1  1 ǫ ln(1 + ǫ ) + ln  ω R 2 γ 2 d  1 − ǫ < (2 + ǫ )(1 + ǫ ) ǫ − 1 ǫ ω − 1  1 + (1 − ǫ ) ln  ω R 2 γ 2 d  = δ (1 + e − 1 ) ln( x ln x ) ln x ≤ δ (37) (ln ln x/ ln x ≤ e − 1 ). Thus, our choice o f b ensures that f > ( δ + 1 + ǫ ) − 1 . Substi- tuting b into (19) w e co nclude that in the ℓ -mar gitron as ǫ, δ → 0 the upp er bound on the num b er of up dates t c ∼ ( ǫ − 1 + δ − 1 ln δ − 1 ) ln( ǫ − 1 + δ − 1 ln δ − 1 ) R 2 /γ 2 d . F or ǫ → 0 with δ ﬁxed, instead, the bound ∼ ǫ − 1 ln ǫ − 1 R 2 /γ 2 d . F or δ ≪ 1 and δ /ǫ < λ ≈ 1, how ever, a more accura te upp e r b ound o n t c may be obtained by observing that 0 = ¯ g ( ¯ t b ) > ¯ t ǫ b − (2 + ǫ ) R 2 γ 2 d  ln ¯ t b ¯ t b  1 − ǫ − ¯ t ǫ lb > ¯ t ǫ b − (2 + ǫ ) R 2 γ 2 d  ln ¯ t lb ¯ t lb  1 − ǫ − ¯ t ǫ lb from where (using also (37)) ¯ t b < ¯ t lb  1 + (2 + ǫ ) R 2 γ 2 d ¯ t − ǫ lb  ln ¯ t lb ¯ t lb  1 − ǫ  1 ǫ = ¯ t lb  1 + 2 + ǫ 1 + ǫ R 2 bγ 1 − ǫ d  ln ¯ t lb ¯ t lb  1 − ǫ  1 ǫ < ¯ t lb  1 + δ 1 + ǫ  1 ǫ < ¯ t lb (1 + δ ) 1 ǫ < ¯ t lb e δ ǫ = e δ ǫ (1 + ǫ ) 1 ǫ ω R 2 γ 2 d . T aking into account tha t t c ≤ t b 1 ≤ ¯ t b we conclude that as δ → 0 with δ / ǫ bo unded fro m a bove (e.g. δ = ǫ → 0 ) the upp er b ound on t c ∼ δ − 1 ln δ − 1 R 2 /γ 2 d . The Margitron: A Generalised Perceptron with Margin 13 Theorem 4. Ther e is a value of the p ar ameter b for which the ℓ -mar gitr on with ǫ ≪ 1 c onver ges to a solution hyp erplane with dir e ctional m ar gin γ ′ d ≥ (1 − 2 ǫ ) γ d in less than ∼ ǫ − 1 ln ǫ − 1 R 2 /γ 2 d up dates. Pr o of. Set δ = ǫ in Remark 4 and notice tha t f ≥ (1 + 2 ǫ ) − 1 ≥ 1 − 2 ǫ . Theorem 5. Both the t- and the ℓ - mar gitr on with 1 < ǫ < 2 c o nver ge i n t c up dates, with t c b ounde d fr om ab ove by R 2 γ 2 d +  2 2 − ǫ b γ 2 d  1 ǫ and R 2 γ 2 d +  2 2 − ǫ b γ 1 + ǫ d  1 ǫ r esp e ctively , to a solut ion hyp erpla ne p ossessing dir e ctional mar gin γ ′ d which in the limit b → ∞ satisﬁ es the ine quality γ ′ d ≥ (1 − ǫ 2 ) γ d . Pr o of. F or the t-ma r gitron the a nalysis of Theorem 1 that led to (1 3) re ma ins v alid a nd the single ro ot t b of g ( t ) still pr ovides an upp e r bound on t c . The bo und on t c stated in Theo rem 5 is the upper b ound on t b inferred fro m Lemma 1. The analys is that led to (9) remains also v alid but we ar e no longer a llow ed to replace t c with its lower b ound t c = 1. Instead, we may re pla ce t c in (9) with its upp er b ound stated in Theor em 5 . Then, a s b → ∞ we get γ ′ d ≥ (1 − ǫ 2 ) γ d . In the case of the ℓ -margitr o n a t · y k for a misclassiﬁed pattern y k may be b ounded from abov e b y emplo ying (4) a nd ( 6 ) as a t · y k ≤ b k a t k 1 − ǫ ≤ b ( γ d t ) 1 − ǫ . Then, the analysis of Theorem 1 that led to (13) r emains v alid with the replacement o f b by bγ 1 − ǫ d . The b ound on t c stated in Theo rem 5 is the upper bo und on t b inferred from Lemma 1. F o r the fraction f , instead, employing (4), (5), (11) a nd (13), with the la s t tw o rela tions ta ken at t = t b as equalities, we hav e f = γ ′ d γ d ≥ b γ d k a t c k ǫ ≥ b γ 1 + ǫ d t ǫ b =  R 2 bγ 1 − ǫ d t ǫ − 1 b + 2 2 − ǫ  − 1 . Replacing t b with its upp er b ound R 2 γ 2 d +  2 2 − ǫ b γ 1 + ǫ d  1 ǫ in the a b ove relation leads to a weak er b ound fr o m where we get lim b →∞ f ≥ 1 − ǫ 2 . 4 Exp erimen ts T o reduce the c omputational cost we follow [17] and form a r e duce d “active set” of patterns consis ting of the ones found misclass iﬁed during e a ch ep o ch which are then cyclically presented to the Marg itron algor ithm for N ep mini-ep o chs unless no up date o ccur s during a mini-ep o ch. Subsequently , a new full epo ch inv olving all the patterns ta kes place giving rise to a new active set. The alg o - rithm termina tes only if no mistake o ccurs during a full ep o ch. This pro cedure clearly amounts to a diﬀerent wa y of sequentially pr esenting the patterns to the algorithm and do es not aﬀect the applicability of o ur theo retical analysis . W e c o mpare the t - and the ℓ -margitr on with SVMs on the ba sis o f their ability to achieve fast conv erg ence to a certain appr oximation o f the “optimal” hyperplane in the feature space where the patterns are linearly sepa rable. F o r 14 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a T able 1. Results of a comparative study of SVM l , t -margitron and ℓ -margitron. data ∆ SVM l ǫ = 0 . 01 ρ N ep t − margitron ℓ − margitron set 10 3 γ ′ Secs ǫ 10 8 b R 2 10 3 γ ′ Secs ǫ 10 5 b R 1 + ǫ 10 3 γ ′ Secs Adult 1 8.489 9 1810.3 0 50 0.001 49 1.1 8.491 7 72.2 0 .0005 220 .4 8 .4903 68.2 W eb 1 20.94 1 250.0 0.1 10 0.2 8400 20.944 17.6 0.2 1250 20.9 42 17 .4 C11 0.1 1.7 818 6172.0 0 .1 50 0.2 100 00 1.7 822 631.7 0.2 1600 1.78 21 655.3 CCA T 0.1 0.90 16 482 3 5.0 0.1 50 0.1 514.2 0.90 16 23 24.8 0.1 2 85 0.90 18 2369 .7 Cov er 10 15.7 74 4798 7.7 1 20 0.0 1 158.6 15.774 1866.1 0.005 121.7 15.77 6 1760 .0 linearly separ able data the feature space is the initial instance space wher e a s for linearly insepa rable data (which is the case her e) a space extended by as many dimensions as the instances is considered where each instance is placed at a dis- tance ∆ fro m the origin in the corresp onding dimension. The ex tension gener ates a mar g in of a t lea st ∆/ √ n with n being the num b er o f patter ns and a mounts to adding a term ∆ 2 to the diagonal entries of the kernel (linear in our case). More- ov er, its employmen t is justiﬁed by the w ell-k nown equiv alence b etw een the hard margin o ptimisa tion in the extended spa ce and the soft ma rgin optimisa tio n in the initial instance s pa ce with ob jective function k w k 2 + ∆ − 2 P i ξ i 2 inv olving the weigh t vector w and the 2-no rm o f the slacks ξ i [1]. W e emphasize that SVMs and the Margitro n are requir ed to solve identical ha rd margin pr oblems. In our exp eriments SVMs a re r epresented by SVM light [5], deno ted her e as SVM l , a decomp osition metho d algo rithm which is many orders of magnitude faster than standard SVMs. F o r SVM l we choos e a memory pa r ameter m = 400MB and a 1-no r m so ft ma rgin parameter C = 1 0 5 (approximating C = ∞ ) since w e a re de a ling with a hard margin problem in the appropriate feature space. The choice of the accur acy ǫ depends on the case. F or the remaining parameters default v alues a re used. The exp eriments were co nducted on a 1.8 GHz Intel Pen tium M pro cess o r with 5 04 MB RAM running Windows XP . The co des written in C++ were run using Micr o soft’s Visual C++ 5.0 co mpiler. The datasets we used for training are the Adult (32561 ins ta nces, 12 3 bi- nary a ttributes) a nd W eb (49749 ins tances, 30 0 binar y attributes) UCI datasets as co mpiled by Platt (see [5]), the test0 set fro m the Reuters RCV1 collec tio n (19932 8 instances, 4 7236 attr ibutes with av erag e sparsity 0.16%) o btainable from ht tp:// www.jmlr.org/ pap ers / volume5/lewis04a/ ly rl2004 rcv1v2 README.htm and the m ulticlass Co vertype (Cover) UCI dataset (581012 instances , 54 at- tributes). In the case of the R CV1 we co nsidered b oth the C11 and the CCA T binary text class iﬁc a tion tasks while in the cas e of the Covert yp e dataset we studied the binary cla ssiﬁcation pr o blem of the ﬁrst class versus all the others. The Cov ertype da taset was rescaled by multiplying a ll the attributes with 0.001 . In T able 1 we present the results (i.e. geometric margin γ ′ achiev ed and CP U secs nee de d) of our ﬁrst comparative s tudy involving the algorithms SVM l , t - margitro n and ℓ -margitr o n tog ether with the v alues o f the parameters employ ed. A solution hyperplane in the extended s pace was ﬁrst obtained using SVM l and subsequently the Margitron was r e quired to obtain a solution of comparable The Margitron: A Generalised Perceptron with Margin 15 geometric margin. The extended s pace pa rameter ∆ refers to b oth SVM l and the Mar gitron while the a ugmented space par ameter ρ and the num be r of mini- epo chs N ep only to the Marg itr on. Also, for the Marg itron γ ′ is the geo metric margin in the orig inal (no n-augmented) feature space with the aug ment a tion providing for the bias. W e see that the Margitro n is at least 10-20 times faster than SVM l on these rather larg e datasets. It is understo o d, of course, that s ome additional computer time was sp ent to lo cate the appr opriate v alue o f b . Recently SVM p erf [6], a cutting-plane algo rithm for training linear SVMs, was pres ent ed. W e did make an attempt a t including SVM p erf in our compara - tive study but we found that it re q uires a muc h long er CPU time to converge compared to SVM l without even achieving as la rge v alues of the ma rgin γ ′ . T a- ble 2 contains our exp erimental results on the datasets Adult and W eb ( ∆ = 1). Apparently , the “accur acy” ǫ of SVM p erf is not directly related to the fractio n of the maximum margin achieved. T able 2. Results of ex p erimen ts with SV M p erf . data SVM p erf set ǫ C 10 3 γ ′ Secs Adult 3 × 10 − 4 10 8 5.9436 54450.3 W eb 2 × 10 − 5 10 8 20.891 7297.9 In T able 3 we pr esent the dire ctional margin γ ′ d achiev ed by the t- and the ℓ -marg itron together with the after-running estimate f est of the ratio γ ′ d /γ d and its a symptotic v alue for compar is on. Let us a ccept that the geometr ic marg in γ ′ rep orted in T able 1 is large r tha n 99% of the max imum geo metr ic margin γ as the accuracy ǫ = 0 . 01 of SVM l suggests. Then, taking into account that γ ≥ γ d and that ( γ ′ − γ ′ d ) /γ ′ < 0 . 0 2 we see that γ ′ d /γ d > 0 . 9 7. Thus, we may conclude that the estimates of T able 3 are certainly impressive g iven that they come fro m worst-case b ounds whic h are not exp ected to be very tight and that they ca nno t, of course, exceed their asy mptotic v alues. T able 3. The d irectional margin γ ′ d ac hieved by t h e t- and the ℓ -margitron together with the after-running estimate f est of t h e ratio γ ′ d /γ d and its asymptotic val ue. data t − margitron ℓ − margitron set 10 3 γ ′ d f est 1 − ǫ 2 10 3 γ ′ d f est (1 + ǫ ) − 1 Adult 8.4917 0 .9898 0.9995 8.4903 0.9420 0.9995 W eb 20.574 0.8645 0.9000 20.573 0.7561 0.8333 C11 1.7789 0.8923 0.9000 1.7787 0.8156 0.8333 CCA T 0.9016 0.9404 0.9500 0.9018 0.8752 0.9091 Co ver 15.714 0.9873 0.9950 15.716 0.9703 0.9950 16 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a T able 4. A comparison b etw een the ℓ -m argitron (su ccessiv e run nings) and SVM l . ℓ − margitron SVM l data ǫ = 1 ǫ = 0 . 1 set 10 3 γ ′ d 10 3 γ up d Secs 10 5 b R 1+ ǫ 10 3 γ ′ d f est 10 3 γ ′ Secs ǫ 10 3 γ ′ Secs Adult 6.8839 11.352 3.2 577 8.3274 0.838 8.3274 39.5 0.055 8.3257 1178.9 W eb 19.202 29.840 5.0 551 20.677 0.864 21.053 44.0 0.0031 21.051 291.2 C11 1.5435 2.5607 121.7 506 1.7765 0.863 1.7798 774.7 0.012 1.7798 5952.0 CCA T 0.7800 1.2701 366.7 270 0.8989 0.855 0.8989 1771.5 0.0165 0.8984 42548.8 Co ver 10.644 19.566 334.5 301 14.674 0.816 14.735 527. 4 0.085 14.7 18 29402.8 F rom (16) and (35 ) it b eco mes apparent that the minimal v alue of b guar- anteeing the desir ed ac c uracy dep ends on the maximum dir ectional margin γ d . Moreov er , this dep endence b ecomes incr easingly crucia l with decrea s ing ǫ . This last observ ation prompts us to pro ceed to a determina tion of the lar ge ma rgin so- lution in succe s sive r unnings star ting with the more ins ensitive to the v alue of γ d Margitro n with ǫ = 1 and gr a dually moving tow ards employing a lgorithms with smaller ǫ ’s a ble to guar antee larger fractions o f γ d . Each running in this pro cess will provide us with an interv a l in which the v alue of γ d lies which, hop efully , will shr ink as w e mov e tow ards sma ller ǫ ’s . This infor mation will then a llow us to ﬁx the v alue of b to b e used in the next r unning. The lower b ound on γ d will b e the margin γ ′ d achiev ed. The upp er b ound γ up d will b e provided b y exploiting the after-running estimate f est of γ ′ d /γ d which g ives γ up d = γ ′ d f − 1 est . Alterna tively , we may employ the upp e r b ound on the num b er of up dates t c required for co nv er- gence to o bta in a v alue for γ up d . F o r ǫ = 1 this g ives γ up d = R q  1 + 2 b/R 2  t − 1 c which is usua lly low er than the upp er b ound  R 2 /b + 2  γ ′ d on γ d obtained fro m γ ′ d /γ d ≥  R 2 /b + 2  − 1 . T his pr o cedure may b e follow ed using either the t - or the ℓ -mar g itron but in the for mer case we may encounter diﬃculties for ǫ < 1 / 2 due to the lack of the stro ng b efore- r unning guar antees stemming from (15). In T able 4 we present the res ults of a seco nd compara tive study be t ween the ℓ - margitro n and SVM l . F or the ℓ -margitr on w e follow ed the pro cedur e of succes sive runnings that w e just describ ed inv olving o nly tw o stages with ǫ v alues 1 and 0.1. The ex tended and augmented feature spaces were identical to the ones o f T able 1 and a common v alue N ep = 50 was chosen fo r all datasets. Also, in the ﬁrst stage ( ǫ = 1) we made the co mmon choice b/R 2 = 5 a nd obtained γ up d from the relation γ up d = R p 11 t − 1 c . Then, in the s e c ond stage ( ǫ = 0 . 1) we ﬁxed b from (35) with δ = ( γ d /γ up d ) 1 − ǫ ǫ ≤ 1 (which eliminates the dep endence o f b on γ d ) employing the γ up d obtained in the ﬁrst stage. This wa y we shift the uncertaint y in γ d /γ up d to the b efore- running accurac y δ and rely on the a fter-running low er b ound f est on γ ′ d /γ d to a ssess the accura cy actua lly achiev ed. W e see that f est is well ab ov e 0 . 8 for all data s ets. A comparis on with SVM l on solutio ns of co mparable margin reveals that the ℓ - ma rgitron remains considera bly faster even if the time s pent to ﬁx b is taken into account. The Margitron: A Generalised Perceptron with Margin 17 5 Conclusions W e genera lised the classical Perceptron a lgorithm with mar g in by constr ucting the Margitro n, a family of incr emental la rge ma r gin class iﬁers all the members of which employ the origina l p erce ptr on up date. The Mar gitron c onsists of tw o classes, namely the t -marg itron with a lgorithms inv olving explicitly the n umber of upda tes a nd the ℓ -margitro n the members of which dep end only on the length of the weigh t vector and as s uch lie closer in spirit to the Perceptron. W e proved that as the pa rameter ǫ decreases from 2 to 0 the corr e s po nding a lgorithms in bo th classes conv erge in a ﬁnite num b er of up dates to hyp e rplanes p os sessing a guaranteed fraction of the maximum margin the lar g est p ossible v alue of which v aries contin uously in the interv al (0 , 1). The Perceptron with margin b elongs to b oth classes and is a s so ciated with the middle p oint of the ab ove interv als. Finally , our exp erimental co mpa rative study b etw een algor ithms from the mar- gitron family a nd SVM light on tasks inv olving linear kernels and 2-nor m so ft margin revealed that the Mar gitron is a ser ious alter native to linea r SVMs. References 1. Cristianini, N., Shaw e-T a ylor, J.: A n introduction to supp ort vector machines (2000) Cam bridge, UK: Cam bridge Universit y Press 2. Duda, R .O., Hart, P .E.: Pattern classsiﬁcation and scene analysis (1973) Wiley 3. Gentile, C.: A new approximate maximal margin classiﬁcation algorithm. Journal of Machine Learning Research 2 (2001) 213–24 2 4. Gradshteyn, I.S., Ry zhik, I.M.: T ables of integrals, series and pro ducts (2007) Aca- demic Press 5. Joac hims, T.: Making large-scale svm lear n in g practical. In A dv ances in kernel metho d s-supp ort vector learning (1999) MIT Press 6. Joac hims, T.: T raining linear svms in linear time. KDD (2006) 217–226 7. Kivinen, J., Smola, A.J., Williamson, R.C.: On line learning with kernels. IEEE’TSP , 52 (2004) 2165–2176 8. Krauth, W., M ´ e zard, M.: Learning algorithms with optimal stability in neural net- w orks. Journal of Physics A20 (1987) L745–L752 9. Li, Y ., Long, P .: The relaxed online maxim um margin algorithm. Mac hine Learning, 46 (2002) 361–387 10. Li, Y., Zaragoza, H., Herbrich, R., S haw e-T aylor, J., Kan dola, J.: The p erceptron algorithm with uneven margins. ICML (2002) 379–386 11. Noviko ﬀ, A.B.J.: On conve rgence pro ofs on p erceptrons. In Pro c. S ymp. Math. Theory Automata, V ol. 12 (1962) 615–622 12. Rosenblatt, F.: The p erceptron: A probabilistic mod el for information storage and organization in the brain. Psychologica l Rev iew, 65 (6) (1958) 386–408 13. Shalev- Sch wa rtz, S., Singer, Y., Srebro, N .: Pega sos: Primal estimated sub-gradient solv er for svm. I CML (2007) 807–814 14. Shaw e-T a ylor, J., Bartlett, P .L., Williamson, R.C., Anthony , M.: Structural risk minimization ov er data-dep endent h ierarchies. IEEE’TIT, 44(5) (1998) 1926–1940 15. Tsamp ouk a, P ., Sh a we-T aylor, J.: Analysis of generic p erceptron-like large margin classiﬁers. ECML (2005) 750–758 18 Constan tinos Panag iotakopoulos and Petroula Tsamp ouk a 16. Tsamp ouk a, P ., Shaw e-T aylor, J.: Constan t rate app roximate maximum margin algorithms. ECML (2006) 437–448 17. Tsamp ouk a, P ., Shaw e-T a ylor, J.: Ap proximate maxim um margin algorithms with rules control led by the num b er of mistakes. ICML (2007) 903–910 18. V apnik, V.: Statistical learning theory (1998) Wiley

The Margitron: A Generalised Perceptron with Margin

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment