Sparse Online Learning via Truncated Gradient

Sparse Online Learning via T runcated Gradien t John Langford Y aho o Researc h jl@y aho o-inc.com Lihong Li Rutgers Computer S c ience Departmen t lihong@cs.rutgers.edu T ong Zhang Rutgers Statistics Departmen t tongz@rci.rutgers.edu Abstract W e prop ose a general metho d called truncated gradien t to induce sparsit y in the w eigh ts of online learning algorithms with con v ex loss functions. This metho d has sev eral essen tial prop erties: 1. Th e degree of sparsit y is con tin uousa parameter con trols the rate of sparsication f rom no sparsication to total sparsication. 2. Th e approac h is theoretically m otiv ated, and an instance of it can b e regarded as an online coun terpart of the p opular L 1 -regularization metho d in the batc h setting. W e pro v e that small rates of sparsication result in only small additional regret with resp ect to t ypical online learning guaran tees. 3. Th e approac h w orks w ell empirically . W e apply the approac h to sev eral datasets and nd that for datasets with large n um b ers of features, substan tial sparsit y is disco v erable. 1 In tro duction W e are concerned with mac hine learning o v er large datasets. As an example, the largest dataset w e use here has o v er 10 7 sparse examples and 10 9 features using ab out 10 11 b ytes. In this setting, man y co mmon approac hes fail, simply b ecause they cannot load the dataset in to memory or they are not sucien tly ecien t. There are roughly t w o approac hes whic h can w ork: 1. P arallelize a batc h learning algorithm o v er man y mac hines ( e.g. , [3]). 2. Stream the examples to an online learnin g algorithm ( e.g. , [9], [10], [2], and [6]). This pap er fo cu ses on the second approac h. T ypical online learning algorithms ha v e at least one w eigh t for ev e ry feature, whi c h is to o m uc h in some applications for a couple reaso n s: 1. Space constrain ts. I f th e state of the online learning algorithm o v ero ws RAM it can not ecien tly run. A similar problem o ccurs if the state o v e ro ws the L2 cac he. 1 2. T est time constrain ts on computation. Substan tially reducing the n um b er of features can yield substan tial impro v emen ts in the computational time required to ev aluate a new sample. This pap er addresses the p roblem of indu cing sp a rs it y in learned w eigh ts wh ile using a n onli ne learning algorithm. There are sev eral w a ys to do this wrong for our problem. F or example: 1. Simply adding L 1 regularization to the g rad ien t of an onli ne w eigh t up date do esn't w ork b ecause gradien ts don't induce sparsit y . The essen tial dicult y is that a gradien t up date has the form a + b where a and b are t w o oats. V ery few oat pairs add to 0 (or an y other default v alue) so there is little reason to exp ect a gradien t up date to acciden tally pro duce sp a rs it y . 2. Simply rounding w eigh ts to 0 is problematic b ecause a w eigh t ma y b e small due to b eing useless or small b ecause it has b een up dated only once (either at the b eginning of training or b ecause the set of features app earing is also sp a rs e ). Rounding tec hniques can also pla y ha v o c with standard online learning guaran tees. 3. Blac k-b o x wrapp er app roa c hes whic h eliminate features and test the impact o f the elimination are not ecien t enough. These approac hes t ypically run an algorithm ma n y times whic h is particularly undesirable with large datasets. 1.1 What Others Do The Lasso algorithm [13] is c ommonly used to ac hiev e L 1 regularization for lin ear regression. This algorithm do es not w ork automatically in a n online fashion. There are t w o form ulations of L 1 regularization. Consi der a loss function L ( w , z i ) whic h is con v ex in w , where z i = ( x i , y i ) is an input/output pair. One i s the con v ex constrain t form ulation ˆ w = arg min w n X i =1 L ( w , z i ) sub ject to k w k 1 ≤ s, (1) where s is a tunable parameter. The other is s o f t-reg u larization, where ˆ w = arg min w n X i =1 L ( w , z i ) + λ k w k 1 . (2) With ap propriately c hosen λ , the t w o form ulations are equiv alen t. The con v ex constrain t form u- lation has a simple online v ersion using the pro jection id ea in [15]. I t requires the pro jection of w eigh t w in to an L 1 ball at ev ery o n line step. This op eration is dicult to implemen t ecien tly for large-scale data where w e ha v e examples with sp arse features and a large n um b er of features. In suc h situ ation, w e require that the n um b er of op erations p er onlin e step to b e linear with resp ect to the n um b er of non z ero features, and indep enden t of the total n um b er of features. Our me th o d, whic h w orks with the soft-regularization form ulation (2), satises the requi re men t. A dditional de- tails can b e found in Section 5. In addition to L 1 regularization form ulation (2), the family of online algorithms w e cons ider in this p a p er also include some n o n co n v ex sparsication tec hniques. The F orgetron algorithm [4] is an online learning algorithm that manages memory use. It op erates b y d ec a ying the w eigh ts on previous examples and then rounding these w eigh ts to zero when th ey b e come small. The F orgetron is stated for k ernelized online algorithms, while w e are concerned with the simple linear setting. When ap plied to a linear k ernel, th e F orgetron is not computationally or space comp etitiv e with approac hes op erating directly on feature w eigh ts. 2 1.2 What W e Do A t a high lev el, the approac h w e tak e is w eigh t deca y to a defaul t v alue. This simple app roa c h enjo ys a strong p erformance guaran tee, as discussed in section 3. F or instance, the algorithm nev er p erforms m uc h w orse than a sta n dard online learning algorithm, and the additional loss due to sparsication is con trolled con tin uously with a single real-v alued parameter. The theory giv es a family of algorithms with con v ex loss functions for indu c i ng sparsit y  o n e p er online learning algorithm. W e instan tiate this for square loss and sh o w ho w to deal with sparse examples ecien tly in section 4. As men tioned in the in tro duction , w e are mainl y in terested in sparse online metho ds for large scale prob lem s with sparse features. F or suc h problems, our algorithm should satisfy the follo wing requiremen ts: • The algorithm should b e computationally ecien t: the n um b er of op erations p er on line step should b e linear in the n um b er of nonzero features, and indep enden t of the total n um b er of features. • The algorithm should b e memory ecien t: it needs to main tain a list of activ e f e atu res, and can insert (when the corresp onding w eigh t b ecomes nonzero) and delete (when th e corresp onding w eigh t b ecomes zero) features dynamically . The implemen tation details, sho wing that our metho ds satisfy the ab o v e requiremen ts, are pro vided in section 5. Theoretical results stating ho w m uc h sparsit y is ac h iev ed using this metho d generally require additional assu m p tions whic h ma y or ma y not b e met in practice. Consequen tly , w e rely on ex- p erimen ts in section 6 to sho w that our metho d ac hiev es go o d sparsit y practice. W e compare our approac h to a few others, including L 1 regularization on small data, as w ell as online rounding of co ecien ts to zero. 2 Online Learnin g with GD In the setting of standard onlin e learning, w e are in terested in sequen tial prediction problems where rep eatedly from i = 1 , 2 , . . . : 1. An unlab eled example x i arriv es. 2. W e mak e a prediction based on existing w eigh ts w i ∈ R d . 3. W e observ e y i , let z i = ( x i , y i ) , and incur some kno wn loss L ( w i , z i ) con v ex in parameter w i . 4. W e up date w eigh ts according to some rule: w i +1 ← f ( w i ) . W e w an t to come up with an up date rule f , whic h allo ws us to b ou nd the sum of losses t X i =1 L ( w i , z i ) as w ell as ac hieving sparsit y . F or this purp ose, w e start with the standard sto c hastic gradien t descen t rule, whic h is of the form : f ( w i ) = w i − η ∇ 1 L ( w i , z i ) , (3) 3 where ∇ 1 L ( a, b ) is a sub-gradien t of L ( a, b ) with resp ect to the rst v ariable a . The parameter η > 0 is often referred to as the learning rate. In o u r analysis, w e only consider constan t learning rate with xed η > 0 for simplicit y . In theory , it migh t b e desirable to ha v e a deca ying learning rate η i whic h b ecomes smaller when i increases to g et the so called no-r e gr et b ound without kno wing T in adv a n ce . Ho w ev er, if T is kno wn in adv ance, one can select a constan t η accordingly so that the regret v anishes as T → ∞ . Since our fo cus is on sparsit y , not ho w to c ho ose learning rate, for clarit y , w e use a constan t learning rate in the analysis b ecause it leads to simpler b oun ds. The ab o v e metho d has b een widely used in online learning suc h as [10] and [2]. Moreo v er, it i s argued to b e ecien t ev en for solving batc h problems wh ere w e rep eatedly run the online algorithm o v er training data m ultiple times. F o r example, the idea has b een successfully applied to solv e large-scale standard SVM form ulations [11, 14]. In the scenario outlined in the i n tro duction, online learning metho ds are more suitable than some traditional batc h learning metho ds. Ho w ev er, a main dra wbac k of (3) is that it do es not ac hiev e sparsit y , whic h w e address in this pap er. Note that in the literature, this particu lar up date rule is often referred to as gradien t d e s c en t (GD) or sto c hastic gradien t descen t (SGD). There are other v arian ts, suc h as exp onen ti a ted gradien t descen t (EG). Since our fo cus in this pap er is sparsit y , not GD v ersus EG, w e shall only consider mo dications of (3) for simplicit y . 3 Sparse Online Learning In this section, w e examine sev eral metho ds for ac hieving sparsit y in online learning. The rst idea is simple co ecien t rounding, whic h is the most natural metho d. W e wi ll then consider its full online implemen tation, and another metho d whic h is the online coun terpart of L 1 regularization in batc h learning. As w e shall see, all th ese ideas are closely related. 3.1 Simple Co ecien t Roun ding In o rd er to ac hiev e sparsit y , the most natural metho d is to round small co ecien ts (that are no larger than a threshold θ > 0 ) to zero after ev ery K online steps. Th a t is , if i/K is not an in teg er, w e use the standard GD rule in (3); if i/K is an in teger, w e mo dify the rule a s : f ( w i ) = T 0 ( w i − η ∇ 1 L ( w i , z i ) , θ ) , (4) where for a v ector v = [ v 1 , . . . , v d ] ∈ R d , and a scalar θ ≥ 0 , T 0 ( v , θ ) = [ T 0 ( v 1 , θ ) , . . . , T 0 ( v d , θ )] , with T 0 ( v j , θ ) = ( 0 if | v j | ≤ θ v j otherwise . That is, w e rst p erform a standard sto c hastic gradien t descen t rule, and then round the up dated co ec i en ts to w ard zero . The e ect is to remo v e nonzero and small comp onen ts in the w eigh t v ector. In general, w e should not tak e K = 1 , esp ecially when η is small, si nce eac h step mo dies w i b y only a small amoun t. If a co ecien t is zero, it remains small after one online up date, and the rounding op eration pulls it bac k to z ero. Consequen tly , roundin g can b e done only after ev ery K steps (with a reasonably large K ); in this case, nonzero co ecien ts ha v e sucien t time to go ab o v e the threshold θ . Ho w ev er, if K is to o large, then in the training stage, w e will need to k eep man y more nonzero features in the in termediate steps b efore they are rounded to zero. In the extreme 4 case, w e ma y simply roun d the co ecien ts in the end, whic h do es not solv e the storage problem in the training p hase. The sensitivit y in c ho osing appropriate K is a main dra wbac k of this metho d ; another dra wbac k is the lac k of theoretical guaran te e for its online p erfo rmance. The ab o v e men tioned issues mo ti v ate us to consider more principled sparse online learning metho ds. In section 3.3, w e deriv e an online v ersion of roundin g using an idea called trunc ate d gr adient for whic h regret b ounds h old. 3.2 A Sub-gradien t Algorithm for L 1 Regularization In our exp erimen ts , w e com bine roun ding-in-the-end-of-training with a simple online sub -gra d ien t metho d for L 1 regularization with a regularization parameter g > 0 : f ( w i ) = w i − η ∇ 1 L ( w i , z i ) − η g sgn( w i ) , (5) where for a v ector v = [ v 1 , . . . , v d ] , sgn( v ) = [sgn( v 1 ) , . . . , sgn( v d )] , and sgn( v j ) = 1 when v j > 0 , sgn( v j ) = − 1 when v j < 0 , and sgn( v j ) = 0 when v j = 0 . In th e exp erimen ts, the online metho d (5) plus rounding in the end is used as a simple baseline. One should note that this metho d do es not pro duce sparse w eigh ts online. Therefore it do es not handl e large-scale problems for whic h w e cannot k eep a l l features in memory . 3.3 T runcated Gradien t In order to obtain an online v e rs ion of the simple rounding rule in (4), w e observ e that the direct rounding to zero is to o aggressiv e. A less aggressiv e v ersion is to shrink the co ecien t to zero b y a smaller am ou n t. W e call this idea tru ncated gradien t. The am ou n t o f shrink age is measured b y a gr avity parameter g i > 0 : f ( w i ) = T 1 ( w i − η ∇ 1 L ( w i , z i ) , η g i , θ ) , (6) where for a v ector v = [ v 1 , . . . , v d ] ∈ R d , and a scalar g ≥ 0 , T 1 ( v , α, θ ) = [ T 1 ( v 1 , α, θ ) , . . . , T 1 ( v d , α, θ )] , with T 1 ( v j , α, θ ) =      max(0 , v j − α ) if v j ∈ [0 , θ ] min(0 , v j + α ) if v j ∈ [ − θ , 0] v j otherwise . Again, the truncation can b e p erformed ev ery K online steps. That is, if i/K is not an in teger, w e let g i = 0 ; if i/K is an in teger, w e let g i = K g for a gra vit y parameter g > 0 . This particular c hoice is equiv alen t to (4) when w e set g suc h that η K g ≥ θ . This requires a large g when η is small. In practice, one should set a small, xe d g , as implied b y our regret b ound dev elop ed later. In general, the larger the parameters g and θ are, the more sparsit y is incurred. Due to the extra trun ca ti o n T 1 , this metho d can lead to sparse solutions, whic h is c on  rmed in our exp erimen ts describ ed later. In those exp erimen ts, the degree of sparsit y disco v ered v aries with the problem. A sp ecial case, whic h w e use in the exp e ri m en t, is to let g = θ in (6). In this case, w e can use only one para meter g to con tro l sparsit y . Since η K g  θ when η K is small, the tru ncation op eration is less aggre s siv e than the rounding in (4). A t rst sigh t, the pro cedure app ears to b e an ad-ho c w a y to x (4). Ho w ev er, w e can establish a regret b ound for thi s metho d, sho wing that it is theoretically sound. 5 Another imp ortan t sp ecial case of (6) is setting θ = ∞ . Th is leads to the follo wing up date rule for ev ery K -th online step f ( w i ) = T ( w i − η ∇ 1 L ( w i , z i ) , g i η ) , (7) where for a v ector v = [ v 1 , . . . , v d ] ∈ R d , an d a scalar g ≥ 0 , T ( v , α ) = [ T ( v 1 , α ) , . . . , T ( v d , α )] , wi th T ( v j , α ) = ( max(0 , v j − α ) if v j > 0 min(0 , v j + α ) otherwise . The m eth o d is a mo dication of the standard sub -g rad ien t onl ine metho d for L 1 regularization i n (5). The p a rameter g i ≥ 0 con trols the sparsit y that can b e ac hiev ed with the algorithm. Note that when g i = 0 , the up date rule is iden tical to the standard sto c hastic gradien t descen t rule. In general, w e ma y p erform a truncation ev ery K steps. That is, if i/K is not an in teger, w e let g i = 0 ; if i/K is an in teger, w e let g i = K g for a gra vit y param eter g > 0 . The reason for doing so (instead of a constan t g ) is that w e can p erform a more aggressiv e truncation with gra vit y parameter K g after ea c h K steps. This can p oten tially lead to b etter sparsit y . The pro cedure in (7) can b e regarded as an online co u n terpart of L 1 regularization in the sense that it appro ximately solv es an L 1 regularization problem in the limit of η → 0 . T runcated gradien t descen t for L 1 regularization is dieren t from the naiv e application of sto c hastic gradien t descen t rule (3) with an added L 1 regularization term. As p oin ted out in the in tro duction, the latter fails b ecause it rarely lea d s to sparsit y . Our theory sho ws that ev en with sparsication, the prediction p erformance is still comparable to that of the standard onl ine learning algorithm. In the follo wing, w e dev elop a general regret b ound for this general metho d, whic h also sho ws ho w the regret ma y dep end on the sparsication parameter g . 3.4 Regret Analysis Throughout the pap er, w e use k · k 1 for 1 -norm, and k · k for 2 -norm. F or reference, w e mak e the follo win g assumption regarding the loss f unction: Assumption 3.1 W e assum e that L ( w , z ) is c onvex in w , and ther e exist non-ne gative c onstants A and B such that ( ∇ 1 L ( w , z )) 2 ≤ AL ( w , z ) + B for al l w ∈ R d and z ∈ R d +1 . F or linear prediction problems, w e ha v e a general loss function of the form L ( w , z ) = φ ( w T x, y ) . The follo wing are some common loss functions φ ( · , · ) with corresp onding c hoices of parameters A and B (whic h are not unique), under the assumption that sup x k x k ≤ C . • Logistic: φ ( p, y ) = ln(1 + exp( − py )) ; A = 0 and B = C 2 . This loss is for binary classication problems with y ∈ {± 1 } . • SVM (hinge loss): φ ( p, y ) = max(0 , 1 − py ) ; A = 0 and B = C 2 . This loss is for binary classication problems with y ∈ {± 1 } . • Least squares (squ a re loss): φ ( p, y ) = ( p − y ) 2 ; A = 4 C 2 and B = 0 . Thi s loss is for regression problems. Our main result is Th eo rem 3 . 1 that is parameterized b y A and B . Th e pro of is left to the app endix. Sp ecializing it to p a rti cular losses yields sev eral corollaries. The one that can b e applied to the least squares loss will b e giv en later in Corollary 4.1. 6 Theorem 3.1 (Sp arse Online R e gr et) Consider sp arse online up date rule (7) with w 1 = 0 and η > 0 . If Ass umption 3.1 holds, then for al l ¯ w ∈ R d we have 1 − 0 . 5 Aη T T X i =1  L ( w i , z i ) + g i 1 − 0 . 5 Aη k w i +1 · I ( w i +1 ≤ θ ) k 1  ≤ η 2 B + k ¯ w k 2 2 η T + 1 T T X i =1 [ L ( ¯ w , z i ) + g i k ¯ w · I ( w i +1 ≤ θ ) k 1 ] , wher e for ve ctors v = [ v 1 , . . . , v d ] and v 0 = [ v 0 1 , . . . , v 0 d ] , we let k v · I ( | v 0 | ≤ θ ) k 1 = d X j =1 | v j | I ( | v 0 j | ≤ θ ) , wher e I ( · ) is the set indic a t o r function. W e state the theorem with a cons ta n t learning rate η . A s men tioned earlier, it is p ossib le to obtain a re s ult with v ariable learning ra te where η = η i deca ys as i increases. Although th is ma y lead to a no-regret b ound without kno wing T in adv ance, it in tro duces extra complexit y to the presen tation of the main idea. Since our fo cus is on sparsit y rather th a n optimizing learnin g rate, w e do not include suc h a result for clarit y . If T is kno wn in adv ance, then in the a b o v e b ound, one can simply tak e η = O (1 / √ T ) and the regret is of order O (1 / √ T ) . In the ab o v e theorem, the righ t-hand side in v olv es a term g i k ¯ w · I ( w i +1 ≤ θ ) k 1 that dep ends on w i +1 whic h is not easily estimated. T o remo v e this dep endency , a trivial upp er b ound of θ = ∞ can b e used, leadin g to L 1 p enalt y g i k ¯ w k 1 . In the g en era l case of θ < ∞ , w e cannot remo v e the w i +1 dep endency b ecause the eectiv e regularization condition (as sho wn on the left-hand side) is the non-con v ex p enalt y g i k w · I ( | w | ≤ θ ) k 1 . S o l v i ng suc h a non-con v ex form ulation is hard b oth in the online and batc h settings. In general, w e only k n o w h o w to ecien tly disco v er a lo cal minim um whic h is dicult to c haracterize. Without a go o d c haracterization of the lo cal minim um, it is not p ossible for us to replace g i k ¯ w · I ( w i +1 ≤ θ ) k 1 on the ri g h t-hand side b y g i k ¯ w · I ( ¯ w ≤ θ ) k 1 b ecause suc h a form ulation w ould ha v e impl ied that w e could ecien tly solv e a non-con v ex problem with a simple online up date rul e . Still, when θ < ∞ , one naturally exp ects that the righ t-hand side p enalt y g i k ¯ w · I ( w i +1 ≤ θ ) k 1 is m uc h smaller than the corresp ondi ng L 1 p enalt y g i k ¯ w k 1 , esp ecially when w j has man y comp onen ts that are close to 0 . Th ere f o re the situation with θ < ∞ can p oten tially yield b etter p erformance on some data. This is conrmed in our exp erimen ts. Theorem 3.1 a l so implies a trade-o b et w een sparsit y and regret p erformance. W e ma y simpl y consider the case where g i = g is a constan t. When g is small, w e ha v e less sparsit y but the regret term g k ¯ w · I ( w i +1 ≤ θ ) k 1 ≤ g k ¯ w k 1 on the ri g h t-hand side is also small. When g is large, w e are able to ac hiev e more sparsit y but th e regret g k ¯ w · I ( w i +1 ≤ θ ) k 1 on the righ t-hand side also b ecomes large. Suc h a trade-o (sparsit y v ersus prediction accuracy ) is empirically studied in Section 6. Our observ ation suggests that w e can gain signican t sp a rs it y with only a small decrea s e of accuracy (that is, using a small g ). No w consider the case θ = ∞ and g i = g . When T → ∞ , if w e let η → 0 and η T → ∞ , then Theorem 3.1 implies that 1 T T X i =1 [ L ( w i , z i ) + g k w i k 1 ] ≤ inf ¯ w ∈ R d " 1 T T X i =1 L ( ¯ w , z i ) + 2 g k ¯ w k 1 # + o (1) . 7 In other w ords , if w e let L 0 ( w , z ) = L ( w , z )+ g k w k 1 b e the L 1 regularized loss, then the L 1 regularized regret is small when η → 0 and T → ∞ . This implies that our pro cedure can b e regarded as the online coun terpart of L 1 -regularization metho ds. In the sto c hasti c setting where the ex amples are dra wn iid from some underlying distribution, the sparse on line gradien t metho d prop osed in this pap er solv es the L 1 regularization problem. 3.5 Sto c hastic Setting Sto c hastic-gradien t-based online learning m eth o ds can b e u sed to solv e large-sca l e batc h optimiza- tion problems, often quite su cc ess fully [11, 14]. In this setting, w e can g o through training ex amples one-b y-one in an online fashi o n , and rep ea t m ultiple times o v er th e training data. In this section, w e analyze the p erformance of suc h a pro cedure using Theorem 3.1. T o simplify the analysis, instead of assuming that w e go through the data one b y one, w e a ss ume that eac h additional data p oin t is dra wn from the training data rand om l y with equal probabilit y . This corresp onds to the standard sto c hastic optimization setting, in whic h observ ed samples are iid from some underlying di stributions. The follo wing result is a simple consequence of Theorem 3.1. F or simplicit y , w e only consider the case with θ = ∞ and constan t gra vit y g i = g . Theorem 3.2 Consider a set of tr aining data z i = ( x i , y i ) for i = 1 , . . . , n , and let R ( w , g ) = 1 n n X i =1 L ( w , z i ) + g k w k 1 b e the L 1 r e gularize d loss over tr aining data. L et ˆ w 1 = w 1 = 0 , and den e r e cursively for t = 1 , 2 , . . . w t +1 = T ( w t − η ∇ 1 ( w t , z i t ) , g η ) , ˆ w t +1 = ˆ w t + ( w t +1 − ˆ w t ) / ( t + 1) , wher e e ach i t is dr awn fr om { 1 , . . . , n } uniformly at r andom. If Assumption 3.1 holds, then at any time T , the fol lowing ine qualities ar e valid for al l ¯ w ∈ R d : E i 1 ,...,i T  (1 − 0 . 5 Aη ) R  ˆ w T , g 1 − 0 . 5 Aη  ≤ E i 1 ,...,i T " 1 − 0 . 5 Aη T T X i =1 R  w i , g 1 − 0 . 5 Aη  # ≤ η 2 B + k ¯ w k 2 2 η T + R ( ¯ w, g ) . Pr oof. Note that the recursion of ˆ w t implies that ˆ w T = 1 T T X t =1 w t from te l esco p ing the up d a te rule. Because R ( w , g ) is con v ex in w , the rst inequalit y follo ws directly from Jensen's inequalit y . In the follo wing w e only need to pro v e the second inequalit y . Theorem 3.1 implies the follo wing: 1 − 0 . 5 Aη T T X t =1  L ( w t , z i t ) + g 1 − 0 . 5 Aη k w t k 1  ≤ g k ¯ w k 1 + η 2 B + k ¯ w k 2 2 η T + 1 T T X t =1 L ( ¯ w , z i t ) . (8) 8 Observ e that E i t  L ( w t , z i t ) + g 1 − 0 . 5 Aη k w t k 1  = R  w t , g 1 − 0 . 5 Aη  and g k ¯ w k 1 + E i 1 ,...,i T " 1 T T X t =1 L ( ¯ w , z i t ) # = R ( ¯ w , g ) . The second inequalit y is obtained b y taking the exp ectation with resp ect to E i 1 ,...,i T in (8).  If w e let η → 0 and η T → ∞ , the b ound in Theorem 3.2 b ecomes E [ R ( ˆ w T , g )] ≤ E " 1 T T X t =1 R ( w t , g ) # ≤ inf ¯ w R ( ¯ w , g ) + o (1) . That is, on a v era ge ˆ w T appro ximately solv es the L 1 regularization problem inf w " 1 n n X i =1 L ( w , z i ) + g k w k 1 # . If w e c ho ose a random stopping time T , then the ab o v e in eq u a l ities sa ys that on a v erage R ( w T ) also solv es this L 1 regularization problem appro ximately . Therefore in o u r exp eri m en t, w e use the last solution w T instead of the aggregated solution ˆ w T . Since 1 -n orm regularization is frequen tly used to ac hiev e sparsit y in the batc h learning setting, the connection to 1-norm regularization c an b e regarded as an alternativ e justication for the sparse-online algorithm dev elop ed in this pap er. 4 T runcated Gradien t Algorithm for Least Squares The metho d in Section 3 can b e directly applied to least squares regression. This leads to A lgo ri thm 1 whic h implemen ts sparsication for square loss according to equation (7). In the d escription, w e use sup erscripted sy m b ol w j to denote the j -th comp onen t of v ector w (in order to d iere n tiate from w i , whic h w e ha v e used to denote the i -th w ei g h t v ector). F or clarit y , w e also drop the index i from w i . Althou g h w e k eep the c hoice of gra vit y parame ters g i op en in the algorithm description, in practice, w e only consider the follo win g c hoice: g i = ( K g if i/K is an in teger 0 otherwise . This ma y giv e a more aggressiv e truncation (th us sparsit y) after ev ery K -th iteration. Since w e do not ha v e a th eo rem formalizing ho w m uc h more sparsit y one can gain from this idea, its eect will only b e examined through exp erimen ts. In man y online learning situations (suc h as w eb applications), o n ly a small subset of the features ha v e nonzero v alues for an y example x . It is th us desirable to deal with sparsit y only in this small subset rather than all features, while sim ultaneously inducing sparsit y on all feature w eigh ts. Moreo v er, it i s imp ortan t to store only features with non-zero co e ci e n ts (if the n um b er of features is so larg e that it cannot b e stored in memory , this approac h allo ws us to use a hash table to trac k 9 Algorithm 1 T runcated Gradien t Inputs: • threshold θ ≥ 0 • gra vit y sequence g i ≥ 0 • learning rate η ∈ (0 , 1) • example oracle O initialize w eigh ts w j ← 0 ( j = 1 , . . . , d ) for trial i = 1 , 2 , . . . 1. A cquire an unlab eled example x = [ x 1 , x 2 , . . . , x d ] from oracle O 2. forall w eigh ts w j ( j = 1 , . . . , d ) (a) if w j > 0 and w j ≤ θ then w j ← max { w j − g i η , 0 } (b) elseif w j < 0 and w j ≥ − θ then w j ← min { w j + g i η , 0 } 3. Compute prediction: ˆ y = P j w j x j 4. A cquire the lab el y from ora cle O 5. Up date w eigh ts for all features j : w j ← w j + 2 η ( y − ˆ y ) x j only the nonzero co ecien ts). W e describ e ho w this can b e implemen ted ecien tly in the next section. F or reference, w e presen t a sp ecialization of Theorem 3.1 in the follo wing corollary that is directly applicable to Algorithm 1. Corollary 4.1 (Sp arse Online Squar e L oss R e gr et) If ther e exists C > 0 such that for al l x , k x k ≤ C , then for al l ¯ w ∈ R d , we have 1 − 2 C 2 η T T X i =1  ( w T i x i − y i ) 2 + g i 1 − 2 C 2 η k w i · I ( | w i | ≤ θ ) k 1  ≤ k ¯ w k 2 2 η T + 1 T T X i =1  ( ¯ w T x i − y i ) 2 + g i +1 k ¯ w · I ( | w i +1 | ≤ θ ) k 1  , wher e w i = [ w 1 , . . . , w d ] ∈ R d is t h e weight ve ctor use d for pr e diction at the i -th step of A lgorithm 1; ( x i , y i ) is the data p oint observe d at the i -step. This corollary explicitly states that the a v erage square loss incurred b y the learner (left term) is b ounded b y the a v erage square loss of the b est w eigh t v ector ¯ w , plus a term related to th e size of ¯ w whic h deca ys as 1 /T and an additiv e oset con trolled b y the sparsit y threshold θ and the gra vit y parameter g i . 10 5 Ecien t Implemen tation W e altered a s ta n dard gradien t-descen t implemen tation ( V o wpal W abbit [7]) according to algorithm 1. V o wpal W abbit optimizes sq u are loss on a linear represen tation w · x via gradien t descen t (3) with a co u ple ca v eats: 1. The prediction is normalized b y the square ro ot of the n um b er of nonzero en tries in a sparse v ector, w · x/ | x | 0 . 5 0 . This alteration is just a constan t rescaling on dense v ectors whic h is eectiv ely remo v able b y an appropriate rescaling of the learning rate. 2. The prediction is clipp ed to the in terv al [0 , 1] , implying that th e loss function is n ot square loss for unclipp ed predictions outside of this dynamic range. Instead the up date i s a constan t v alue, equiv alen t to the gradien t of a linear loss function. The learning rate in V o wpal W abbit is con trollable, supp ortin g 1 /i deca y as w ell as a constan t learning rate (and rates in -b et w een). The program op erates in an en tirely online fashi o n , so the memory fo otprin t is essen tiall y just the w eigh t v ector, ev en wh en the amoun t of data is v ery large. As men tioned earlier, w e w ould lik e the algorithm's computational complexit y to dep end linearly on the n um b er of nonzero features of an example, rather than th e tota l n um b er of features. The approac h w e to o k w as to store a time-stam p τ j for eac h f ea tu re j . The time-stamp w as initialized to th e index o f the e xample where feature j w as nonzero for the rst time. During online learning, w e simply w en t through all nonzero features j of exam p le i , and could sim ulate the shrink age of w j after τ j in a batc h mo de. These w eigh ts are then up dated, and their time stamps are reset to i . This lazy-up date idea of dela ying the shrink age calcul a ti o n un til needed is the k ey to ecien t implemen tation of truncated gradien t. Sp ecically , instead of using u p date rule (6) for w eigh t w j , w e shrunk the w eigh ts of all nonzero feature j dieren tly b y the follo wing: f ( w j ) = T 1  w j + 2 η ( y − ˆ y ) x j ,  i − τ j K  K η g , θ  , and τ j is up dated b y τ j ← τ j +  i − τ j K  K. W e note that suc h a lazy-up date tric k b y main taining the time-stamp information can b e applied to the other t w o algorithms giv en in section 3. In the co ecien t rounding algorithm (4), for instance, for eac h nonzero feature j of example i , w e can rst p erform a regular gradien t descen t on the square loss, and then do the follo wi ng: if | w j | is b el o w the thresh o l d θ and i ≥ τ j + K , w e round w j to 0 and set τ j to i . This implemen tation sho ws that the truncated gradien t metho d satises the follo wing require- men ts needed for solving large scale problems with sparse features. • The algorithm is c ompu ta ti o n ally ecien t: the n um b er of op erations p er online step is linear in the n um b er of nonzero features, and indep end en t o f the total n um b er of features. • The algorithm is memory e ci e n t: it main tains a li st of activ e features, and a feature can b e inserted when observ ed, and deleted when the corresp ond ing w eigh t b ecomes zero. 11 If w e apply the online pro j ec ti o n id ea in [15] to solv e (1), then in the up date rule (7), one has to pic k the smallest g i ≥ 0 suc h that k w i +1 k 1 ≤ s . W e do not kno w an ecien t metho d to nd this sp ecic g i using op erations indep enden t o f the total n u m b er of features. A standard implemen tation relies on sorting all w eigh ts, whic h requires O ( d ln d ) op erations, where d is the total n u m b er of (nonzero) features. This complexit y is un ac cep ta b le for our purp ose. Ho w ev er, w e shall p oin t out that in an imp ortan t recen t w ork [5], th e authors prop osed an ecien t online ` 1 -pro jection metho d. The idea is to use a balanced tree to k eep trac k of w eigh ts, whi c h allo ws ecien t threshold nding and tree up dates in O ( k ln d ) op erations on a v erage (here, k denotes the n um b er of nonzero co ecien ts in the curren t traini ng example). Although the algorithm still has w eak dep endency on d , it is applicable to large s c ale practical applications. The theoretical analysis presen ted in this pap er sho ws that w e can obtain a meaningful regret b ound b y pic king an arbitrary g i . This is usefu l b ecause the resulting metho d is m uc h simpler to impl e men t a n d is computationally mo re ecien t p er online step. Moreo v e r, our metho d allo ws non-con v ex up dates that are closely related to the simple co ecien t rounding idea. Due to the complexit y of implemen ting the balanced tree strategy in [5], w e shall not compare to it in this pap er. 6 Empirical Resul t s W e applied V o wpal W abbi t with the ecien tly implemen ted sparsify opti o n , as describ ed in the previous section, to a selection of datasets, including elev en datasets from the UCI rep osi to ry [1], the m uc h larger dataset rcv1 [8], and a priv ate large-scale d ata set Big_A ds related to ad in terest prediction. While U CI datasets are useful for b enc hmark pu rp oses, rcv1 and Big_A ds are more in teresti ng since they em b o dy real-w orld datasets with large n um b ers of features, man y of whic h are less informativ e for mak i ng predictions than others. Th e datasets are summarized in T able 1. The UCI datasets w e used do not ha v e man y featu re s , and it is exp ected that a large fraction of these features are useful for making predictions. F or comparison purp oses as w ell as to b etter demonstrate the b eha vior of our algorithm, w e also added 1000 random binary features to those datasets. Eac h feature has v alue 1 with probabilit y 0 . 05 and 0 otherwise. 6.1 F eature Sparsication of T runcated Gradien t Descen t In th e rst set of exp erimen ts, w e are i n terested i n ho w m uc h reduction in th e n um b e r of features is p ossible without aecting l e arn ing p erformance sign ic an tly; sp ecic all y , w e require th e accuracy b e reduced b y no more th a n 1% for c l a s sica ti o n tasks, and th e total square loss b e increased b y no more th a n 1 % for re gressi o n tasks. As common p ra cti c e, w e allo w ed the algorithm to run on the training data set for m ultiple passes with deca yin g learning rate. F or eac h dataset, w e p erformed 10 -fold cro s s v alidation o v er the training set to iden tify the b est set of parameters, includi ng the learning rate η , the sparsication rate g , n um b er of p a ss e s of the training set, and the deca y of learning rate across these passes. This set of param eters w as then used to train V o wpal W a b bit on the whole training set. Finally , the learned classier/regressor is ev alu ate d on the test set. W e xed K = 1 and θ = ∞ in these exp erimen ts, and will study the eects of K and θ in later subsections. Figure 1 sho ws the fraction of reduced features after sparsication is appl ied to eac h dataset. F or UCI datasets with randomly added f ea tu res, V o wpal W abbit is able to reduce the n um b er of features b y a fraction of more than 90% , except for the ad dataset in whic h only 71% reduction is observ ed. This less satisfying result migh t b e impro v ed b y a more extensiv e p a rameter searc h 12 T able 1: Dataset Summary . Dataset #features #train data #test data task ad 1411 2455 824 cl a ss ic ation crx 47 526 164 classication housing 14 381 125 regression krvskp 74 2 413 783 classication magic04 11 14226 4794 classication m ush ro om 117 6079 2045 classication spam base 58 3 445 115 6 classication wb c 10 520 179 classication wdb c 31 421 148 classication wpb c 33 153 45 classication zo o 17 77 24 regression rcv1 38853 781265 23149 classication Big_A d s 3 × 10 9 26 × 10 6 2 . 7 × 10 6 classication in cross v alid a ti o n . Ho w ev er, if w e can tolerate 1 . 3% decrease in accuracy (instead of 1% as for other datasets) during cross v alidation, V o wpal W abbit is able to ac hiev e 91 . 4% reduction, indicating that a large reducti o n is still p ossible at the tin y additional cost of 0 . 3% accuracy loss. With this sligh tly more aggressiv e sparsication, the test-set ac cu ra cy drops from 95 . 9% (when only 1% loss in a ccuracy is allo w ed in cross v alidation) to 95 . 4% , while the accuracy without sparsication is 96 . 5% . Ev en for the original UCI datasets without articially added features, V o wpal W abbit manages to lter ou t some of the less usefu l features while main tain ing the same lev el of p erformance. F or example, for th e ad dataset, a redu ction of 83 . 4% is ac hiev ed. Compared to the resu lts ab o v e, it seems the m ost eectiv e feature reductions o ccu r on datasets with a large n um b er of less usefu l features, exactly where sparsication is needed. F or rcv1, m ore than 75% of features are remo v ed after the sparsication pro cess, i ndicating the eectiv eness of our algorithm in real-life problems. W e w ere not able to try man y paramete rs in cross v alidation b ecause of the si z e of rcv1. It is exp ected that more reduction is p ossible when a more th o rou g h parameter searc h is p erformed. The previous results do not exercise the full p o w er of the approac h presen ted here b ecause they are applied to datasets where standard Lasso regularization [13] is or ma y b e computationally viable. W e ha v e also applied this approac h to a large non-public dataset Big_A ds where the goal is predicting whic h of t w o ads w as clic k ed on giv en con text information (the con ten t of ads and query information). Here, accepting a 0 . 009 increase in classication error allo ws us to reduce the n um b er of features from ab out 3 × 10 9 to ab out 24 × 10 6 , a factor of 125 decrease in the n um b er of features. F or classication tasks, w e also study ho w our sparsication solu tion aects A UC (Area Under the R OC Curv e), whic h is a standard metric for classication. 1 Using the same sets of parameters from 10 -fold cross v alidation describ ed ab o v e, w e nd that the criterion is not aected signican tly b y sparsication and in some cases, they a re actually sligh tly impro v ed . The reason ma y b e that our 1 W e use A UC here and in later subsections b ecause it is insensitiv e to threshold, whi c h is unlik e accuracy . 13 0 0.2 0.4 0.6 0.8 1 Big_Ads rcv1 zoo wpbc wdbc wbc spam shroom magic04 krvskp housing crx ad Fraction Left Dataset Fraction of Features Left Base data 1000 extra Figure 1: A plot sho wing the amoun t of features left after sparsication for eac h dataset. The rst result is the fraction of features left when the p erformance is c hanged b y at most 1% due to sparsication. The second resu lt is the % sparsication when 1000 random features are added to eac h example. F or rcv1 and Big_A ds there is no second column, since the exp erimen t is not useful. sparsication metho d re mo v e some of the features that could ha v e confused V o wpal W abbit . The ratios of the A UC with and without sparsication for all classi  cation tasks are plotted in Figures 2. It is often the case that these ratios a re ab o v e 98% . 6.2 The Eects of K As w e argued b efore, using a K v alue larger than 1 ma y b e desired in truncated g radi en t and the rounding algorithms. This adv an tage is empirically demonstrated here. In particular, w e try K = 1 , K = 10 , and K = 20 in b oth algorithms. As b efore, cross v alid a ti o n is used to select parameters in the rounding algorithm, including learnin g rate η , n um b er of passes of data during training, and learning rate deca y o v er training passes. Figures 3 and 4 giv e the A U C vs. n um b er-of-feature plots, where eac h data p oin t i s generated b y running resp ectiv e algorithm using a dieren t v alue of g (for truncated gradien t) and θ (for the rounding algorithm). W e used θ = ∞ in truncated gradien t. F or truncated gradien t, the p erformances with K = 10 or 20 are at least as go o d as those with K = 1 , and for the spam base d a taset further feature reduction is ac hiev ed at the same lev el of p erformance, reducing the n um b er of features from 76 (when K = 1 ) to 25 (when K = 10 or 20 ) with of an A UC of ab out 0 . 89 . Suc h an e ect is ev en more remark able in the rounding algorithm. F or in stance, in the ad dataset the algorithm u sing K = 1 ac hiev es a n A UC o f 0 . 94 with 322 features, while 13 and 7 features are needed using K = 10 and K = 20 , resp ectiv ely . 6.3 The Eects of θ in T runcated Gradien t In thi s subsection, w e empirically study the eect of θ in tru ncated gradien t. The rounding algorithm is also included for comparison due to its similarit y to tru ncated gradien t when θ = g . As b efore, w e 14 0 0.2 0.4 0.6 0.8 1 1.2 rcv1 wpbc wdbc wbc spam shroom magic04 krvskp crx ad Ratio Dataset Ratio of AUC Base data 1000 extra Figure 2: A p lot sho wing the ratio of the A UC when sparsication is used o v er the A UC when no sparsication is used. The same pro cess as in Figure 1 is used to determine empirically go o d parameters. The rst result is for the original dataset, while the second result is for the mo died dataset where 1000 random fea tu res are added to eac h example. used cross v alidation to c ho ose parameters for eac h θ v alue tried, and fo c u sed on the A UC metric in the eigh t UCI classication tasks, except the degenerate one of wp b c. W e xed K = 10 in b oth algorithm. Figure 5 giv es the A UC vs. n um b er-of-feature plots, where eac h data p oin t is generated b y running resp ectiv e algorithms using a dieren t v alue of g (for truncated gradien t) and θ (for the rounding algorithm). A few observ a ti o n s are in place. First, the results v erify the observ ation that the b eha vior of truncated gradien t wi th θ = g is similar to the rounding algorithm. S e cond , these results suggest that, in practice, it ma y b e desired to use θ = ∞ in truncated gradien t b ecause it a v oids the lo ca l minim um problem. 6.4 Comparison to Other Algorithms The next set of e xp erimen ts compares truncated gradien t descen t to other algorithms regarding their abilities to tradeo feature sparsication and p erform an c e. Again, w e fo cus on the A UC metric in UCI classication tasks except wp dc. The algorithms for com p a ri son include: • The truncated gradien t algorithm: W e xed K = 10 and θ = ∞ , used crossed- v alidated parameters, a n d altered the gra vit y param eter g . • The roun ding algorithm describ ed in section 3.1: W e xed K = 10 , u sed cross-v alidated parameters, a n d altered the roundin g threshold θ . • The subgradien t algo ri thm d escrib ed in section 3.2: W e xed K = 10 , used cross-v alidated parameters, a n d altered the regularization parameter g . • The Lasso [13] for batc h L 1 regularization: W e used a publicly a v ail a b le implemen tation [12]. 15 Note that w e do not attempt to compare these algorithms on rcv1 and Big_A ds simply b ecause their sizes are to o large for the Lasso and subgradien t descen t ( c.f. , section 5). Figure 6 giv es the resu lts. First, it is observ ed that truncated gradien t is consisten tly comp etitiv e with the other t w o online algorithms and signican tly outp erform ed them in some problems. Th is suggests the eectiv enes s of truncated gradien t. Second, it is in teresting to observ e that the qualitativ e b eha vior of truncated gradien t is often similar to that of L A SSO, esp ecially when v ery sparse w eigh t v ectors are allo w ed (the left sides in the graphs). This is consisten t with theorem 3.2 sho wing the relation b et w een these t w o algorithms. Ho w ev er, LASSO usually has w orse p erformance when the allo w ed n um b er of nonzero w e i g h ts is set to o large (the righ t side of th e graphs). In this c ase, LASSO seems to o v ert. In con trast, tru ncate d gradien t is more robust to o v ertting. The robustness of online learning is often attributed to early stopping, whic h has b een extensiv ely discussed in the literature (e.g., in [14]). Finally , it is w orth emphasizing th at the exp erimen ts in this subsection try to shed some ligh t on the relativ e strengths of these algorithms i n terms of feature sparsication. F or larg e datasets suc h as Big_A ds only truncated g rad ien t, co ecien t rounding, and the sub-gradien t a l g orith ms are app licable to large-scale problems with sparse features. As w e ha v e sho wn a n d argued, the rounding algorithm is quite ad h o c and ma y not w ork robustly in some problems, and the sub- gradien t algorithm do es not lead to sparsi t y in general during training. 7 Conclusion This pap er co v ers the rst sparsication tec hnique for large-scale online learning with strong the- oretical guaran tees. The algorithm, truncated gradien t, i s the natural extension of Lasso-st yle regression to the online-learning setting. Theorem 3.1 pro v es that the tec hnique is sound: it nev er harms p erform an c e m uc h compared to standard sto c hastic gradien t descen t in adv ersarial situations. F urthermore, w e sho w that the asymptotic solution of one instance of the algorithm is essen tially equiv alen t to Lasso regression, and th us justifying the algorithm's abili t y to pro duce sparse w eigh t v ectors when the n um b er of features is in tractably large. The theorem is v eried ex p erimen tally in a n u m b er of problems. In some cases, esp ecially for problems with man y irrel e v an t features, this approac h ac hiev es a one or t w o order of magnitude reduction in the n um b er of features. References [1] Arth ur Asuncion and Da vid J. Newma n . UCI mac hine learnin g rep ository , 2007. Univ ersit y of California, Irvine, Sc ho ol of Information and Computer Sciences, h ttp ://www.ics.uci.edu/ ∼ mlearn/MLRep ository .h tml. [2] Nicolò Cesa-Bianc hi, Philip M. Long, a n d Manfred W arm uth. W orst-case quadratic loss b ounds for prediction using linear functions and gradien t descen t. IEEE T r ansactions on Ne ur al Net- works , 7(3 ): 6 04619, 1996. [3] Cheng-T ao Ch u, Sang Kyun Kim, Yi-An Lin, Y uanY uan Y u, Gary Bradski, A ndrew Y. Ng, and Kunle Oluk otun. Map- re d uce for mac hine learning on m ulticore. In A dvanc es in Neur al Information Pr o c essing Systems 20 (NIPS-07) , 2008 . 16 [4] Ofer Dek el, Shai Shalev-Sc h w artz, and Y ora m Singer. The F orgetron: A k ernel-based p ercep- tron on a xed budget. In A dvanc es in Neur al Information Pr o c essing Systems 18 (NIPS-05) , pages 25 9266, 2006. [5] John Duc hi, Shai Shalev-Sh w artz, Y oram Singer, and T ushar Ch andra. Ecien t pro jections on to the ` 1 -ball for learning in h igh dimensions. In ICML'08 , 2008 . [6] Jyrki Kivinen and Manfred K. W arm uth. Exp onen tiated gradien t v ersus gradien t descen t for linear predictors. Information and Computation , 132 (1):163, 1997. [7] John Langford, Lihong Li, and Alexander L. Strehl. V o wpal W abbit (f a st online learning), 2007. h ttp://h unc h.net/ ∼ vw/. [8] Da vid D. Lewis, Yiming Y ang, T on y G. Rose, and F an L i . R CV1: A new b enc hmark collection for text categorization resea rc h. Journal of Machine L e ar ning R ese ar ch , 5:36 1397, 2004. [9] Nic k Littlestone. Lea rn ing quic kly when irrelev an t attributes ab ound: A new linear-threshold algorithms. Machine L e arni ng , 2(4):28 5318, 1988. [10] Nic k Littlestone, Ph ilip M. Long, and Manfred K. W arm uth. On-line learning of linear f unc- tions. Computational Complexity , 5(2):1 23, 1995. [11] Shai Shalev-Sh w artz , Y oram Singer, and Nathan Srebro. P egasos: Primal Estimated sub- GrA d ien t SOl v er for SVM. In Pr o c e e din gs of the Twenty-F ourth Internation al Confer enc e on Machine L e arnin g (ICML- 07) , 20 07. [12] Karl Sjöstrand. Matlab implemen tation of LAS SO, LARS, the elastic net and SPCA, June 2005. V ersion 2.0, h ttp://www2.imm.dtu.dk/pub db/p.php?3897. [13] Rob ert Tib shirani. Regression shrink age an d selection via the lasso. Journal of the R oyal Statistic al So ciety, B. , 58(1 ): 2 67288, 1996. [14] T ong Zhang. Solvin g large scale linear prediction problems usin g sto c hastic gradien t descen t algorithms. In Pr o c e e din gs of the Twenty-First International Confer enc e on Machine L e arning (ICML-04) , pages 919926, 2 004. [15] Martin Zink evic h. Online con v ex programming and generalized innitesimal gradien t ascen t. In Pr o c e e din gs of the Twentieth International Confer enc e on Machine L e arning (ICML-0 3) , pages 92 8936, 2003. A Pro of of Theo rem 3.1 The follo wing lemma is the essen tial step in our analysis. Lemma A.1 F or up date rule (6) applie d to weight ve ctor w on example z = ( x, y ) with gr avity p ar ameter g i = g , r esulting in a weight ve ctor w 0 . If Assumption 3.1 holds, then for al l ¯ w ∈ R d , we have (1 − 0 . 5 Aη ) L ( w , z ) + g k w 0 · I ( | w 0 | ≤ θ ) k 1 ≤ L ( ¯ w , z ) + g k ¯ w · I ( | w 0 | ≤ θ ) k 1 + η 2 B + k ¯ w − w k 2 − k ¯ w − w 0 k 2 2 η . 17 Pr oof. Consider an y target v ector ¯ w ∈ R d and let ˜ w = w − η ∇ 1 L ( w , z ) . W e ha v e w 0 = T 1 ( ˜ w , g η , θ ) . Let u ( ¯ w , w 0 ) = g k ¯ w · I ( | w 0 | ≤ θ ) k 1 − g k w 0 · I ( | w 0 | ≤ θ ) k 1 . Then the up date e quation implies the follo wing: k ¯ w − w 0 k 2 ≤k ¯ w − w 0 k 2 + k w 0 − ˜ w k 2 = k ¯ w − ˜ w k 2 − 2( ¯ w − w 0 ) T ( w 0 − ˜ w ) ≤k ¯ w − ˜ w k 2 + 2 η u ( ¯ w , w 0 ) = k ¯ w − w k 2 + k w − ˜ w k 2 + 2( ¯ w − w ) T ( w − ˜ w ) + 2 η u ( ¯ w , w 0 ) = k ¯ w − w k 2 + η 2 k∇ 1 L ( w , z ) k 2 + 2 η ( ¯ w − w ) T ∇ 1 L ( w , z ) + 2 η u ( ¯ w , w 0 ) ≤k ¯ w − w k 2 + η 2 k∇ 1 L ( w , z ) k 2 + 2 η ( L ( ¯ w , z ) − L ( w , z )) + 2 η u ( ¯ w , w 0 ) ≤k ¯ w − w k 2 + η 2 ( AL ( w , z ) + B ) + 2 η ( L ( ¯ w , z ) − L ( w , z )) + 2 η u ( ¯ w , w 0 ) . Here, the rst and second equalities follo w from algebra, and the third from the denition of ˜ w . The rst inequalit y follo ws b ecause a square is alw a ys non-negativ e. The second inequalit y follo ws b ecause w 0 = T 1 ( ˜ w , g η , θ ) , whic h impli es that ( w 0 − ˜ w ) T w 0 = − g η k w 0 · I ( | w 0 | ≤ θ ) k 1 and | w 0 j − ˜ w j | ≤ g η I ( | w 0 j | ≤ θ ) . Therefore − ( ¯ w − w 0 ) T ( w 0 − ˜ w ) = − ¯ w T ( w 0 − ˜ w ) + w 0 T ( w 0 − ˜ w ) ≤ d X j =1 | ¯ w j || w 0 j − ˜ w j | + ( w 0 − ˜ w ) T w 0 ≤ g η d X j =1 | ¯ w j | I ( | w 0 j | ≤ θ ) + ( w 0 − ˜ w ) T w 0 = η u ( ¯ w , w 0 ) . The third inequalit y follo ws from the denition of sub-gradien t of a con v ex function, whic h implies that ( ¯ w − w ) T ∇ 1 L ( w , z ) ≤ L ( ¯ w , z ) − L ( w , z ) for all w and ¯ w . The fourth inequalit y follo ws from Ass umption 3.1. Rearranging the ab o v e inequalit y leads to the desired b ound.  Pr oof. (of theorem 3.1) Apply Lemma A.1 to the up date on trial i , w e ha v e (1 − 0 . 5 Aη ) L ( w i , z i ) + g i k w i +1 · I ( | w i +1 | ≤ θ ) k 1 ≤ L ( ¯ w , z i ) + k ¯ w − w i k 2 − k ¯ w − w i +1 k 2 2 η + g i k ¯ w · I ( | w i +1 | ≤ θ ) k 1 + η 2 B . 18 No w summing o v er i = 1 , 2 , . . . , T , w e o b ta i n T X i =1 [(1 − 0 . 5 Aη ) L ( w i , z i ) + g i k w i +1 · I ( | w i +1 | ≤ θ ) k 1 ] ≤ T X i =1  k ¯ w − w i k 2 − k ¯ w − w i +1 k 2 2 η + L ( ¯ w , z i ) + g i k ¯ w · I ( | w i +1 | ≤ θ ) k 1 + η 2 B  = k ¯ w − w 1 k 2 − k ¯ w − w T k 2 2 η + η 2 T B + T X i =1 [ L ( ¯ w , z i ) + g i k ¯ w · I ( | w i +1 | ≤ θ ) k 1 ] ≤ k ¯ w k 2 2 η + η 2 T B + T X i =1 [ L ( ¯ w , z i ) + g i k ¯ w · I ( | w i +1 | ≤ θ ) k 1 ] . The rst equalit y foll o ws from the telescoping sum and the second inequalit y follo ws from the initial condition (a l l w eigh ts are zero) and dropping negativ e quan tities. The theorem follo ws b y dividing with resp ect to T and rearranging terms.  19 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ad Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crx Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskp Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04 Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroom Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambase Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbc Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbc Number of Features AUC K=1 K=10 K=20 Figure 3: Eect of K on A UC in truncated gradien t. 20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ad Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crx Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskp Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04 Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroom Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambase Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbc Number of Features AUC K=1 K=10 K=20 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbc Number of Features AUC K=1 K=10 K=20 Figure 4: Eect of K on A UC in the rounding algorithm. 21 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ad Number of Features AUC Rounding Algorithm Trunc. Grad. ( θ =1g) Trunc. Grad. ( θ = ∞ ) 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crx Number of Features AUC Rounding Algorithm Trunc. Grad. ( θ =1g) Trunc. Grad. ( θ = ∞ ) 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskp Number of Features AUC Rounding Algorithm Trunc. Grad. ( θ =1g) Trunc. Grad. ( θ = ∞ ) 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04 Number of Features AUC Rounding Algorithm Trunc. Grad. ( θ =1g) Trunc. Grad. ( θ = ∞ ) 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroom Number of Features AUC Rounding Algorithm Trunc. Grad. ( θ =1g) Trunc. Grad. ( θ = ∞ ) 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambase Number of Features AUC Rounding Algorithm Trunc. Grad. ( θ =1g) Trunc. Grad. ( θ = ∞ ) 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbc Number of Features AUC Rounding Algorithm Trunc. Grad. ( θ =1g) Trunc. Grad. ( θ = ∞ ) 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbc Number of Features AUC Rounding Algorithm Trunc. Grad. ( θ =1g) Trunc. Grad. ( θ = ∞ ) Figure 5: Eect of θ on A UC in truncated gradien t. 22 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ad Number of Features AUC Trunc. Grad. Rounding Sub−gradient Lasso 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crx Number of Features AUC Trunc. Grad. Rounding Sub−gradient Lasso 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 krvskp Number of Features AUC Trunc. Grad. Rounding Sub−gradient Lasso 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 magic04 Number of Features AUC Trunc. Grad. Rounding Sub−gradient Lasso 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroom Number of Features AUC Trunc. Grad. Rounding Sub−gradient Lasso 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 spambase Number of Features AUC Trunc. Grad. Rounding Sub−gradient Lasso 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wbc Number of Features AUC Trunc. Grad. Rounding Sub−gradient Lasso 10 0 10 1 10 2 10 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wdbc Number of Features AUC Trunc. Grad. Rounding Sub−gradient Lasso Figure 6: Comparison of four algorithms. 23

Sparse Online Learning via Truncated Gradient

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment