Stability Conditions for Online Learnability

Stability Condition s for Onlin e Learnabili ty St ´ ephane Ross Robotics Institute Carnegie Mellon Uni versity Pittsb urgh, P A USA stephaneross@ cmu.edu J. An drew Bagnell Robotics Institute Carnegie Mellon Uni versity Pittsb urgh, P A USA dbagnell@ri.c mu.edu Abstract Stability is a gen eral no tion that q uantiﬁes th e sensitivity of a learning algorithm’ s output to small change in the train ing dataset (e.g. deletion or replace ment of a single training sample). Such con- ditions have recen tly been shown to be mor e p owerful to ch aracterize lear nability in th e gener al learning setting under i.i.d. samples where un iform conver gence is no t necessary for learnabil- ity , but where stability is both sufﬁcient and ne cessary for lear nability . W e here sho w that similar stability conditions are also sufﬁcient for online learnability , i.e. whe ther there e xists a learning al- gorithm such that under any s equence of examples (potentially chosen adversarially) produ ces a se- quence of hypoth eses th at has no regret in the limit with respect to the best hypothesis in hind sight. W e introdu ce online stability , a stability condition related to unifor m-leave-one-out stability in the batch setting, th at is sufﬁcient fo r o nline lear nability . In particular we show that p opular classes of onlin e learners, namely algorithm s that fall in the category of Follo w-th e-(Regularized) -Leader, Mirror Descent, gradient- based method s and ran domized alg orithms like W eighted Majority and Hedge, are gu aranteed to have no regret if they have such online stab ility prop erty . W e pr ovide examples tha t suggest the existence of an algorithm with such stability condition m ight in fact be necessary fo r onlin e learnability . For th e mo re restricted binary classiﬁcation setting, we establish that such stability condition is in fact b oth sufﬁcient and necessary . W e also show th at for a large class of on line learn able pr oblems in the general learn ing setting , namely th ose with a notio n of sub-expon ential covering, no-regret online algorithm s that ha ve such stability con dition e xists. 1 Intr oduction W e consider the problem of online learning in a setting similar to the Gen eral Setting of Learning (V apnik, 1995). In this setting, an online learning alg orithm ob serves data points z 1 , z 2 , . . . , z m ∈ Z in seq uence, potentially chosen adversarially , and upon seeing z 1 , z 2 , . . . , z i − 1 , the algorithm must pick a hypothesis h i ∈ H that incurs loss on the next data point z i . Given the kno wn loss function al f : H × Z → R , the regret R m of the sequence of hypoth eses h 1: m after observin g m data points is deﬁned as: R m = m X i =1 f ( h i , z i ) − min h ∈H m X i =1 f ( h, z i ) (1) The g oal is to pick a sequence of hypothe ses h 1: m that has no regret, i.e. the average regret R m m → 0 as the number of data points m → ∞ . The setting we consider is general e nough to subsume most, if no t all, online learning pr oblems. In fact the space Z of po ssible “d ata p oints” could itself be a function spa ce H → R , suc h th at f ( h, z ) = z ( h ) . Hence the typica l online learning setting where the adversary picks a lo ss function H → R at each time step is always su bsumed by our setting. The data po ints z should mor e loosely be interpreted as the parameter s that deﬁne the lo ss func tion at the curren t time step. For instance, in a su pervised classiﬁcation scena rio, the space Z = X × Y , for X the inpu t f eatures and Y the o utput class and the classiﬁcation loss is deﬁned as f ( h, ( x, y )) = I ( h ( x ) 6 = y ) for I the indicator function. W e do not make a ny ass umption about f , other than that the maximum instantaneous regret is bounded: sup z ∈Z ,h,h ′ ∈H | f ( h, z ) − f ( h ′ , z ) | ≤ B . This allows for potentially u nbou nded loss f : e.g., consider z ∈ R , h ∈ [ − k , k ] an d f ( h, z ) = | h − z | , then the immediate loss is unbou nded b ut instantane ous regret is bound ed by B = 2 k . W e ar e interested in char acterizing sufﬁcient conditio ns und er which an o nline algorithm is guaran - teed to p ick a sequ ence of h ypothe ses that has no regret un der any sequence of d ata poin ts an adversary might p ick. In the b atch setting wh en the data points are drawn i.i.d. from so me un known d istribution D , Shalev-Shwartz et al. ( 2010, 2009) ha ve shown that stability is a key p roperty for learn ability . In particular, they sho w that a problem is learnable if and only if there exists a universally stable asymp totic empirical risk minimizer (AERM). In this p aper, we consider using batch algo rithms in our online setting, wh ere the hypoth esis h i is the output of the batch learning algorithm on the ﬁrst i − 1 data points. Many online algorithms (such as Follo w- the-(Regularized )-Leader, Mirror Descen t, W eighted Majority , Hedge, etc.) can b e interp reted in this way . For instance, F ollow-the-Leader (FTL) algor ithms can b e essentially thought as using a batch empirical r isk minimizer (ERM) algorithm to select the hypo thesis h i on the dataset { z 1 , z 2 , . . . , z i − 1 } , wh ile Follow- the-Regularized- Leader (FTRL) alg orithms essentialy use a batch AERM algo rithm (m ore precisely what we call a Regularized ERM (RERM)) to select the hypoth esis h i on th e dataset { z 1 , z 2 , . . . , z i − 1 } . Ou r main result shows that Un iform Leave-One-Out stability (Shalev-Shwartz et al., 200 9), albeit stro nger than the stab ility co ndition req uired in (Shalev-Shwartz et al., 2010, 20 09), is in fact sufﬁcient to guaran tee n o regret of RERM typ e algor ithms. For asymmetric algo rithms like g radient-b ased methods (which can also be seen as some fo rm of RERM), a notio n related to Unifo rm Leave-One-Out stability (and equiv alen t to for symmetr ic algorithms), which we call o nline stability , is also sufﬁcient to guara ntee no-regret. W e also provide general results for the c lass of always-AERM algorithm s (a slightly stronger notion than AERM but weaker than ERM and RERM). Unfo rtunately they are weaker in that they require the algorithm to be stable or an always-AERM at a fast enough rate. The stronger notion of stability we use to guarantee no regret seems to be necessary in the online setting. Intuitively , th is is b ecause the algorithm mu st be able to com pete on any sequence of data p oints, potentially chosen adversarially , ra ther than on i.i.d . sampled data p oints. W e also provide an examp le that illustrates this. Nam ely , an AERM with a slightly weaker stability condition, can learn the prob lem in the batch setting but cannot in the online setting , howe ver there is a FTRL alg orithm th at can learn th e problem in the on line setting. Furthermo re, it is kn own tha t batch learnability and online learnability are not equiv alent, which naturally su ggests stron ger no tions of stability should be necessary for on line lear nability . W e re view a known problem of thresho ld learning ov er an interval that s hows batch and online learnab ility are not equiv alen t. In the more r estricted binar y classiﬁcation setting, we show that existence of a (potentially rand omized) uniform - LOO stable RERM is both sufﬁcient an d necessary for online learnability . W e also show that for a large class of online learnab le p roblem s in th e general learning setting, name ly those with a notion of sub-expo nential covering, uniform-LOO stable (potentially rando mized) RERM algorithm s e x ist. W e begin by introducing n otation, deﬁnition s a nd revie wing stability notions that have b een used in the batch setting. W e then provide our main results which sho w how some of these stability no tions can be used to guarantee n o regret in the on line setting. W e then go over examples that suggest such strong stability n otion might in fact be necessary in th e online setting . W e fu rther show that in the restricted binary classiﬁcation setting, such stability notion s are in fact necessary . W e also introduce a notion of covering that allows us to show that unifor m-LOO stable RERM algorith ms exist for a large class of o nline learnable p roblems in the general learning setting. W e conclude with potential future directions and open questions. 2 Learn ability and Stability in the Batch Setting In the batch setting, a ba tch algor ithm is given a set o f m i.i.d. samples z 1 , z 2 , . . . , z m drawn from some unknown distribution D , and giv en knowledge of the lo ss functional f , we seek to ﬁnd a h ypothe sis h ∈ H that minimizes the populatio n risk: F ( h ) = E z ∼D [ f ( h, z )] (2) Giv en a set of m i.i.d. samples S ∼ D m , the empirical risk of a hypo thesis h is deﬁned as: F S ( h ) = 1 m m X i =1 [ f ( h, z i )] (3) Most batch algorithm s used in practice p roceed by m inimizing the e mpirical risk, at least asym ptotically (when an additional regularizer is used). Deﬁnition 1 An algorithm A is an Empirical Risk Minimizer (ERM) if for any data set S : F S ( A ( S )) = min h ∈H F S ( h ) (4) Deﬁnition 2 (Shalev-S hwartz et al., 2 010) An alg orithm A is an Asymptotic Emp irical Risk Minimizer (AERM) under distrib ution D at rate ǫ erm ( m ) if for all m : E S ∼D m [ F S ( A ( S )) − min h ∈H F S ( h )] ≤ ǫ erm ( m ) (5) 2 Whenever we mention a rate ǫ ( m ) , we mean { ǫ ( m ) } ∞ m =0 is a mono tonically non-increasing sequence that is o (1) , i.e. ǫ ( m ) → 0 as m → ∞ . I f A is an AERM und er any distribution D , then we say A is a u niv ersal AERM. A useful n otion for our o nline setting will b e that of an alw ays AERM, which is satisﬁed by co mmon online learners such as FTRL: Deﬁnition 3 (Shalev-S hwartz et al., 2010) An algorithm A is an Always Asympto tic Empirical Risk Mini- mizer (always AERM) at rate ǫ erm ( m ) if for all m and dataset S of m data po ints: F S ( A ( S )) − min h ∈H F S ( h ) ≤ ǫ erm ( m ) (6) Learnability in the batch setting is interested in analyzin g th e existence o f a lgorithms that are u niversally consistent: Deﬁnition 4 (Shalev-S hwartz et al., 201 0) A n algorithm A is said to be univ ersally consistent a t rate ǫ cons ( m ) if for all m and distribution D : E S ∼D m [ F ( A ( S )) − min h ∈H F ( h )] ≤ ǫ cons ( m ) (7) If su ch alg orithm A exists, we say the problem is learnable . A well kn own result in the superv ised classiﬁ- cation and regression setting (i.e the loss f ( h, ( x, y )) is I ( h ( x ) 6 = y ) o r ( h ( x ) − y ) 2 ) is that learn ability is equiv alent to unifor m convergence o f the empirical risk to the population r isk ov er the class H (Blum er et al., 1989, Alon et al., 1997). This imp lies the problem is learnable using an ERM. Shalev-Shwartz et al. (2010, 200 9) recently showed that the situa tion is m uch more com plex in the Gen- eral Learning Setting considered he re. For in stance, there are conv ex optimization pr oblems whe re u niform conv ergence does not ho ld that are lea rnable via an AERM, but n ot learnab le v ia any ERM (Shalev-Shwartz et al., 2010, 20 09). In the General Learnin g Setting, stability turns out to be a mo re suitable n otion to characterize learnability than unifo rm con vergence. Most statibility no tions studied in the literature fall into two categories: leave-one-ou t (LOO) stability and replace-one (R O) stability . Th e form er measures sensiti v ity o f the algorithm to deletion of a sing le data point from the dataset, while the latter measures sensiti v ity of the algorithm to replacin g one data point in the dataset by anoth er . In g eneral these two notions are incomp arable and le ad to signiﬁcantly different resu lts as we shall see below . W e now revie w th e mo st com monly used stability notio ns and so me o f the important results from the literature. 2.1 Leav e-One-Out Stability Most notions o f LOO stability are measured in terms of c hange in th e loss on a lea ve-o ne-ou t sample when looking at the output hypo thesis trained with and withou t that samp le in the dataset. The four commonly used notions of LOO stab ility (from st ronge st to weakest) are deﬁned below . W e use z i to denote the i th data point in the dataset S and S \ i to denote the dataset S with z i removed. Deﬁnition 5 (Shalev-S hwartz et al., 2009) An algorithm A is uniform-LOO Stable at rate ǫ loo-stable ( m ) if for all m , dataset S of size m and index i ∈ { 1 , 2 , . . . , m } : | f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i ) | ≤ ǫ loo-stable ( m ) (8) Deﬁnition 6 (Shalev-S hwartz et al., 2 009) An a lgorithm A is all-i-LOO Sta ble u nder distrib utio n D at rate ǫ loo-stable ( m ) if for all m and index i ∈ { 1 , 2 , . . . , m } : E S ∼ D m [ | f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i ) | ] ≤ ǫ loo-stable ( m ) (9) Deﬁnition 7 (Shalev-S hwartz et al., 200 9) A n algorithm A is LOO Stable u nder distribution D at rate ǫ loo-stable ( m ) if for all m : 1 m m X i =1 E S ∼ D m [ | f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i ) | ] ≤ ǫ loo-stable ( m ) (10) Deﬁnition 8 (Shalev-S hwartz et al., 2 009) An a lgorithm A is on-average-LOO Stable under d istrib ution D at rate ǫ loo-stable ( m ) if for all m : | 1 m m X i =1 E S ∼ D m [ f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i )] | ≤ ǫ loo-stable ( m ) (11) 3 Whenever one of these prop erties holds for all distrib u tions D we s hall say it holds uni versally (e.g. universal on-average-LOO stable). Each of the se property implies all the o nes below it at the same ra te (e.g. a uniform- LOO stable algor ithm a t rate ǫ loo-stable ( m ) is also all-i-L OO stable, LOO stable and on -average-LOO stable at rate ǫ loo-stable ( m ) ) (Shale v-Shwartz et al., 2009). Howev er the implications do not h old in the o pposite d i- rection, and there are counte r examples fo r each implication in the oppo site d irections (Shale v -Shwartz et al., 2009). The only exception is th at for sy mmetric algorithms A (meaning the order of the data in the d ataset does not matter), then all-i-LOO stable and LOO stable are equiv alent (Shalev-Shwartz et al., 2009). Some of these stability n otions have also b een studied by d ifferent au thors under different names (Bousqu et and Elisseef f, 2002, Kutin and Ni yogi, 2002, Rakhlin et al., 2005, Mukherjee et al., 2006) sometimes with sli ght variations on the deﬁnitions. Another ev en stronge r no tion of LOO stability simply called un iform stability was studied by Bousquet and Elisseef f (2002). It is similar to uniform-L OO stability except that the absolute difference in loss needs to be smaller than ǫ loo-stable ( m ) at all z ∈ Z for any held ou t z i , in stead of just a t the held ou t data poin t z i . Howev er , it turns out we do not need a notion stronger than Uniform-L OO Stable to guaran tee online learnability . Shalev-Shwartz et al. (200 9 ) have sh own the following two r esults for AERM and ERM in the General Learning Setting: Theorem 9 (Shalev-Shwartz et al., 2009) A pr o blem is learnab le if an d only if th er e exists a un iversal on- average-LOO stable AERM. Theorem 10 (Shalev-Shwartz et al., 2009) A pr oblem is learnab le with an ERM if and only if there exists a universal LOO stable ERM. A nice conseq uence of this result is that f or ba tch le arning in the Gen eral Lear ning Setting, it is su fﬁcient to r estrict our attention to AERM that have su ch stab ility pro perties. W e will see that the notion of LOO stability , especially uniform- LOO stability , is very natural to analyze online algorithms as t he algorithm must output a sequ ence of hypoth eses as the dataset is grown o ne d ata poin t at a time. In the context of batch learning, R O stability is a mor e natural notion and leads to stronger results. 2.2 Replace-One Stability Most notions of R O stability are measured in terms of ch ange in the loss at another sample point when looking at the ou tput hy pothesis trained with an initial dataset and th at dataset with one data p oint replaced b y an other . W e brieﬂy mention two of the strongest R O stability n otions that turn out to be both suf ﬁcien t and necessary for batch learnability . Another weaker notio n of R O stability has been studied in Shalev-Shwartz et al. (2010). For the deﬁnitions below , we denote S ( i ) the dataset S with the i th data point replaced by anoth er data point z ′ i . Deﬁnition 11 (Sha lev-Shwartz et al., 2010) A n algorithm A is strongly-uniform-R O Stable a t rate ǫ r o-stable ( m ) if for all m , dataset S of size m an d data points z ′ i and z ′ : | f ( A ( S ( i ) ) , z ′ ) − f ( A ( S ) , z ′ ) | ≤ ǫ r o-stable ( m ) (12) Deﬁnition 12 (Sha lev-Shwartz et al., 2 010) A n a lgorithm A is unif orm-R O Stable at rate ǫ r o-stable ( m ) if for all m , dataset S of size m and data po ints { z ′ 1 , z ′ 2 , . . . , z ′ m } and z ′ : 1 m m X i =1 | f ( A ( S ( i ) ) , z ′ ) − f ( A ( S ) , z ′ ) | ≤ ǫ r o-stable ( m ) (13) The deﬁnitio n of strongly -unifo rm-R O Stable is similar to the d eﬁnition of un iform stability of Bousquet and Elisseef f (2002), except that we replace a data point instead of deleting one. R O stability allows to sho w the follo win g much stronger result than with LOO stability: Theorem 13 (Shalev-Shwartz et al., 2 010) A pr oblem is learnab le if a nd only if there e xists a u niform-RO stable AERM. In addition if we allow for randomized algorithms, in that the algorithm outpu ts a distrib u tion d over H such that the loss f ( d, z ) = E h ∼ d [ f ( h, z )] , than an even stronger result can be shown: Theorem 14 (Shalev-Shwartz et al., 2010) A pr oblem is learnab le if and on ly if ther e exists a str ongly- uniform-RO stable always AERM (potentially randomized). Note that if the problem is learnab le and the loss f is con vex in h fo r all z and H is a co n vex set then there must exist a d eterministic algorithm that is strongly-un iform-RO stable always AERM, (namely the algorithm that returns E h ∼ d [ h ] for the distribution d pic ked by the randomized algorithm). 4 3 Sufﬁcient Stability Conditions in the Online Setting W e n ow move our attention to the problem o f on line learnin g, wher e the d ata p oints z 1 , z 2 , . . . , z m are revealed to the algorithm in sequence and po tentially chosen adversarially gi ven knowledge of the algorithm A . W e consider using a batch alg orithm in this online setting in the following w ay : let S i = { z 1 , z 2 , . . . , z i } denote the dataset of th e ﬁrst i data points; at each time i , after observing S i − 1 , the batch algorithm A is used to pick the hypo thesis h i = A ( S i − 1 ) . As mentio ned p reviously , on line algorithms like Follow-the- (Regularized)-L eader can be th ough t of in this way . This can also be thou ght as a batch-to-on line reduc tion, similar to the approach of Kakade and Kalai (2 006), wh ere we reduce on line le arning to solving a sequence of batch lea rning problems. Unlike (Kak ade and Kalai, 20 06) we consider the general learning setting in stead of the superv ised classiﬁcation setting and do not make the transdu ctiv e a ssumption that we have access to future “unlabeled” data points. Hen ce ou r re sults can be inter preted as a set of gene ral con ditions un der which batch algorith ms can be used to obtain a no regret algorithm for online learning. W e n ow begin by introducing some deﬁnitions particular to the online setting: Deﬁnition 15 An a lgorithm A has no regret at rate ǫ re gret ( m ) if for all m and any sequen ce z 1 , z 2 , . . . , z m , potentially chosen adversarially given knowledge of A , it holds that: 1 m m X i =1 f ( A ( S i − 1 ) , z i ) − min h ∈H 1 m m X i =1 f ( h, z i ) ≤ ǫ re gret ( m ) (14) If such a lgorithm A exists , we say the prob lem is online learnable . It is well known that the FTL alg orithm has no r egret at rate O ( log m m ) for Lipsch itz continuou s and strongly con vex loss f in h at all z (Hazan et a l., 2006, Kak ade and Shale v-Shwartz, 2008). Addition nally , if f is Lipschitz c ontinuo us and conve x in h at all z , then t he FTRL algorithm has no regret at rate O ( 1 √ m ) (Kakade and Shale v-Shwartz, 2008). An important subclass of alw ays AERM algorithm s is what we deﬁne as a Regularized ERM (RERM): Deﬁnition 16 An algo rithm A is a Regularized ERM if for all m an d any dataset S of m data points: r 0 ( A ( S )) + m X i =1 [ f ( A ( S ) , z i ) + r i ( A ( S ))] = min h ∈H r 0 ( h ) + m X i =1 [ f ( h, z i ) + r i ( h )] (15) wher e { r i } m i =0 is a sequen ce of r egularizer functiona ls ( r i : H → R ), which measure the co mplexity of a hypothe sis h , an d that satisfy s up h,h ′ ∈H | r i ( h ) − r i ( h ′ ) | ≤ ρ i for all i wher e { ρ i } ∞ i =0 is a sequence th at is o (1) . It is easy to see tha t any RERM algorithm is always AERM at rate 1 m P m i =0 ρ i . Ad ditionally , an ERM is a special case of a RERM where r i = 0 for all i . Th is su bclass is important fo r online lear ning as FTRL can be tho ught of as using an un derlying RERM to pick the seq uence of hypo theses. T ypically FTRL chooses r i = λ i r for some regularizer r and λ i a regularization con stant such that { λ i } ∞ i =0 is o (1) . Many Mirror Descent type algorithms such as gradien t descent can also be i nterprete d as some form of RERM (see section 4 and (McMah an, 20 11)) but where r i may depen d on p reviously seen datapo ints. Add itionally W eighted Majority/Hedg e typ e algorithm s can also be interpreted as Ran domized RER M (see section 5). Our strongest result for online learnability will be particular to the class of RERM. A notion of stability related to uniform-LOO stability (but slightly weaker) that will be sufﬁcient for our online setting is what we deﬁne as online stability: Deﬁnition 17 An algo rithm A is On line Stable at rate ǫ on-stable ( m ) if for all m , dataset S of size m : | f ( A ( S \ m ) , z m ) − f ( A ( S ) , z m ) | ≤ ǫ on-stable ( m ) (16) The dif fer ence between online s tability and uniform-L OO stability is that i t is only required to ha ve small change in loss on the last data poin t whe n it is held ou t, rathe r th an any data point in th e dataset S . For symmetric algorithm s ( e.g. FTL/FTRL algorith ms), online stability is equiv alen t to uniform-LOO stab ility , howe ver it is weaker than uniform -LOO stability for asymmetric algorithm s, like grad ient-based me thods analyzed in Section 4. I t is a lso obvio us that an u niform -LOO stable algor ithm must also be online stab le at rate ǫ on-stable ( m ) ≤ ǫ loo-stable ( m ) . W e n ow present our main results for the class of RERM and alw ays AERM: Theorem 18 If ther e e xists an online s table RERM, then the pr oblem is online learnable. In particular , it has no r egr et at rate: ǫ re gret ( m ) ≤ 1 m m X i =1 ǫ on-stable ( i ) + 2 m m − 1 X i =0 ρ i + ρ m m (17) 5 This theorem implies that b oth FTL and FTRL algorithms are g uaranteed to achieve n o regret on any problem where they are on line stable (or unif orm-LO O s table as th ese algorith ms are symmetric). In fact it is easy to show that in the case where f is strongly con vex in h , FTL is uniform -LOO stable at rate O ( 1 m ) (see Lemma 26). Additionally when f is conv ex in h , it is easy to show FTRL is uniform -LOO stable at rate O ( 1 √ m ) when choosing a stron gly conve x regularizer r such that r m = λ m r and λ m to be Θ(1 / √ m ) (see Lemm a 27 and 28), while FTL is not uniform-LOO stable. It is well known that FTL is not a no regret algorithm for general conv ex problem. Hence using only u niform- LOO stability we can pr ove currently known results ab out FTL and FTRL. An interesting application o f this result is in th e context o f app renticeship/imitatio n learnin g, where it has been shown that such non-i. i.d. supe rvised lear ning p roblems can b e reduced to online learning over mini-batch of data (Ross et al., 2 011). In this red uction, a classiﬁcation algorith m is used to pick the next “leader” (best classiﬁer in hindsight) at each iteration of training , that is in tu rn used to collect more data (to add to the training dataset for the next itera tion) from the expert we want to m imic. This result imp lies that online stability (or un iform- LOO stability) of th e base classiﬁcation alg orithm in th is reduction is sufﬁcient to guarantee no regret, and hence that the reduction provides a good bound on perfor mance. Unfortu nately our current result for the class of alw ays AERM is weaker: Theorem 19 If ther e exists an always AERM such t hat either (1) or (2) holds: 1. It is always AERM at rate o ( 1 m ) and online stable. 2. It is symmetric, uniform LOO stable at rate o ( 1 m ) and uniform R O stable at rate o ( 1 m ) . then the pr oblem is on line learnable. In particular , for each case it has no r e gr e t at rate: 1. ǫ re gret ( m ) ≤ 1 m P m i =1 ǫ on-stable ( i ) + 1 m P m i =1 iǫ erm ( i ) 2. ǫ re gret ( m ) ≤ 1 m P m i =1 ǫ loo-stable ( i ) + ǫ erm ( m ) + 1 m P m − 1 i =1 i [ ǫ loo-stable ( i ) + ǫ r o-stable ( i )] W e believe the r equired rates o f o ( 1 m ) migh t simply be an artefact of our particular p roof techn ique and that in general it might be true that any always AERM achie ves no regret as l ong as it is online stable. W e weren’t able to ﬁnd a counter-example where this is not the case. 3.1 Detailed Analysis W e will use the no tation R m ( A ) to den ote the regret ( as in Equatio n 1) of the sequence of hyp otheses p re- dicted by algo rithm A . W e begin by showing the following lem ma that will allow us to relate the regre t o f any algorithm to its online stability and AERM properties. Lemma 20 F or any algo rithm A : R m ( A ) = P m i =1 [ f ( A ( S i − 1 ) , z i ) − f ( A ( S i ) , z i )] + P m i =1 f ( A ( S m ) , z i ) − min h ∈H P m i =1 f ( h, z i ) + P m − 1 i =1 P i j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] (18) Proof: R m ( A ) = P m i =1 f ( A ( S i − 1 ) , z i ) − min h ∈H P m i =1 f ( h, z i ) = P m i =1 [ f ( A ( S i − 1 ) , z i ) − f ( A ( S m ) , z i )] + P m i =1 f ( A ( S m ) , z i ) − min h ∈H P m i =1 f ( h, z i ) For the term − P m i =1 f ( A ( S m ) , z i ) , we can rewrite it using the following manipulation: P m i =1 f ( A ( S m ) , z i ) = P m − 1 i =1 f ( A ( S m ) , z i ) + f ( A ( S m ) , z m ) = P m − 1 i =1 f ( A ( S m − 1 ) , z i ) + P m − 1 j =1 [ f ( A ( S m ) , z j ) − f ( A ( S m − 1 ) , z j )] + f ( A ( S m ) , z m ) . . . . . . = P m i =1 f ( A ( S i ) , z i ) + P m − 1 i =1 P i j =1 [ f ( A ( S i +1 ) , z j ) − f ( A ( S i ) , z j )] This proves the lemma. From this lemma we can immediately s ee that for any online stable al ways AERM algorithm A we o btain the following: 6 Corollary 21 F or any online stable always AERM algorithm A : R m ( A ) ≤ m X i =1 ǫ on-stable ( i ) + mǫ erm ( m ) + m − 1 X i =1 i X j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] (19) Proof: By online stability we have that for all i : f ( A ( S i − 1 ) , z i ) − f ( A ( S i ) , z i ) ≤ | f ( A ( S i − 1 ) , z i ) − f ( A ( S i ) , z i ) | = | f ( A ( S \ i i ) , z i ) − f ( A ( S i ) , z i ) | ≤ ǫ on-stable ( i ) and since A is always AERM it follows by deﬁnition that: m X i =1 f ( A ( S m ) , z i ) − min h ∈H m X i =1 f ( h, z i ) ≤ mǫ erm ( m ) W e will now seek to upper bound the extra double summation part. For an ERM it can easily be seen that: Lemma 22 F or any ERM algo rithm A : m − 1 X i =1 i X j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] ≤ 0 (20) Proof: Follo ws immediately since P i j =1 f ( A ( S i ) , z j ) is op timal hence for any other hypothesis h , in partic- ular A ( S i +1 ) , P i j =1 f ( A ( S i ) , z j ) ≤ P i j =1 f ( A ( S i +1 ) , z j ) . Since an ERM has ǫ erm ( m ) = 0 for a ll m , th en it can be seen directly tha t an ERM has no regret if it is online stable, as R m ( A ) m ≤ 1 m P m i =1 ǫ on-stable ( i ) . For general RERM this dou ble summation can be bounded by: Lemma 23 F or any RERM algo rithm A : m − 1 X i =1 i X j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] ≤ m − 1 X i =0 ρ i (21) Proof: P m − 1 i =1 P i j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] = P m − 1 i =1 [ P i j =1 [ f ( A ( S i ) , z j )] + P i j =0 [ r j ( A ( S i )) − r j ( A ( S i ))] − P i j =1 [ f ( A ( S i +1 ) , z j )] − P i j =0 [ r j ( A ( S i +1 )) − r j ( A ( S i +1 ))]] ≤ P m − 1 i =1 P i j =0 [ r j ( A ( S i +1 )) − r j ( A ( S i ))] = P m − 1 i =0 [ r i ( A ( S m )) − r i ( A ( S i ))] ≤ P m − 1 i =0 ρ i Combining this result with Corollary 21 proves ou r main result in Theorem 18, using the fact that a RERM is always AERM at rate 1 m P m i =0 ρ i . It is howe ver harder to bound this double summation by a term that becomes negligible (when looking at the av erage regret) for general always AERM. W e can show the following: Lemma 24 F or any always AERM algorithm A : m − 1 X i =1 i X j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] ≤ m − 1 X i =1 iǫ erm ( i ) (22) Proof: P m − 1 i =1 P i j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] ≤ P m − 1 i =1 [ P i j =1 f ( A ( S i ) , z j ) − min h ∈H P i j =1 f ( h, z j )] ≤ P m − 1 i =1 iǫ erm ( i ) This p roves case (1) o f Theo rem 19 when combin ing with Cor ollary 21. If we have a symme tric always AERM that is uniform LOO stable and unifor m R O stab le then we can also show: 7 Lemma 25 F or any symmetric a lways AERM algorithm A that is b oth uniform LOO stable and un iform R O stable: m − 1 X i =1 i X j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] ≤ m − 1 X i =1 i [ ǫ loo-stable ( i ) + ǫ r o-stable ( i )] (23) Proof: P m − 1 i =1 P i j =1 [ f ( A ( S i ) , z j ) − f ( A ( S i +1 ) , z j )] = P m − 1 i =1 P i j =1 [ f ( A ( S i ) , z j ) − f ( A ( S \ j i +1 ) , z j ) + f ( A ( S \ j i +1 ) , z j ) − f ( A ( S i +1 ) , z j )] For symmetric algorithm s, the terms P i j =1 [ f ( A ( S i ) , z j ) − f ( A ( S \ j i +1 ) , z j )] are related to RO stability a s S \ j i +1 ) co rrespon ds to S ( j ) i where w e r eplace z j by z i +1 . Henc e for symmetr ic algorithm s, by deﬁnition of unifo rm RO stability we have: P i j =1 [ f ( A ( S i ) , z j ) − f ( A ( S \ j i +1 ) , z j )] ≤ iǫ ro-stable ( i ) . Furthermore b y deﬁnition of uniform LOO stability , the terms P i j =1 f ( A ( S \ j i +1 ) , z j ) − f ( A ( S i +1 ) , z j ) ≤ iǫ loo-stable ( i ) . This proves the lemma. This lemma proves case (2) of Theorem 19 when combining with Corollary 21. Now we show that stron g conve xity , either in f or in r i when f is only conve x , implies u niform -LOO stability: Lemma 26 F or a ny ERM A : If H is a con vex set, and for some norm || · || on H we h ave tha t at all z ∈ Z : f ( · , z ) is L - Lipschitz continuou s in || · || and ν -str on gly conve x in || · || , then A is uniform-LOO stable at rate ǫ loo-stable ( m ) ≤ 2 L 2 mν . Proof: By Lipschitz co ntinuity we ha ve | f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i ) | ≤ L || A ( S \ i ) − A ( S ) || . W e can use strong con vexity to bound || A ( S \ i ) − A ( S ) || : For all α ∈ (0 , 1) we h av e: P m j =1 αf ( A ( S \ i ) , z j ) + (1 − α ) f ( A ( S ) , z j ) ≥ P m j =1 f ( αA ( S \ i ) + (1 − α ) A ( S ) , z j ) + α (1 − α ) mν 2 || A ( S \ i ) − A ( S ) || 2 ≥ P m j =1 f ( A ( S ) , z j ) + α (1 − α ) mν 2 || A ( S \ i ) − A ( S ) || 2 where the last inequality f ollows from th e fact that A ( S ) is th e ERM on S . So we obtain for all α ∈ (0 , 1) : || A ( S \ i ) − A ( S ) || 2 ≤ 2 mν (1 − α ) P m j =1 [ f ( A ( S \ i ) , z j ) − f ( A ( S ) , z j )] . Since A ( S \ i ) is the ERM on S \ i , then P m j =1 | j 6 = i f ( A ( S ) , z j ) ≥ P m j =1 | j 6 = i [ f ( A ( S \ i ) , z j ) so: || A ( S \ i ) − A ( S ) || 2 ≤ 2 mν (1 − α ) P m j =1 [ f ( A ( S \ i ) , z j ) − f ( A ( S ) , z j )] ≤ 2 mν (1 − α ) [ f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i )] ≤ 2 mν (1 − α ) L || A ( S \ i ) − A ( S ) || Hence we conclud e || A ( S \ i ) − A ( S ) || ≤ 2 mν (1 − α ) L . Since this h olds for all α ∈ (0 , 1) then we conclud e || A ( S \ i ) − A ( S ) || ≤ 2 mν L . Th is proves the lemma. Lemma 27 F or any RERM A : If H is a conve x set, and for some norm || · || on H we ha ve that at all z ∈ Z , f ( · , z ) is con vex and L -Lipschitz continuous in || · || , a nd for all i , r i is L i R -Lipschitz continuous in || · || an d ν i -str ongly conve x in || · || , the n A is uniform-LOO stable at rate ǫ loo-stable ( m ) ≤ 2 L [ L + L m R ] P m i =0 ν i . Proof: By Lipschitz co ntinuity we ha ve | f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i ) | ≤ L || A ( S \ i ) − A ( S ) || . W e can use strong con vexity of the regularizers to bou nd || A ( S \ i ) − A ( S ) || : For all α ∈ (0 , 1) we hav e: P m j =1 αf ( A ( S \ i ) , z j ) + (1 − α ) f ( A ( S ) , z j ) + P m j =0 αr j ( A ( S \ i )) + (1 − α ) r j ( A ( S \ i )) ≥ P m j =1 f ( αA ( S \ i ) + (1 − α ) A ( S ) , z j ) + P m j =0 r ( αA ( S \ i ) + (1 − α ) A ( S )) + α (1 − α ) P m i =0 ν j 2 || A ( S \ i ) − A ( S ) || 2 ≥ P m j =1 f ( A ( S ) , z j ) + P m j =0 r j ( A ( S )) + α (1 − α ) P m j =0 ν j 2 || A ( S \ i ) − A ( S ) || 2 where th e last inequality fo llows from th e fact that A ( S ) minimizes P m j =1 f ( h, z j ) + P m j =0 r j ( h ) . So we o btain fo r all α ∈ (0 , 1 ) : || A ( S \ i ) − A ( S ) || 2 ≤ 2 P m j =0 ν j (1 − α ) [ P m j =1 [ f ( A ( S \ i ) , z j ) − f ( A ( S ) , z j )] + P m j =0 [ r j ( A ( S \ i )) − r j ( A ( S ))]] . 8 Since A ( S \ i ) minimize s P m j =1 | j 6 = i [ f ( h, z j )]+ P m − 1 j =0 r j ( h ) , th en P m j =1 | j 6 = i f ( A ( S ) , z j )+ P m − 1 j =0 r j ( A ( S )) ≥ P m j =1 | j 6 = i [ f ( A ( S \ i ) , z j ) + P m − 1 j =0 r j ( A ( S \ i )) so: || A ( S \ i ) − A ( S ) || 2 ≤ 2 P m j =0 ν j (1 − α ) [ P m j =1 [ f ( A ( S \ i ) , z j ) − f ( A ( S ) , z j )] + P m j =0 [ r j ( A ( S \ i )) − r j ( A ( S ))]] ≤ 2 P m j =0 ν j (1 − α ) [ f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i ) + r m ( A ( S \ i )) − r m ( A ( S ))] ≤ 2 P m j =0 ν j (1 − α ) [ L + L m R ] || A ( S \ i ) − A ( S ) || Hence we conclude || A ( S \ i ) − A ( S ) || ≤ 2 P m j =0 ν j (1 − α ) [ L + L m R ] . Sinc e this holds for all α ∈ (0 , 1) then we conclud e || A ( S \ i ) − A ( S ) || ≤ 2 P m j =0 ν j [ L + L m R ] . T his proves the lemma. W e also pr ove an alternate result for the case where the regularizers r i are strongly con vex b ut not neces- sarily Lipschitz continuo us: Lemma 28 F or any RERM A : If H is a conve x set, and for some norm || · || on H we ha ve that at all z ∈ Z , f ( · , z ) is conve x and L - Lipschitz continu ous in || · || , and fo r a ll i ≥ 0 , r i is ν i -str ongly con vex in || · || an d sup h,h ′ ∈H | r i ( h ) − r i ( h ′ ) | ≤ ρ i , then A is uniform-LOO stable at rate ǫ loo-stable ( m ) ≤ 2 L 2 P m j =0 ν j + L q 2 ρ m P m j =0 ν j . Proof: Following a similar proo f to the p revious proof and u sing the fact that [ r m ( A ( S \ i )) − r m ( A ( S ))] ≤ ρ m we obtain that: || A ( S \ i ) − A ( S ) || 2 ≤ 2 L P m j =0 ν j || A ( S \ i ) − A ( S ) || + 2 ρ m P m j =0 ν j . This is a quadratic ine quality of the form Ax 2 + B x + C ≤ 0 . Since her e A = 1 > 0 , the n this imp lies x is less than or equa l to the largest ro ot of Ax 2 + B x + C . W e know that the roots are x = − B ± √ B 2 − 4 AC 2 A . Her e A = 1 , B = − 2 L P m j =0 ν j and C = − 2 ρ m P m j =0 ν j . So the largest root is: x = L P m j =0 ν j [1 + q 1 + 2 ρ m P m j =0 ν j L 2 ] . W e conclu de || A ( S \ i ) − A ( S ) || ≤ L P m j =0 ν j [1 + q 1 + 2 ρ m P m j =0 ν j L 2 ] . Since q 1 + 2 ρ m P m j =0 ν j L 2 ≤ 1 + q 2 ρ m P m j =0 ν j L 2 we o btain || A ( S \ i ) − A ( S ) || ≤ 2 L P m j =0 ν j + q 2 ρ m P m j =0 ν j . Combining with the fact that | f ( A ( S \ i ) , z i ) − f ( A ( S ) , z i ) | ≤ L || A ( S \ i ) − A ( S ) || proves the lemma. 4 Mirr or Descent and Gradient-Based Methods So far we ha ve thoug ht of using an underlying b atch algorith m to pick the s equence of h ypoth eses. A popular class of on line method s are gradient based me thods, suc h as gr adient descent and Newton’ s type m ethods (Zinkevich , 2003, Agarwal et al., 2006). Such ap proach es can all be interpreted as Mirr or Descent m ethods, and it is known that Mirror Descent alg orithms can be thought a s some form o f FTRL (McMahan, 2011). The difference is that they fo llow the regularized leader on a lin ear/quad ratic appr oximation to the loss function (linear/qu adratic lower bound in the conve x/strongly convex case) at each data point z , an d the r egularizers r i may regularize about the previously cho sen h i (after observ ing the ﬁrst i − 1 d atapoints) rath er than some ﬁxed hypothesis over the iter ations ( such a s h 1 ). These a lgorithms ar e typ ically not symmetric, as the approx imation points to th e loss f unction (and p otentially the r egularizers) depend on the or der of th e data points in the dataset. Nev ertheless, we can still use our previous ana lysis to boun d the regret for these methods in term s of online stability and AERM pro perties. W e will refer to this broad class of methods as Regularized Surrog ate Loss Minimizer (RSLM): Deﬁnition 29 An algorithm A is a Regularized Surr ogate Loss Minimizer (RSLM) if for all m a nd an y dataset S of m data po ints: r 0 ( A ( S )) + m X i =1 [ ℓ i ( A ( S ) , z i ) + r i ( A ( S ))] = min h ∈H r 0 ( h ) + m X i =1 [ ℓ i ( h, z i ) + r i ( h )] (24) for { ℓ i } m i =1 the surr ogate loss functio nals chosen such that f ( A ( S i − 1 ) , z i ) − f ( h, z i ) ≤ ℓ i ( A ( S i − 1 ) , z i ) − ℓ i ( h, z i ) for all h (i.e. they upper bou nd the r egr et), { r i } ∞ i =0 the r egularizers functionals such that sup h,h ′ ∈H | r i ( h ) − r i ( h ′ ) | ≤ ρ i and { ρ i } ∞ i =0 is o (1) . Note that a RERM is a special case of a RSLM where ℓ i ( h, z i ) = f ( h, z i ) . For t he bro ader class of RS LM, the regret is bounded by: 9 Lemma 30 F or any RSLM A : R m ( A ) ≤ P m i =1 [ ℓ i ( A ( S i − 1 ) , z i ) − ℓ i ( A ( S i ) , z i )] + P m i =1 ℓ i ( A ( S m ) , z i ) − min h ∈H P m i =1 ℓ i ( h, z i ) + P m − 1 i =1 P i j =1 [ ℓ j ( A ( S i ) , z j ) − ℓ j ( A ( S i +1 ) , z j )] (25) Proof: By pr operties of the f unctions ℓ i we have that: R m ( A ) ≤ P m i =1 ℓ i ( A ( S i − 1 ) , z i ) − min h ∈H P m i =1 ℓ i ( h, z i ) . Using the same manipu lations as in lemma 20 proves the lemma. A RS LM is a R ERM in the loss { ℓ i } m i =1 instead of f . Hen ce it follows that if such RSLM is online stable (in the loss { ℓ i } m i =1 , i.e. | ℓ m ( A ( S m − 1 ) , z m )) − ℓ m ( A ( S m ) , z m )) | → 0 as m → ∞ ) it mu st ha ve no regret: Theorem 31 If ther e exists a RSLM that is o nline stable in th e surr ogate loss { ℓ i } m i =1 , th en the pr ob lem is online learnable. In particular , it has no r e gr et a t rate: ǫ re gret ( m ) ≤ 1 m m X i =1 ǫ on-stable ( i ) + 2 m m − 1 X i =0 ρ i + ρ m m (26) Proof: Follows from applyin g c orollary 21 and lemma 23 (but replacing f by { ℓ i } ) to the previous lem ma 30. 5 W eighted Majority , Hedge and Randomized Al gorithms W e have so f ar restricted our attention t o deterministic algorithm s, which upon observin g a dataset S return a ﬁxed hypo thesis h ∈ H . An imp ortant class of methods f or on line learnin g are ran domized algorith ms such as W eighted Ma jority , an d its gen eralization Hedg e, which in stead retur n a distribution over hypo theses in H at each iteration. These random ized algo rithms are importan t in online learning as it is known that some problem s are not online learnab le with determin istic algorithms b ut are online learnable with randomiz ed al- gorithms (assuming the adversary can only be a ware of the distrib ution over hypo theses and not the particular hypoth esis that will be sampled f rom this distribution when choosing the data point z ). For instance, gener al problem s with a ﬁnite set of hypo theses fall in this category . In this section we show th at W eigh ted Majority , Hedg e and similar variants, can be in terpreted as Ran- domized uniform-LOO stable RERM. W e provide an analysis o f the stability , AERM a nd no-regret rates of such algorithm s based on the previous results deri ved in this pap er . T hese results will be usefu l to determine the existence of (potentially randomize d) uniform-LOO stable RERM for a large class of learnin g problems. Before we in troduce this an alysis, we ﬁrst d eﬁne formally what we mean by a Rand omized RERM and how notions of stability and no-regret extend to random ized algorithms. 5.1 Randomized Algorithms Deﬁnition 32 Let Θ be a set such that for any θ ∈ Θ , P θ is a pr ob ability distribution over the class of hypothe sis H , a nd for any h ∈ H , and ǫ > 0 ther e e xists a θ ∈ Θ such that E h ′ ∼ P θ [ f ( h ′ , z )] − f ( h, z ) ≤ ǫ for all z ∈ Z . Let P θ S = A ( S ) deno te the distribution picked by algorithm A on dataset S . An algorithm A is a Randomized RERM if for all m and any dataset S : r 0 ( θ S ) + m X i =1 [ E h ∼ P θ S [ f ( h, z i )] + r i ( θ S )] = min θ ∈ Θ r 0 ( θ ) + m X i =1 [ E h ∼ P θ [ f ( h, z i )] + r i ( θ )] (27) for r i : Θ → R the r egularizer function als, which measur e th e complexity o f a chosen θ , that we a ssume satisfy sup θ ,θ ′ ∈ Θ | r i ( θ ) − r i ( θ ′ ) | ≤ ρ i and { ρ m } ∞ m =0 is o (1) . The set Θ m ight rep resent a set o f param eters parametr izing a family of d istributions (e.g. Θ a set of mean- variance tuples such that P θ is gaussian with those parameter s), or in other cases be a set of distribution itself (e.g. when H is ﬁnite, Θ mig ht be the set of all discrete distributions over H ), in which case P θ = θ . T he condition that there exists a θ ∈ Θ such that E h ′ ∼ P θ [ f ( h ′ , z )] − f ( h, z ) < ǫ for all z ∈ Z is to ensure the algorithm is an AERM, i.e. that it can pick a θ that has a verag e expected loss n o g reater than th e best ﬁxed h ∈ H in the limit as m → ∞ . A deterministic RERM is a sp ecial case of a Randomized RERM whe re th e set Θ = H an d P θ is just the probab ility distribution with probability 1 for the chosen hypothesis θ . When u sing a randomized algorithm , the algorithm incurs lo ss o n a hypothesis h sam pled from th e cho sen P θ , and we assume the adversary may only be aware o f P θ in advance (n ot the p articular sampled h ) when choosing z . Th e previous deﬁnitions of stability , AE RM and no -regret extends to randomize d algorithms by considerin g the lo ss f ( A ( S ) , z ) = E h ∼ A ( S ) [ f ( h, z )] . Thus a n o-regret rando mized algo rithm is an algorithm such that its expected average regret un der the s equence of chosen distributions goes to 0 as m goes to ∞ . By 10 our assumption th at the instantane ous regre t is bounded , this is also equ iv alen t to saying that its average regret (under the s ampled hypo theses) goes to 0 with probability 1 as m goe s to ∞ (e.g. using an Hoef f ding bo und). Additionally , a r andom ized onlin e stable algo rithm imp lies th at the change in expected loss on the last data point when it is held out goes to 0 as m goes to ∞ ( | E h ∼ A ( S \ m ) [ f ( h, z m )] − E h ∼ A ( S ) [ f ( h, z m )] | → 0 ) . 5.2 Hedge and W eighted Majority An important rand omized no- regret online learning algo rithm whe n H is ﬁnite is the Hed ge a lgorithm (Freund and Schapire, 199 7). Hed ge is a g eneralization to arbitrary loss of the W eighted Majority algorithm that was introdu ced f or the classiﬁcation s etting (Littlestone and W armuth, 1994). Let θ i denote the probabil- ity of hypo thesis h i , then at any iteration t , Hedge/W eighted Ma jority plays θ i ∝ exp( − η P t − 1 j =1 f ( h i , z j )) for some positive constant η . W hen the num ber of rounds m is known in ad vance, η is typically chosen as O ( B q m log( |H| ) ) , for B the maximu m in stantaneous regret. W e will consider here a slight generalization of Hedge that can b e applied f or cases where the number of rou nds is not known in ad vance. In this case a t iteration t : θ i ∝ exp( − η t P t − 1 j =1 f ( h i , z j )) fo r some seque nce o f p ositiv e con stants { η t } ∞ t =0 . W e sho w here that Hedge (and W eighted Majority) is in fact a Randomized uniform-L OO stable RERM, where Θ is the set of all discrete distributions ov er the ﬁnite set of experts, an d th e regular izer correspo nds to a KL d iv ergence between the chosen distribution and the uniform distrib u tion over experts: Theorem 33 F or ﬁnite set of d experts with instanta neous r egr et bou nded by B , the Hedge (an d W eighted Majority) a lgorithm co rr espon ds to the following Randomized unifo rm-LOO stable RERM. Let Θ be the set of distrib u tions over the ﬁ nite set of d experts, a nd U denote th e un iform distribution, then at each iteration t , Hedge (and W eighted Majority) picks the distrib u tion θ ∗ ∈ Θ tha t satisﬁes: θ ∗ = argmin θ ∈ Θ t − 1 X i =1 E h ∼ θ [ f ( h, z i )] + t − 1 X i =0 λ i K L ( θ || U ) i.e. it uses r t = λ t r for r a KL r e gu larizer with r espect to the uniform distribution. Choo sing the r e gular- ization con stants λ t = B q 1 8 log( d ) max(1 ,t ) for a ll t ≥ 0 makes Hedge (and W eig hted Ma jority) unifo rm- LOO sta ble at rate ǫ loo-stable ( m ) ≤ B p 2 log( d )[ 1 2 √ m − 1 + 1 2 √ m +1 ] , always AERM at rate ǫ erm ( m ) ≤ B q log( d ) 2 m (1 + 1 2 √ m ) and no-r e g r et at rate ǫ re gret ( m ) ≤ B p 2 log( d )[ 3 √ m + log( m ) 2 m + 1+2 ln ( 2) 2 m ] . Proof: Consider the above Randomiz ed RERM algo rithm. Then we have 0 ≤ λ i K L ( θ || U ) ≤ λ i log( d ) , for all i and θ ∈ Θ . So Θ a nd { r i } ∞ i =0 are well deﬁned accor ding to our assumptions in the deﬁnition of a Randomized RERM as long as { λ i } ∞ i =0 is o (1) . Let h i denote the i th expert and θ i denote the pr obability assigned to h i for a chosen θ ∈ Θ . At any iteration t + 1 , when th e algorithm has observed t data points so far , the rand omized RERM algorith m solves an optimization prob lem of the form: argmin θ ∈ Θ P t j =1 P d i =1 θ i f ( h i , z j ) + P t j =0 λ j P d i =1 θ i log( dθ i ) s.t. 0 ≤ θ i ≤ 1 P d i =1 θ i = 1 Using the Lagrangian , we can easily see that the optimal solution to th is optimization problem is to ch oose θ i ∝ exp( − 1 P t j =0 λ j P t j =1 f ( h i , z j )) fo r all i . This is the same as H edge fo r η t = 1 P t j =0 λ j . When Hedge is playing for m r ound s an d uses a ﬁxed η , this can be ach iev ed with a ﬁxed regularizer λ 0 = 1 η and λ t = 0 for all t ≥ 1 . So this estab lishes that Hedge is equiv alen t to the above RERM. Now let’ s consider the case where the nu mber or ro unds m is not known in advance and we choose λ t = c q 1 max( t, 1) for all t ≥ 0 and some constant c in the above R ERM. This cho ice leads to 2 c √ t − c ≤ P t j =0 λ j ≤ 2 c √ t + c . Note also that because λ t ≤ λ j for all j ≤ t we also ha ve that ( t + 1) λ t ≤ P t j =0 λ j . It is easy to see why the a bove RERM must be unifor m-LOO stable. First the expected loss of the randomized algorithm is linear in θ (and hen ce conv ex) wh ile th e K L regularizer is 1-stro ngly convex in θ unde r || · || 1 and b ounde d by log ( d ) (so r m is λ m -strongly convex an d bound ed by λ m log( d ) ). Additionn ally , the expected loss is L -Lipschitz continu ous in || · || 1 on θ , f or L = sup z ∈Z inf v ∈ R sup h ∈H | f ( h, z ) − v | . This is becau se for any z : | E h ∼ P θ [ f ( h, z )] − E h ∼ P θ ′ [ f ( h, z )] | = | P d i =1 ( θ i − θ ′ i )( f ( h i , z ) − v ) | ≤ P d i =1 | θ i − θ ′ i || f ( h i , z ) − v | ≤ sup h ∈H | f ( h, z ) − v ||| θ − θ ′ || 1 11 for any v ∈ R . So we conclud e that for all z ∈ Z , | E h ∼ P θ [ f ( h, z )] − E h ∼ P θ ′ [ f ( h, z )] | ≤ L || θ − θ ′ || 1 . If the loss f ha ve instantaneou s regret bo unded by B , then L = B 2 . So by our pr evious re sult for RERM with conve x lo ss and strong ly conve x regular izers, we o btain that the algorith m is u niform- LOO stable at rate ǫ loo-stable ( m ) ≤ 2 L 2 P m i =0 λ i + L q 2 λ m log( d ) P m i =0 λ i . So tha t the algo rithm h as no regret at rate ǫ regret ( m ) ≤ 1 m P m i =1 [ 2 L 2 P i j =0 λ j + L r 2 λ i log( d ) P i j =0 λ j ] + 2 log( d ) m P m − 1 j =0 λ j + log( d ) λ m m . Setting λ i = c q 1 max(1 ,i ) leads to: ǫ regret ( m ) ≤ 1 m P m i =1 [ 2 L 2 P i j =0 λ j + L r 2 λ i log( d ) P i j =0 λ j ] + 2 log( d ) m P m − 1 j =0 λ j + log( d ) λ m m ≤ 1 m P m i =1 [ 2 L 2 c (2 √ i − 1) + L q 2 log( d ) i +1 ] + 2 log ( d ) m c (2 √ m + 1) ≤ 2 L 2 c 1 m ( √ m + 1 2 log( m ) + lo g(2)) + L p 2 log( d ) 2 √ m + 4 c log( d ) 1 √ m + 2 c log( d ) m = 1 √ m [ 2 L 2 c + 4 c log( d ) + 2 L p 2 log( d )] + log( m ) m [ L 2 c ] + 1 m [ 2 ln(2) L 2 c + 2 c log( d )] where the second inequ ality uses 1 P i j =0 λ j ≤ 1 2 c √ i − c , λ i P i j =0 λ j ≤ 1 i +1 and P m j =0 λ j ≤ 2 c √ m + c ; and the third in equality uses th e fact that P m i =1 1 2 √ i − 1 ≤ √ m + 1 2 log( m ) + log(2) and P m i =1 1 √ i +1 ≤ 2 √ m , which follows from using the integrals to upper bound the summations. Setting c = L q 1 2 log( d ) minimizes the factor multiplying the 1 √ m term. Th is leads to ǫ regret ( m ) ≤ L p 2 log( d )[ 6 √ m + log( m ) m + 1+2 ln(2) m ] , ǫ loo-stable ( m ) ≤ L p 2 log( d )[ 2 2 √ m − 1 + 1 √ m +1 ] and ǫ erm ( m ) ≤ L q 2 log ( d ) m (1 + 1 2 √ m ) . Plugging in L = B 2 proves the statements in the theorem . This theorem establishes the following: Corollary 34 Any learning pr oblem with a ﬁnite hyp othesis cla ss (and boun ded instan taneous r egr et) is online learnable with a (potentially randomized) uniform-LOO st able RER M. In Section 6.1 , we will also d emonstrate that wh en H is inﬁnite, b ut can be “ﬁnitely appr oximated” well enoug h with respect to the loss f , then the prob lem is also onlin e learnab le via a ( potentially rando mized) unifor m-LOO stable RERM. 6 Is Uniform LOO Stabili ty Necessary? W e now restrict ou r attention to symmetric algorithm s where we have sh own that un iform-L OO stability is sufﬁcient for online learnability . W e start by giving instruc ti ve examples that illustra te that in fact unifor m- LOO stability might be necessary to achieve no regret. Example 6.1 There exists a pr oblem that is learna ble in the batch setting with a n ERM that is unive rsal all-i-LOO stable. However that pr o blem is not online learnable (by any deterministic algorithm) and ther e does n ot exist a ny ( deterministic) a lgorithm tha t can be both unifo rm LOO stable an d always AERM. Wh en allowing randomized algorithms (conve xifying the pr oblem), th e pr o blem is on line learnable via a u niform LOO stable RERM but th er e e xists (r andomized ) un iversal all-i- LOO stab le RERM that ar e not u niform-LOO stable that cannot achieve no r egr et. Proof: This example was stud ied in both (Kutin and Niyogi, 2002, Sh alev-Shwartz et al., 20 09). Consider the hypothesis space H = { 0 , 1 } , the instan ce space Z = { 0 , 1 } and the loss f ( h, z ) = | h − z | . As was shown in (Sh alev-Shwartz et al., 2 009) fo r the batch setting , an ERM for this problem is universally consistent and universally all-i-LOO stable, because removin g a data po int z fro m th e dataset can cha nge the h ypothesis only if there’ s a n equ al number of 0 ’ s an d 1’ s ( plus or minus on e), which occu rs with pr obability O ( 1 √ m ) . Shalev-Shwartz et al. (2009) also sho wed that the only unif orm LOO stable algorithm s on this problem must be con stant ( i.e. alw ay s return the same hyp othesis h , regard less of the dataset), at least for large en ough dataset, and hence cannot be an AERM. It is also easy to see that this prob lem is not online learn able w ith any determin istic algorith m A . Consider an adversary who has knowledge of A and picks the data points z i = 1 − A ( S i − 1 ) . Th en algorithm A incurs loss P m i =1 f ( A ( S i − 1 ) , z i ) = m , wh ile ther e exists a h ypoth esis h that achieves P m i =1 f ( h, z i ) ≤ m 2 . Hence for any deterministic algorithm A , there e xists a sequence of data points such that R m ( A ) m ≥ 1 2 for all m . 12 Now consider allowing ran domized alg orithms, in that we ch oose a distribution over { 0 , 1 } . Allowing random ized alg orithms makes the pro blem linear ( and hen ce conve x ) in the distribution (b y linearity of expec- tation) and makes the hypoth esis spa ce (the space of distrib u tions on H ) con vex. Let p denote the probab ility of hy pothesis 1. Then the problem can now b e expressed with a h ypoth esis space p ∈ [0 , 1] and the loss f ( p, z ) = (1 − p ) z + p (1 − z ) . This p roblem is obviou sly online learnable with a randomized uniform-L OO stable RERM (i.e. Hedge) that is uniform- LOO stable at rate O ( 1 √ m ) and no-regret at rate O ( 1 √ m ) using our previous results. Even un der th is chang e, th e previous ERM algo rithm that is universally all-i-LOO stable would still choose th e same hypo thesis as bef ore, i.e. p would be alw ays 0 or 1 and would not be uniform- LOO stable. That would also be the case e ven if we make it pick p = 1 2 or some other intermed iate value when there is an equal numb er of 0’ s and 1’ s. If we m ake it pick such intermed iate v alu e it would still be universal all-i-LOO stable as the hypoth esis would still on ly chan ge with small probability O ( 1 √ m ) . Howe ver such algo rithm cannot achie ve no regret. Again if we pick the sequence z i = r ound (1 − A ( S i − 1 )) , then whenever i is even, the ERM use an odd number of data points and it must pick either 0 or 1 and w ould incur loss of 1. When i is odd, there will be an equal number of 0’ s and 1’ s in the dataset (by the fact A chooses the ERM at odd steps) and no matter what p it picks it would incur loss of at least 1 2 . T hus R m ( A ) m ≥ 1 4 for all m . W e can also co nsider th e f ollowing ran domized RERM algorithm th at u ses only a co n vex regularizer : A ( S ) = arg min p ∈ [0 , 1] P m i =1 f ( p, z i ) + P m i =0 λ i | p − 1 2 | . Let z = 1 m P m i =1 z i and λ = 1 m P m i =0 λ i . Using the subg radient of this o bjective, we ca n e asily show that A ( S ) p icks 1 2 if z ∈ [ 1 − λ 2 , 1+ λ 2 ] , an d otherwise picks e ither p = 1 if z > 1+ λ 2 and p = 0 if z < 1 − λ 2 . This algorithm is not unifo rm-LOO stable, as for any regularizer λ m and lar ge enough m we can pick a dataset S m such that S m − 1 has z ∈ [ 1 − λ 2 , 1+ λ 2 ] but S m has z / ∈ [ 1 − λ 2 , 1+ λ 2 ] , such that f ( A ( S m ) , z m ) = 0 but f ( A ( S m − 1 ) , z m ) = 1 2 . Hen ce ǫ loo-stable ( m ) ≥ 1 2 . However it is uni versal all-i-LOO stable as the hypoth esis would sti ll only change with small probability O ( 1 √ m ) as in the pre vious case (we need to draw m samples that has number of 1’ s 1 − λ 2 or 1+ λ 2 , plus or minus one, for the hypoth esis to change upon removal of a sample). Furthermo re this alg orithm doesn’t achiev e no regret. Consider the sequen ce where whenever A ( S i − 1 ) picks 1 2 we p ick z i = 1 a nd whenever A ( S i − 1 ) picks 1 we pick z i = 0 . It is easy to see that by the way this sequence is gener ated that the propo rtion o f 1’ s z in S m will seek to track the b ound ary 1+ λ 2 , wher e the algorithm switche s between p = 1 2 and p = 1 , as m in creases. Sinc e λ → 0 as m → ∞ , then in th e limit z → 1 2 . Since the sequ ence is such th at e verytim e we gen erate a 0 , the algorithm incurs loss of 1 an d ev erytime we generate a 1 it in curs loss o f 1 2 , then its av erage loss con verges to 3 4 but there’ s a hypothesis th at achieves a verag e loss of 1 2 so the av erage regret con verges to 1 4 . This prob lem is insigh tful in a numb er of ways. First it sh ows that there are p roblems th at are batch learnable that are n ot on line learn able, but when considering rand omized algorith ms c an beco me online learnable. Additionally it shows that a RERM that is universal all-i-LOO stable, the next weakest stability notion, canno t be sufﬁcient to guarantee the algorithm achie ves no regret. Th is shows we cann ot g uarantee no regret for any RERM using only universal all-i- LOO stability o r any weaker notion of LOO stab ility . This also suggests that it might be necessary to have a no tion of LOO stability that is at least stronger than all-i-LOO stability to guarantee no regret. Another point reinfo rcing the fact that uniform -LOO stability migh t b e necessary is th at it is kn own that o nline lear nability is not equivalent to b atch learn ability (as shown in the example below). Therefore, necessary stability conditions for online learnability should intuiti vely be stronger than for batch learnability . Example 6.2 (Example taken from Ad am Kalai and Sham Kakade) Ther e e xists a pr oblem that is learn- able in the ba tch setting but n ot learnable in the on line setting by an y deterministic o r randomized online algorithm. Proof: Consider a threshold learning problem on the interval [0 , 1] , where the true hypothesis h ∗ is such that for some x ∗ ∈ [0 , 1] , h ( x ) = 2 I ( x ≥ x ∗ ) − 1 . Given an observation z = ( x, h ∗ ( x )) , we deﬁne the loss incured by a h ypoth esis h ∈ H as L ( h, ( x, h ∗ ( x ))) = 1 − h ( x ) h ∗ ( x ) 2 , for H = { 2 I ( x ≥ t ) − 1 | t ∈ [0 , 1] } the set of all th reshold fun ctions on [0 , 1] . Sin ce this is a binary classiﬁcation pr oblem and the VC dimension of threshold function s is ﬁnite (2), then we con clude this problem is batch lear nable. I n fact by existing results, it is batch learnable by an ERM that is all-i-LOO st able. Howev er in the online setting consider an adversary who picks the sequence o f inputs by doing th e following binary search : x 1 = 1 2 and x i = x i − 1 − y i − 1 2 − i , and y i = − h i ( x i ) , so that the ob servation by the learner at iteration i is z i = ( x i , y i ) . This sequence is constructed so that the lear ner always incu r loss of 1 at e ach iter ation, an d after any number of iteratio ns m , 13 the hypothesis h = 2 I ( x ≥ x m +1 ) − 1 a chieves 0 lo ss on the en tire sequence z 1 , z 2 , . . . , z m . This implies the average regret of th e algorithm is 1 for all m . Add itionnally ev en if we allow randomized algorithm s such that the p rediction at iter ation i b y the le arner is effectively a distrib ution ov er { − 1 , 1 } wher e p i denote the probab ility P ( h i ( x i ) = 1) for the distribution over hypo theses chosen by the learner, then the expected loss of the le arner at iteration i is 1+ y i − 2 p i y i 2 . If x i are chosen as bef ore b u t y i = − I ( p i ≥ 0 . 5) . Th en a gain a t each iteration the lea rner must incur expected loss of at le ast 1 2 but the hypo thesis h = 2 I ( x ≥ x m +1 ) − 1 achieves loss of 0 on the entire sequence z 1 , z 2 , . . . , z m . Hen ce the expected a verag e regret is ≥ 1 2 for all m , so that with prob ability 1, the av erage regret of the ran domized algorithm is ≥ 1 2 in the limit as m g oes to inﬁnity . Hence we conclud e that this problem is not online learnable. 6.1 Necessary Stability Conditions f or Online Learnability in P articular Settings 6.1.1 Binary Classiﬁcation Setting W e n ow show that if we restrict our attention to the binary classiﬁcation setting ( f ( h, ( x, y )) = I ( h ( x ) 6 = y ) for y ∈ { 0 , 1 } ), online learnability is eq uiv alen t to the existence of a (pote ntially randomized) uniform LOO stable RERM. Our argum ent uses the notio n of Littlestone dimen sion, which was shown to c haracterize on line learn - ability in the binary classiﬁcation setting. Ben -David et al. (200 9) have sho wn that a classiﬁcation pr oblem is online learnable if and only if the class of hypo thesis has ﬁnite Littlestone dimension . By our cu rrent results we know that if there exists a u niform LOO stable RERM, the classiﬁcation p roblem must be onlin e learnable and thus have ﬁn ite Littlestone dimen sion. W e her e show that ﬁnite L ittlestone dimension implies the existence of a (poten tially rando mized) unifor m LOO stable RERM. T o estab lish this, we use the fact tha t when H is inﬁnite but has ﬁnite Littleston e dimen sion ( Ldim( H ) < ∞ ), W eig hted Majority can b e adapted to b e a no -regret algorithm by playing d istributions ov er a ﬁxed ﬁnite set o f experts (of size ≤ m Ldim( H ) when playing for m r ound s) deri ved from H (Ben-David et al., 2009): Theorem 35 F or a ny b inary cla ssiﬁcation pr oblem with h ypothesis space H that has ﬁnite Littlestone dimen - sion Ldim( H ) an d number of r ound s m , ther e e xists a Rando mized uniform-LOO stable RERM algorithm. In particular , it has no r e g r et at rate ǫ re gret ( t ) ≤ p 2 log( m ) Ldim( H )[ 3 √ t + log( t ) 2 t + 1+2 ln(2) 2 t ] fo r all t ≤ m . Proof: The alg orithm pro ceeds by co nstructing the same set o f expert as in Ben-D avid et al. ( 2009) f rom H , wh ich has number o f experts ≤ m Ldim( H ) for m rounds. The pr eviously mentioned W eigh ted Majority algorithm on this set achieves no regret at rate ǫ regret ( t ) ≤ p 2 log( m ) Ldim( H )[ 3 √ t + log( t ) 2 t + 1+2 ln ( 2) 2 t ] for all t ≤ m (since the m aximum instantaneous regret is 1) and is a Randomized u niform- LOO stable RERM as shown in theor em 33. This result implies that ﬁnite littlestone dimension is equi valent to the existence of a (potentially random- ized) un iform LOO stable RERM, and theref ore that onlin e learnability in the binary classiﬁcation setting is equiv alent to the existence of a (potentially random ized) unifor m LOO stable RERM: Corollary 36 A binary classiﬁcation pr ob lem is online le arnable if and o nly if ther e exis ts a ( potentially randomized) uniform-LOO stable RERM. 6.1.2 Problems with Sub-Exponential Covering For any ǫ > 0 , let C ǫ = { C ⊆ H|∀ h ′ ∈ H , ∃ h ∈ C s.t. ∀ z ∈ Z : | f ( h, z ) − f ( h ′ , z ) | ≤ ǫ } . C ǫ is the set of all subsets C of H such that for a ny h ′ ∈ H , we can ﬁnd an h ∈ C that h as loss within ǫ of the loss of h ′ at all z ∈ Z . W e deﬁne the ǫ -covering n umber of the tuple ( H , Z , f ) as N ( H , Z , f , ǫ ) = inf C ∈C ǫ | C | , i.e. the minimal number of h ypotheses ne eded to cover the loss of any hy pothesis in H within ǫ . W e will show that we can guaran tee n o-regret with a Randomized unifo rm-LOO stable RERM algor ithm ( e.g. Hedge) as lon g as there exists a sequ ence { ǫ i } ∞ i =0 that is o (1) and such that fo r any n umber of rou nds m : N ( H , Z , f , ǫ m ) is o (exp( m )) . Theorem 37 Any learning pr oblem (with instantan eous r e gret boun ded by B ) wher e the r e exists a sequence { ǫ m } ∞ m =0 that is o (1) and such that { N ( H , Z , f , ǫ m ) } ∞ m =0 is o (exp( m )) , is online learnab le with a Ran- domized un iform-LOO stable RERM algorithm. In particular , when pla ying for m r ounds it has no re gr et a t rate ǫ re gret ( t ) ≤ B p 2 log( N ( H , Z , f , ǫ m ))[ 3 √ t + log( t ) 2 t + 1+2 ln ( 2) 2 t ] + ǫ m for all t ≤ m . Proof: Sup pose we k now we must d o on line learning fo r m round s. Th en we c an con struct an ǫ m -cover C of ( H , Z , f ) such that C ⊆ H and | C | = N ( H , Z , f , ǫ m ) . From the previous theorem, we k now that runn ing Hedg e o n the set C guarantee s that 1 t P t i =1 E h i ∼ P θ i [ f ( h i , z i )] − inf h ∈ C 1 t P t i =1 f ( h, z i ) ≤ 14 B p 2 log( N ( H , Z , f , ǫ m ))[ 3 √ t + log( t ) 2 t + 1+2 ln(2) 2 t ] for all t ≤ m . By deﬁnition of C , inf h ∈ C 1 t P t i =1 f ( h, z i ) ≤ inf h ∈H 1 t P t i =1 f ( h, z i ) + ǫ m for all t ≤ m . So we con clude ǫ regret ( t ) ≤ B p 2 log( N ( H , Z , f , ǫ m ))[ 3 √ t + log( t ) 2 t + 1+2 ln(2) 2 t ] + ǫ m for all t ≤ m . This theorem applies to a large numbe r of settings. For instance, if we h av e a problem where f ( · , z ) is K -Lip schitz continuou s at all z ∈ Z with respect to some norm || · || on H , and H ⊆ R d for some ﬁnite d and has bo unded diameter D un der || · || (i.e. sup h,h ′ ∈H || h − h ′ || ≤ D ). The n N ( H , Z , f , ǫ ) is O ( K ( D ǫ ) d ) for all ǫ ≥ 0 . Choo sing ǫ m = 1 m implies we can achieve no r egret at rate ǫ regret ( t ) ≤ O ( B q log( K )+ d log ( mD ) t ) for all t ≤ m . Th is notion also allo ws to handle highly discontinuous lo ss functions. For in stance consider the case where Z = H = R and the lo ss f ( h, z ) = 1 − I ( h ∈ Q ) I ( z ∈ Q ) − I ( h / ∈ Q ) I ( z / ∈ Q ) , i.e. the loss is 0 if both h and z are rational, or bo th irration al, and the loss is 1 is one is rational and the other irrational. In this case, the set C = { 1 , √ 2 } is an ǫ -cover of {H , Z , f } fo r any ǫ > 0 and thus we can achieve no-regret at rate O ( 1 √ m ) by runnin g Hedge on the set C . 7 Conclusions and Open Questions In this pap er we ha ve sho wn that popular online algor ithms such as FTL, FTRL, Mirror Descent, g radient- based m ethods and ran domized algo rithms like W eighted Majority and Hedg e can all be an alyzed pu rely in terms of stability prope rties of the under lying batch learning algorithm that picks the sequence of hypotheses (or d istribution over hypo theses). In particular, we have introduced th e n otion of on line stability , wh ich is sufﬁcient to guar antee online learnab ility in the g eneral learning setting for the class of RERM and RSLM algorithm . Our results allow to relate a number of learnability results deriv ed for th e batch setting to the onlin e setting. There are a numbe r of interesting open questions related to our work. First, it is still an open question to know whether for the g eneral class of always AERM (at o (1) rate) it is suf ﬁcien t to be on line stable (at o (1) rate) to guarantee no regret, or show a counter-example that proves otherwise. The presented examples seem to suggest that a pr oblem is online learnable on ly if there exists a un iform- LOO stable o r o nline stable (and always AERM) algorithm, or a t lea st with some form of LOO s tability in between online stable and all-i-LOO stable. This has been veriﬁed in the binary classiﬁcation setting where we ha ve shown that online learnability is equiv alen t to the existence of a potentially rando mized un iform-LO O stab le RERM. W hile we h av en’t been able to p rovide necessary conditio ns for online lea rnability in th e gen eral lear ning setting, we have shown that all problems with a sub-expon ential covering a re all online learnable with a potentially randomized uniform- LOO stable RERM. An interesting open question is whether the notion of sub-expon ential covering we hav e introdu ced turns o ut to be eq uiv alen t to on line lear nability in the gen eral learnin g setting. If this is the case, this would establish immediate ly that existence of a (potentially rando mized) uniform-L OO stable RERM is both sufﬁcient and necessary for onlin e learnability in the gener al learning setting. Refer ences A. Agar wal, E. Hazan, S. Kale, an d R. E. Scha pire. Algo rithms for portfolio man agement based on the newton method . I n Pr oceedin gs of the 23r d internationa l conference on Machine learning (ICML) , 2006. N. Alon, S. Ben-Da v id, N. Cesa-Bianchi, and D. Haussler . Lear nability and the vapnik-chervonenk is dimen- sion. 199 7. S. Ben-David, D. P al, an d S. Shalev-Shwartz. Ag nostic online learning. In COLT , 2009. A. Blumer , A. Ehrenfe ucht, D. Haussler , and M. W armuth. Lear nability and the vapnik-chervonenkis dimen - sion. 198 9. O. Bousquet and A. Elisseeff. Stability and gen eralization. 2002. Y . Freun d and R. Schapire. A decision -theoretic generalization of on-line lear ning and an application to boosting. Journal of Computer and System Sciences , 55:119–1 39, 1997. E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Log arithmic regret algorithms for on line con vex optimization. In Pr oceed ings of the 19th annual confer ence on Computatio nal Learning T heory (COLT) , 2006. S. Kakade and S. Shale v-Shwartz. Mind the d uality gap: L ogarithm ic re gret algorithms for online op timiza- tion. In Adva nces in Neur al Information Pr ocessing Systems (NIPS) , 2008 . S. M. Kakade and A. Kalai. From batch to transductive online learning. In NIPS , 2006 . 15 S. Kutin and P . N iyogi. Almost-everywhere algorithmic stability and generalization error . I n Pr oce edings of the 18th Confer ence in Uncertainty in Artiﬁcial Intelligence (U AI) , 2002. N. Littlestone and M. K. W armuth. The weighted majority algor ithm. Information and Computa tion , 108: 212–2 61, 1994 . B. McMahan. Follo w- the-regularized -leader and mirror descent: Equivalence th eorems and l1 r egularization. In AIST ATS , 2011 . S. Mukherjee, P . Niyo gi, T . Poggio, an d R. Rifkin. Lear ning th eory: stab ility is sufﬁcient for generalization and necessary and sufﬁcient for consistency of empirical risk minimization. 2006 . S. Rakhlin, S. Mukher jee, and T . Poggio. Stability results in learning theory . 2005. S. Ross, G. J. Gord on, and J. A. Bag nell. A Reduction of Im itation Learnin g and Structur ed Prediction to No-Regret Online Learning. In AIST ATS , 2011. S. Shale v-Shwartz, O. Shamir , N. S rebro, and K. Sridharan . L earnability and stability in the general learning setting. In Pr oc eedings of the 22nd annual confer ence on Computational Learning Theory (COL T) , 20 09. S. Shale v-Shwartz, O. S hamir, N. Srebro, and K. Sridharan. Learn ability , st ability and uniform con vergen ce. 2010. V . V apnik. The Nature of Statistical Learning Theory . Springe r , 1995 . M. Zinkevich. Online conve x program ming and generalized inﬁnitesimal gradient ascent. In Pr o ceedings of the 20th Confer ence on Machine Learning (ICML) , 2003 . 16

Stability Conditions for Online Learnability

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment