Stability Bound for Stationary Phi-mixing and Beta-mixing Processes

Journal of Machine Learning Resear ch 1 (2000) 1-48 Submitted 4/00; Published 10/00 Stability Bounds f or Stationary ϕ -mixing and β -mixing Processes Mehryar Mohri M O H R I @ C I M S . N Y U . E D U Courant Institute of Mathematical Sciences and Google Resear ch 251 Mer cer Str eet New Y ork, NY 1001 2 Afshin Rostamiza deh RO S TA M I @ C S . N Y U . E D U Department of Computer Science Courant Institute of Mathematical Sciences 251 Mer cer Str eet New Y ork, NY 1001 2 Editor: TBD Abstract Most generalization bounds in learning theory are based on some measure of the complexity of the hypoth esis class used, indep endently of any algorithm. In contrast, the notion o f alg orithmic stabil- ity can b e u sed to d eriv e tight generalization bounds that are tailored to speciﬁc lea rning algorithms by exploiting th eir particular pr operties. Howe ver , a s in mu ch of lear ning theory , existing stability analyses and bo unds apply on ly in the scenario wh ere the samples ar e independ ently and identically distributed. In many mac hine lea rning applications, ho we ver , this assumption does n ot hold. T he observations receiv ed by the learning algorithm often ha ve some inherent temporal depende nce. This pap er studies the s cenario wher e the ob servations are dr awn fr om a stationar y ϕ -mixing o r β -m ixing sequence, a widely ad opted assumption in the stud y of non-i.i.d. processes that imp lies a depend ence between observations weakening over time. W e prove novel and distinct stability- based generalizatio n b ounds for stationary ϕ -m ixing and β - mixing seq uences. These b ounds strictly generalize th e bou nds given in th e i.i.d. case an d apply to all stable learning algorithms, thereby extending the use of stability-bound s to no n-i.i.d. scenarios. W e also illustrate the app lication o f our ϕ -mixing ge neralization bound s to gen eral classes of learning algor ithms, includ ing Su pport V ector Regression, Kernel Ridge Regression, and Sup port V ector Machines, an d many other kernel regularization -based an d relative entropy- based r egular- ization algo rithms. Th ese novel bounds can thus be viewed as the ﬁrst the oretical basis for the use of these algorithms in non- i.i.d. scen arios. Keywords: Mixing Distributions, Algorith mic Stability , Generalization Bound s, Machine Learn - ing Theory 1. Introduc tion Most general ization bounds in learning theory are based on some measure of the comple xity of the hypot hesis class used, such as the VC-dimension , co vering numbers , or Rademacher comple xity . These measures character ize a class of hypotheses, independe ntly of any algorith m. In contras t, c  2000 Mehryar Mohri and Afshin Rostamiza deh. M O H R I A N D R O S TA M I Z A D E H the notion of algor ithmic stability can be used to deri ve bou nds that are tailored to speciﬁc learning algori thms and exploi t their particu lar propertie s. A learning algorith m is stable if the hypoth esis it outpu ts v aries in a limited way in response to small changes made to the train ing set. Algorith mic stabili ty has been used effect i ve ly in the past to deri ve tight generalizat ion bounds (Bousquet and Elisseef f, 2001, 2002). But, as in much of learn ing theory , ex isting stabil ity analyses and boun ds apply only in the scenar io where the samples are indepen dently and identic ally dist rib uted (i.i.d.). In many machine learnin g applicati ons, this assumption, ho we ver , does not hold ; in fact, the i.i.d. assumption is not tested or deri ve d from any data analys is. The observ ations recei ve d by the learning algorithm often ha v e some inherent tempor al depe ndenc e. This is cl ear i n system di agnos is or t ime series p redict ion proble ms. Clearly , prices o f differ ent sto cks on the sa me d ay , or of the same st ock on dif ferent days, may be dependen t. But, a less apparent time dependenc y may af fect data sampled in many other tasks as well. This pa per studies the s cenari o where the obs erv ations are dra wn from a stati onary ϕ -mixing or β -mixin g sequen ce, a w idely adopted assumption in the study of non-i.i.d. processes that implies a depen dence between observ ation s weak ening over time (Y u, 1994; Meir, 2000; V idyasaga r, 2003 ; Lozano et al., 2006). W e pro ve nov el and distinct stability-ba sed generaliza tion bounds for station- ary ϕ -mixing and β -mixing sequences . These boun ds strictly generaliz e the bounds gi ven in the i.i.d. case and apply to all stable learning algorithms, ther eby exten ding the usefulness of stability- bound s to non-i.i.d. scenario s. Our proofs are based on the independe nt block technique desc ribed by Y u (1994) and attri b uted to B ernste in (1927 ), which is commonly used in such conte xts. How- e ver , our analysis dif fers from prev ious us es o f this techn ique in that the blocks o f poi nts c onsid ered are not of equal size. For our analysis of stationary ϕ -mixin g se quenc es, we make use of a gener alized v ersion of Mc- Diarmid’ s inequali ty (K ontor ovi ch and Ramanan, 2006 ) that holds for ϕ -mixing sequences. This leads to stability-b ased generaliz ation bounds with the stand ard expo nentia l form. Our general- ization bounds for stationary β -mixing sequences cover a more general non-i.i.d. scenario and use the standard McDiarmid’ s inequ ality , howe ver , unlike the ϕ -mixing case, the β -mixing bound pre- sented here is not a purely exponen tial bound and contains an additi ve te rm depend ing on t he m ixing coef ﬁcient. W e also illustrate the applicati on of our ϕ -mixing gener alizati on bounds to gener al classes of learnin g algorithms, including Support V ecto r Regressi on (SVR) (V apnik , 1998), Ker nel Ridge Re- gressi on (Saunder s et al., 1998), and Support V ector Machin es (SVMs) (Cortes and V apnik, 1995) . Algorith ms such as support ve ctor regre ssion (SVR) (V apnik, 1998; Sch ¨ olk opf and Smola, 2002) ha v e been used in the contex t of time series prediction in w hich the i.i.d. assumption does not hold, some with good exp erimenta l results (M ¨ uller et al., 1997; Mattera and Haykin, 1999). T o our kno wledge, the use of these algorith ms in non-i.i.d. scenarios has not been pre viously sup ported by any theo retical analy sis. The stability bound s we gi ve for SVR , SVMs, and many other kernel reg ulariza tion-b ased and r elati ve entrop y-bas ed regula rizatio n algorithms can thus b e vie w ed as the ﬁrst theore tical basis for their use in such scena rios. The follo wing sections are or ganized as follo ws. In Section 2, we introduc e the necess ary deﬁnitio ns for the non-i.i.d. problems that we are cons iderin g and discuss the learni ng scenar ios in that context. Section 3 gi ves our main generaliz ation bounds for statio nary ϕ -mixi ng sequence s based on stability , as well as the illu strati on of its applicatio ns to gener al k ernel reg ulariz ation- based algori thms, including SVR, KRR, and SVMs, as well as to relati ve entrop y-base d regular izatio n al- 2 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S gorith ms. Finally , Section 4 presents the ﬁrst k no w n stability bounds fo r the mo re general stationary β -mixin g scenario. 2. Preliminaries W e ﬁrst introd uce some standard d eﬁnition s for dependent o bserv ations in m ixing theory (Dou khan, 1994) and then brieﬂy discu ss the learn ing scenarios in the non-i.i.d. case. 2.1 Non-i.i.d. Deﬁnitions Deﬁnition 1 A sequence of rand om varia bles Z = { Z t } ∞ t = −∞ is said to be stati onary if for any t and non-ne gative inte ger s m and k , the rando m vector s ( Z t , . . . , Z t + m ) and ( Z t + k , . . . , Z t + m + k ) have the same distrib ution. Thus, the inde x t or time, does not af fect the distrib ution of a varia ble Z t in a stationa ry sequence. This does not imply independ ence ho we ver . In particul ar , for i < j < k , Pr[ Z j | Z i ] may not equal Pr[ Z k | Z i ] . The follo wing is a standa rd deﬁnition gi ving a measu re of the d epende nce of the random v ariables Z t within a statio nary sequence . There are se veral equi valent deﬁnition s of this quanti ty , we are adopting here that of (Y u, 1994) . Deﬁnition 2 Let Z = { Z t } ∞ t = −∞ be a station ary sequ ence of rando m variables. F or any i, j ∈ Z ∪ {−∞ , + ∞ } , let σ j i denote the σ -algeb ra ge ner ated by the random variables Z k , i ≤ k ≤ j . Then, for any positi ve inte ger k , the β -mixing and ϕ -mixing coef ﬁcients of the stochas tic pr ocess Z ar e deﬁned as β ( k ) = sup n E B ∈ σ n − ∞ h sup A ∈ σ ∞ n + k    Pr[ A | B ] − Pr[ A ]    i ϕ ( k ) = sup n A ∈ σ ∞ n + k B ∈ σ n − ∞    Pr[ A | B ] − Pr[ A ]    . (1) Z is said to be β -mixin g ( ϕ -mixin g) if β ( k ) → 0 (r esp. ϕ ( k ) → 0 ) as k → ∞ . It is said to be algebr aically β -mixing ( algebra ically ϕ -mixin g ) if ther e exis t re al numbers β 0 > 0 (r esp. ϕ 0 > 0 ) and r > 0 such that β ( k ) ≤ β 0 /k r (r esp. ϕ ( k ) ≤ ϕ 0 /k r ) for all k , ex ponen tially mixing if ther e e xist rea l nu mbers β 0 (r esp. ϕ 0 > 0 ) and β 1 (r esp. ϕ 1 > 0 ) suc h that β ( k ) ≤ β 0 exp( − β 1 k r ) (re sp. ϕ ( k ) ≤ ϕ 0 exp( − ϕ 1 k r ) ) for all k . Both β ( k ) and ϕ ( k ) measure the dependence of an eve nt on those that occurred more than k units of time i n the p ast. β -mixing is a weak er assumption than ϕ -mixin g and thus co ve rs a more general non-i.i.d . scenari o. This pa per gi ves st ability -based generalizati on bounds both in the ϕ -mixing and β -mixing c ase. The β -mixing bounds cove r a more general case of course, howe ver , the ϕ -mixin g bounds are simpler and admit the stand ard exp onenti al form. The ϕ -mixing bounds are based on a conce ntratio n inequa lity tha t app lies to ϕ -mixing processes only . Except from the use of this co ncentr ation bound, all of the intermedia te proofs and results to deri v e a ϕ -mixing bou nd in Section 3 are gi ven in the more general case of β -mixing sequences. 3 M O H R I A N D R O S TA M I Z A D E H It has been argue d by V idyasagar (2003) that β -mixing is “just the right” assumption for the analys is of weakly-depen dent sample points in machine learning , in particula r bec ause sev eral P A C- learnin g results then carry over to the non-i.i.d. case. Our β -mixing genera lizatio n bounds further contri b ute to the analysi s of this scena rio. 1 W e describe in sev eral instance s the applic ation of our bounds in the case of algebraic m ixing . Algebrai c mixing is a standa rd assu mption for mixing coefﬁci ents tha t has been adopted in pre vious studie s of learni ng in the presen ce of dependent observ ation s (Y u, 1994; Meir, 2000; V idyasagar, 2003; Lozano et al., 2006 ). Let us also poin t out that mixing assump tions can be check ed in some cases such as with Gaus- sian or Marko v processes (Meir, 2000) and that mixing paramete rs can also be estimate d in such cases. Most pre vious studies use a techniqu e original ly introduced by Bernstein (1927) based on in- depen dent bloc ks of equal size (Y u, 1994; Meir, 2000; Lozano et al., 2006). This technique is particu larly relev ant when dealing with stationary β -mixing. W e w ill need a related b ut some what dif ferent technique since the blocks we consider may not ha ve the sa me size. The follo w ing lemma is a speci al case of Corolla ry 2.7 from (Y u, 1994). Lemma 3 (Y u (Y u, 1994 ), Cor ollar y 2.7) Let µ ≥ 1 and suppos e that h is measur able function, with absolute value bounded by M , on a pr oduct pr obability space  Q µ j =1 Ω j , Q µ i =1 σ s i r i  wher e r i ≤ s i ≤ r i +1 for all i . Let Q be a pr obabilit y measur e on the pr oduct space with mar ginal measur es Q i on ( Ω i , σ s i r i ), and let Q i +1 be the mar ginal m easur e of Q on  Q i +1 j =1 Ω j , Q i +1 j =1 σ s j r j  , i = 1 , . . . , µ − 1 . Let β ( Q ) = sup 1 ≤ i ≤ µ − 1 β ( k i ) , wher e k i = r i +1 − s i , and P = Q µ i =1 Q i . Then, | E Q [ h ] − E P [ h ] | ≤ ( µ − 1) M β ( Q ) . (2) The lemma gi ves a measure of the d if ference between the dist rib ution of µ blocks where the block s are independen t in one case and depen dent in the other case. The distrib ution within each block is assumed to be the same in both cases. For a monotonic ally decreasin g function β , we hav e β ( Q ) = β ( k ∗ ) , where k ∗ = min i ( k i ) is the smalles t gap between blocks. 2.2 Learning Scenari os W e c onside r the f amiliar supervi sed learnin g setting whe re th e lear ning al gorith m recei ves a sample of m labeled points S = ( z 1 , . . . , z m ) = (( x 1 , y 1 ) , . . . , ( x m , y m )) ∈ ( X × Y ) m , where X is the input space and Y the set of labels ( Y = R in the re gress ion case), both assumed to be measu rable. For a ﬁxed learning algorithm, we denote by h S the hypo thesis it return s when trained on the sample S . The error o f a hypothesis on a pair z ∈ X × Y is meas ured in te rms of a cost fun ction c : Y × Y → R + . Thus, c ( h ( x ) , y ) m easur es the error of a hypot hesis h on a pair ( x, y ) , c ( h ( x ) , y ) = ( h ( x ) − y ) 2 in the standard regressio n cases. W e will use the shorthand c ( h, z ) := c ( h ( x ) , y ) for a hypot hesis h and z = ( x, y ) ∈ X × Y and will assume that c is upper bounded by a constant M > 0 . 1. Some r esults hav e al so been obtained in the more general context of α -mixing b ut they seem to require th e stronger condition of ex ponential mixing (Modha and Masry, 1998). 4 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S W e denote by b R ( h ) the empirical error of a hypothesi s h for a training sample S = ( z 1 , . . . , z m ) : b R ( h ) = 1 m m X i =1 c ( h, z i ) . (3) In the stand ard machine learni ng scenario, the sample pairs z 1 , . . . , z m are assumed to be i.i.d., a restric ti ve assu mption that d oes n ot al ways hold in p ractice . W e wil l con sider here the more general case of d epende nt samples drawn fro m a stati onary mixing sequen ce Z ov er X × Y . As in the i.i.d. case, the objecti ve of the learning algorithm is to select a hypothesis with small error ov er future samples. But, here, we must disting uish two v ersio ns of this problem. In the most general version , future samples depend on the training sample S and thus the gen- eraliza tion error or true error of the hypothe sis h S trained on S must be measured by its expected error cond itioned on the sample S : R ( h S ) = E z [ c ( h S , z ) | S ] . (4) This is the most realistic setti ng in this context , which matches time series prediction probl ems. A somewha t less realisti c versio n is one where the samples are depen dent, but the test point s are assumed to be independ ent of the training sample S . The generaliza tion erro r of the hypothesis h S trained on S is then: R ( h S ) = E z [ c ( h S , z ) | S ] = E z [ c ( h S , z )] . (5) This setting seems less natural since, if samples are dependent, future test points must also depen d on the train ing points , ev en if that de pende nce is rel ati vely weak d ue to the time inter v al after which test poin ts are drawn. Nev erthe less, it is this somewha t less realis tic setting that has been studie d by all pre vious machin e learning stud ies that w e are awa re of (Y u, 199 4; Meir, 2000 ; V idyasag ar, 2003; L ozano et al., 2006), ev en when examini ng speciﬁcall y a time series prediction problem (Meir, 200 0). Thus, the bounds deriv ed in these studies cannot be directl y applie d to the more genera l setting. W e will consider instead the most general setting with the deﬁnition of the generali zation error based on Eq. 4. Clearly , our analysis also applies to the less general setting just discusse d as well. Let us brieﬂy discuss the more general scen ario of n on-sta tionar y mixing sequen ces, that is on e where the distr ib ution m ay chan ge over t ime. W ithin that genera l case, the generaliza tion error of a hypot hesis h S , deﬁned straigh tforwa rdly by R ( h S , t ) = E z t ∼ σ t t [ c ( h S , z t ) | S ] , (6) would dep end on the time t and it may be the case that R ( h S , t ) 6 = R ( h S , t ′ ) for t 6 = t ′ , making the deﬁnitio n of the genera lizatio n error a more sub tle issue. T o remov e the dependence on time, one could deﬁne a weaker not ion of the generaliz ation error based on an expec ted loss over all time: R ( h S ) = E t [ R ( h S , t )] . (7) It is not clear ho wev er whethe r this term co uld be ea sily comp uted and us eful. A stro nger conditio n would be to minimize the generaliza tion erro r for any partic ular tar get time. Studies of this type ha v e been conducted for smooth ly changing distrib utions, such as in Zhou et al. (200 8), ho w e ve r , to the best of our kno wledge, the scenario of a bot h non-id entical and non-indepen dent sequences has not yet been studie d. 5 M O H R I A N D R O S TA M I Z A D E H 3. ϕ -Mixing Generalization Bounds and App lications This section giv es genera lizatio n bounds for ˆ β -sta ble algorithms ov er a mixing stationary dist rib u- tion. 2 The ﬁrst two sections presen t our main proofs which hold for β -mixing statio nary distrib u- tions. In the third section, we w ill brieﬂy discuss concentra tion ineq ualitie s that apply to ϕ -mixing proces ses only . Then, in the ﬁnal section, we will present our main results. The conditio n of ˆ β -sta bility is an algori thm-depe ndent property ﬁrst introduced by D e vro ye and W agner (1979) and Kearn s and Ron (1997). It has been later used successfull y by Bousquet and Elisseef f (200 1, 2002 ) to show algorit hm-speci ﬁc stabilit y bound s for i.i.d. samples . Rou ghly speaki ng, a le arning a lgorit hm is sai d to be stable if small chan ges to the tra ining se t do not p roduc e lar ge dev iation s in its output. The follo wing gi ve s the precise technica l deﬁnition. Deﬁnition 4 A learnin g algori thm is said to be (uniformly) ˆ β -sta ble if the hypotheses it r eturn s for any two tra ining samples S and S ′ that dif fer by a single point satisfy ∀ z ∈ X × Y , | c ( h S , z ) − c ( h S ′ , z ) | ≤ ˆ β . (8) The use of stab ility in conjunct ion with McDiarmid’ s inequali ty w ill allo w us to produ ce gener al- ization bound s. McDiarmid’ s inequality is an exponent ial conce ntratio n bou nd of the type, Pr[ | Φ − E[Φ] | ≥ ǫ ] ≤ exp  − ǫ 2 ml 2  , where t he pr obabil ity is ov er a sa mple o f siz e m and l is t he Lips chitz p arameter o f Φ (which is a lso a function of m ). Unfortunat ely , this inequality cannot b e e asily ap plied when the sample points are not distrib uted in an i.i.d. fashion . W e will use the results of K ontor ovi ch and Ramanan (2006 ) to ext end the use of McDiarmid’ s inequa lity with genera l m ixing distrib utions (Theorem 9). T o obtain a stability -based generalizat ion bou nd, we will apply this theorem to Φ( S ) = R ( h S ) − b R ( h S ) . T o do so, we need to show , as with the standard McDiarmid’ s inequa lity , that Φ is a Lipschitz functi on and, to m ak e it usefu l, bound E[Φ] . The next two sectio ns descri be ho w we achie ve both of these in this non-i.i.d. scenar io. Let us ﬁ rst take a brief look at the probl em fa ced when attempt ing to gi ve stabili ty bound s for depen dent seque nces an d giv e some ide a of our solutio n for that problem. The stability pr oofs gi ven by Bousquet a nd Elis seef f (2001) assume the i.i.d. property , thus repla cing an element in a seque nce with another does not affe ct the expec ted value of a rando m v ariabl e deﬁned ov er that sequen ce. In other words, the foll o wing equality holds, E S [ V ( Z 1 , . . . , Z i , . . . , Z m )] = E S,Z ′ [ V ( Z 1 , . . . , Z ′ , . . . , Z m )] , (9) for a random vari able V that is a function of the sequence of random v ariable s S = ( Z 1 , . . . , Z m ) . Ho wev er , cl early , if the points in that sequenc e S ar e depen dent, thi s equal ity may not hold anymore. The m ain technique to cope with this problem is base d on the so-calle d “independ ent block sequen ce” originally introduc ed by Bernstein (1927) . T his cons ists of elimin ating from the orig inal depen dent sequen ce se ve ral blocks of contiguous points, lea ving us with some remaining blocks of 2. The standard v ariable used for the stability coef ﬁcient is β . T o av oid the confusion with the β -mi xing coef ﬁcient, we will use ˆ β instead. 6 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S points . Instead of these de pende nt blocks, we then co nsider i ndepe ndent blocks o f points, each with the same si ze and th e same d istrib ution (within e ach block) as the dep enden t ones. By Lemma 3, for a β -mixin g distrib ution, the expecte d value of a ran dom v ariable deﬁned ov er the depen dent blocks is close to the one based on these inde pende nt blocks. W orking with these indepe ndent blocks brings us back to a situa tion similar to the i.i.d. case, with i.i.d. block s replacing i.i.d. points. Our use of this method some what diffe rs from previo us ones (see Y u, 1994; Meir, 2000) where many blocks of equal size are conside red. W e will be dealing with fou r blocks and with typic ally unequ al sizes. More speciﬁcally , note that for Equation 9 to hold, we only need that the v ariab le Z i be independent of the other points in the sequ ence. T o achie ve this, roughly speaki ng, we will be “discar ding” some of the points in the sequence surroundin g Z i . This results in a seque nce of three blocks of con tiguo us points . If our algorithm is stable and we do not discar d too m any points , the hypothe sis returned should not be greatly af fected by this operation. In the next step, we apply the independ ent block lemma, which then allo w s us to assume each of these blocks as indepe ndent modulo the additio n of a mixing term. In partic ular , Z i becomes independen t of all other points. Clearly , the number of points discarded is subject to a trade-o f f: removin g too m any points could exces si v ely m odify the hypothe sis returne d; remo ving too few wou ld maintain the depen denc y be tween Z i and the remainin g points, t hereby producin g a larg er penalty when applying Lemma 3. T his trad e-of f is made e xplici t in the followin g section where an optimal solu tion is sough t. 3.1 Lipschitz Bound As d iscuss ed in Section 2.2 , in the most gener al scenario , test points depe nd on th e trainin g sample. W e ﬁ rst present a lemma that relates the expe cted valu e of the genera lizatio n error in that scenario and the same expec tation in the scenario w here the test point is independ ent of the training sample. W e denote by R ( h S ) = E z [ c ( h S , z ) | S ] the expectat ion in the dependent cas e and by e R ( h S b ) = E e z [ c ( h S b , e z )] the expectati on where the test points are assumed independe nt of the training, with S b denoti ng a sequen ce similar to S bu t with the last b points remo ved . Figure 1(a) illus trates that sequen ce. The blo ck S b is assumed to ha ve exactl y the same d istrib ution as the correspon ding block of the same size in S . Lemma 5 Assume that the learn ing algorithm is ˆ β -stable and that the cost functio n c is boun ded by M . Then, for any sampl e S of size m drawn fr om a β -mixin g stationa ry distrib ution and for any b ∈ { 0 , . . . , m } , the following holds: | E S [ R ( h S )] − E S [ e R ( h S b )] | ≤ b ˆ β + β ( b ) M . (10) Pro of The ˆ β -sta bility of the learning algorithm implies that E S [ R ( h S )] = E S,z [ c ( h S , z )] ≤ E S,z [ c ( h S b , z )] + b ˆ β . (11) The applic ation of Lemma 3 yields E S [ R ( h S )] ≤ E S, e z [ c ( h S b , e z )] + b ˆ β + β ( b ) M = e E S [ R ( h S b )] + b ˆ β + β ( b ) M . (12) The other side of the inequ ality of the lemma can be shown follo wing the same steps. W e can no w prov e a Lipschit z bound for the function Φ . 7 M O H R I A N D R O S TA M I Z A D E H S b b z b b z i S i (a) (b) z b b b S i,b z i z b b b S i i,b z (c) (d) Figure 1: Illustr ation of the sequences deriv ed from S that are conside red in the proo fs. Lemma 6 Let S = ( z 1 , . . . , z i , . . . , z m ) and S i = ( z 1 , . . . , z ′ i , . . . , z m ) be two sequences dra wn fr om a β -mixing station ary pr ocess that diffe r only in poin t i ∈ [1 , m ] , and let h S and h S i be the hypoth eses r eturne d by a ˆ β -sta ble algorith m when tra ined on each of these samples. Then, for any i ∈ [1 , m ] , the following inequality holds: | Φ( S ) − Φ ( S i ) | ≤ ( b + 1)2 ˆ β + 2 β ( b ) M + M m . (13) Pro of T o p rov e this inequ ality , w e ﬁrst boun d the dif ference of t he empiric al error s as in (Bousquet and E lisseef f, 2002), then the diff erence of the true errors. Bounding the diff erence of costs on agreei ng points with ˆ β and the one that disag rees with M yields | b R ( h S ) − b R ( h S i ) | = 1 m X j 6 = i | c ( h S , z j ) − c ( h S i , z j ) | + 1 m | c ( h S , z i ) − c ( h S i , z ′ i ) | (14) ≤ ˆ β + M m . Since both R ( h S ) and R ( h S i ) are deﬁned with respect to a (dif ferent) depende nt point, we apply Lemma 5 to both gene ralizat ion error terms and use ˆ β -sta bility . This then resu lts in | R ( h S ) − R ( h S i ) | ≤ | e R ( h S b ) − e R ( h S i b ) | + 2 b ˆ β + 2 β ( b ) (15) = E e z [ c ( h S b , e z ) − c ( h S i b , e z )] + 2 b ˆ β + 2 β ( b ) M ≤ ˆ β + 2 b ˆ β + 2 β ( b ) M . The lemma’ s statement is obtained by combining inequalities 14 and 15. 3.2 Bound on Expectatio n As mentioned earlier , to obtain an explicit bound after applic ation of a gener alized McDiarmid’ s inequa lity , we also need to bound E S [Φ( S )] . This is done by analy zing indepe ndent blocks using Lemma 3. 8 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S Lemma 7 Let h S be the hypo thesis r eturne d by a ˆ β -sta ble algorithm train ed on a sample S drawn fr om a stationar y β -mixing distrib ution. Then, for all b ∈ [1 , m ] , the following inequality holds: E S [ | Φ( S ) | ] ≤ (6 b + 1) ˆ β + 3 β ( b ) M . (16) Pro of Let S b be deﬁned as in the proof of L emma 5. T o deal with indepen dent block sequences deﬁned with resp ect to the same hypothesis , w e will consider the sequen ce S i,b = S i ∩ S b , which is illustrat ed by Figure 1(c). T his can result in as m any as four blocks. As before , we will consider a sequence e S i,b with a similar set of blocks each with the same distrib ution as the correspondi ng blocks in S i,b , bu t such that the blocks are indepen dent. Since three blocks of at most b points are remov ed from each hypot hesis, by the ˆ β -sta bility of the learnin g algorithm, the followin g holds: E S [Φ( S )] = E S [ b R ( h S ) − R ( h S )] = E S,z " 1 m m X i =1 c ( h S , z i ) − c ( h S , z ) # (17) ≤ E S i,b ,z " 1 m m X i =1 c ( h S i,b , z i ) − c ( h S i,b , z ) # + 6 b ˆ β . (18) The appl ication of L emma 3 to the dif ference of two cost functi ons also bounded by M as in the right-h and side leads to E S [Φ( S )] ≤ E e S i,b , e z " 1 m m X i =1 c ( h e S i,b , e z i ) − c ( h e S i,b , e z ) # + 6 b ˆ β + 3 β ( b ) M . (19) No w , since the points e z and e z i are inde pende nt and s ince the dis trib ution is stat ionary , they ha ve the same distrib ution and we can replace e z i with e z in the empiric al cost. T hus, we can write E S [Φ( S )] ≤ E e S i,b , e z " 1 m m X i =1 c ( h e S i i,b , e z ) − c ( h e S i,b , e z ) # + 6 b ˆ β + 3 β ( b ) M ≤ ˆ β + 6 b ˆ β + 3 β ( b ) M , where e S i i,b is the sequenc e deriv ed from e S i,b by replacing e z i with e z . The last inequality holds by ˆ β -sta bility of the learning algorithm. T he othe r side of the inequality in the statement of the lemma can be sho wn follo wing the same steps. 3.3 ϕ -mixing Generaliza tion Bounds W e a re no w prep ared to mak e use of a con centra tion inequali ty to pro vide a generali zation bound in the ϕ -mixing scen ario. S e ve ral concentr ation inequalit ies hav e been sho wn in ϕ -mixing case, e.g. Marton (1998); Samson (2000); Chazo ttes et al. (2 007); K ontoro vich and Ramanan (2006). W e will use that of K ontorov ich and Ramanan (2006), which is very similar t o that of Chazottes et al. (2007) modulo the fac t that the latter requires a ﬁ nite samp le space. These conce ntratio n inequalit ies are general izatio ns of the of follo wing inequal ity of McDi- armid (1989) commonly used in the i.i.d. settin g. 9 M O H R I A N D R O S TA M I Z A D E H Theor em 8 (McDiarmid (1989), 6.10) Let S = ( Z 1 , . . . , Z m ) be a sequ ence of r andom variabl es, eac h taking value s in the set Z , then for any m easur able function Φ : Z m → R that satis ﬁes the followin g, ∀ i ∈ 1 , . . . , m, ∀ z i , z ′ i ∈ Z ,     E S h Φ( S )   Z 1 = z 1 , . . . , Z i = z i i − E S h Φ( S )   Z 1 = z 1 , . . . , Z i = z ′ i i     ≤ c i , for consta nts c i . Then, for all ǫ > 0 , Pr[ | Φ − E[Φ] ≥ ǫ ] ≤ 2 exp  − 2 ǫ 2 P m i =1 c 2 i  . In the i.i.d. scen ario, the requi rement to produce the constants c i simply tran slates into a Lip- schitz condition on the functi on Φ . Theore m 5.1 of K ontor ovi ch and Ramanan (2006) bounds precis ely this quantity as follo ws, 3 c i ≤ 1 + 2 m − i X k =1 ϕ ( k ) . (20) Giv en the bound in Equation 20, the conce ntratio n bound of McDiarmid can be restated as follo ws, making it easily access ible to ϕ -mixin g distrib ution s. Theor em 9 (Kontor ovich and Ramanan (2006)) Let Φ : Z m → R be a measur able function. If Φ is l -Lipschit z with r espect to the Hamming metric for so me l > 0 , then t he following holds for al l ǫ > 0 : Pr Z [ | Φ( Z ) − E[Φ( Z )] | > ǫ ] ≤ 2 exp  − 2 ǫ 2 ml 2 || ∆ m || 2 ∞  , (21) wher e || ∆ m || ∞ ≤ 1 + 2 m X k =1 ϕ ( k ) . It should be pointed out that the statement of the theorem in this paper is improv ed by a fact or of 4 in th e exp onent , from the one s tated in K ontoro vich and Raman an (200 6) Theorem 1.1. T his c an be achie ve straig htforw ardly by follo wing the same ste ps as in the proo f by K ontorov ich and Ramanan (2006 ) and making use of the general form of McDiarmid’ s inequality (Theor em 8) as opposed to Azuma’ s inequality . This se ction pr esents se veral the orems tha t consti tute the main results of t his p aper . The follo w- ing theore m is const ructed form the bounds shown in the pre vious three sections. Theor em 10 (General Non-i.i.d. Stability Bound) Let h S denote the hypothe sis ret urned by a ˆ β - stable algorithm train ed on a sample S drawn fr om a ϕ -mixing stationary distrib ution and let c be a m easur able non-ne gative cost functi on upper bo unded by M > 0 , then f or a ny b ∈ [0 , m ] and any ǫ > 0 , the following gene ral izatio n bou nd holds Pr S h    R ( h S ) − b R ( h S )    > ǫ + (6 b + 1) ˆ β + 6 M ϕ ( b ) i ≤ 2 exp − 2 ǫ 2 (1 + 2 P m i =1 ϕ ( i )) − 2 m (( b + 1)2 ˆ β + 2 M ϕ ( b ) + M /m ) 2 ! . 3. W e should note that original bound is expressed in terms of η -mi xing coefﬁcients. T o si mplify presentation, we are adapting it to the case of stati onary ϕ -mixing sequences by using the follo wing straightforward ineq uality for a stationary process: 2 ϕ ( j − i ) ≥ η ij . Furthermore, the bound presented in K ontorovich an d Ramana n (200 6) hold s when the sample space is countable, it is ex tended to the continuous case in K ontoro vich (2007). 10 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S Pro of The theorem follo ws directly the applica tion of Lemma 6 and Lemma 7 to Theorem 9. The theorem giv es a general stability bound for ϕ -mixing stationary sequences. If we further as- sume that the sequenc e is algebraicall y ϕ -mixing , that is for all k , ϕ ( k ) = ϕ 0 k − r for some r > 1 , then we can solv e for the valu e of b to optimize the bound . Theor em 11 (Non-i.i.d. Stability Bound fo r Algebraica lly M ixing Sequences) Let h S denote the hypoth esis re turned by a ˆ β -sta ble algorithm train ed on a sample S dr awn fr om an algebr aically ϕ - mixing station ary distrib ution, ϕ ( k ) = ϕ 0 k − r with r > 1 and let c be a measu rab le non-ne gative cost function upper bound ed by M > 0 , then for any ǫ > 0 , the following gener alization boun d holds Pr S h    R ( h S ) − b R ( h S )    > ǫ + ˆ β + ( r + 1)6 M ϕ ( b ) i ≤ 2 exp − 2 ǫ 2 (1 + 2 ϕ 0 r/ ( r − 1 )) − 2 m (2 ˆ β + ( r + 1 )2 M ϕ ( b ) + M /m ) 2 ! , wher e ϕ ( b ) = ϕ 0  ˆ β r ϕ 0 M  r / ( r +1) . Pro of For an algebr aically mixing sequence, the valu e of b minimizing the bou nd of Theorem 10 satisﬁes ˆ β b = r M ϕ ( b ) , which giv es b =  ˆ β r ϕ 0 M  − 1 / ( r +1) and ϕ ( b ) = ϕ 0  ˆ β r ϕ 0 M  r / ( r +1) . The follo wing term can be bounde d as 1 + 2 m X i =1 ϕ ( i ) = 1 + 2 m X i =1 ϕ 0 i − r ≤ 1 + 2 ϕ 0  1 + Z m 1 i − r di  = 1 + 2 ϕ 0  1 + m 1 − r − 1 1 − r  . Using the assumptio n r > 1 , we uppe r bound m 1 − r with 1 and ﬁnd that, 1 + 2 ϕ 0  1 + m 1 − r − 1 1 − r  ≤ 1 + 2 ϕ 0  1 + 1 r − 1  = 1 + 2 ϕ 0 r r − 1 . Plugging in this val ue and the minimizing value of b in the bou nd of Theorem 10 yields the state- ment of the theorem. In the case of a zero mixing coef ﬁ cient ( ϕ = 0 and b = 0 ), the bound s of Theorem 10 coin- cide with the i.i.d. stabil ity bou nd of (B ousqu et and Elisseef f, 200 2). In order for the right-hand side of these bounds to con ver ge, we must ha ve ˆ β = o (1 / √ m ) and ϕ ( b ) = o (1 / √ m ) . For se v- eral general classes of algorithms, ˆ β ≤ O (1 /m ) (Bousquet and Elisse ef f, 2002). In the case of algebr aically mixing sequence s with r > 1 , as assu med in Theorem 11, ˆ β ≤ O (1 /m ) implies ϕ ( b ) = ϕ 0 ( ˆ β / ( rϕ 0 M )) ( r/ ( r +1)) < O (1 / √ m ) . The next section illustr ates the applicatio n of The- orem 11 to se ver al general classes of algorith ms. W e no w presen t the application of our stabil ity bounds to se vera l algori thms in the case of an algebra ically mixing sequence. W e make use of the stabili ty anal ysis foun d in Bousque t and Elisseef f (2002), which allo ws us to apply our bounds in the case of kernel regulariz ed algorithms, k -local rules and relati ve entropy regular ization . 11 M O H R I A N D R O S TA M I Z A D E H 3.4 Applica tions 3 . 4 . 1 K E R N E L R E G U L A R I Z E D A L G O R I T H M S Here we apply our bounds to a family of algorithms based on the m inimizat ion of a reg ularize d object i ve function based on the norm k · k K in a reproducin g k ernel Hilbert space, where K is a positi ve deﬁnite symmetric kernel: argmin h ∈ H 1 m m X i =1 c ( h, z i ) + λ k h k 2 K . (22) The appl ication of our bound is possibl e, under some gene ral conditio ns, since kerne l regu larized algori thms are stable with ˆ β ≤ O (1 /m ) (Bousquet and Elisseef f, 2002). Here we brieﬂy repro duce the proof of this ˆ β -sta bility for the sake of completene ss; ﬁ rst we introduce some needed terminol- ogy . W e will assume that the cos t functio n c is σ -admissibl e , that is there exists σ ∈ R + such that for any tw o hypoth eses h, h ′ ∈ H and for all z = ( x, y ) ∈ X × Y , | c ( h, z ) − c ( h ′ , z ) | ≤ σ | h ( x ) − h ′ ( x ) | . (23) This assumptio n holds for the quadratic cost and most other cost functions when the hypothes is set and the set of outpu t labels are bounded by some M ∈ R + : ∀ h ∈ H , ∀ x ∈ X, | h ( x ) | ≤ M and ∀ y ∈ Y , | y | ≤ M . W e will also assume that c is diff erentia ble. This assumptio n is in fact not necess ary and all of our resul ts hold without it, b ut it make s the presentat ion simpler . W e denote by B F the Bre gman div er genc e associ ated to a con vex fun ction F : B F ( f k g ) = F ( f ) − F ( g ) − h f − g, ∇ F ( g ) i . In what follows, it will be helpful to deﬁne F as the objecti ve functi on of a gene ral reg ulariza tion based algorith m, F S ( h ) = b R S ( h ) + λN ( h ) , (24) where b R S is the empirical error as measured on the sample S , N : H → R + is a regulari zation functi on and λ > 0 is the usual trade-of f paramet er . Finally , we shall use the short hand ∆ h = h ′ − h . Lemma 12 (Bousquet and Elisseef f (2002)) A kernel r e gularized learn ing algorith m, (22), with bound ed ker nel K ( x, x ) ≤ κ < ∞ and σ -admissible cost function, is ˆ β -sta ble w ith coef ﬁcient, ˆ β ≤ σ 2 κ 2 mλ Pro of Let h and h ′ be the minimizers of F S and F ′ S respec ti ve ly where S and S ′ dif fer in the ﬁrst coordi nate (choice of coord inate is without loss of generality ), then, B N ( h ′ k h ) + B N ( h k h ′ ) ≤ 2 σ mλ sup x ∈ S | ∆ h ( x ) | . (25) T o see this, we notice th at since B F = B b R + λB N , and since a Bregma n div ergen ce is non-n ega ti ve , λ  B N ( h ′ k h ) + B N ( h k h ′ )  ≤ B F S ( h ′ k h ) + B F S ′ ( h k h ′ ) . 12 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S By the deﬁnition of h and h ′ as the minimizers of F S and F S ′ , B F S ( h ′ k h ) + B F S ′ ( h k h ′ ) = b R F S ( h ′ ) − b R F S ( h ) + b R F S ′ ( h ) − b R F S ′ ( h ′ ) . Finally , by the σ -admis sibili ty of the cost function c and the deﬁnition of S and S ′ , λ  B N ( h ′ k h ) + B N ( h k h ′ )  ≤ b R F S ( h ′ ) − b R F S ( h ) + b R F S ′ ( h ) − b R F S ′ ( h ′ ) = 1 m  c ( h ′ , z 1 ) − c ( h, z 1 ) + c ( h, z ′ 1 ) − c ( h ′ , z ′ 1 )  ≤ 1 m  σ | ∆ h ( x 1 ) | + σ | ∆ h ( x ′ 1 ) |  ≤ 2 σ m sup x ∈ S | ∆ h ( x ) | , which establi shes (25). No w , if w e consid er N ( · ) = k·k 2 K , we hav e B N ( h ′ k h ) = k h ′ − h k 2 K , thus B N ( h ′ k h ) + B N ( h k h ′ ) = 2 k ∆ h k 2 K and by (25) and the reprodu cing kern el property , 2 k ∆ h k 2 K ≤ 2 σ mλ sup x ∈ S | ∆ h ( x ) | ≤ 2 σ mλ κ || ∆ h || K . Thus k ∆ h k K ≤ σκ mλ . And using the σ -admissibili ty of c and the kerne l reprod ucing proper ty w e get, ∀ z ∈ X × Y , | c ( h ′ , z ) − c ( h, z ) | ≤ σ | ∆ h ( x ) | ≤ κσ k ∆ h k K . Therefore , ∀ z ∈ X × Y , | c ( h ′ , z ) − c ( h, z ) | ≤ σ 2 κ 2 mλ , which complete s the proof. Three speciﬁc insta nces of kernel regu lariza tion alg orithms are SV R, for which the cost functio n is based on the ǫ -insensi ti ve cost: c ( h, z ) = | h ( x ) − y | ǫ = ( 0 if | h ( x ) − y | ≤ ǫ, | h ( x ) − y | − ǫ otherwise. (26) Ker nel Ridge Regres sion (Saunders et al., 1998), for which c ( h, z ) = ( h ( x ) − y ) 2 , (27) and ﬁnally Support V ector Machine s with the hinge-loss , c ( h, z ) =      0 if 1 − y h ( x ) ≤ 0 , 1 − yh ( x ) if 0 ≤ y h ( x ) < 1 , 1 if y h ( x ) < 0 . (28) W e note that for kernel reg ulariz ation algorit hms, as pointed out in Bousque t and Elisseef f (2002 , Lemm a 23), a bound on the labels immediate ly implies a bound on the outp ut of the h ypoth - esis produ ced by equati on (22). W e formally sta te this lemma belo w . 13 M O H R I A N D R O S TA M I Z A D E H Lemma 13 Let h ∗ be the solution to equation (22), let c be a cost function and let B ( · ) be a r eal- valued functi on suc h that ∀ y ∈ { y | ∃ x ∈ X , ∃ h ∈ H , y = h ( x ) } , ∀ y ′ ∈ Y , c ( y , y ′ ) ≤ B ( y ) . Then, the outpu t of h ∗ is bounded as follows, ∀ x ∈ X , | h ∗ ( x ) | ≤ κ r B (0) λ , wher e λ is the r e gulariz ation paramet er , and κ 2 ≥ K ( x, x ) for all x ∈ X . Pro of Let F ( h ) = 1 m P m i =1 c ( h, z i ) + λ k h k 2 K and let 0 be the zero hypothes is, then by deﬁnition of F and h ∗ , λ k h ∗ k 2 K ≤ F ( h ∗ ) ≤ F ( 0 ) ≤ B (0) . Then, using the reprod ucing ker nel property and the Cauchy-Schw artz inequality we note, ∀ x ∈ X , | h ∗ ( x ) | = h h ∗ , K ( x, · ) i ≤ k h ∗ k K p K ( x, x ) ≤ κ k h ∗ k K . Combining the two ineq ualiti es produces the result. W e note that in Bousquet and Elisseef f (2 002), the follo w ing the bound is also stated: c ( h ∗ ( x ) , y ′ ) ≤ B ( κ p B (0) /λ ) . Howe ver , when later applied it seems the authors use an incor rect upper bound functi on B ( · ) , which we remed y in the follo wing. Cor ollar y 14 Assume a bound ed output Y = [0 , B ] , for some B > 0 , and assume that K ( x, x ) ≤ κ 2 for all x for some κ > 0 . Let h S denote the hypothesis ret urned by the algorit hm when trained on a sample S drawn fr om an algebr aica lly ϕ -mixing stationary distrib ution. Let u = r / ( r + 1) ∈ [ 1 2 , 1] , M ′ = 2( r + 1) ϕ 0 M / ( rϕ 0 M ) u , and ϕ ′ 0 = (1 + 2 ϕ 0 r / ( r − 1)) . Then, with pr obability at least 1 − δ , the following gener alizat ion bounds hold for a. Support V ector Machines (SVM, with hing e-loss ) R ( h S ) ≤ b R ( h S ) + κ 2 λm +  κ 2 λ  u 3 M ′ m u + ϕ ′ 0  1 + κ 2 λ +  κ 2 λ  u M ′ m u − 1  r 2 log(2 /δ ) m , wher e M = 1 . b . Support V ector Re gr ession (SVR): R ( h S ) ≤ b R ( h S ) + κ 2 λm +  κ 2 λ  u 3 M ′ m u + ϕ ′ 0  M + κ 2 λ +  κ 2 λ  u M ′ m u − 1  r 2 log(2 /δ ) m , wher e M = κ q B λ + B . c. Kerne l Ridge Re gr essio n (KRR): R ( h S ) ≤ b R ( h S )+ 4 κ 2 B 2 λm +  4 κ 2 B 2 λ  u 3 M ′ m u + ϕ ′ 0  M + 4 κ 2 B 2 λ +  4 κ 2 B 2 λ  u M ′ m u − 1  r 2 log(2 /δ ) m , wher e M = κ 2 B 2 /λ + B 2 . 14 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S Pro of For SVM, the hinge-los s is 1 -ad missible giv ing ˆ β ≤ κ 2 / ( λm ) , and the cost functio n is clearly bound ed by M = 1 . Similarly , SVR has a loss functio n that is 1 -admissible , thus , applying Lemma 12 gi ve s us ˆ β ≤ κ 2 / ( λm ) . Using Lemma 13, with B (0) = B , we can bound the loss as follows: ∀ x ∈ X , y ∈ Y , | h ∗ ( x ) − y | ≤ κ q B λ + B . Finally for KR R, we ha ve a loss functio n that is 2 B -admis sible and again using Lemma 12 ˆ β ≤ 4 κ 2 B 2 / ( λm ) . Again, apply ing Lemma 13 with B (0) = B 2 and ∀ x ∈ X, y ∈ Y , ( h ∗ ( x ) − y ) 2 ≤ κ 2 B 2 /λ + B 2 . Plugging these values into the bound of Theorem 11 and setti ng the right-ha nd side to δ yields the statement of the corol lary . 3 . 4 . 2 R E L A T I V E E N T RO P Y R E G U L A R I Z E D A L G O R I T H M S In this sect ion we apply T heorem 11 to algori thms that produce a hypot hesis h that is a con vex combina tion of base hypoth eses h θ ∈ H which ar e p arameter ized by θ ∈ Θ . Thus, we wish to learn a weightin g function g ∈ G : Θ → R that is a solutio n to the follo wing optimizat ion, argmin g ∈ G 1 m m X i =1 c ( g , z i ) + λD ( g k g 0 ) , (29) where the cost functio n c : G × Z → R is deﬁned in term of a second internal cost function c ′ : H × Z → R : c ( g , z ) = Z Θ c ′ ( h θ , z ) g ( θ ) dθ , and where D is the Kullback -Leibler div er genc e or relati ve entrop y regulari zer (with respect to some ﬁxed dis trib ution g 0 ): D ( g k g 0 ) = Z Θ g ( θ ) ln g ( θ ) g 0 ( θ ) dθ . It has been shown, (Bousquet and Elisseef f, 2002, Theore m 24), that an algorithm satisfyin g equati on 29 and with bound ed loss c ′ ( · ) ≤ M , is ˆ β -sta ble with coef ﬁcient ˆ β ≤ M 2 λm . The applic ation of our boun ds, results in the following coroll ary . Cor ollar y 15 Let h S be the hyp othesi s pr oduced by the opti mization in (29), w ith internal cost functi on c ′ bound ed by M . Then with pr obability at least 1 − δ , R ( h S ) ≤ b R ( h S ) + M 2 λm + 3 M ′ λ u m u + ϕ ′ 0  M + M 2 λ + M ′ λ u m u − 1  r 2 log (2 /δ ) m , wher e u = r / ( r + 1) ∈ [ 1 2 , 1] , M ′ = 2( r + 1) ϕ 0 M u +1 / ( r ϕ 0 ) u , and ϕ ′ 0 = (1 + 2 ϕ 0 r / ( r − 1)) . 15 M O H R I A N D R O S TA M I Z A D E H 3.5 Discussion The results present ed here are, to the best of our kno w ledge, the ﬁ rst stabilit y-base d generaliza tion bound s for the class of algorithms just studied in a non-i.i.d. scenario. These bounds are non-tri vial when the conditi on o n the reg ulariza tion λ ≫ 1 /m 1 / 2 − 1 /r paramete r h olds for all lar ge v alues o f m . This condition coin cides with the i.i.d. condit ion, in the limit, as r tends to inﬁnity . The n ext section gi ve s stability -based general ization bounds that hold ev en in the scenario of β -mixing sequen ces. 4. β -Mixi ng Generalization Boun ds In this section , we prov e a stability-b ased general ization bound that only requires the training se- quenc e to be drawn from a stationary β -mixing distrib ution. The bound is thus more general and cov ers the ϕ -mixing case analyzed in the prev ious secti on. Howe ver , unlike the ϕ -mixi ng case, the β -mixing bound presen ted here is not a purely exp onent ial bound. It contains an additi ve term, which depend s on the mixing coef ﬁcient. As in the previ ous section , Φ( S ) is deﬁned by Φ( S ) = R ( h S ) − b R ( h S ) . T o simplify the presen tation , here , w e will deﬁne the gener alizati on error of h S by R ( h S ) = E z [ c ( h S , z )] . T hus, test samples are assumed indepen dent of S . By Lemma 5, this can be assumed modulo the additional term b ˆ β + M β ( b ) , for a cost function bounde d by M . Note that for any block of points Z = z 1 . . . z k dra wn independe ntly of S , the follo wing equalit y E Z  1 | Z | X z ∈ Z c ( h S , z )  = 1 k k X i =1 E Z [ c ( h S , z i )] = 1 k k X i =1 E z i [ c ( h S , z i )] = E z [ c ( h S , z )] (30) holds since, by stationarity , E z i [ c ( h S , z i )] = E z j [ c ( h S , z j )] for all 1 ≤ i, j ≤ k . Thus, R ( h S ) = E Z  1 | Z | P z ∈ Z c ( h S , z )  for any such block Z . For con venienc e, we w ill exte nd the cost functio n c to block s as follo ws: c ( h, Z ) = 1 | Z | X z ∈ Z c ( h, z ) . (31) W ith this notatio n, R ( h S ) = E Z [ c ( h S , Z )] for any blo ck dra w n independ ently of S , regardle ss of the size of Z . T o der i ve a generalizati on bound for the β -mixing scenari o, we will apply McDiarmid’ s in- equali ty to Φ deﬁned ov er a sequence of independent blocks. The inde pende nt blocks we will be consid ering are non-symmetr ic and thus more general than those considered by pre vious author s (Y u, 1994 ; M eir, 2000 ; L ozano et al., 200 6). From a sample S m ade of a sequence of m poi nts, we constru ct two sequences of blocks S a and S b , eac h contain ing µ bloc ks. Each block in S a contai ns a points and each block S b in contains b points . S a and S b form a partitioning of S ; for any a, b ∈ [0 , m ] such that ( a + b ) µ = m , they are deﬁned preci sely as follo ws: S a = ( Z ( a ) 1 , . . . , Z ( a ) µ ) , w ith Z ( a ) i = z ( i − 1)( a + b ) +1 , . . . , z ( i − 1)( a + b ) + a S b = ( Z ( b ) 1 , . . . , Z ( b ) µ ) , with Z ( b ) i = z ( i − 1)( a + b ) + a +1 , . . . , z ( i − 1)( a + b ) + a + b , (32) for all i ∈ [1 , µ ] . W e shall consid er similarly sequenc es of i.i.d. blocks e Z a i and e Z b i , i ∈ [1 , µ ] , such that the points within each block are dra wn accord ing to the same original β -mixing distrib ution 16 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S and shall denote by e S a the block sequence ( e Z ( a ) 1 , . . . , e Z ( a ) µ ) . In preparation for the applic ation of M cDiarmid’ s inequality , we giv e a bound on the expe ctation of Φ( e S a ) . Sinc e the exp ectati on is taken ov er a sequence of i.i.d. blocks, this brings us to a situation similar to the i.i.d. scenario analyz ed by Bousqu et and Eliss eef f (2002), with the exce ption that we are dealin g with i.i.d. b locks instea d of i.i.d. points . Lemma 16 Let e S a be an independen t bloc k seque nce as deﬁned abov e, then the following boun d holds for the e xpec tation of | Φ( e S a ) | : E e S a [ | Φ( e S a ) | ] ≤ a ˆ β . Pro of Since the blocks e Z ( a ) are independ ent, we can replace any one of them with any other block Z drawn from the same distrib ution . Howe ver , changin g the training set also chan ges the hypot hesis, in a limited way . This is shown preci sely below , E e S a [ | Φ( e S a ) | ] = E e S a "    1 µ µ X i =1 c ( h e S a , e Z ( a ) i ) − E Z [ c ( h e S a , Z )]    # ≤ E e S a ,Z "    1 µ µ X i =1 c ( h e S a , e Z ( a ) i ) − c ( h e S a , Z )    # = E e S a ,Z "    1 µ µ X i =1 c ( h e S i a , Z ) − c ( h e S a , Z )    # , where e S i a corres ponds to the block sequenc e e S a obtain ed by replaci ng the i th block with Z . The inequa lity holds throug h the use of Jensen’ s inequality . The ˆ β -stabil ity of the learning algorithm gi ve s E e S a ,Z " 1 µ    µ X i =1 c ( h e S i a , Z ) − c ( h e S a , Z )    # ≤ E e S a ,Z " 1 µ µ X i =1 a ˆ β # ≤ a ˆ β . W e now relate the non-i.i.d . e ven t Pr[Φ( S ) ≥ ǫ ] to an independen t block sequence ev ent to w hich we can apply McDiarmid’ s inequality . Lemma 17 Assume a ˆ β -alg orithm. Then, for a sample S drawn fr om a stationa ry β -mixing distri- b ution , the following bound holds, Pr S [ | Φ( S ) | ≥ ǫ ] ≤ Pr e S a  | Φ( e S a ) | − E[ | Φ( e S a ) | ] ≥ ǫ ′ 0  + ( µ − 1) β ( b ) , (33) wher e ǫ ′ 0 = ǫ − µbM m − 2 µb ˆ β − E e S ′ a [ | Φ( e S ′ a ) | ] . Pro of The proof consist s of ﬁ rst rewriti ng the ev ent in terms of S a and S b and boundin g the error on the points in S b in a trivi al manner . This can be afford ed since b will be e ven tually chosen to be 17 M O H R I A N D R O S TA M I Z A D E H small. Since | E Z ′ [ c ( h S , Z ′ )] − c ( h S , z ′ ) | ≤ M for any z ′ ∈ S b , we can write Pr S [ | Φ( S ) | ≥ ǫ ] = Pr S [ | R ( h S ) − b R ( h S ) | ≥ ǫ ] = Pr S  1 m    X z ∈ S E Z [ c ( h S , Z )] − c ( h S , z )    ≥ ǫ  ≤ Pr S  1 m    X z ∈ S a E Z [ c ( h S , Z )] − c ( h S , z )    + 1 m    X z ′ ∈ S b E Z ′ [ c ( h S , Z ′ )] − c ( h S , z ′ )    ≥ ǫ  ≤ Pr S  1 m    X z ∈ S a E Z [ c ( h S , Z )] − c ( h S , z )    + µbM m ≥ ǫ  . By ˆ β -sta bility and µa/m ≤ 1 , this last term can be bounded as follo ws Pr S  1 m    X z ∈ S a E Z [ c ( h S , Z )] − c ( h S , z )    + µbM m ≥ ǫ  ≤ Pr S a  1 µa    X z ∈ S a E Z [ c ( h S a , Z )] − c ( h S a , z )    + µbM m + 2 µb ˆ β ≥ ǫ  . The right- hand side can be re w ritten in terms of Φ and bound ed in terms of a β -mixing coef ﬁcient: Pr S a  1 µa    X z ∈ S a E Z [ c ( h S a , Z )] − c ( h S a , z )    + µbM m + 2 µb ˆ β ≥ ǫ  = Pr S a  | Φ( S a ) | + µbM m + 2 µb ˆ β ≥ ǫ  ≤ Pr e S a  | Φ( e S a ) | + µbM m + 2 µb ˆ β ≥ ǫ  + ( µ − 1) β ( b ) , by applyi ng L emma 3 to the indicator function of the e ve nt n | Φ( S a ) | + µbM m + 2 µb ˆ β ≥ ǫ o . Since E e S ′ a [ | Φ( e S ′ a ) | ] is a consta nt, the probability in this last term can be rewritte n as Pr e S a  | Φ( e S a ) | + µbM m + 2 µb ˆ β ] ≥ ǫ  = Pr e S a  | Φ( e S a ) | − E e S ′ a [ | Φ( e S ′ a ) | ] + µbM m + 2 µb ˆ β ] ≥ ǫ − E e S ′ a [ | Φ( e S ′ a ) | ]  = Pr e S a  | Φ( e S a ) | − E e S ′ a [ | Φ( e S ′ a ) | ] ≥ ǫ ′ 0  , which ends the proof of the lemma. The last two lemmas will help us prov e the m ain result of this section formulated in the follo wing theore m. Theor em 18 Assume a ˆ β -stable algorithm and let ǫ ′ denote ǫ − µbM m − 2 µb ˆ β − a ˆ β as in Lemm a 17. Then, for any sample S of si ze m dra wn accor ding to a statio nary β -mixing distrib ution, any c hoice 18 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S of the par ameter s a, b, µ > 0 suc h that ( a + b ) µ = m , and ǫ ≥ 0 suc h that ǫ ′ ≥ 0 , the following gen era lizatio n bound holds: Pr S h | R ( h S ) − b R ( h S ) | ≥ ǫ i ≤ exp − 2 ǫ ′ 2 m  2 a ˆ β m + ( a + b ) M  2 ! + ( µ − 1) β ( b ) . Pro of T o prov e the statement of theorem, it sufﬁces to boun d the probability term appear ing in the right-h and side of Equation 33, Pr e S a  | Φ( e S a ) | − E [ | Φ( e S a )] | ≥ ǫ ′ 0  , which is expresse d only in terms of indepe ndent blocks . W e can therefo re apply McDiarmid’ s inequali ty by viewin g the blocks as i.i.d. “point s”. T o do so, we must bou nd the quan tity   | Φ( e S a ) | − | Φ( e S i a ) |   where the se quenc e S a and S i a dif fer in the i th bloc k. W e will bound separately the dif ference between the generalizat ion errors and empirica l errors. 4 The dif ference in empirical errors can be bounded as follo ws using the bound on the cost functi on c : | b R ( h S a ) − b R ( h S i a ) | =     1 µ  X j 6 = i c ( h S a , Z j ) − c ( h S i a , Z j )  + 1 µ  c ( h S a , Z i ) − c ( h S i a , Z ′ i )      ≤ a ˆ β + M µ = a ˆ β + ( a + b ) M m . The dif ference in gener alizati on error can be straig htforw ardly bounded using ˆ β -sta bility: | R ( h S a ) − R ( h S i a ) | = | E Z [ c ( h S a , Z )] − E Z [ c ( h S i a , Z )] | = | E Z [ c ( h S a , Z ) − c ( h S i a , Z )] | ≤ a ˆ β . Using these boun ds in conj unctio n with M cDiarmid’ s inequalit y yields Pr e S a [ | Φ( e S a ) | − E e S ′ a [ | Φ( e S ′ a ) | ] ≥ ǫ ′ 0 ] ≤ exp − 2 ǫ ′ 2 0 m  2 a ˆ β m + ( a + b ) M  2 ! ≤ exp − 2 ǫ ′ 2 m  2 a ˆ β m + ( a + b ) M  2 ! . Note that to sho w the second inequality we mak e use of Lemm a 16 to es tabilis h the fact that ǫ ′ 0 = ǫ − µbM m − 2 µb ˆ β − E e S ′ a [ | Φ( e S ′ a ) | ] ≥ ǫ − µbM m − 2 µb ˆ β − α ˆ β = ǫ ′ . Finally , we make use of Lemma 17 to establish the proof, Pr S [ | Φ( S ) | ≥ ǫ ] ≤ Pr e S a  | Φ( e S a ) | − E[ | Φ( e S a ) | ] ≥ ǫ ′ 0  + ( µ − 1) β ( b ) ≤ exp − 2 ǫ ′ 2 m  2 a ˆ β m + ( a + b ) M  2 ! + ( µ − 1) β ( b ) . 4. W e drop the superscripts on Z ( a ) since we will not be considering the sequen ce S b in what follows. 19 M O H R I A N D R O S TA M I Z A D E H In order to make use of the bounds, w e must select the value s of parameters b and µ ( a is then equal to µ/m − u ). The re is a trade-of f between choosing lar ge va lue for b , to ensure the mixing term decreases, while choo sing a large value of µ , to minimize the remaining terms of the bound. The e xact choic e of parameter s w ill d epend on the typ e of mixing tha t is assume d (e.g. algebr aic or exp onent ial). In order to choose optimal parameters, it will be useful to view the bound as it holds with high probab ility , in the follo w ing corollary . Cor ollar y 19 Assume a ˆ β -sta ble alg orithm and let δ ′ denote δ − ( µ − 1) β ( b ) . Then, for any sample S of size m drawn accor ding to a stationary β -mixin g distrib ution, any choice of the para meters a, b, µ > 0 suc h that ( a + b ) µ = m , and δ ≥ 0 suc h th at δ ′ ≥ 0 , the following gen era lizatio n bound holds with pr obabi lity at least (1 − δ ) : | R ( h S ) − b R ( h S ) | < r log (1 /δ ′ )) 2 m  2 a ˆ β m + M m µ  + µb  M m + 2 ˆ β  + a ˆ β In the case of a fa st mixing distrib ution, it is poss ible to selec t the values of the parameters to retrie ve a bound as in the i.i.d. case, i.e. | R ( h S ) − b R ( h S ) | ∈ O  m − 1 2 p log 1 /δ  . In particul ar , for β ( b ) ≡ 0 , we can choo se a = 0 , b = 1 and µ = m to retrie ve the i.i.d. bound of Bousque t and Elisseef f (2001). In the follo wing, we will examine slo wer mixing algebraic β -mixing distrib utions, w hich are thus not clo se to the i.i.d. s cenario . For a lgebra ic mixing the mixing paramete r is deﬁned as β ( b ) = b − r . In that case, we wish to minimize the follo wing functio n in terms of µ and b . s ( µ, b ) = µ b r + m 3 / 2 ˆ β µ + m 1 / 2 µ + µb  1 m + ˆ β  . (34) The ﬁrst term of the fun ction captures the condit ion on δ > ( µ + 1) β ( b ) ≈ µ/b r and the remain ing terms capture the shape of the bound in Corollar y 19. Setting the deri vati ve with respect to each vari able µ and b to zero and solving for each parameter results in the follo wing expressio ns: b = C r γ − 1 r +1 , µ = m 3 / 4 γ 1 2( r + 1) p C r (1 + 1 /r ) , (35) where γ = ( m − 1 + ˆ β ) and C r = r 1 r +1 is a constan t deﬁned by the paramete r r . No w , assuming ˆ β ∈ O ( m − α ) for some 0 < α ≤ 1 , we anal yze the con ver gence beha vior of Corollary 19. First, we notice that the terms b and µ ha ve the follo wing asymptotic beha vior , b ∈ O  m α r +1  , µ ∈ O  m 3 4 − α 2( r + 1)  . (36) Next, we con sider the conditio n δ ′ > 0 which is equi val ent to, δ > ( µ − 1) β ( b ) ∈ O  m 3 4 − α  1 − 1 2( r + 1)   . (37) 20 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S In order for the right-ha nd side of the inequality to con ver ge, it must be the case that α > 3 r +3 4 r +2 . In par ticular , if α = 1 , as we ha ve sho wn is the case for se veral algorithms in Section 3.4, then it suf ﬁ ces that r > 1 . Finally , in order to see ho w the bound itse lf con ver ges, w e stu dy the asymptoti c beha vior of the terms of Equation 34 (without the ﬁrst term, which corresp onds to the quantit y already analyzed in Equation 37): m 3 / 2 ˆ β µ + µb ˆ β | {z } ( a ) + m 1 / 2 µ + µb m | {z } ( b ) ∈ O  m 3 4 − α  1 − 1 2( r + 1)  | {z } ( a ) + m α 2( r + 1) − 1 4  | {z } ( b ) . (38) This express ion can be further si mpliﬁed b y noticing tha t ( b ) ≤ ( a ) f or all 0 < α ≤ 1 ( with equality at α = 1 ). Thus, both the bound and the condition on δ decrease asymptotically as the term in ( a ) , resulti ng in the followin g corollary . Cor ollar y 20 Assume a ˆ β -stable algorith m with ˆ β ∈ O ( m − 1 ) and let δ ′ = δ − m 1 2( r + 1) − 1 4 . Then, for any sample S of size m drawn accor ding to a stationar y alg ebr aic β -mixing distrib ution, and δ ≥ 0 such tha t δ ′ ≥ 0 , the follo wing gen era lizati on bound holds with pr obability at least (1 − δ ) : | R ( h S ) − b R ( h S ) | < O  m 1 2( r + 1) − 1 4 p log(1 /δ ′ )  . (39) As in pre vious bound s r > 1 is required fo r con ver gence. Furthermore, as expect ed, a lar ger mixin g paramete r r leads to a more fa vo rable bound . 5. Conclusion W e presented stabili ty bound s for both ϕ -mixing and β -mixin g station ary sequenc es. Our bounds apply to lar ge classes of algorithms, includin g common algorithms such as SVR , KRR, and SVMs, and extend to non-i.i.d. scenarios existing i.i.d. stability bounds. Since they are algorithm-sp eciﬁc, these bounds can often be tighter than other gene ralizat ion bounds based on gen eral comple xity measures for families of hypot heses. As in the i.i.d. case, weake r notions of stability might help furthe r improve and reﬁne these bou nds. Our bounds can be used to analyze the prope rties of stable a lgorith ms when used in the non-i.i.d setting s studied. But, mor e impo rtantly , they can serv e as a tool fo r the design of nove l and accurate learnin g algorithms. O f course, some mixing properties of the distrib utions need to be kno wn to tak e adv antage of the infor mation supplied by our genera lizatio n bounds. In some problems, it is possib le to estimate the shape of the mixin g coefﬁci ents. This should help de vising such a lgorit hms. Acknowledgmen ts This wo rk w as partially fund ed by the New Y ork State Ofﬁce of Scie nce T echnolo gy and Acad emic Research (NYST AR) a nd a Google Research A ward. 21 M O H R I A N D R O S TA M I Z A D E H Refer ences Ser gei N atano vich Bernstein. Sur l’e xtension du th ´ eor ` eme limite du calcul des probabili t ´ es aux sommes de quant it ´ es d ´ epend antes. Math. Ann. , 97:1–59, 1927. Oli vier Bousque t and Andr ´ e Elisseef f. Algorithmic stabili ty and generalizat ion performanc e. In Advances in Neur al Informati on Pr ocessing Systems (NIPS 2000) , 2001. Oli vier Bousquet and Andr ´ e Elisseef f. Stability and generalizat ion. J ournal of Machine Learning Resear ch , 2:499– 526, 2002. ISSN 1533-79 28. Jean-Ren ´ e Chazotte s, Pierre Collet, Christof K ¨ ulske, and Frank Redig. Concen tratio n inequ alities for random ﬁelds via couplin g. Pr obability Theory and Related Fi elds , 137(1) :201– 225, 2007. Corinna Cortes and Vladimir N. V apnik. Suppor t-V ecto r Networks. Machi ne Learning , 20(3): 273–2 97, 1995. Luc Devro ye and T . J. W agner . D istrib ution-fre e perfor mance bounds for pote ntial function rules. In Infor mation Theory , IEE E T ransactio ns on , v olume 25, pages 601–6 04, 1979. Paul Douk han. Mixing: Pr operties and Examples . S pringe r -V erlag , 1994. Michael Kearns and Dana R on. Algorithmic stabil ity and sanity- check bounds for lea ve-one-o ut cross- v alidat ion. In Computatio nal Learing Theory , pages 152–1 62, 1997. Leo Kon toro vich. Measur e Conce ntr ation of Str ongly M ixing P r ocesses with Applicat ions . PhD thesis , Carnegie Mellon Univ ersity , 2007. Leo Ko ntoro vich and Kav ita Ramanan. C oncen tration inequal ities for dependent random varia bles via the martingal e m ethod , 2006. Aur ´ elie Lozano , Sanjee v Kulkarni, and Robert S chapir e. Con ver gence and consistenc y of re gular - ized boost ing algorithms with statio nary β -mixing observ ations. In NIPS , 2006. Katalin Marton. Measure concentrati on for a class of random proces ses. Pr obability Theory and Related F ields , 110(3):4 27–43 9, 199 8. Dav ide Mattera and Simon Haykin. Support vecto r machines for dynamic reconstr uction of a chaoti c system. In Advances in kerne l methods: suppo rt vector learning , pages 211– 241. MIT Press, Cambridge , MA, USA, 1999 . ISBN 0-262-1 9416- 3. Colin McDiarmid. On the method of bounded dif ference s. In Surverys in Combinatorics , pages 148–1 88. Cambridge Uni versity Press, 1989. Ron Meir . Nonparametric time series predic tion throu gh adapti ve model se lectio n. Machine Lea rn- ing , 39(1):5 –34, A pril 2000. Dharmendra Modha and Elias Masry . On the co nsiste ncy in nonp arametri c estimatio n unde r m ixing assumpti ons. IEEE T ransact ions of Information Theory , 44:117–1 33, 1998. 22 S TA B I L I T Y B O U N D S F O R N O N - I . I . D . P R O C E S S E S Klaus-Robe rt M ¨ uller , Ale x Smola, Gunnar R ¨ atsch, Bernhard Sch ¨ olkop f, Jens Kohlmor gen, and Vladimir V apnik . Predic ting time series with suppo rt ve ctor machine s. In Pr oceed ings of the Intern ationa l Confer ence on A rtiﬁci al Neur al Networ ks (I CANN’97) , Lectu re Notes in Compute r Science, pages 999–1 004. S pringe r , 1997. Paul-Ma rie Samson. Concen tratio n of measure inequaliti es for Marko v chains and-mix ing pro- cesses . Annals Pr obability , 28(1):416 –461, 2000. Craig Saund ers, A lex ander Gammerman, and V olod ya V o vk. Ridge Regr ession L earnin g Algo- rithm in Dual V ariables. In Pr oceedings of the F ifteenth Internationa l Confer ence on Machine Learning , pages 515 –521. Morgan Kaufmann Publish ers Inc., 1998 . Bernhard Sch ¨ olkopf and A lex Smola . Learning w ith K ernels . MIT Press: Cambridge , MA, 2002. Vladimir N. V apnik . Statist ical Learning Theory . W iley -Inters cience , Ne w Y ork, 19 98. Mathuku malli V idyasagar . L earnin g and Genera lizatio n: with Applicati ons to Neural Networks . Springer , 2003. Bin Y u. Rates of con ver gence for empirical processes of stationary mixing sequenc es. The Annals of Pr obabili ty , 22(1):9 4–116 , Jan. 1994. Shuheng Zhou, John Laff erty , and Larry W asserman. T ime v arying undirect ed graphs. In Pr oceed- ings of the 21st Annual Confer ence on Learning Theory , 2008. 23

Stability Bound for Stationary Phi-mixing and Beta-mixing Processes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment