Rademacher complexity of stationary sequences

Rademac her Complexit y of Sta tionary Sequences Daniel J. McDonald Departmen t of Statistics Indiana Univ ersit y da jmcdon@indiana.edu Cosma Rohilla Shalizi Departmen t of Statistics Carnegie Mellon Univ ersit y cshalizi@cm u.edu V ersion: May 24, 201 7 Abstract W e show how to control the generalization error of time series mo dels wherein past v alues of th e outcome are used to predict future v alues. The results are based on a generalization of standard i.i.d. concentrati on inequalities to dependent data without the mixing assumptions common in th e time series setting. Our proof and the result are simpler t han previous analyses with dep endent data or sto chastic adversa ries whic h use sequential Rademac her complexities rather than the ex p ected Rademacher com- plexity for i.i.d. pro cesses. W e also derive empirical Rademacher results without mix ing assumptions resulting in fully calculable upp er boun ds. 1 In tro duction Statistical learning theor y aims to b ound the o ut-of-sample p er formance o f prediction rules induced from ﬁnite data sets. The cla ssical s itua tion is where one wishes to pr edict o ne v a r iable Y ∈ Y from a nother X ∈ X , and ha s a training set of n pairs ( X 1 , Y 1 ) , . . . ( X n , Y n ), assumed to b e drawn i.i.d. from a distribution ν tha t will a lso genera te future instances. Provided with a lo ss function ℓ : Y × Y → R + and a c la ss of prediction functions G , where ea ch g ∈ G is a map from X to Y , the usua l go a l is to b ound the supremum of the empirical pro c e ss of the losses , sup g ∈G E ν [ ℓ ( Y , g ( X ))] − n − 1 P n i =1 ℓ ( Y i , g ( X i )). Such bounds inv olve some notion of the ﬂexibility or complexity o f the mo del space G , and a par ticularly impo r tant one is the Rademacher complexity , R n ( G ) = 2 E ξ, ν " sup g ∈G 1 n n X i =1 ξ i ℓ ( Y i , g ( X i )) # where ξ i are a seq uenc e of i.i.d. v a riables taking the v a lues +1 and − 1 with equal probability (see § 2.3 for a fuller statement). While the Rademacher complexity was ﬁrst used to b ound generalization er ror for i.i.d. pro cesses (e.g . Bartlett and Mendelson , 2 002 ), it has b e en e xtended to situations whe r e the ( X i , Y i ) pa irs a re depe ndent but ( X i , Y i ) b ecomes indep endent of ( X j , Y j ) as | i − j | → ∞ ( Mohri and Rostamizadeh , 20 09 ), and even to adversarial settings, wher e the data so urce a ctively tries to foo l the lea rner Rakhlin et al. ( 2010 ). W e build on the latter work to extend Rademacher complexit y to the rather diﬀerent problem o f time- series forecasting. In that setting, w e observe a single sequence of random v ariable s Y 1 , Y 2 , . . . Y n (for short, Y n 1 ), taking v alues in Y , a nd wish to lear n a function which extrap olates the s e quence into the future, to forecast (say) the next 1 v alue Y n +1 . Given a predictor g : Y n 7→ Y , a natural notion of generalization error for time series, the forecasting risk , is R ( g ) ≡ E [ ℓ ( Y n +1 , g ( Y n 1 )) | Y n 1 ] . While a precise statement needs so me care ( § 2.4 ), w e will show that forecasting risk , like the generalizatio n error of classiﬁcation and regress io n problems, can b e b ounded via the Rademacher complexity , despite the rather diﬀerent nature o f the problem. In par ticular, we a re able to use the standard Rademacher complexity . Our result is co mparable to that in Rakhlin et al. ( 2015 , § 9), which gives a b ound for time series prediction in Bana ch spa ces. Their r esult how ever is a co nsequence of re s ults for the more diﬃcult pr o blem of prediction 1 Going beyond “one-step -ahead” f orecasting, to longer hori zons or whole blo cks, in volv es mostly not ational c hanges, whic h we will not note explicitly . 1 under sto chastic adv ersar ies. As such, our bound a nd the pro of are simpler a nd tight er, thoug h they apply to an easier (but still highly relev ant) pr ediction task. § 2 gives background material essential for stating our results, on time series, mo de l complexity , and forecasting r isk. § 3 derives risk b ounds for time series, g iving a nov el pro of that the sta ndard Ra demacher complexity characterizes the ﬂexibilit y of G , even under sta tio narity , with concent ra tio n inequalities for no n- mixing dep endent v ar iables. § 4 car efully compare s our re s ults to other s in the literature, s ketc hes applications and algorithms, and concludes. 2 Time S eries, Complexit y , and Concen tration of M easure W e introduce some of the concepts needed for our results: stationarity and ergo dicit y a re required to cont ro l generaliza tion er ror (unless w e aim to predict only a single new observ atio n); Rademac her complexity mea- sures the ﬂexibility of the mo del space G ; forecasting risk meas ur es the quality of a time-se r ies prediction rule. Notation Y = { Y t } ∞ t = −∞ is a sequence of rando m v aria bles, i.e., each Y t is a measurable mapping from some pro bability s pace (Ω , F , P ) into a measur able spac e Y . W e wr ite Y j i for the blo ck { Y t } t = i,i +1 ,...j from the random sequence; either limit may be inﬁnity . The σ -ﬁeld g e nerated b y the blo ck Y j i is F j i . L ( W ) deno tes the probability law of the r andom ob ject W , a nd L ( W | V ) the conditional law of W given V . Fina lly , if W has distribution ν a nd f is a measurable function, we deﬁne E ν [ f ( W )] = E W [ f ( W )] = R dν f ( W ). W e will try to use whic hever notation is clea rest in con text, sticking to E [ f ( W )] when that is unambiguous. 2.1 Stationarit y and Er go dicit y W e assume Y is (strictly or strongly) stationary . Deﬁnition 1 (Stationarity) . A r andom se quenc e Y is stationary when al l its ﬁn ite-dimensional distribu- tions ar e time invariant: for al l t and al l i ≥ 0 , L  Y t + i t  = L  Y i 0  . Stationarity do es not require the random v aria bles Y t to be independent across time, but do es imply they all ha ve the same dis tr ibution. The inﬁnite-dimensio nal distribution of Y , L ( Y ), is a pro bability measure on Y ∞ . In this space, the time-evolution of the pro ces s is just the shift map τ , which “moves the sequence a step to the right”: ( τ Y ) t = Y t +1 . Deﬁnition 2 (Erg o dicity) . A s et A ⊂ Y ∞ is shift-i nv ariant , A ∈ I , when τ − 1 A = A . A pr ob ability me asu re µ on Y ∞ is ergo di c when shift-invariant sets have either pr ob ability 0 or pr ob ability 1, i.e., A ∈ I only if µ ( A ) = 0 or µ ( A ) = 1 . Ergo dicity is impo r tant for tw o reasons. The ﬁrs t is that it implies a law of lar ge num b ers for time series. Prop ositio n 3 (Individual Ergo dic Theo rem; Gray 200 9 ) . If µ is stationary and er go dic, and f ∈ L 1 , then the time-aver age of f c onver ges to its exp e ctation µ -almost-sur ely. That is, the set of y ∈ Y ∞ such that 1 n P n − 1 t =0 f ( τ t y ) → E µ [ f ( Y )] has µ -me asur e 1. The second reaso n is that every stationary pro ces s Y decomp oses into a mixture of stationary and ergo dic pro cesses, and each realiza tion o f Y co mes from just one of these ergo dic compo nent s. Prop ositio n 4 (Ergo dic Decomp osition; Dynkin 197 8 ; Gra y 2009 ) . If ρ is a stationary but not er go dic distribution on Y ∞ , then ρ = R µdπ ( µ ) , wher e π is a me asur e on the s p ac e of stationary and er go dic pr o c esses. Mor e over, for any f ∈ L 1 , 1 n P n − 1 t =0 f ( τ t y ) → E ρ [ f ( Y ) |I ] for ρ -almost-al l tr aje ctories y . In words, to generate a tra jectory from a stationar y , non-ergo dic pro cess , ﬁrst pic k a sta tionary ergo dic pro cess (according to the distribution π ), and then genera te Y from that pro cess. T o sum up, then, if we a s sume that the data sourc e is stationa r y , and that we only g et to see a single tra jectory from it, there is no loss of genera lit y in also assuming that the source is ergo dic, a nd so the stro ng 2 law of large n umbers, in the form of Prop. 3 , applies. Non- ergo dicity would only be relev ant if w e w ere to consider mult iple indep endent tra jecto ries fr o m the sa me stationary pr o cess, which might s ample diﬀeren t ergo dic comp onents Wiener ( 1956 ). 2.2 Empirical Pro cesses The standard device in lear ning theor y for b ounding the genera liz ation er ror of a prediction function is to control the empiri cal pro cess o ver a function space, i.e., the deviations of empirical means from their exp ectation v alues. W e thus de ﬁne some conv enient, if abstr act, notation here. Let Z 1 , . . . Z n be a sequence of Z -v alued r andom v aria bles (generally dep endent ), a nd H a cla ss of real- v alued functions on Z . W e deﬁne the empiri cal mean or sample mean as ˆ h n ≡ 1 n P n t =1 h ( Z t ) and the exp ectation v alue as E [ h ] ≡ E Z n 1 h ˆ h n i = 1 n n X t =1 E Z t [ h ( Z t )] If the Z t are i.i.d., then E [ h ] = E Z 1 [ h ( Z 1 )]. The empirical pro cess 2 at h is γ n ( h ) = E [ h ] − ˆ h n . W e care particularly ab out the supremum of the empirical pro cess: Γ n ( H ) ≡ sup h ∈H γ n ( h ) 2.3 Rademac her Complexit y The Rademac her complexity of a function class is, in esse nce, ho w well it ca n (seem to) match pur e noise. The formal deﬁnition is (after Bartlett and Mendelson 200 2 ): Deﬁnition 5 (I.i.d. Rademacher Complexit y) . L et Z n 1 b e a Y - value d i.i.d. se quenc e, and H a r e al-value d class of functions on Y . The empirical Rademac her complexi ty of H on Z n 1 is 3 b R n ( H ) ≡ E ξ " sup h ∈H 2 n n X t =1 ξ t h ( Z t ) # . The Rademac he r complexity of H is the exp e ctation of the empiric al R ademacher c omplexity over Z : R n ( H ) ≡ E Z h b R n ( H ) i . Rademacher complexity matters b eca use it is closely related to the supremum of the empirical pro cess ov er H . Sp eciﬁcally , E Z [Γ n ( H )] ≤ R n ( H ). Its utility is that E Z [Γ n ( H )] is almost never expressible, but one of b R n ( H ) of R n ( H ) may b e, th us allowing control of the genera lization er ror with meaning ful qua ntities. The main burden o f our pa pe r is to show that, if Z is sta tio nary and er go dic ra ther than i.i.d., we have the same result, though with a more in volv ed pr o of. W e rehearse the (now standard) i.i.d. Rademacher generaliza tion error b ound and its proo f using our notation in the Supplement be c ause of its imp ortanc e for our own developmen t. This deﬁnition o f i.i.d. Rademacher complexity will, it turns out, work for stationary pro cess es almost unch ang ed. Deﬁnition 6 (Rademacher Co mplexit y) . L et Y n 1 b e a time series gener ate d fr om P . The empirica l Rademacher complexity of t he r e al-value d function class H on Y n 1 is b R n ( H ) ≡ E ξ " sup h ∈H 2 n n X t =1 ξ t h t ( Y t 1 ) # The Rademach er complexity is the ex p e ctation of t he empiric al R ademache r c omplexity: R n ( H ) ≡ E Y n 1 h b R n ( H ) i . 2 Some autho rs would include an ov er-all scaling factor of √ n . 3 Some deﬁnitions hav e an absolute v alue inside the supr emum af ter ( Bartlett and Mendelson , 2002 ), but others av oid it (ev en the same authors in later w ork, e.g. Bartlett et al . 2005 ). As the ev ent ual pr oof demonstrates, i t isn’t required, so w e drop it. 3 The term inside the s upremum, 1 n P n t =1 ξ t h t ( Y t 1 ), is the sample cov a riance b et ween the noise ξ and the v alues of a par ticular function s equence h . The Rademacher complexity takes the largest v alue of this sample cov ar iance ov er all mo dels in the class (mimicking e mpirical risk minimizatio n), then averages o ver realizations of the noise. Relative to the i.i.d. Rademacher co mplex it y , we hav e indexed the predictor h with a time dependent subscript. F o r time series, the goal is to forecast Y t +1 from the history Y t 1 . Since a function Y t 7→ Y is not, techn ically , the same a s a function Y t +1 7→ Y , one m ust, strictly sp eaking, use a diﬀerent prediction function at each time step. A single predictive mo del h th us is implemen ted a s a whole se r ies of functions h t : Y t 7→ Y . A t some abuse o f notation, w e will write h for the name of this whole sequence of functions. 4 W e emphasize that the sequence h 1 , h 2 , . . . do es no t represent inﬁnitely many individually-learna ble functions but rather stages of a single function sequence h . Int uitively , Radema cher complexity shows ho w well o ur models could seem to ﬁt outco mes which were really just noise, giving a baseline ag ainst which to a ssess ov er-ﬁtting o r failing to generalize. Since Y is stationary a nd erg o dic, and ξ is i.i.d. and independent o f Y , the jo in t pr o cess ( Y , ξ ) is also stationary and ergo dic. Th us by the er go dic tower pr op erty ( v an Handel , 2014 ), for a ﬁxe d function s e q uence h , the sample cov a riance tends to zero a lmost surely: 1 n n X t =1 ξ t h t ( Y t 1 ) → E Y , ξ [ ξ h ( Y )] = E ξ [ ξ ] E Y [ h ( Y )] = 0 The overall Rademacher complexity s ho uld als o shrink, though mor e slowly , unless the mo del class is so ﬂexible tha t it can ﬁt a bsolutely an ything, in which cas e we can infer nothing ab out how w ell it will predict in the future from the fact that it perfor med w ell in the past. Showing that this heuristic r easoning is v alid, and that the Radema cher complexity of Deﬁnition 6 contin ues to control the empirical pr o cess when for ecasting s tationary time s e ries, is the ma in a im of our pap er. W e note that Kuznetsov and Mohr i ( 2014 , 201 5 ) prov e generaliza tion e rror b ounds for the for ecasting risk under non-sta tionarity w ith and without mixing assumptions, but these results r ely o n the intricate sequential co mplexities introduced b y Rakhlin et al. ( 20 1 0 ), w hich replace the outer expecta tion ov er the observ ations Y n 1 with a supremum over such obse r v atio ns. ( § 4.1 carefully compares these results and ours.) 2.4 F orecast Risk In class iﬁc a tion or regressio n problems, w e obta in data p oints Z t = ( X t , Y t ), a nd the g oal is to predict o ne part of the da ta, Y t , from the other , X t . The risk of a prediction function g : X 7→ Y , can b e sensibly deﬁned as an exp ectation ov er data p oints: R ( g ) = E X,Y [ ℓ ( Y , g ( X ))] . This ris k is w ell-deﬁned so long as the marginal distribution of the data is shift-in v a r iant ( L ( Z t ) = L ( Z 1 ) for all i ). F or an i.i.d. data source, it is of course true that E X n +1 ,Y n +1 [ ℓ ( Y n +1 , g ( X n +1 )) | X n 1 , Y n 1 ] = E X n +1 ,Y n +1 [ ℓ ( Y n +1 , g ( X n +1 ))] = R ( g ) so that averaging over the marginal distribution o f the next data point indicates the expected loss of contin- uing to use the predicto r g on new data . This is no longer true for dependent data. How ever, for a statio na ry ergo dic source, one has that ( Shalizi and Kontorovic h , 2013 ) lim m →∞ 1 m m X i =1 E X i + n ,Y i + n [ ℓ ( Y n + i , g ( X n + i )) | X n 1 , Y n 1 ] = E X,Y [ ℓ ( Y , g ( X ))] = R ( g ) so the expe c ta tion o ver new da ta w ould still be a g o o d indicator of long-run per formance. All o f this is subtly c hanged for time series , where the goal is to forecast Y t +1 from the histor y Y t 1 . As discussed ab ov e, we abuse notatio n and denote a single pr e dictive mo del g even though it r eally repre s ent s a of functions g t : Y t 7→ Y . 4 If we w ant ed to b e purists, w e could introduce a paramet er space Θ (not necessarily ﬁnite-dimensional), and consider the collection of pr ediction f unctions g t ( Y t 1 ; θ ) for all t . 4 Deﬁnition 7 (F oreca s t r isk) . Given a stationary and er go dic sto chastic pr o c ess Y , and a loss fun ction ℓ , the ﬁni te-history risk of the pr e dictive mo del g is R n ( g ) ≡ E Y n 1 " 1 n n X t =1 ℓ ( Y t , g t ( Y t − 1 1 )) # and the for e c ast risk is R ( g ) = lim n →∞ R n ( g ) when the limit exists. F or brevity , we introduce the no ta tion Z t ≡ ( Y t , Y t − 1 1 ), and h t ( Z t ) ≡ ℓ ( Y t , g t ( Y t − 1 1 )), deﬁning Y 0 1 ≡ ∅ . The foreca s t r is k c an th us b e also wr itten as lim n − 1 P n t =1 h t ( Z t ). So R ( g ), aga in, captures the lo ng -run av erage cost of using the predictive mo del g . By contrast, R n ( g ) is the av era g e risk o f g if used on an indep endent rea lization of Y n 1 . Having an inﬁnite-time limit in the deﬁnition o f fo r ecast risk is irks o me. It can b e ev ade d if the predictive mo del has o nly a ﬁnite memor y length d ≥ 0, so no thing more than d time s teps old mat- ters for predictions (formally , g t ( Y t 1 ) ∈ σ ( Y t t − d ) for all t > d ). Then, by stationar ity , w e may simplify R ( g ) = E Y d +1 1  ℓ ( Y d +1 , g ( Y d 1 ))  In fact, in the ﬁnite-memory-length case, as so on as n > d , R ( g ) = 1 n − d n X t = d +1 E Y n 1  ℓ ( Y t +1 , g ( Y t − 1 t − d ))  = R n ( g ) and it follo ws from the ergo dic theorem that 1 n − d P n t = d +1 ℓ ( Y t +1 , g ( Y t − 1 t − d )) → R ( g ) almost surely . Predictive mo dels with inﬁnite-r a nge memories are how ever actually fairly common in fo recasting practice, including not just hidden Markov mo dels but also things as basic as mo ving-average mo dels. W e therefore p osit that R ( g ) exists for such mo dels, writing the gap b etw e e n the for ecast risk a nd the ﬁnite-history risk by ∆ n ( g ) ≡ R ( g ) − R n ( g ). W e also p osit 5 that the time-av erage d lo ss converges to the forecast ris k : 1 n P n t =1 h t ( Z t ) → R ( g ) . With ﬁnite amounts o f data , w e th us fo cus on control of R n ( g ). Whether ∆ n ( g ) → 0 is a prop erty of both the function class and the dep endence structure, and is outside our sco pe, tho ug h see v an Handel ( 2014 ) for rela ted discussion. 3 Risk B ounds Generalization error b o unds follow from deriving high proba bilit y uppe r bounds o n the quan tity Γ n ( H ) := sup h ∈H R n ( h ) − b R n ( h ) , which is the worst case diﬀerence be tw een the true r isk R n ( h ) and the empirical risk b R n ( h ) over all functions in the class of loss es H = { h = ℓ ( · , g ( · )) : g ∈ G } deﬁned ov er a particular class of prediction functions G . W e ﬁrst presen t our main result, which b ounds E Z [Γ n ( H )] with the Rademacher complexity and discuss its pro of. W e then use our Rademacher bound to derive r isk b ounds fo r time-series foreca sters which a re fully calculable from data. 3.1 Stationary Rademac her B ounds The symmetrization a r guments used to prove Ra demacher b o unds for the i.i.d. case fail fo r time series prediction. How ever, as we now show, for stationar y time series, bounds of the same form are still v alid, alb eit with a s o mewhat more in volv ed proo f. This is in contrast to the far more int rica te constructions needed to establish b o unds us ing gener alized Rademacher complexities for o nline learning ( Rakhlin et al. , 2010 , 2011 ) or for non-stationary proces ses ( Kuznetsov and Mohri , 2015 ). (W e g ive more detailed con trasts in § 4.1 .) Our ﬁrst principle result is simply: Theorem 8. F or a time series pr e diction pr oblem b ase d on a se quenc e Y n 1 , E [ Γ n ( H )] ≤ R n ( H ) . 5 If the loss function is the negat ive log-likelihoo d, this p osit i s the generalized asymptotic equipartition prop erty , or Shannon- McMillan-Br eiman theorem Algoet and Cov er ( 1988 ); Gra y ( 1990 ). 5 Z 1 ξ 2 = − 1   ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ξ 2 =1   ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ e Z 1 ξ 2 = − 1   ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ξ 2 =1   ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ e Z 2 ξ 3 = − 1   ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ξ 3 =1   Z 2 ξ 3 = − 1   ξ 3 =1   ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ Z 2 ξ 3 = − 1   ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ξ 3 =1   e Z 2 ξ 3 = − 1   ξ 3 =1   ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ e Z 3 Z 3 e Z 3 Z 3 Z 3 e Z 3 Z 3 e Z 3 Figure 1: This ﬁgure displays the tr ee s tructures for Z ( ξ ) and e Z ( ξ ) with ξ 1 = 1 (for example). The path along each tree is determined by ξ s equence, interleaving the “pa s t” b etw een pa ths. The version with ξ 1 = − 1 would exchange Z 1 for e Z 1 at the ro o t of ea ch tree. W e note her e that unless sup h ∈H k h k ∞ < ∞ , R n ( H ) = ∞ by its deﬁnition. Th us, this re s ult, like all Rademacher results, is only useful with b o unded predictors or losses. Of course if s up h ∈H k h k ∞ = ∞ , the theorem holds trivially . The standard pro of for i.i.d. classiﬁcation or regression in tro duces a “ghost sample”, a n independent sample of size n from the same distributio n tha t pr o duced the original data, before using a symmetriza tio n argument. F or forecasting, ho wev er, intro ducing an independent copy of the original time series will not pro duce the necessa ry symmetry . Rather, follo wing an idea introduced by Rak hlin et al. ( 201 0 , 20 1 1 ) for dealing with adv ersa rial data, w e work with a tangen t sequence , where the surr ogate v alue int ro duce d at each time p oint is conditioned on the actual time series up to that p o int. That is , the ta ng ent sequence e Y is deﬁned recursively: L  e Y 1  = L ( Y 1 ), and L  e Y t | Y t − 1 1  = L  Y t | Y t − 1 1  . F ur thermore, e Y t is independent of all other e Y s and o f all Y s , conditional on Y t − 1 1 . (In direc ted graphical mo dels terms, Y t − 1 1 are the paren ts o f e Y t , whic h ha s no c hildren. See Figure 1 .) The time series Y a nd the tange n t seq uence do not have the same joint distributio ns. 6 Pr o of of Thm. 8 . With b oth the original time series Y and the tangen t sequence e Y in hand, we construct Z t and e Z t v ar iables as follows: Z t ≡ ( Y t , Y t − 1 1 ) and e Z t ≡ ( e Y t , Y t − 1 1 ) (with the con ven tion that Z 1 = Y 1 , e Z 1 = e Y 1 ). Notice that e Z combines the origina l time series and its tangent s equence, but in suc h a wa y that L ( Z t ) = L  e Z t  . F urther more, s ince e Y t ⊥ ⊥ Y n t | Y t − 1 1 , it follows that e Z t ⊥ ⊥ Z n t | Z t − 1 1 . As R n ( h ) = E Z  1 n P n t =1 h t ( Z t )  for some h ∈ H = ℓ ◦ G , we may equally well wr ite the risk in terms of the tang ent sequence: R n ( h ) = 1 n P n t =1 E e Z t h h t ( e Z t ) i . Therefore, E [Γ n ( H )] = E Z " sup h ∈H E Z " 1 n n X i =1 h ( Z i ) # − 1 n n X i =1 h ( Z i ) !# = E Z " sup h ∈H E e Z " 1 n n X i =1 h ( e Z i ) # − 1 n n X i =1 h ( Z i ) !# ≤ E Z , e Z " sup h ∈H 1 n n X i =1 h ( e Z i ) − h ( Z i ) # (Jensen’s inequality) = E Z 1 e Z 1 E Z 2 | Z 1 e Z 2 | e Z 1 · · · E Z n | Z n − 1 ,...,Z 1 e Z n | e Z n − 1 ,..., e Z 1 " sup h ∈H 1 n n X i =1 h ( e Z i ) − h ( Z i ) # (Iterated expec tation). (1) Now, due to dep endence, Radema cher v ar iables m ust b e intro duced carefully as in the a dversarial cas e. Rademacher v ariables c reate tw o tree structures, one asso cia ted to the Z seq uence, a nd one a sso ciated to the e Z sequence (see Rakhlin et al. , 2010 , 2011 for a thorough treatment) . W e write these trees as Z ( ξ ) and 6 Let Y 1 be 0 or 1 with equal probability , and Y t +1 = Y t with probability 0 . 9 and = 1 − Y t otherwise. ( Y is a stationary and ergodic Marko v c hain.) Because e Y 1 ⊥ ⊥ e Y 2 | Y 1 , the probability tha t e Y 2 = e Y 1 is not 0 . 9 but 0 . 5. 6 e Z ( ξ ), where ξ is a pa rticular sequence of Ra demacher v a riables, e .g. (1 , − 1 , − 1 , 1 , . . . , 1 ), whic h cr eates a path along e a ch tree. F or example, consider ξ = 1 . Then, Z ( ξ ) = ( Z 1 , . . . , Z n ) and e Z ( ξ ) = ( e Z 1 , . . . , e Z n ), the “rig ht ” path o f both tree structures. F or ξ = − 1 . Then, Z ( ξ ) = ( e Z 1 , . . . , e Z n ) and e Z ( ξ ) = ( Z 1 , . . . , Z n ), the “left” path of b oth tree structures. Changing ξ i from +1 to − 1 exchanges Z i for e Z i in bo th trees and chooses the left child of Z i − 1 and e Z i − 1 rather tha n the rig ht child. Figure 1 displa ys b oth trees. In o rder to talk ab out the probability of Z i conditional on the “past” in the tree, we need to know the path taken so far. F o r this, w e deﬁne a sele ctor function χ ( ξ ) := χ ( ξ , ρ,  ) = ρI ( ξ = 1) + I ( ξ = − 1) . Distributions ov er trees then beco me the ob jects of interest. Contrary to the o nline-learning scenario, the dependence b etw een future and pas t means the adversary is not free to change predictors and resp ons e s separately . Once a branc h of the tree is chosen, the distribution of future data po in ts is ﬁxed, and dep ends only on the preceding sequence. Because of this, the join t distribution of an y path along the tre e is the s ame as an y other path, i.e. for any t wo paths ξ , ξ ′ , L ( Z ( ξ )) = L ( Z ( ξ ′ )) and L ( e Z ( ξ )) = L ( e Z ( ξ ′ )) . Similarly , due to the construction of the tangent sequence, we ha ve that L ( Z ( ξ )) = L ( e Z ( ξ )). This equiv alence betw een paths allows us to in tro duce Rademacher v a riables sw apping Z i for e Z i as w ell as the ability to combine terms b elow: ( 1 ) = E Z 1 e Z 1 E ξ 1 E Z 2 | χ ( ξ 1 ,Z 1 , e Z 1 ) e Z 2 | χ ( ξ 1 , e Z 1 ,Z 1 ) E ξ 2 · · · E Z n | χ ( ξ n − 1 ) ,...,χ ( ξ 1 ) e Z n | χ ( ξ n − 1 ) ,...,χ ( ξ 1 ) E ξ n " sup h ∈H 1 n n X i =1 ξ i ( h ( e Z i ) − h ( Z i )) # = E Z , e Z ,ξ " sup h ∈H 1 n n X i =1 ξ i ( h ( e Z i ) − h ( Z i )) # ≤ E Z ,ξ " sup h ∈H 1 n n X i =1 ξ i h ( Z i ) # + E e Z ,ξ " sup h ∈H 1 n n X i =1 ξ i h ( e Z i ) # = 2 E Z ,ξ " sup h ∈H 1 n n X i =1 ξ i h ( Z i ) # = R n ( H ) . Rademacher complexit y g ets its utilit y from b ounding the prediction risk of forec asters. F or i.i.d. data, the ma in to ols for proving risk b o unds a re the inequa lities of Ho eﬀding ( 1963 ) and McDiarmid ( 1989 ). Extensions of lear ning theory to de p endent da ta have r e lied on strong mix ing prop erties to a ppr oximate weakly-dependent pro cesses b y i.i.d. ones, r ecov ering i.i.d. results with a reduced eﬀectiv e sample s ize. W e ins tead use genera lizations of Hoeﬀding and McDiarmid to dep endent sequences based on results of v an de Geer ( 2002 ), which do not need mixing at all. Rather than deriving bounds under the conditio n sup h ∈H k h k ∞ < ∞ , w e use a weaker hypo thes is on the tails of conditional distributions. W e ﬁr st state this more genera l r esult, g iving the b ounded c ase as a corollary . W e discuss the co ncentration b ound and its deriv a tion in § 3.3 . Theorem 9. Supp ose that ther e exist c onstants τ and c such that E  ψ  | Γ n ( H ) | c      F t 0  ≤ τ ∀ t, (2) wher e ψ ( x ) = exp( x 2 ) − 1 . Then, for ǫ > 0 and n lar ge enough, for al l h ∈ H , with pr ob ability at le ast 1 − δ , R n ( h ) ≤ b R n ( h ) + R n ( H ) + 4 c ( τ + 1) r 2 lo g 1 / δ n . The following coro llary is immediate b y noting that s up h ∈H k h k ∞ ≤ M < ∞ implies that E  exp  ( | Γ n | / M ) 2  ≤ e < 3. W e ma de no eﬀort to o ptimize the constant b efore the conﬁdence p enalty . Corollary 10. If sup h ∈H k h k ∞ ≤ M , then for al l h ∈ H , with pr ob ability at le ast 1 − δ , R n ( h ) ≤ b R n ( h ) + R n ( H ) + 12 M r 2 lo g 1 / δ n . 7 3.2 Empirical Rademac her B ounds Unfortunately R n ( H ) may itself b e har d or imp oss ible to calculate for some classes H . How ever, under our assumptions, w e show that R n ( H ) is closely a pproximated b y the empirical Radema cher complexit y . That is, the same data can estimate both b R n and R n ( H ). Theorem 11 (Empirical Rademacher Complexit y Bound) . As s u me e q. ( 2 ) ho lds. Then, for al l h ∈ H , with pr ob ability at le ast 1 − δ , R n ( h ) ≤ b R n ( h ) + b R n ( H ) + 12 c ( τ + 1) r 2 lo g 2 / δ n . T o apply Thm. 11 , we can estimate b R n ( H ) by drawing m indep endent Rademacher samples of size n , and use 1 mn m X i =1 sup h ∈H n X t =1 ξ ti h t ≈ b R n ( H ) . (3) The approximation is O (1 /m )-ac curate. Thus, given one sample of data, the e n tire risk b ound is fully calculable. If R n ( H ) is known (the case for man y common cla s ses H , see Section 4.2 ) w e may apply Thm. 9 . F or a ny other class of predictors, w e can e stimate the complexit y w ith ( 3 ) and a pply Thm. 11 . Finally , we present a coro llary for the ca se that sup h ∈H k h k ∞ < ∞ . Corollary 12. If sup h ∈H k h k ∞ ≤ M < ∞ , then for al l h ∈ H , with pr ob ability at le ast 1 − δ , R n ( h ) ≤ b R n ( h ) + b R n ( H ) + 36 M r 2 lo g 2 / δ n . Both of these results ca n be seen as p e nalizing the empiric al risk with a ter m that accounts for the complexity of H alo ng with a second penalty fo r the amo un t of conﬁdence we r equire. 3.3 Necessary Concen tration Inequalities F or i.i.d. data, the main too ls for developing risk bo unds a re the inequalities of Ho eﬀding ( 1 963 ) and McDiarmid ( 1989 ). As discussed ab ov e, extensio ns of learning theory to dependent data hav e relied on strong mixing prop erties to approximate weakly-dep endent pro ces ses b y i.i.d. ones, and so r ecov er the i.i.d. results with a reduced eﬀectiv e sample size. W e will ins tead us e a generalizatio n applying to dep endent sequences based on results due to v an de Geer ( 2002 ), which do not req uir e mixing at all. W e need some conditions on the ta ils of the random v a riables. Supp o se that X t is a mar tingale, e.g. a real-v alued F t 0 -measurable r andom v ariable satisfying E  X t   F t − 1 0  = 0 with the conv en tion F 0 = ∅ . F o r a constant c , deﬁne B 2 n = n X t =1 c 2  1 + E  ψ  | X t | c      F t − 1 0  , where ψ ( x ) = ex p( x 2 ) − 1. Essentially , controlling B 2 n by bo unding the exp ecta tion of ψ ( | X t | /c ) c ontrols the tails of X t . The function ψ can b e an y non-decre a sing, conv ex function satis fying ψ (0) = 0, but the use of ψ ( x ) = exp( x 2 ) − 1 is most common. In g eneral, inf c> 0 E [ ψ ( | X t | /c )] ≤ 1 is refer r ed to as the Orlicz norm of X t denoted as k X t k ψ . In the simplest case, if c < ∞ and E [ X t ] = 0 it holds that P ( | X t | > x ) ≤ 2 exp( − x 2 /c 2 ). This is the deﬁnition of s ub - Gaussian tails : X t has tails which decreas e a t leas t as quickly as those of a s ta ndard Gauss ian random v ariable. In particular, b ounded r andom v a riables sa tisfy this c ondition. As our data come from a time-dep e ndent pro cess Y , we require the conditional version of this idea. Lemma 13 ( v an de Geer 20 02 ; Theorem 2.2) . Supp ose X t is a m artingale. Then, for al l ǫ > 0 , b > 0 , for n lar ge enou gh, P n X t =1 X t ≥ ǫ and B 2 n ≤ b 2 ! ≤ exp {− ǫ 2 / 8 b 2 } . 8 T able 1: Compar ison of existing risk b ounds. W e use the notation po lylog( n ) to mean log k ( n ) for some k > 0. Assumptions Reference Complexity Calculable Best-case convergence ra te I.i.d. (many, e.g. Bartlett and Mendelson , 2002 ) Rademacher Yes O ( p 1 /n ) St at ionary & mixing ( Mohri and Rost amizadeh , 2009 ) Blocked Rademacher If β -mixing coefs are kno wn O ( p polylog( n ) /n ) St at ionary, non-mixi ng This p aper Rademacher Yes O ( p 1 /n ) Non-st a tionary & mixing ( Kuznetsov a nd Mohri , 2014 , 2017 ) Blocked or Sequential Rade macher If β -mixing coefs are known O ( p polylog( n ) /n ) Non-st a tionary, non-mixing ( Kuznetsov and Mohri , 2015 ) Expected covering number Depends on H O ( p polylog( n ) /n ) Adversarial ( Rakhlin et al. , 2010 , 2011 , 2015 ) Sequential Rademacher Depends on H O ( p polylog( n ) /n ) This result generalizes Ho eﬀding’s inequality to the case o f conditionally sub-Gaussian random v ariables from a dep endent sequence. As long as the ta ils of the next observ ation are w ell con trolle d c onditional on the p ast , we ca n still con trol the size of deviations from the mean with high probabilit y . W e now pr esent the following extension, analogo us to McDia rmid’s inequality , but for dependent se- quences with sub-Gaussian tails (rather than bounded diﬀerences ). Theorem 14. L et X t b e F t 0 -me asur able with E  ψ  | X t | c      F t − 1 0  ≤ τ , (4) for some τ > 0 and al l t > 0 . Then for al l ǫ > 0 and n lar ge enough, P ( X n − E [ X n ] > ǫ ) ≤ exp  − ǫ 2 32 nc 2 ( τ + 1) 2  . Thm. 14 can be gener a lized to allow b oth c and τ to dep end on t with a ppropriate mo diﬁca tions as in Lem. 13 . Typically , we would exp ect better control ov er the tails as w e condition on more data, resulting in a decreas ing sequence of τ , tho ug h we will no t pursue this genera lit y further here. Bec ause w e were unable to ﬁnd a co mparable result in the literature, and this one may b e useful in it’s own r ight, w e ha ve c hosen to include it here. The pro of is given in the Supplemen t. 4 Discussion In this sectio n, we give a car eful explanatio n, situa ting o ur res ults in the co nt ext of existing b ounds. W e then pro vide a few simple (standard) exa mples o f cas es in whic h o ur b ounds are ca lculable, as well as a generalized a lgorithm for classes whic h don’t admit calculable expected Rademacher complexities. Finally , we conclude. 4.1 Relationship with Existing W ork As discussed in the in tro duction, e xisting work has developed risk b ounds for dependent data under a num b er of assumptions whic h are more or less g e neral then our s. In or der to g ive context for our results, we compar e the assumptions and beneﬁts of ea ch of these here. This compariso n is summarize d in T able 4.1 . The ﬁr st risk b ounds for time ser ies ar e, like o ur result, based on standar d Ra demacher complexities. Mohri and Rostamizadeh ( 2009 ) ass ume that Y is a stationary β -mixing pro ces s. Like our results (Cor. 1 0 and Cor . 1 2 ), they are able to pr ov e bounds based on b oth the exp ected a nd empirical Rademacher com- plexities. Their results how ever, do no t apply to the full time-se r ies forecas ting setting we present here— predictions in their setting may dep end o nly on a ﬁxed lag d o f previous observ ations. F urthermor e, bo th the Ra demacher co mplexity a nd the conﬁdence penalty dep end on blo cks of data rather than individual data po int s. The n umber of blo cks, µ , then replaces n in b oth terms, wher e µ dep ends on the unknown mixing co eﬃcients. Thus, conv erg ence rates a re slig h tly slow e r —b e cause the s ize o f the blocks s hould incr ease with n , µ must b e sublinear in n —and cannot be directly calculated without knowledge of the mixing coe ﬃ- cients. McDonald et al. ( 2011 , 201 5 ) give a n estima to r for the mixing co eﬃcients with nearly parametric rates, though bo unds which r eplace known co eﬃcients with estimates hav e not b een derived. O ur re s ults 9 subsume the stationary and mixing r esults because our conv ergence r ate is faster without a ssuming an y t yp e of asymptotic decay of dep endence. Alternatively , Rakhlin et al. ( 2 0 10 , 20 11 , 2 015 ) develop truly ingenious tec hniques for an a dversarial data gener ating pro cess, a m uch mo re gener al condition wherein not o nly is the pro cess potentially non- stationary and non-mixing, but subsequent data points may b e chosen based on previous predictions to make the learner pe r form as p o o rly as p os s ible. These results rely instead on the sequential Rademac her complexity deﬁned in our notation as R seq n ( H ) = sup Z E ξ " sup h ∈H 2 n n X t =1 ξ t h ( Z t ( ξ )) # , where the o uter supremum is taken ov er all Y -v a lued trees of depth n . Beca use their results ar e more gene r al, one could simply apply them to our setting. How ever, R seq n ( H ) is mor e diﬃcult to calculate than R n ( H ), is lo os er, a nd doe s not admit an empirical version (analo gous to our Cor. 12 ) b eca us e it replaces the outer exp ectation ov er Z with a suprem um. Finally , w ork on non-stationary , mixing pro cess es ( Kuznetsov and Mohr i , 2 0 14 , 2 017 ) a nd no n- stationary , non-mixing pro cess es ( Kuznetsov and Mohri , 201 5 ) has also app eared. In the mixing case, the co mplexity is either the blo cked v ersion as in ( Mohri and Rostamizadeh , 2009 ) adjusted to handle non-stationa rity , o r the sequential complexity above with an additional discrepancy p ena lty which “measur es” non-stationar it y in view of H . The discrepancy meas ur e can b e calculated from data a s ca n the blo cked Rademacher complexity , though ag ain, the mixing co eﬃcients cannot. The non-stationary , non-mixing s e tting replac e s Rademacher complexities with an exp ected sequential co v ering n um b er . This results in b ounds which a re lo os er than ours by poly - logar ithmic factors in n . If the cov ering n umber can b e co mputed for the function cla ss H of interest, than these results a re wholly calculable, but if the clas s do es no t hav e known cov ering nu mber, there is no analogue to Cor. 12 which ca n be estimated from the given data. Thu s, the b eneﬁts of our work ar e that, if we are willing to ass ume stationar it y , our r esults ar e tighter than previous results, easier to calculate bas ed o n known e x pec ted Rademacher formulas, and admit empirical Rademacher complexities whic h can alwa ys b e calculated given suﬃcien t computational r esources. None of these b eneﬁts require un testable mixing assumptions or knowledge of the asso ciated c o eﬃcients. 4.2 Examples and Algorithms In s ome cas es, the exp ected (or empirical) Rademacher complexit y is eas ily calculated fr om data. In these cases, one can derive simple algorithms for time-series prediction. Our ﬁrst t wo exa mples , give co mplete ris k bo unds for alg orithms which predict future obser v ations based on d previous o bserv ations for clarity . Thes e follow from results of Bartlett and Mendelson ( 2002 ). Consider ﬁrs t the case of a 2-layer Neural Net work whic h makes predictions based on d previous v alue s and let Y = R p . Supp ose that the a ctiv ation function σ : R → [ − 1 , 1] is 1-Lips chit z with σ (0) = 0. F or v i ∈ R pd deﬁne G N = ( y 7→ X i w i σ ( v i · y ) : k w k 1 ≤ 1 , k v i k 1 ≤ 1 ) . Suppo se further that ℓ is 1-Lipschitz. Then, b R n ( ℓ ◦ G N ) ≤ 2 c log 1 / 2 ( pd ) n max 1 ≤ j,j ′ ≤ p v u u t n − d X i =1 ( y ij − y ij ′ ) 2 for so me c > 0. Th us, b R n ( ℓ ◦ G N ) = O P ( n − 1 / 2 ) as usua l. The Lipschitz conditions and nor m constra in ts can easily b e exchanged for o ther constan ts without altering the rate, and the n umber of lay ers is easily altered. Consider now regula rized Kernel metho ds . Suppo se ℓ is M -L ips chit z and consider the clas s G K =  y 7→ w · Φ( y ) : k w k Ψ ≤ B 2  , where Φ( y ) : Y → Ψ is the feature map asso cia ted with the Hilb ert space Ψ, 10 Algorithm 1 Generic ERM Algorithm Input: data Y n 1 , mo dels H 1 , . . . , H k , int eger m for i = 1 to k do Estimate a predictor h i ∈ H i as usual Compute the training error b R n ( h i ). Compute b R n ( H i ) using ( 3 ) end for Cho ose i ∗ = ar gmin i b R n ( h i ) + b R n ( H i ). Return h i ∗ , b R n ( h i ∗ ) + b R n ( H i ∗ ) and calculate the complexit y p enalty to for m the b ound in Cor. 12 . k is the corresp onding k ernel function, and k·k Ψ denotes the norm in Ψ. Then, w e hav e that b R n ( ℓ ◦ G K ) ≤ 4 M B n v u u t n − d X i =1 k ( y i , y i ) = O P ( n − 1 / 2 ) . Finally , using Co r. 12 , w e can deriv e a generic empirical r is k minimiza tion-type (ERM) alg orithm for learning without any knowledge of complexity measur ement s. Algorithm 1 shows how to c ho ose a predicto r from among a collection of bounded function cla sses H 1 , . . . , H k . 4.3 Conclusion In this pap er, we hav e demonstrated how to co nt ro l the generaliza tio n of time series prediction algor ithms. These metho ds use some or all o f the o bserved pa st to predict future v a lue s of the same s eries. In order to handle the complicated Rademacher complexit y bo und fo r the expectatio n, we hav e follow ed the appro ach used in the online learning ca se pioneer e d b y Rakhlin et al. ( 2010 , 2 011 ), but w e show that in our pa rticular case, muc h of the structure needed to dea l with the adversary is unnecessa r y . This r esults in clean risk bo unds which have a for m similar to the i.i.d. cas e. As these res ults take exp ectatio ns o ver Y n 1 rather than a suprem um, empir ical counterparts which a r e estimable can also be derived. Extending our res ults to lo cal Rademacher complexities with faster conv erg ence r ates is left for fut ure work. References Algoet, P. H. , and Cover, T. M. (1988), “A sa ndwich proo f of the Shanno n-McMillan-Breima n theo- rem,” Annals of Pr ob ability , 16 , 899–9 09. Bar tlett, P. L., and Mendelson, S. (2002), “Rademacher and Gaussian complexities: Risk bo unds and structural results,” Journal of Machine L e arning Rese ar ch , 3 , 463– 482. Bar tlett, P. L., Bousquet, O., and Mendelso n, S. (2005), “Lo cal Radema cher complexities,” The Annals of St atistics , 33 (4), 1497–1 537. Dynkin, E. B. (19 78), “Suﬃcien t statistics and extreme points,” Annals of Pr ob ability , 6 , 705–7 30. Gra y, R. M. (1990), Entr opy and Information The ory , Springer -V erla g, New Y o rk. Gra y, R. M. (2 009), Pr ob ability, Rando m Pr o c esses, and Er go dic Pr op erties , Springer -V erla g, New Y ork , second edn. Hoeffding, W. (19 63), “P robability inequalities for sums of bo unded random v ar iables,” Journal of the Americ an Statistic al As s o ciation , 58 , 13– 30. Kuznetsov, V. , and Mohri, M. (2014 ), “Generaliza tio n bo unds for time ser ies pre diction with non- stationary pro cesses,” in International Confer enc e on A lgorithmic L e arning The ory , pp. 260– 274. 11 Kuznetsov, V., and Mohri, M. (2015), “Learning theory and algor ithms for for ecasting non-s tationary time se ries,” in A dvanc es in Neur al Information Pr o c essing Systems 28 , eds. C. Cor tes , N. D. L awrence, D. D. Lee, M. Sugiyama, and R. Garnett, pp. 541– 549. Kuznetsov, V. , and Mohri, M. (201 7), “Gener a lization b ounds for non-stationary mixing pro c esses,” Machine L e arning , 106 (1), 93–11 7 . McDiarmid, C. (19 89), “On the metho d of b ounded diﬀerences,” in Surveys in Combinatorics , ed. J. Siemons, pp. 148–1 88, Cambridge, E ngland, Cambridge Universit y Press. McDonald, D. J., Shalizi, C. R. , and Sch er vish, M. (2011), “Estimating b eta-mixing co eﬃcien ts,” in Pr o c e e dings of t he 14 th International Confe r enc e on A rtiﬁcial Intel ligenc e and Statistics [AIST A TS 2011] , eds. G. Gordo n, D. Dunson, and M. Dud ´ ık, vol. 15 of Journal of Machine L e arning Rese ar ch: Workshops and Confer enc e Pr o c e e dings , pp. 51 6–524 . McDonald, D. J., Shalizi, C. R., and S cher vish, M. (2015), “ Estimating beta-mixing co eﬃcients via histograms ,” Ele ctr onic Journal of Statistics , 9 , 2855–2 883. Mohri, M., and Rost amizadeh, A. (200 9), “ Rademacher co mplexity bounds for non- I.I.D. pro c e sses,” in A dvanc es in Neur al Information Pr o c essing Systems 21 [NIPS 2008] , eds. D. Koller , D. Sch uurmans, Y. Bengio, and L. Bottou, pp. 1097–11 04. Rakhlin, A ., Sridharan , K., and Tew ari, A. (20 10), “Online lear ning : Random averages, combina- torial parameters, and lear nability ,” in A dvanc es in N eu r al Information Pr o c essing 23 [NIPS 2010] , eds. J. Laﬀer t y , C. K. I. Williams, J . Shaw e-T aylor, R. S. Zemel, a nd A. Culotta , pp. 198 4 –199 2, Cambridge, Massach usetts, MIT Press. Rakhlin, A. , Sridharan, K., and Tew ari, A. (2011), “ Online learning: Sto chastic and constra ined adversaries,” in A dvanc es in Neu ra l Information Pr o c essing Systems 24 [NIPS 2011 ] , eds. J. Shaw e-T aylor, R. S. Zemel, P . Bar tlett, F. C. N. Pereira, and K. Q. W einber g er, pp. 1764– 1772. Rakhlin, A ., Sridharan, K., and Tew ari, A. (20 15), “Se q uent ial complexities and uniform mar tingale laws of larg e num b er s ,” Pr ob ability The ory and R elate d Fields , 161 (1/2 ), 11 1–153 . Shalizi, C., and Kontor ovich, A . (201 3 ), “ P redictive P AC lea rning and proc e s s decomp ositio ns ,” in A dvanc es in Neu r al Information Pr o c essing Systems 26 , eds. C. J . C. Burg es, L. Bottou, M. W elling , Z. Ghahramani, and K. Q. W einberger, pp. 161 9 –162 7. v an de Geer, S. A . (2002), “On Ho eﬀding’s inequalit y for dependent ra ndom v aria ble s ,” in Empiric al Pr o c ess T e chniques for Dep endent Data , eds. H. Dehling, T. Mikosc h, and M. So rensen, pp. 161 –169, Birkh¨ auser , Bo ston. v an Handel, R. (2014 ), “E rgo dicity , decisions, and pa rtial information,” in S´ eminair e de Pr ob abilit´ es XL VI , eds. C. Donati-Martin, A. Lejay , a nd A. Rouault, pp. 411–459 , Springer. Wiener, N. (1 956), “ Nonlinear prediction and dynamics,” in Pr o c e e dings of t he Thir d Berkeley Symp osium on Mathe matic al Statistics and Pr ob ability , ed. J . Neyman, vol. 3, pp. 24 7–252 , B e rkeley , Universit y of California Press. 12 A Additional Pro ofs Prop ositio n (Standard i.i.d. Rademacher bound) . If Z 1 , . . . , Z n is an i.i.d. sample fr om some pr ob ability distribution P , then E Z [Γ n ( H )] ≤ R n ( H ) . Pr o of. The usual pro of in tro duces a “g ho st sample” e Z n 1 , wher e the e Z t hav e the same distribution as the Z t , but are indep endent of the latter a nd of each other. Then exp ectations ma y as well b e taken ov er the ghost sample as the real one: E [ h ] = E e Z 1 h h ( e Z 1 ) i = 1 n P n i =1 E e Z h h ( e Z t ) i . Hence (using the notation from § 2.2 ) γ n ( h ) = 1 n n X i =1 E e Z h h ( e Z t ) i − 1 n n X i =1 h ( Z t ) = 1 n n X i =1 E e Z h h ( e Z t ) − h ( Z t ) i , Γ n ( H ) ≤ E e Z " sup h ∈H 1 n n X i =1 h ( e Z t ) − h ( Z t ) # , (5) and E Z [Γ n ( H )] ≤ E Z , e Z " sup h ∈H 1 n n X i =1 h ( e Z t ) − h ( Z t ) # . (6) Eq. 5 holds beca use the supremum o f expectations is less than or equal to the expected suprem um, and Eq. 6 just takes the exp ectation of both sides with resp ect to Z . Since Z t and e Z t hav e the same marginal distribution and are indep endent, L  h ( e Z t ) − h ( Z t )  = L  h ( Z t ) − h ( e Z t )  , and the s igns of summands in Eq. 6 can be ﬂipp ed ar bitrarily , according to the Radema cher v ar iables, without eﬀect: E Z [Γ n ( H )] ≤ E Z , e Z ,ξ " sup h ∈H 1 n n X i =1 ξ t  h ( e Z t ) − h ( Z t )  # ≤ E Z ,ξ " sup h ∈H 1 n n X i =1 ξ t h ( Z t ) # + E e Z ,ξ " sup h ∈H 1 n n X i =1 ξ t h ( e Z t ) # = 2 E Z ,ξ " sup h ∈H 1 n n X i =1 ξ t h ( Z t ) # = R n ( H ) . Pr o of of Thm. 9 . This result follows immediately from Thm. 14 upon setting the right hand side equal to δ and solving for ǫ . Pr o of of Thm. 11 . W r ite h ∈ R n for the vector h 1 ( Z 1 ) , . . . , h n ( Z n ). Note that as the range of ℓ is R + , h lies 13 in the non-negative or thant of R n ( h ≥ 0). Now, n Γ n = sup h ∈H n X t =1 h t − E Z " n X t =1 h t #! ≥ sup h ∈H n X t =1 h t − sup h ∈H E Z " n X t =1 h t # (prop erty o f sup) ≥ sup h ∈H n X t =1 h t − E Z " sup h ∈H n X t =1 h t # (Jensen’s ineq.) = sup h 1 ⊤ h − E Z  sup h 1 ⊤ h  ≥ E ξ  sup h ξ ⊤ h  − E Z  sup h 1 ⊤ h  = n 2 b R n ( H ) − K where K is a constan t. Therefore, E Z h n 2 b R n ( H ) − K i = n 2 R n ( H ) − K Since ψ is increasing in it’s ar gument a nd we assumed that n Γ n satisﬁed eq . ( 2 ) for constants c , and τ , w e can apply Thm. 14 with Z n = b R n ( H ) − K with consta n ts c → 2 c/ n a nd τ as befor e . Thus, P ( b R n ( H ) − R n ( H ) > ǫ ) ≤ exp  − nǫ 2 128 c 2 ( τ + 1) 2  . Setting the right hand side eq ual to δ / 2 and combining with Thm. 9 applied with δ → δ / 2 via the union bo und gives the result. Pr o of of Thm. 14 . W r ite X n − E [ X n ] = P n i =1 W t , where W t = E [ X n | F t 0 ] − E  X n   F t − 1 0  , fo r all t = 1 , . . . , n . Then W t is F t 0 -measurable , and E  W t   F t − 1 0  = 0 for a ll t . No w, let K > 0 to be chosen. Then E  ψ ( | W t | /K )   F t − 1 0  = E " ψ | E [ X n | F t 0 ] − E  X n   F t − 1 0  | K !      F t − 1 0 # = E  exp  1 K 2  E  X n   F t 0  − E  X n   F t − 1 0  2  − 1     F t − 1 0  ≤ E  exp  2 K 2  E  X n   F t 0  2 + E  X n   F t − 1 0  2   − 1     F t − 1 0  = E   exp (  | E [ X n | F t 0 ] | K/ √ 2  2 ) exp    | E  X n   F t − 1 0  | K/ √ 2 ! 2    − 1       F t − 1 0   = exp    | E  X n   F t − 1 0  | K/ √ 2 ! 2    E " exp (  | E [ X n | F t 0 ] | K/ √ 2  2 )      F t − 1 0 # − 1 ≤ E " exp (  | X n | K/ √ 2  2 )      F t − 1 0 # E " E " exp (  | X n | K/ √ 2  2 )      F t 0 #      F t − 1 0 # − 1 = E " exp (  | X n | K/ √ 2  2 )      F t − 1 0 # E " exp (  | X n | K/ √ 2  2 )      F t − 1 0 # − 1 = E " exp (  | X n | K/ √ 2  2 )      F t − 1 0 # 2 − 1 ≤ ( τ + 1) 2 − 1 14 for K = c √ 2. Therefore , we hav e B 2 n = n X i =1 2 c 2  1 + E h ψ ( | W t | / √ 2 c )    F t − 1 0 i ≤ 2 n c 2 ( τ + 1) 2 , and so, P ( X n − E [ X n ] > ǫ ) = P n X i =1 W t > ǫ ! ≤ exp  − ǫ 2 32 nc 2 ( τ + 1) 2  , by Lem. 13 . 15

Rademacher complexity of stationary sequences

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment