Rademacher complexity of stationary sequences

We show how to control the generalization error of time series models wherein past values of the outcome are used to predict future values. The results are based on a generalization of standard i.i.d. concentration inequalities to dependent data with…

Authors: Daniel J. McDonald, Cosma Rohilla Shalizi

Rademac her Complexit y of Sta tionary Sequences Daniel J. McDonald Departmen t of Statistics Indiana Univ ersit y da jmcdon@indiana.edu Cosma Rohilla Shalizi Departmen t of Statistics Carnegie Mellon Univ ersit y cshalizi@cm u.edu V ersion: May 24, 201 7 Abstract W e show how to control the generalization error of time series mo dels wherein past v alues of th e outcome are used to predict future v alues. The results are based on a generalization of standard i.i.d. concentrati on inequalities to dependent data without the mixing assumptions common in th e time series setting. Our proof and the result are simpler t han previous analyses with dep endent data or sto chastic adversa ries whic h use sequential Rademac her complexities rather than the ex p ected Rademacher com- plexity for i.i.d. pro cesses. W e also derive empirical Rademacher results without mix ing assumptions resulting in fully calculable upp er boun ds. 1 In tro duction Statistical learning theor y aims to b ound the o ut-of-sample p er formance o f prediction rules induced from finite data sets. The cla ssical s itua tion is where one wishes to pr edict o ne v a r iable Y ∈ Y from a nother X ∈ X , and ha s a training set of n pairs ( X 1 , Y 1 ) , . . . ( X n , Y n ), assumed to b e drawn i.i.d. from a distribution ν tha t will a lso genera te future instances. Provided with a lo ss function ℓ : Y × Y → R + and a c la ss of prediction functions G , where ea ch g ∈ G is a map from X to Y , the usua l go a l is to b ound the supremum of the empirical pro c e ss of the losses , sup g ∈G E ν [ ℓ ( Y , g ( X ))] − n − 1 P n i =1 ℓ ( Y i , g ( X i )). Such bounds inv olve some notion of the flexibility or complexity o f the mo del space G , and a par ticularly impo r tant one is the Rademacher complexity , R n ( G ) = 2 E ξ, ν " sup g ∈G 1 n n X i =1 ξ i ℓ ( Y i , g ( X i )) # where ξ i are a seq uenc e of i.i.d. v a riables taking the v a lues +1 and − 1 with equal probability (see § 2.3 for a fuller statement). While the Rademacher complexity was first used to b ound generalization er ror for i.i.d. pro cesses (e.g . Bartlett and Mendelson , 2 002 ), it has b e en e xtended to situations whe r e the ( X i , Y i ) pa irs a re depe ndent but ( X i , Y i ) b ecomes indep endent of ( X j , Y j ) as | i − j | → ∞ ( Mohri and Rostamizadeh , 20 09 ), and even to adversarial settings, wher e the data so urce a ctively tries to foo l the lea rner Rakhlin et al. ( 2010 ). W e build on the latter work to extend Rademacher complexit y to the rather different problem o f time- series forecasting. In that setting, w e observe a single sequence of random v ariable s Y 1 , Y 2 , . . . Y n (for short, Y n 1 ), taking v alues in Y , a nd wish to lear n a function which extrap olates the s e quence into the future, to forecast (say) the next 1 v alue Y n +1 . Given a predictor g : Y n 7→ Y , a natural notion of generalization error for time series, the forecasting risk , is R ( g ) ≡ E [ ℓ ( Y n +1 , g ( Y n 1 )) | Y n 1 ] . While a precise statement needs so me care ( § 2.4 ), w e will show that forecasting risk , like the generalizatio n error of classification and regress io n problems, can b e b ounded via the Rademacher complexity , despite the rather different nature o f the problem. In par ticular, we a re able to use the standard Rademacher complexity . Our result is co mparable to that in Rakhlin et al. ( 2015 , § 9), which gives a b ound for time series prediction in Bana ch spa ces. Their r esult how ever is a co nsequence of re s ults for the more difficult pr o blem of prediction 1 Going beyond “one-step -ahead” f orecasting, to longer hori zons or whole blo cks, in volv es mostly not ational c hanges, whic h we will not note explicitly . 1 under sto chastic adv ersar ies. As such, our bound a nd the pro of are simpler a nd tight er, thoug h they apply to an easier (but still highly relev ant) pr ediction task. § 2 gives background material essential for stating our results, on time series, mo de l complexity , and forecasting r isk. § 3 derives risk b ounds for time series, g iving a nov el pro of that the sta ndard Ra demacher complexity characterizes the flexibilit y of G , even under sta tio narity , with concent ra tio n inequalities for no n- mixing dep endent v ar iables. § 4 car efully compare s our re s ults to other s in the literature, s ketc hes applications and algorithms, and concludes. 2 Time S eries, Complexit y , and Concen tration of M easure W e introduce some of the concepts needed for our results: stationarity and ergo dicit y a re required to cont ro l generaliza tion er ror (unless w e aim to predict only a single new observ atio n); Rademac her complexity mea- sures the flexibility of the mo del space G ; forecasting risk meas ur es the quality of a time-se r ies prediction rule. Notation Y = { Y t } ∞ t = −∞ is a sequence of rando m v aria bles, i.e., each Y t is a measurable mapping from some pro bability s pace (Ω , F , P ) into a measur able spac e Y . W e wr ite Y j i for the blo ck { Y t } t = i,i +1 ,...j from the random sequence; either limit may be infinity . The σ -field g e nerated b y the blo ck Y j i is F j i . L ( W ) deno tes the probability law of the r andom ob ject W , a nd L ( W | V ) the conditional law of W given V . Fina lly , if W has distribution ν a nd f is a measurable function, we define E ν [ f ( W )] = E W [ f ( W )] = R dν f ( W ). W e will try to use whic hever notation is clea rest in con text, sticking to E [ f ( W )] when that is unambiguous. 2.1 Stationarit y and Er go dicit y W e assume Y is (strictly or strongly) stationary . Definition 1 (Stationarity) . A r andom se quenc e Y is stationary when al l its fin ite-dimensional distribu- tions ar e time invariant: for al l t and al l i ≥ 0 , L  Y t + i t  = L  Y i 0  . Stationarity do es not require the random v aria bles Y t to be independent across time, but do es imply they all ha ve the same dis tr ibution. The infinite-dimensio nal distribution of Y , L ( Y ), is a pro bability measure on Y ∞ . In this space, the time-evolution of the pro ces s is just the shift map τ , which “moves the sequence a step to the right”: ( τ Y ) t = Y t +1 . Definition 2 (Erg o dicity) . A s et A ⊂ Y ∞ is shift-i nv ariant , A ∈ I , when τ − 1 A = A . A pr ob ability me asu re µ on Y ∞ is ergo di c when shift-invariant sets have either pr ob ability 0 or pr ob ability 1, i.e., A ∈ I only if µ ( A ) = 0 or µ ( A ) = 1 . Ergo dicity is impo r tant for tw o reasons. The firs t is that it implies a law of lar ge num b ers for time series. Prop ositio n 3 (Individual Ergo dic Theo rem; Gray 200 9 ) . If µ is stationary and er go dic, and f ∈ L 1 , then the time-aver age of f c onver ges to its exp e ctation µ -almost-sur ely. That is, the set of y ∈ Y ∞ such that 1 n P n − 1 t =0 f ( τ t y ) → E µ [ f ( Y )] has µ -me asur e 1. The second reaso n is that every stationary pro ces s Y decomp oses into a mixture of stationary and ergo dic pro cesses, and each realiza tion o f Y co mes from just one of these ergo dic compo nent s. Prop ositio n 4 (Ergo dic Decomp osition; Dynkin 197 8 ; Gra y 2009 ) . If ρ is a stationary but not er go dic distribution on Y ∞ , then ρ = R µdπ ( µ ) , wher e π is a me asur e on the s p ac e of stationary and er go dic pr o c esses. Mor e over, for any f ∈ L 1 , 1 n P n − 1 t =0 f ( τ t y ) → E ρ [ f ( Y ) |I ] for ρ -almost-al l tr aje ctories y . In words, to generate a tra jectory from a stationar y , non-ergo dic pro cess , first pic k a sta tionary ergo dic pro cess (according to the distribution π ), and then genera te Y from that pro cess. T o sum up, then, if we a s sume that the data sourc e is stationa r y , and that we only g et to see a single tra jectory from it, there is no loss of genera lit y in also assuming that the source is ergo dic, a nd so the stro ng 2 law of large n umbers, in the form of Prop. 3 , applies. Non- ergo dicity would only be relev ant if w e w ere to consider mult iple indep endent tra jecto ries fr o m the sa me stationary pr o cess, which might s ample differen t ergo dic comp onents Wiener ( 1956 ). 2.2 Empirical Pro cesses The standard device in lear ning theor y for b ounding the genera liz ation er ror of a prediction function is to control the empiri cal pro cess o ver a function space, i.e., the deviations of empirical means from their exp ectation v alues. W e thus de fine some conv enient, if abstr act, notation here. Let Z 1 , . . . Z n be a sequence of Z -v alued r andom v aria bles (generally dep endent ), a nd H a cla ss of real- v alued functions on Z . W e define the empiri cal mean or sample mean as ˆ h n ≡ 1 n P n t =1 h ( Z t ) and the exp ectation v alue as E [ h ] ≡ E Z n 1 h ˆ h n i = 1 n n X t =1 E Z t [ h ( Z t )] If the Z t are i.i.d., then E [ h ] = E Z 1 [ h ( Z 1 )]. The empirical pro cess 2 at h is γ n ( h ) = E [ h ] − ˆ h n . W e care particularly ab out the supremum of the empirical pro cess: Γ n ( H ) ≡ sup h ∈H γ n ( h ) 2.3 Rademac her Complexit y The Rademac her complexity of a function class is, in esse nce, ho w well it ca n (seem to) match pur e noise. The formal definition is (after Bartlett and Mendelson 200 2 ): Definition 5 (I.i.d. Rademacher Complexit y) . L et Z n 1 b e a Y - value d i.i.d. se quenc e, and H a r e al-value d class of functions on Y . The empirical Rademac her complexi ty of H on Z n 1 is 3 b R n ( H ) ≡ E ξ " sup h ∈H 2 n n X t =1 ξ t h ( Z t ) # . The Rademac he r complexity of H is the exp e ctation of the empiric al R ademacher c omplexity over Z : R n ( H ) ≡ E Z h b R n ( H ) i . Rademacher complexity matters b eca use it is closely related to the supremum of the empirical pro cess ov er H . Sp ecifically , E Z [Γ n ( H )] ≤ R n ( H ). Its utility is that E Z [Γ n ( H )] is almost never expressible, but one of b R n ( H ) of R n ( H ) may b e, th us allowing control of the genera lization er ror with meaning ful qua ntities. The main burden o f our pa pe r is to show that, if Z is sta tio nary and er go dic ra ther than i.i.d., we have the same result, though with a more in volv ed pr o of. W e rehearse the (now standard) i.i.d. Rademacher generaliza tion error b ound and its proo f using our notation in the Supplement be c ause of its imp ortanc e for our own developmen t. This definition o f i.i.d. Rademacher complexity will, it turns out, work for stationary pro cess es almost unch ang ed. Definition 6 (Rademacher Co mplexit y) . L et Y n 1 b e a time series gener ate d fr om P . The empirica l Rademacher complexity of t he r e al-value d function class H on Y n 1 is b R n ( H ) ≡ E ξ " sup h ∈H 2 n n X t =1 ξ t h t ( Y t 1 ) # The Rademach er complexity is the ex p e ctation of t he empiric al R ademache r c omplexity: R n ( H ) ≡ E Y n 1 h b R n ( H ) i . 2 Some autho rs would include an ov er-all scaling factor of √ n . 3 Some definitions hav e an absolute v alue inside the supr emum af ter ( Bartlett and Mendelson , 2002 ), but others av oid it (ev en the same authors in later w ork, e.g. Bartlett et al . 2005 ). As the ev ent ual pr oof demonstrates, i t isn’t required, so w e drop it. 3 The term inside the s upremum, 1 n P n t =1 ξ t h t ( Y t 1 ), is the sample cov a riance b et ween the noise ξ and the v alues of a par ticular function s equence h . The Rademacher complexity takes the largest v alue of this sample cov ar iance ov er all mo dels in the class (mimicking e mpirical risk minimizatio n), then averages o ver realizations of the noise. Relative to the i.i.d. Rademacher co mplex it y , we hav e indexed the predictor h with a time dependent subscript. F o r time series, the goal is to forecast Y t +1 from the history Y t 1 . Since a function Y t 7→ Y is not, techn ically , the same a s a function Y t +1 7→ Y , one m ust, strictly sp eaking, use a different prediction function at each time step. A single predictive mo del h th us is implemen ted a s a whole se r ies of functions h t : Y t 7→ Y . A t some abuse o f notation, w e will write h for the name of this whole sequence of functions. 4 W e emphasize that the sequence h 1 , h 2 , . . . do es no t represent infinitely many individually-learna ble functions but rather stages of a single function sequence h . Int uitively , Radema cher complexity shows ho w well o ur models could seem to fit outco mes which were really just noise, giving a baseline ag ainst which to a ssess ov er-fitting o r failing to generalize. Since Y is stationary a nd erg o dic, and ξ is i.i.d. and independent o f Y , the jo in t pr o cess ( Y , ξ ) is also stationary and ergo dic. Th us by the er go dic tower pr op erty ( v an Handel , 2014 ), for a fixe d function s e q uence h , the sample cov a riance tends to zero a lmost surely: 1 n n X t =1 ξ t h t ( Y t 1 ) → E Y , ξ [ ξ h ( Y )] = E ξ [ ξ ] E Y [ h ( Y )] = 0 The overall Rademacher complexity s ho uld als o shrink, though mor e slowly , unless the mo del class is so flexible tha t it can fit a bsolutely an ything, in which cas e we can infer nothing ab out how w ell it will predict in the future from the fact that it perfor med w ell in the past. Showing that this heuristic r easoning is v alid, and that the Radema cher complexity of Definition 6 contin ues to control the empirical pr o cess when for ecasting s tationary time s e ries, is the ma in a im of our pap er. W e note that Kuznetsov and Mohr i ( 2014 , 201 5 ) prov e generaliza tion e rror b ounds for the for ecasting risk under non-sta tionarity w ith and without mixing assumptions, but these results r ely o n the intricate sequential co mplexities introduced b y Rakhlin et al. ( 20 1 0 ), w hich replace the outer expecta tion ov er the observ ations Y n 1 with a supremum over such obse r v atio ns. ( § 4.1 carefully compares these results and ours.) 2.4 F orecast Risk In class ific a tion or regressio n problems, w e obta in data p oints Z t = ( X t , Y t ), a nd the g oal is to predict o ne part of the da ta, Y t , from the other , X t . The risk of a prediction function g : X 7→ Y , can b e sensibly defined as an exp ectation ov er data p oints: R ( g ) = E X,Y [ ℓ ( Y , g ( X ))] . This ris k is w ell-defined so long as the marginal distribution of the data is shift-in v a r iant ( L ( Z t ) = L ( Z 1 ) for all i ). F or an i.i.d. data source, it is of course true that E X n +1 ,Y n +1 [ ℓ ( Y n +1 , g ( X n +1 )) | X n 1 , Y n 1 ] = E X n +1 ,Y n +1 [ ℓ ( Y n +1 , g ( X n +1 ))] = R ( g ) so that averaging over the marginal distribution o f the next data point indicates the expected loss of contin- uing to use the predicto r g on new data . This is no longer true for dependent data. How ever, for a statio na ry ergo dic source, one has that ( Shalizi and Kontorovic h , 2013 ) lim m →∞ 1 m m X i =1 E X i + n ,Y i + n [ ℓ ( Y n + i , g ( X n + i )) | X n 1 , Y n 1 ] = E X,Y [ ℓ ( Y , g ( X ))] = R ( g ) so the expe c ta tion o ver new da ta w ould still be a g o o d indicator of long-run per formance. All o f this is subtly c hanged for time series , where the goal is to forecast Y t +1 from the histor y Y t 1 . As discussed ab ov e, we abuse notatio n and denote a single pr e dictive mo del g even though it r eally repre s ent s a of functions g t : Y t 7→ Y . 4 If we w ant ed to b e purists, w e could introduce a paramet er space Θ (not necessarily finite-dimensional), and consider the collection of pr ediction f unctions g t ( Y t 1 ; θ ) for all t . 4 Definition 7 (F oreca s t r isk) . Given a stationary and er go dic sto chastic pr o c ess Y , and a loss fun ction ℓ , the fini te-history risk of the pr e dictive mo del g is R n ( g ) ≡ E Y n 1 " 1 n n X t =1 ℓ ( Y t , g t ( Y t − 1 1 )) # and the for e c ast risk is R ( g ) = lim n →∞ R n ( g ) when the limit exists. F or brevity , we introduce the no ta tion Z t ≡ ( Y t , Y t − 1 1 ), and h t ( Z t ) ≡ ℓ ( Y t , g t ( Y t − 1 1 )), defining Y 0 1 ≡ ∅ . The foreca s t r is k c an th us b e also wr itten as lim n − 1 P n t =1 h t ( Z t ). So R ( g ), aga in, captures the lo ng -run av erage cost of using the predictive mo del g . By contrast, R n ( g ) is the av era g e risk o f g if used on an indep endent rea lization of Y n 1 . Having an infinite-time limit in the definition o f fo r ecast risk is irks o me. It can b e ev ade d if the predictive mo del has o nly a finite memor y length d ≥ 0, so no thing more than d time s teps old mat- ters for predictions (formally , g t ( Y t 1 ) ∈ σ ( Y t t − d ) for all t > d ). Then, by stationar ity , w e may simplify R ( g ) = E Y d +1 1  ℓ ( Y d +1 , g ( Y d 1 ))  In fact, in the finite-memory-length case, as so on as n > d , R ( g ) = 1 n − d n X t = d +1 E Y n 1  ℓ ( Y t +1 , g ( Y t − 1 t − d ))  = R n ( g ) and it follo ws from the ergo dic theorem that 1 n − d P n t = d +1 ℓ ( Y t +1 , g ( Y t − 1 t − d )) → R ( g ) almost surely . Predictive mo dels with infinite-r a nge memories are how ever actually fairly common in fo recasting practice, including not just hidden Markov mo dels but also things as basic as mo ving-average mo dels. W e therefore p osit that R ( g ) exists for such mo dels, writing the gap b etw e e n the for ecast risk a nd the finite-history risk by ∆ n ( g ) ≡ R ( g ) − R n ( g ). W e also p osit 5 that the time-av erage d lo ss converges to the forecast ris k : 1 n P n t =1 h t ( Z t ) → R ( g ) . With finite amounts o f data , w e th us fo cus on control of R n ( g ). Whether ∆ n ( g ) → 0 is a prop erty of both the function class and the dep endence structure, and is outside our sco pe, tho ug h see v an Handel ( 2014 ) for rela ted discussion. 3 Risk B ounds Generalization error b o unds follow from deriving high proba bilit y uppe r bounds o n the quan tity Γ n ( H ) := sup h ∈H R n ( h ) − b R n ( h ) , which is the worst case difference be tw een the true r isk R n ( h ) and the empirical risk b R n ( h ) over all functions in the class of loss es H = { h = ℓ ( · , g ( · )) : g ∈ G } defined ov er a particular class of prediction functions G . W e first presen t our main result, which b ounds E Z [Γ n ( H )] with the Rademacher complexity and discuss its pro of. W e then use our Rademacher bound to derive r isk b ounds fo r time-series foreca sters which a re fully calculable from data. 3.1 Stationary Rademac her B ounds The symmetrization a r guments used to prove Ra demacher b o unds for the i.i.d. case fail fo r time series prediction. How ever, as we now show, for stationar y time series, bounds of the same form are still v alid, alb eit with a s o mewhat more in volv ed proo f. This is in contrast to the far more int rica te constructions needed to establish b o unds us ing gener alized Rademacher complexities for o nline learning ( Rakhlin et al. , 2010 , 2011 ) or for non-stationary proces ses ( Kuznetsov and Mohri , 2015 ). (W e g ive more detailed con trasts in § 4.1 .) Our first principle result is simply: Theorem 8. F or a time series pr e diction pr oblem b ase d on a se quenc e Y n 1 , E [ Γ n ( H )] ≤ R n ( H ) . 5 If the loss function is the negat ive log-likelihoo d, this p osit i s the generalized asymptotic equipartition prop erty , or Shannon- McMillan-Br eiman theorem Algoet and Cov er ( 1988 ); Gra y ( 1990 ). 5 Z 1 ξ 2 = − 1   ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ξ 2 =1   ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ e Z 1 ξ 2 = − 1   ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ξ 2 =1   ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ e Z 2 ξ 3 = − 1   ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ξ 3 =1   Z 2 ξ 3 = − 1   ξ 3 =1   ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ Z 2 ξ 3 = − 1   ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ⑧ ξ 3 =1   e Z 2 ξ 3 = − 1   ξ 3 =1   ❄ ❄ ❄ ❄ ❄ ❄ ❄ ❄ e Z 3 Z 3 e Z 3 Z 3 Z 3 e Z 3 Z 3 e Z 3 Figure 1: This figure displays the tr ee s tructures for Z ( ξ ) and e Z ( ξ ) with ξ 1 = 1 (for example). The path along each tree is determined by ξ s equence, interleaving the “pa s t” b etw een pa ths. The version with ξ 1 = − 1 would exchange Z 1 for e Z 1 at the ro o t of ea ch tree. W e note her e that unless sup h ∈H k h k ∞ < ∞ , R n ( H ) = ∞ by its definition. Th us, this re s ult, like all Rademacher results, is only useful with b o unded predictors or losses. Of course if s up h ∈H k h k ∞ = ∞ , the theorem holds trivially . The standard pro of for i.i.d. classification or regression in tro duces a “ghost sample”, a n independent sample of size n from the same distributio n tha t pr o duced the original data, before using a symmetriza tio n argument. F or forecasting, ho wev er, intro ducing an independent copy of the original time series will not pro duce the necessa ry symmetry . Rather, follo wing an idea introduced by Rak hlin et al. ( 201 0 , 20 1 1 ) for dealing with adv ersa rial data, w e work with a tangen t sequence , where the surr ogate v alue int ro duce d at each time p oint is conditioned on the actual time series up to that p o int. That is , the ta ng ent sequence e Y is defined recursively: L  e Y 1  = L ( Y 1 ), and L  e Y t | Y t − 1 1  = L  Y t | Y t − 1 1  . F ur thermore, e Y t is independent of all other e Y s and o f all Y s , conditional on Y t − 1 1 . (In direc ted graphical mo dels terms, Y t − 1 1 are the paren ts o f e Y t , whic h ha s no c hildren. See Figure 1 .) The time series Y a nd the tange n t seq uence do not have the same joint distributio ns. 6 Pr o of of Thm. 8 . With b oth the original time series Y and the tangen t sequence e Y in hand, we construct Z t and e Z t v ar iables as follows: Z t ≡ ( Y t , Y t − 1 1 ) and e Z t ≡ ( e Y t , Y t − 1 1 ) (with the con ven tion that Z 1 = Y 1 , e Z 1 = e Y 1 ). Notice that e Z combines the origina l time series and its tangent s equence, but in suc h a wa y that L ( Z t ) = L  e Z t  . F urther more, s ince e Y t ⊥ ⊥ Y n t | Y t − 1 1 , it follows that e Z t ⊥ ⊥ Z n t | Z t − 1 1 . As R n ( h ) = E Z  1 n P n t =1 h t ( Z t )  for some h ∈ H = ℓ ◦ G , we may equally well wr ite the risk in terms of the tang ent sequence: R n ( h ) = 1 n P n t =1 E e Z t h h t ( e Z t ) i . Therefore, E [Γ n ( H )] = E Z " sup h ∈H E Z " 1 n n X i =1 h ( Z i ) # − 1 n n X i =1 h ( Z i ) !# = E Z " sup h ∈H E e Z " 1 n n X i =1 h ( e Z i ) # − 1 n n X i =1 h ( Z i ) !# ≤ E Z , e Z " sup h ∈H 1 n n X i =1 h ( e Z i ) − h ( Z i ) # (Jensen’s inequality) = E Z 1 e Z 1 E Z 2 | Z 1 e Z 2 | e Z 1 · · · E Z n | Z n − 1 ,...,Z 1 e Z n | e Z n − 1 ,..., e Z 1 " sup h ∈H 1 n n X i =1 h ( e Z i ) − h ( Z i ) # (Iterated expec tation). (1) Now, due to dep endence, Radema cher v ar iables m ust b e intro duced carefully as in the a dversarial cas e. Rademacher v ariables c reate tw o tree structures, one asso cia ted to the Z seq uence, a nd one a sso ciated to the e Z sequence (see Rakhlin et al. , 2010 , 2011 for a thorough treatment) . W e write these trees as Z ( ξ ) and 6 Let Y 1 be 0 or 1 with equal probability , and Y t +1 = Y t with probability 0 . 9 and = 1 − Y t otherwise. ( Y is a stationary and ergodic Marko v c hain.) Because e Y 1 ⊥ ⊥ e Y 2 | Y 1 , the probability tha t e Y 2 = e Y 1 is not 0 . 9 but 0 . 5. 6 e Z ( ξ ), where ξ is a pa rticular sequence of Ra demacher v a riables, e .g. (1 , − 1 , − 1 , 1 , . . . , 1 ), whic h cr eates a path along e a ch tree. F or example, consider ξ = 1 . Then, Z ( ξ ) = ( Z 1 , . . . , Z n ) and e Z ( ξ ) = ( e Z 1 , . . . , e Z n ), the “rig ht ” path o f both tree structures. F or ξ = − 1 . Then, Z ( ξ ) = ( e Z 1 , . . . , e Z n ) and e Z ( ξ ) = ( Z 1 , . . . , Z n ), the “left” path of b oth tree structures. Changing ξ i from +1 to − 1 exchanges Z i for e Z i in bo th trees and chooses the left child of Z i − 1 and e Z i − 1 rather tha n the rig ht child. Figure 1 displa ys b oth trees. In o rder to talk ab out the probability of Z i conditional on the “past” in the tree, we need to know the path taken so far. F o r this, w e define a sele ctor function χ ( ξ ) := χ ( ξ , ρ,  ) = ρI ( ξ = 1) + I ( ξ = − 1) . Distributions ov er trees then beco me the ob jects of interest. Contrary to the o nline-learning scenario, the dependence b etw een future and pas t means the adversary is not free to change predictors and resp ons e s separately . Once a branc h of the tree is chosen, the distribution of future data po in ts is fixed, and dep ends only on the preceding sequence. Because of this, the join t distribution of an y path along the tre e is the s ame as an y other path, i.e. for any t wo paths ξ , ξ ′ , L ( Z ( ξ )) = L ( Z ( ξ ′ )) and L ( e Z ( ξ )) = L ( e Z ( ξ ′ )) . Similarly , due to the construction of the tangent sequence, we ha ve that L ( Z ( ξ )) = L ( e Z ( ξ )). This equiv alence betw een paths allows us to in tro duce Rademacher v a riables sw apping Z i for e Z i as w ell as the ability to combine terms b elow: ( 1 ) = E Z 1 e Z 1 E ξ 1 E Z 2 | χ ( ξ 1 ,Z 1 , e Z 1 ) e Z 2 | χ ( ξ 1 , e Z 1 ,Z 1 ) E ξ 2 · · · E Z n | χ ( ξ n − 1 ) ,...,χ ( ξ 1 ) e Z n | χ ( ξ n − 1 ) ,...,χ ( ξ 1 ) E ξ n " sup h ∈H 1 n n X i =1 ξ i ( h ( e Z i ) − h ( Z i )) # = E Z , e Z ,ξ " sup h ∈H 1 n n X i =1 ξ i ( h ( e Z i ) − h ( Z i )) # ≤ E Z ,ξ " sup h ∈H 1 n n X i =1 ξ i h ( Z i ) # + E e Z ,ξ " sup h ∈H 1 n n X i =1 ξ i h ( e Z i ) # = 2 E Z ,ξ " sup h ∈H 1 n n X i =1 ξ i h ( Z i ) # = R n ( H ) . Rademacher complexit y g ets its utilit y from b ounding the prediction risk of forec asters. F or i.i.d. data, the ma in to ols for proving risk b o unds a re the inequa lities of Ho effding ( 1963 ) and McDiarmid ( 1989 ). Extensions of lear ning theory to de p endent da ta have r e lied on strong mix ing prop erties to a ppr oximate weakly-dependent pro cesses b y i.i.d. ones, r ecov ering i.i.d. results with a reduced effectiv e sample s ize. W e ins tead use genera lizations of Hoeffding and McDiarmid to dep endent sequences based on results of v an de Geer ( 2002 ), which do not need mixing at all. Rather than deriving bounds under the conditio n sup h ∈H k h k ∞ < ∞ , w e use a weaker hypo thes is on the tails of conditional distributions. W e fir st state this more genera l r esult, g iving the b ounded c ase as a corollary . W e discuss the co ncentration b ound and its deriv a tion in § 3.3 . Theorem 9. Supp ose that ther e exist c onstants τ and c such that E  ψ  | Γ n ( H ) | c      F t 0  ≤ τ ∀ t, (2) wher e ψ ( x ) = exp( x 2 ) − 1 . Then, for ǫ > 0 and n lar ge enough, for al l h ∈ H , with pr ob ability at le ast 1 − δ , R n ( h ) ≤ b R n ( h ) + R n ( H ) + 4 c ( τ + 1) r 2 lo g 1 / δ n . The following coro llary is immediate b y noting that s up h ∈H k h k ∞ ≤ M < ∞ implies that E  exp  ( | Γ n | / M ) 2  ≤ e < 3. W e ma de no effort to o ptimize the constant b efore the confidence p enalty . Corollary 10. If sup h ∈H k h k ∞ ≤ M , then for al l h ∈ H , with pr ob ability at le ast 1 − δ , R n ( h ) ≤ b R n ( h ) + R n ( H ) + 12 M r 2 lo g 1 / δ n . 7 3.2 Empirical Rademac her B ounds Unfortunately R n ( H ) may itself b e har d or imp oss ible to calculate for some classes H . How ever, under our assumptions, w e show that R n ( H ) is closely a pproximated b y the empirical Radema cher complexit y . That is, the same data can estimate both b R n and R n ( H ). Theorem 11 (Empirical Rademacher Complexit y Bound) . As s u me e q. ( 2 ) ho lds. Then, for al l h ∈ H , with pr ob ability at le ast 1 − δ , R n ( h ) ≤ b R n ( h ) + b R n ( H ) + 12 c ( τ + 1) r 2 lo g 2 / δ n . T o apply Thm. 11 , we can estimate b R n ( H ) by drawing m indep endent Rademacher samples of size n , and use 1 mn m X i =1 sup h ∈H n X t =1 ξ ti h t ≈ b R n ( H ) . (3) The approximation is O (1 /m )-ac curate. Thus, given one sample of data, the e n tire risk b ound is fully calculable. If R n ( H ) is known (the case for man y common cla s ses H , see Section 4.2 ) w e may apply Thm. 9 . F or a ny other class of predictors, w e can e stimate the complexit y w ith ( 3 ) and a pply Thm. 11 . Finally , we present a coro llary for the ca se that sup h ∈H k h k ∞ < ∞ . Corollary 12. If sup h ∈H k h k ∞ ≤ M < ∞ , then for al l h ∈ H , with pr ob ability at le ast 1 − δ , R n ( h ) ≤ b R n ( h ) + b R n ( H ) + 36 M r 2 lo g 2 / δ n . Both of these results ca n be seen as p e nalizing the empiric al risk with a ter m that accounts for the complexity of H alo ng with a second penalty fo r the amo un t of confidence we r equire. 3.3 Necessary Concen tration Inequalities F or i.i.d. data, the main too ls for developing risk bo unds a re the inequalities of Ho effding ( 1 963 ) and McDiarmid ( 1989 ). As discussed ab ov e, extensio ns of learning theory to dependent data hav e relied on strong mixing prop erties to approximate weakly-dep endent pro ces ses b y i.i.d. ones, and so r ecov er the i.i.d. results with a reduced effectiv e sample size. W e will ins tead us e a generalizatio n applying to dep endent sequences based on results due to v an de Geer ( 2002 ), which do not req uir e mixing at all. W e need some conditions on the ta ils of the random v a riables. Supp o se that X t is a mar tingale, e.g. a real-v alued F t 0 -measurable r andom v ariable satisfying E  X t   F t − 1 0  = 0 with the conv en tion F 0 = ∅ . F o r a constant c , define B 2 n = n X t =1 c 2  1 + E  ψ  | X t | c      F t − 1 0  , where ψ ( x ) = ex p( x 2 ) − 1. Essentially , controlling B 2 n by bo unding the exp ecta tion of ψ ( | X t | /c ) c ontrols the tails of X t . The function ψ can b e an y non-decre a sing, conv ex function satis fying ψ (0) = 0, but the use of ψ ( x ) = exp( x 2 ) − 1 is most common. In g eneral, inf c> 0 E [ ψ ( | X t | /c )] ≤ 1 is refer r ed to as the Orlicz norm of X t denoted as k X t k ψ . In the simplest case, if c < ∞ and E [ X t ] = 0 it holds that P ( | X t | > x ) ≤ 2 exp( − x 2 /c 2 ). This is the definition of s ub - Gaussian tails : X t has tails which decreas e a t leas t as quickly as those of a s ta ndard Gauss ian random v ariable. In particular, b ounded r andom v a riables sa tisfy this c ondition. As our data come from a time-dep e ndent pro cess Y , we require the conditional version of this idea. Lemma 13 ( v an de Geer 20 02 ; Theorem 2.2) . Supp ose X t is a m artingale. Then, for al l ǫ > 0 , b > 0 , for n lar ge enou gh, P n X t =1 X t ≥ ǫ and B 2 n ≤ b 2 ! ≤ exp {− ǫ 2 / 8 b 2 } . 8 T able 1: Compar ison of existing risk b ounds. W e use the notation po lylog( n ) to mean log k ( n ) for some k > 0. Assumptions Reference Complexity Calculable Best-case convergence ra te I.i.d. (many, e.g. Bartlett and Mendelson , 2002 ) Rademacher Yes O ( p 1 /n ) St at ionary & mixing ( Mohri and Rost amizadeh , 2009 ) Blocked Rademacher If β -mixing coefs are kno wn O ( p polylog( n ) /n ) St at ionary, non-mixi ng This p aper Rademacher Yes O ( p 1 /n ) Non-st a tionary & mixing ( Kuznetsov a nd Mohri , 2014 , 2017 ) Blocked or Sequential Rade macher If β -mixing coefs are known O ( p polylog( n ) /n ) Non-st a tionary, non-mixing ( Kuznetsov and Mohri , 2015 ) Expected covering number Depends on H O ( p polylog( n ) /n ) Adversarial ( Rakhlin et al. , 2010 , 2011 , 2015 ) Sequential Rademacher Depends on H O ( p polylog( n ) /n ) This result generalizes Ho effding’s inequality to the case o f conditionally sub-Gaussian random v ariables from a dep endent sequence. As long as the ta ils of the next observ ation are w ell con trolle d c onditional on the p ast , we ca n still con trol the size of deviations from the mean with high probabilit y . W e now pr esent the following extension, analogo us to McDia rmid’s inequality , but for dependent se- quences with sub-Gaussian tails (rather than bounded differences ). Theorem 14. L et X t b e F t 0 -me asur able with E  ψ  | X t | c      F t − 1 0  ≤ τ , (4) for some τ > 0 and al l t > 0 . Then for al l ǫ > 0 and n lar ge enough, P ( X n − E [ X n ] > ǫ ) ≤ exp  − ǫ 2 32 nc 2 ( τ + 1) 2  . Thm. 14 can be gener a lized to allow b oth c and τ to dep end on t with a ppropriate mo difica tions as in Lem. 13 . Typically , we would exp ect better control ov er the tails as w e condition on more data, resulting in a decreas ing sequence of τ , tho ug h we will no t pursue this genera lit y further here. Bec ause w e were unable to find a co mparable result in the literature, and this one may b e useful in it’s own r ight, w e ha ve c hosen to include it here. The pro of is given in the Supplemen t. 4 Discussion In this sectio n, we give a car eful explanatio n, situa ting o ur res ults in the co nt ext of existing b ounds. W e then pro vide a few simple (standard) exa mples o f cas es in whic h o ur b ounds are ca lculable, as well as a generalized a lgorithm for classes whic h don’t admit calculable expected Rademacher complexities. Finally , we conclude. 4.1 Relationship with Existing W ork As discussed in the in tro duction, e xisting work has developed risk b ounds for dependent data under a num b er of assumptions whic h are more or less g e neral then our s. In or der to g ive context for our results, we compar e the assumptions and benefits of ea ch of these here. This compariso n is summarize d in T able 4.1 . The fir st risk b ounds for time ser ies ar e, like o ur result, based on standar d Ra demacher complexities. Mohri and Rostamizadeh ( 2009 ) ass ume that Y is a stationary β -mixing pro ces s. Like our results (Cor. 1 0 and Cor . 1 2 ), they are able to pr ov e bounds based on b oth the exp ected a nd empirical Rademacher com- plexities. Their results how ever, do no t apply to the full time-se r ies forecas ting setting we present here— predictions in their setting may dep end o nly on a fixed lag d o f previous observ ations. F urthermor e, bo th the Ra demacher co mplexity a nd the confidence penalty dep end on blo cks of data rather than individual data po int s. The n umber of blo cks, µ , then replaces n in b oth terms, wher e µ dep ends on the unknown mixing co efficients. Thus, conv erg ence rates a re slig h tly slow e r —b e cause the s ize o f the blocks s hould incr ease with n , µ must b e sublinear in n —and cannot be directly calculated without knowledge of the mixing coe ffi- cients. McDonald et al. ( 2011 , 201 5 ) give a n estima to r for the mixing co efficients with nearly parametric rates, though bo unds which r eplace known co efficients with estimates hav e not b een derived. O ur re s ults 9 subsume the stationary and mixing r esults because our conv ergence r ate is faster without a ssuming an y t yp e of asymptotic decay of dep endence. Alternatively , Rakhlin et al. ( 2 0 10 , 20 11 , 2 015 ) develop truly ingenious tec hniques for an a dversarial data gener ating pro cess, a m uch mo re gener al condition wherein not o nly is the pro cess potentially non- stationary and non-mixing, but subsequent data points may b e chosen based on previous predictions to make the learner pe r form as p o o rly as p os s ible. These results rely instead on the sequential Rademac her complexity defined in our notation as R seq n ( H ) = sup Z E ξ " sup h ∈H 2 n n X t =1 ξ t h ( Z t ( ξ )) # , where the o uter supremum is taken ov er all Y -v a lued trees of depth n . Beca use their results ar e more gene r al, one could simply apply them to our setting. How ever, R seq n ( H ) is mor e difficult to calculate than R n ( H ), is lo os er, a nd doe s not admit an empirical version (analo gous to our Cor. 12 ) b eca us e it replaces the outer exp ectation ov er Z with a suprem um. Finally , w ork on non-stationary , mixing pro cess es ( Kuznetsov and Mohr i , 2 0 14 , 2 017 ) a nd no n- stationary , non-mixing pro cess es ( Kuznetsov and Mohri , 201 5 ) has also app eared. In the mixing case, the co mplexity is either the blo cked v ersion as in ( Mohri and Rostamizadeh , 2009 ) adjusted to handle non-stationa rity , o r the sequential complexity above with an additional discrepancy p ena lty which “measur es” non-stationar it y in view of H . The discrepancy meas ur e can b e calculated from data a s ca n the blo cked Rademacher complexity , though ag ain, the mixing co efficients cannot. The non-stationary , non-mixing s e tting replac e s Rademacher complexities with an exp ected sequential co v ering n um b er . This results in b ounds which a re lo os er than ours by poly - logar ithmic factors in n . If the cov ering n umber can b e co mputed for the function cla ss H of interest, than these results a re wholly calculable, but if the clas s do es no t hav e known cov ering nu mber, there is no analogue to Cor. 12 which ca n be estimated from the given data. Thu s, the b enefits of our work ar e that, if we are willing to ass ume stationar it y , our r esults ar e tighter than previous results, easier to calculate bas ed o n known e x pec ted Rademacher formulas, and admit empirical Rademacher complexities whic h can alwa ys b e calculated given sufficien t computational r esources. None of these b enefits require un testable mixing assumptions or knowledge of the asso ciated c o efficients. 4.2 Examples and Algorithms In s ome cas es, the exp ected (or empirical) Rademacher complexit y is eas ily calculated fr om data. In these cases, one can derive simple algorithms for time-series prediction. Our first t wo exa mples , give co mplete ris k bo unds for alg orithms which predict future obser v ations based on d previous o bserv ations for clarity . Thes e follow from results of Bartlett and Mendelson ( 2002 ). Consider firs t the case of a 2-layer Neural Net work whic h makes predictions based on d previous v alue s and let Y = R p . Supp ose that the a ctiv ation function σ : R → [ − 1 , 1] is 1-Lips chit z with σ (0) = 0. F or v i ∈ R pd define G N = ( y 7→ X i w i σ ( v i · y ) : k w k 1 ≤ 1 , k v i k 1 ≤ 1 ) . Suppo se further that ℓ is 1-Lipschitz. Then, b R n ( ℓ ◦ G N ) ≤ 2 c log 1 / 2 ( pd ) n max 1 ≤ j,j ′ ≤ p v u u t n − d X i =1 ( y ij − y ij ′ ) 2 for so me c > 0. Th us, b R n ( ℓ ◦ G N ) = O P ( n − 1 / 2 ) as usua l. The Lipschitz conditions and nor m constra in ts can easily b e exchanged for o ther constan ts without altering the rate, and the n umber of lay ers is easily altered. Consider now regula rized Kernel metho ds . Suppo se ℓ is M -L ips chit z and consider the clas s G K =  y 7→ w · Φ( y ) : k w k Ψ ≤ B 2  , where Φ( y ) : Y → Ψ is the feature map asso cia ted with the Hilb ert space Ψ, 10 Algorithm 1 Generic ERM Algorithm Input: data Y n 1 , mo dels H 1 , . . . , H k , int eger m for i = 1 to k do Estimate a predictor h i ∈ H i as usual Compute the training error b R n ( h i ). Compute b R n ( H i ) using ( 3 ) end for Cho ose i ∗ = ar gmin i b R n ( h i ) + b R n ( H i ). Return h i ∗ , b R n ( h i ∗ ) + b R n ( H i ∗ ) and calculate the complexit y p enalty to for m the b ound in Cor. 12 . k is the corresp onding k ernel function, and k·k Ψ denotes the norm in Ψ. Then, w e hav e that b R n ( ℓ ◦ G K ) ≤ 4 M B n v u u t n − d X i =1 k ( y i , y i ) = O P ( n − 1 / 2 ) . Finally , using Co r. 12 , w e can deriv e a generic empirical r is k minimiza tion-type (ERM) alg orithm for learning without any knowledge of complexity measur ement s. Algorithm 1 shows how to c ho ose a predicto r from among a collection of bounded function cla sses H 1 , . . . , H k . 4.3 Conclusion In this pap er, we hav e demonstrated how to co nt ro l the generaliza tio n of time series prediction algor ithms. These metho ds use some or all o f the o bserved pa st to predict future v a lue s of the same s eries. In order to handle the complicated Rademacher complexit y bo und fo r the expectatio n, we hav e follow ed the appro ach used in the online learning ca se pioneer e d b y Rakhlin et al. ( 2010 , 2 011 ), but w e show that in our pa rticular case, muc h of the structure needed to dea l with the adversary is unnecessa r y . This r esults in clean risk bo unds which have a for m similar to the i.i.d. cas e. As these res ults take exp ectatio ns o ver Y n 1 rather than a suprem um, empir ical counterparts which a r e estimable can also be derived. Extending our res ults to lo cal Rademacher complexities with faster conv erg ence r ates is left for fut ure work. References Algoet, P. H. , and Cover, T. M. (1988), “A sa ndwich proo f of the Shanno n-McMillan-Breima n theo- rem,” Annals of Pr ob ability , 16 , 899–9 09. Bar tlett, P. L., and Mendelson, S. (2002), “Rademacher and Gaussian complexities: Risk bo unds and structural results,” Journal of Machine L e arning Rese ar ch , 3 , 463– 482. Bar tlett, P. L., Bousquet, O., and Mendelso n, S. (2005), “Lo cal Radema cher complexities,” The Annals of St atistics , 33 (4), 1497–1 537. Dynkin, E. B. (19 78), “Sufficien t statistics and extreme points,” Annals of Pr ob ability , 6 , 705–7 30. Gra y, R. M. (1990), Entr opy and Information The ory , Springer -V erla g, New Y o rk. Gra y, R. M. (2 009), Pr ob ability, Rando m Pr o c esses, and Er go dic Pr op erties , Springer -V erla g, New Y ork , second edn. Hoeffding, W. (19 63), “P robability inequalities for sums of bo unded random v ar iables,” Journal of the Americ an Statistic al As s o ciation , 58 , 13– 30. Kuznetsov, V. , and Mohri, M. (2014 ), “Generaliza tio n bo unds for time ser ies pre diction with non- stationary pro cesses,” in International Confer enc e on A lgorithmic L e arning The ory , pp. 260– 274. 11 Kuznetsov, V., and Mohri, M. (2015), “Learning theory and algor ithms for for ecasting non-s tationary time se ries,” in A dvanc es in Neur al Information Pr o c essing Systems 28 , eds. C. Cor tes , N. D. L awrence, D. D. Lee, M. Sugiyama, and R. Garnett, pp. 541– 549. Kuznetsov, V. , and Mohri, M. (201 7), “Gener a lization b ounds for non-stationary mixing pro c esses,” Machine L e arning , 106 (1), 93–11 7 . McDiarmid, C. (19 89), “On the metho d of b ounded differences,” in Surveys in Combinatorics , ed. J. Siemons, pp. 148–1 88, Cambridge, E ngland, Cambridge Universit y Press. McDonald, D. J., Shalizi, C. R. , and Sch er vish, M. (2011), “Estimating b eta-mixing co efficien ts,” in Pr o c e e dings of t he 14 th International Confe r enc e on A rtificial Intel ligenc e and Statistics [AIST A TS 2011] , eds. G. Gordo n, D. Dunson, and M. Dud ´ ık, vol. 15 of Journal of Machine L e arning Rese ar ch: Workshops and Confer enc e Pr o c e e dings , pp. 51 6–524 . McDonald, D. J., Shalizi, C. R., and S cher vish, M. (2015), “ Estimating beta-mixing co efficients via histograms ,” Ele ctr onic Journal of Statistics , 9 , 2855–2 883. Mohri, M., and Rost amizadeh, A. (200 9), “ Rademacher co mplexity bounds for non- I.I.D. pro c e sses,” in A dvanc es in Neur al Information Pr o c essing Systems 21 [NIPS 2008] , eds. D. Koller , D. Sch uurmans, Y. Bengio, and L. Bottou, pp. 1097–11 04. Rakhlin, A ., Sridharan , K., and Tew ari, A. (20 10), “Online lear ning : Random averages, combina- torial parameters, and lear nability ,” in A dvanc es in N eu r al Information Pr o c essing 23 [NIPS 2010] , eds. J. Laffer t y , C. K. I. Williams, J . Shaw e-T aylor, R. S. Zemel, a nd A. Culotta , pp. 198 4 –199 2, Cambridge, Massach usetts, MIT Press. Rakhlin, A. , Sridharan, K., and Tew ari, A. (2011), “ Online learning: Sto chastic and constra ined adversaries,” in A dvanc es in Neu ra l Information Pr o c essing Systems 24 [NIPS 2011 ] , eds. J. Shaw e-T aylor, R. S. Zemel, P . Bar tlett, F. C. N. Pereira, and K. Q. W einber g er, pp. 1764– 1772. Rakhlin, A ., Sridharan, K., and Tew ari, A. (20 15), “Se q uent ial complexities and uniform mar tingale laws of larg e num b er s ,” Pr ob ability The ory and R elate d Fields , 161 (1/2 ), 11 1–153 . Shalizi, C., and Kontor ovich, A . (201 3 ), “ P redictive P AC lea rning and proc e s s decomp ositio ns ,” in A dvanc es in Neu r al Information Pr o c essing Systems 26 , eds. C. J . C. Burg es, L. Bottou, M. W elling , Z. Ghahramani, and K. Q. W einberger, pp. 161 9 –162 7. v an de Geer, S. A . (2002), “On Ho effding’s inequalit y for dependent ra ndom v aria ble s ,” in Empiric al Pr o c ess T e chniques for Dep endent Data , eds. H. Dehling, T. Mikosc h, and M. So rensen, pp. 161 –169, Birkh¨ auser , Bo ston. v an Handel, R. (2014 ), “E rgo dicity , decisions, and pa rtial information,” in S´ eminair e de Pr ob abilit´ es XL VI , eds. C. Donati-Martin, A. Lejay , a nd A. Rouault, pp. 411–459 , Springer. Wiener, N. (1 956), “ Nonlinear prediction and dynamics,” in Pr o c e e dings of t he Thir d Berkeley Symp osium on Mathe matic al Statistics and Pr ob ability , ed. J . Neyman, vol. 3, pp. 24 7–252 , B e rkeley , Universit y of California Press. 12 A Additional Pro ofs Prop ositio n (Standard i.i.d. Rademacher bound) . If Z 1 , . . . , Z n is an i.i.d. sample fr om some pr ob ability distribution P , then E Z [Γ n ( H )] ≤ R n ( H ) . Pr o of. The usual pro of in tro duces a “g ho st sample” e Z n 1 , wher e the e Z t hav e the same distribution as the Z t , but are indep endent of the latter a nd of each other. Then exp ectations ma y as well b e taken ov er the ghost sample as the real one: E [ h ] = E e Z 1 h h ( e Z 1 ) i = 1 n P n i =1 E e Z h h ( e Z t ) i . Hence (using the notation from § 2.2 ) γ n ( h ) = 1 n n X i =1 E e Z h h ( e Z t ) i − 1 n n X i =1 h ( Z t ) = 1 n n X i =1 E e Z h h ( e Z t ) − h ( Z t ) i , Γ n ( H ) ≤ E e Z " sup h ∈H 1 n n X i =1 h ( e Z t ) − h ( Z t ) # , (5) and E Z [Γ n ( H )] ≤ E Z , e Z " sup h ∈H 1 n n X i =1 h ( e Z t ) − h ( Z t ) # . (6) Eq. 5 holds beca use the supremum o f expectations is less than or equal to the expected suprem um, and Eq. 6 just takes the exp ectation of both sides with resp ect to Z . Since Z t and e Z t hav e the same marginal distribution and are indep endent, L  h ( e Z t ) − h ( Z t )  = L  h ( Z t ) − h ( e Z t )  , and the s igns of summands in Eq. 6 can be flipp ed ar bitrarily , according to the Radema cher v ar iables, without effect: E Z [Γ n ( H )] ≤ E Z , e Z ,ξ " sup h ∈H 1 n n X i =1 ξ t  h ( e Z t ) − h ( Z t )  # ≤ E Z ,ξ " sup h ∈H 1 n n X i =1 ξ t h ( Z t ) # + E e Z ,ξ " sup h ∈H 1 n n X i =1 ξ t h ( e Z t ) # = 2 E Z ,ξ " sup h ∈H 1 n n X i =1 ξ t h ( Z t ) # = R n ( H ) . Pr o of of Thm. 9 . This result follows immediately from Thm. 14 upon setting the right hand side equal to δ and solving for ǫ . Pr o of of Thm. 11 . W r ite h ∈ R n for the vector h 1 ( Z 1 ) , . . . , h n ( Z n ). Note that as the range of ℓ is R + , h lies 13 in the non-negative or thant of R n ( h ≥ 0). Now, n Γ n = sup h ∈H n X t =1 h t − E Z " n X t =1 h t #! ≥ sup h ∈H n X t =1 h t − sup h ∈H E Z " n X t =1 h t # (prop erty o f sup) ≥ sup h ∈H n X t =1 h t − E Z " sup h ∈H n X t =1 h t # (Jensen’s ineq.) = sup h 1 ⊤ h − E Z  sup h 1 ⊤ h  ≥ E ξ  sup h ξ ⊤ h  − E Z  sup h 1 ⊤ h  = n 2 b R n ( H ) − K where K is a constan t. Therefore, E Z h n 2 b R n ( H ) − K i = n 2 R n ( H ) − K Since ψ is increasing in it’s ar gument a nd we assumed that n Γ n satisfied eq . ( 2 ) for constants c , and τ , w e can apply Thm. 14 with Z n = b R n ( H ) − K with consta n ts c → 2 c/ n a nd τ as befor e . Thus, P ( b R n ( H ) − R n ( H ) > ǫ ) ≤ exp  − nǫ 2 128 c 2 ( τ + 1) 2  . Setting the right hand side eq ual to δ / 2 and combining with Thm. 9 applied with δ → δ / 2 via the union bo und gives the result. Pr o of of Thm. 14 . W r ite X n − E [ X n ] = P n i =1 W t , where W t = E [ X n | F t 0 ] − E  X n   F t − 1 0  , fo r all t = 1 , . . . , n . Then W t is F t 0 -measurable , and E  W t   F t − 1 0  = 0 for a ll t . No w, let K > 0 to be chosen. Then E  ψ ( | W t | /K )   F t − 1 0  = E " ψ | E [ X n | F t 0 ] − E  X n   F t − 1 0  | K !      F t − 1 0 # = E  exp  1 K 2  E  X n   F t 0  − E  X n   F t − 1 0  2  − 1     F t − 1 0  ≤ E  exp  2 K 2  E  X n   F t 0  2 + E  X n   F t − 1 0  2   − 1     F t − 1 0  = E   exp (  | E [ X n | F t 0 ] | K/ √ 2  2 ) exp    | E  X n   F t − 1 0  | K/ √ 2 ! 2    − 1       F t − 1 0   = exp    | E  X n   F t − 1 0  | K/ √ 2 ! 2    E " exp (  | E [ X n | F t 0 ] | K/ √ 2  2 )      F t − 1 0 # − 1 ≤ E " exp (  | X n | K/ √ 2  2 )      F t − 1 0 # E " E " exp (  | X n | K/ √ 2  2 )      F t 0 #      F t − 1 0 # − 1 = E " exp (  | X n | K/ √ 2  2 )      F t − 1 0 # E " exp (  | X n | K/ √ 2  2 )      F t − 1 0 # − 1 = E " exp (  | X n | K/ √ 2  2 )      F t − 1 0 # 2 − 1 ≤ ( τ + 1) 2 − 1 14 for K = c √ 2. Therefore , we hav e B 2 n = n X i =1 2 c 2  1 + E h ψ ( | W t | / √ 2 c )    F t − 1 0 i ≤ 2 n c 2 ( τ + 1) 2 , and so, P ( X n − E [ X n ] > ǫ ) = P n X i =1 W t > ǫ ! ≤ exp  − ǫ 2 32 nc 2 ( τ + 1) 2  , by Lem. 13 . 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment