Generalization error bounds for stationary autoregressive models

Generalization error b ounds for stationary autoregressiv e mo dels Daniel J. McDonald Departmen t of Statistics Carnegie Mellon Univ ersity Pittsburgh, P A 15213 danielmc@stat.cm u.edu Cosma Rohilla Shalizi Departmen t of Statistics Carnegie Mellon Univ ersity Pittsburgh, P A 15213 cshalizi@stat.cm u.edu Mark Sc hervish Departmen t of Statistics Carnegie Mellon Univ ersity Pittsburgh, P A 15213 mark@cm u.edu V ersion: June 2, 2011 Abstract W e deriv e generalization error b ounds for stationary univ ariate autoregressive (AR) models. W e sho w that imposing stationarit y is enough to con trol the Gaussian complexit y without further regularization. This lets us use structural risk minimization for mo del selection. W e demonstrate our metho ds by predicting interest rate mov ements. 1 In tro duction In standard machine learning situations, we observe one v ariable, X , and wish to predict another v ariable, Y , with an unkno wn join t distribution. Time series models are sligh tly diﬀeren t: w e observ e a sequence of observ ations X n 1 ≡ { X t } n t =1 from some pro cess, and w e wish to predict X n + h , for some h ∈ N . Throughout what follows, X = { X t } ∞ t = −∞ will b e a sequence of random v ariables, i.e., eac h X t is a measurable mapping from some probability space (Ω , F , P ) into a measurable space X . A blo c k of the random sequence will be written X j i ≡ { X t } j t = i , where either limit may go to inﬁnit y . The goal in building a predictiv e model is to learn a function b f whic h maps the past in to predictions for the future, ev aluating the resulting forecasts through a loss function ` ( X n + h , b f ( X n 1 )) whic h giv es the cost of errors. Ideally , we w ould use f ∗ , the function whic h minimizes the risk R ( f ) ≡ E [ ` ( X n + h , f ( X n 1 ))] , o ver all f ∈ F , the class of prediction functions we can use. Since the true joint distribution of the sequence is unknown, so is R ( f ), but it is often estimated with the error on a training sample of size n b R n ( f ) ≡ 1 n n X t =1 ` ( X i + h , f ( X t 1 )) , (1) with b f b eing the minimizer of b R n o ver F . This is “empirical risk minimization”. While b R n ( b f ) conv erges to R ( b f ) for many algorithms, one can show that when b f minimizes (1), E [ b R n ( b f )] ≤ R ( b f ). This is because the choice of b f adapts to the training data, causing the 1 training error to b e an ov er-optimistic estimate of the true risk. Also, training error must shrink as mo del complexit y grows. Thus, empirical risk minimization gives unsatisfying results: it will tend to ov erﬁt the data and give p oor out-of-sample predictions. Statistics and machine learning prop ose t wo mitigation strategies. The ﬁrst is to restrict the class F . The second, which we follo w, is to c hange the optimization problem, p enalizing model complexity . Without the true distribution, the prediction risk or generalization error are inaccessible. Instead, the goal is ﬁnding b ounds on the risk whic h hold with high probabilit y — “probably appro ximately correct” (P A C) b ounds. A t ypical result is a conﬁdence b ound on the risk whic h says that with probabilit y at least 1 − η , R ( b f ) ≤ b R n ( b f ) + δ ( C ( F ) , n, η ) , where C ( · ) measures the complexity of the mo del class F , and δ ( · ) is a function of this complexit y , the conﬁdence level, and the n umber of observ ed data points. The statistics and mac hine learning literature con tains man y generalization error bounds for both classiﬁcation and regression problems with IID data, but their extension to time series prediction is a fairly recent developmen t; in 1997, Vidyasagar [21] named extending such results to time series as an imp ortan t op en problem. Y u [22] sets forth man y of the uniform ergo dic theorems that are needed to deriv e generalization error b ounds for sto c hastic pro cesses. Meir [11] is one of the ﬁrst pap ers to construct risk b ounds for time series. His approach w as to consider a stationary but inﬁnite-memory process, and to decomp ose the training error of a predictor with ﬁnite memory , c hosen through empirical risk minimization, in to three parts: b R ( b f p,n,d ) = ( b R ( b f p,n,d ) − b R ( f ∗ p,n )) + ( b R ( f ∗ p,n ) − b R ( f ∗ p )) + b R ( f ∗ p ) where b f p,n,d is an empirical estimate based on ﬁnite data of length n , ﬁnite memory of length p , and complexit y indexed b y d ; f ∗ p,d is the oracle with ﬁnite memory and given complexity , and f ∗ p is the oracle with ﬁnite memory ov er all p ossible complexities. The three terms amount to an estimation error incurred from the use of limited and noisy data, an appro ximation error due to selecting a predictor from a class of limited complexity , and a loss from approximating an inﬁnite memory pro cess with a ﬁnite memory process. More recently , others hav e provided P AC results for non-I ID data. Stein wart and Christmann [20] prov e an oracle inequality for generic regularized empirical risk minimization algorithms learn- ing from α -mixing pro cesses, a fairly general sort of weak serial dep endence, getting learning rates for least-squares supp ort vector machines (SVMs) close to the optimal I ID rates. Mohri and Ros- tamizadeh [13] prov e stabilit y-based generalization bounds when the data are stationary and ϕ - mixing or β -mixing, strictly generalizing I ID results and applying to all stable learning algorithms. (W e deﬁne β -mixing b elo w.) Karandik ar and Vidy asagar [8] sho w that if an algorithm is “sub- additiv e” and yields a predictor whose risk can b e upp er b ounded when the data are I ID, then the same algorithm yields predictors whose risk can b e bounded if data are β -mixing. They use this result to derive generalization error b ounds in terms of the learning rates for I ID data and the β -mixing co eﬃcien ts. All these generalization b ounds for dep enden t data rely on notions of complexity whic h, while common in machine learning, are hard to apply to mo dels and algorithms ubiquitous in the time series literature. SVMs, neural netw orks, and kernel metho ds ha ve kno wn complexities, so their risk can b e b ounded on dependent data as w ell. On the other hand, autoregressiv e moving a verage (ARMA) models, generalized autoregressive conditional heterosk edasticity (GARCH) mo dels, and state-space mo dels in general ha v e unkno wn complexity and are therefore neglected theoretically . (This do es not keep them from b eing used in applied statistics , or ev en in machine learning and rob otics, e.g., [17, 15, 18, 2, 10].) Arbitrarily regularizing suc h mo dels will not do, as often the only assumption applied researchers are willing to mak e is that the time series is stationary . W e show that the assumption of stationarit y regularizes autoregressiv e (AR) mo dels implicitly , allo wing for the application of risk bounds without the need for additional penalties. This result follo ws from work in the optimal control and systems design literatures but the application is no vel. In § 2, w e in tro duce concepts from time series and complexit y theory necessary for our results. Section 3 uses these results to calculate explicit risk bounds for autoregressive mo dels. Section 4 illustrates 2 the applicability of our metho ds by forecasting interest rate mov ements. W e discuss our results and articulate directions for future research in § 5. 2 Preliminaries Before dev eloping our results, w e need to explain the idea of eﬀectiv e sample size for dep enden t data, and the closely related measure of serial dep endence called β -mixing, as well as the Gaussian complexit y tec hnique for measuring mo del complexit y . 2.1 Time series Because time-series data are dep enden t, the num b er of data p oin ts n in a sample X n 1 exaggerates ho w m uch information the sample con tains. Kno wing the past allows forecasters to predict future data (at least to some degree), so actually observing those future data p oin ts gives less information ab out the underlying data generating pro cess than in the I ID case. Thus, the sample size term in a probabilistic risk b ound m ust b e adjusted to reﬂect the dep endence in the data source. This eﬀectiv e sample size may b e muc h less than n . W e inv estigate only stationary β -mixing input data. W e ﬁrst remind the reader of the notion of (strict or strong) stationarit y . Deﬁnition 2.1 (Stationarity) . A se quenc e of r andom variables X is stationary when al l its ﬁnite- dimensional distributions ar e invariant over time: for al l t and al l non-ne gative inte gers i and j , the r andom ve ctors X t + i t and X t + i + j t + j have the same distribution. F rom among all the stationary pro cesses, we restrict ourselves to ones where widely-separated observ ations are asymptotically indep enden t. Stationarity does not imply that the random v ariables X t are indep enden t across time t , only that the distribution of X t is constant in time. The next deﬁnition describ es the nature of the serial dep endence which w e are willing to allow. Deﬁnition 2.2 ( β -Mixing) . L et σ j i = σ ( X j i ) b e the σ -ﬁeld of events gener ate d by the appr opriate c ol le ction of r andom variables. L et P t b e the r estriction of P to σ t −∞ , P t + m b e the r estriction of P to σ ∞ t + m , and P t ⊗ t + m b e the r estriction of P to σ ( X t −∞ , X ∞ t + m ) . The coeﬃcient of absolute regularit y , or β -mixing co eﬃcien t , β ( m ) , is given by β ( m ) ≡ || P t × P t + m − P t ⊗ t + m || T V , (2) wher e || · || T V is the total variation norm. A sto chastic pr o c ess is absolutely regular , or β -mixing , if β ( m ) → 0 as m → ∞ . This is only one of many equiv alen t characterizations of β -mixing (see Bradley [3] for others). This deﬁnition makes clear that a process is β -mixing if the joint probabilit y of ev ents which are widely separated in time increasingly approac hes the pro duct of the individual probabilities, i.e., that X is asymptotically indep enden t. T ypically , a supremum o ver t is tak en in (2), how ever, this is unnecessary for stationary pro cesses, i.e. β ( m ) as deﬁned ab o ve is indep enden t of t . 2.2 Gaussian complexit y Statistical learning theory provides several wa ys of measuring the complexit y of a class of predictiv e mo dels. The results w e are using here rely on Gaussian complexit y (see, e.g., Bartlett and Mendelson [1]), which can be though t of as measuring how well the model can (seem to) ﬁt white noise. Deﬁnition 2.3 (Gaussian Complexity) . L et X n 1 b e a (not ne c essarily IID) sample dr awn ac c or ding to ν . The empirical Gaussian complexity is b G n ( F ) ≡ 2 E Z " sup f ∈F      1 n n X i =1 Z i f ( X i 1 )      | X n 1 # , 3 wher e Z i ar e a se quenc e of r andom variables, indep endent of e ach other and everything else, and dr awn fr om a standar d Gaussian distribution. The Gaussian complexit y is G n ( F ) ≡ E ν h b G n ( F ) i wher e the exp e ctation is over sample p aths D n gener ate d by ν . The term inside the supremum,   1 n P n i =1 Z i f ( X i 1 )   , is the sample co v ariance betw een the noise Z and the predictions of a particular mo del f . The Gaussian complexit y tak es the largest v alue of this sample co v ariance o ver all mo dels in the class (mimicking empirical risk minimization), then a verages o ver realizations of the noise. In tuitively , Gaussian complexity measures ho w well our mo dels could seem to ﬁt outcomes which w ere really just noise, giving a baseline against which to assess the risk of o ver-ﬁtting or failing to generalize. As the sample size n gro ws, for an y giv en f the sample co v ariance   1 n P n i =1 Z i f ( X i 1 )   → 0, b y the ergo dic theorem; the o verall Gaussian complexity should also shrink, though more slowly , unless the mo del class is so ﬂexible that it can ﬁt absolutely an ything, in whic h case one can conclude nothing ab out how well it will predict in the future from the fact that it p erformed well in the past. 2.3 Error b ounds for β -mixing data Mohri and Rostamizadeh [12] present Gaussian 1 complexit y-based error b ounds for stationary β - mixing sequences, a generalization of similar b ounds presented earlier for the I ID case. The results are data-dep enden t and measure the complexity of a class of hypotheses based on the training sample. Theorem 2.4. L et F b e a sp ac e of c andidate pr e dictors and let H b e the sp ac e of induc e d losses: H = { h = ` ( · , f ( · )) : f ∈ F } for some loss function 0 ≤ ` ( · , · ) ≤ M . Then for any sample X n 1 dr awn fr om a stationary β -mixing distribution, and for any µ, m > 0 with 2 µm = n and η > 4( µ − 1) β ( m ) wher e β ( m ) is the mixing c o eﬃcient, with pr ob ability at le ast 1 − η , R ( b f ) ≤ b R n ( b f ) +  π 2  1 / 2 b G µ ( H ) + 3 M s ln 4 /η 0 2 µ , and R ( b f ) ≤ b R n ( b f ) +  π 2  1 / 2 G µ ( H ) + M s ln 2 /η 0 2 µ , wher e η 0 = η − 4( µ − 1) β ( m ) in the ﬁrst c ase or η 0 = η − 2( µ − 1) β ( m ) in the se c ond. . The generalization error b ounds in Theorem 2.4 hav e a straightforw ard interpretation. The risk of a c hosen model is controlled, with high probabilit y , by three terms. The ﬁrst term, the training error, describ es ho w well the model p erforms in-sample. More complicated mo dels can more closely ﬁt any data set, so increased complexity leads to smaller training error. This is p enalized by the second term, the Gaussian complexity . The ﬁrst bound uses the empirical Gaussian complexit y whic h is calculated from the data X n 1 while the second uses the expected Gaussian complexit y , and is therefore tigh ter. The third term is the conﬁdence term and is a function only of the conﬁdence lev el η and the eﬀectiv e num b er of data p oin ts on whic h the mo del was based µ . While it was actually trained on n data p oin ts, because of dep endence, this num b er must b e reduced. This pro cess is accomplished b y taking µ widely spaced blocks of points. Under the asymptotic independence quantiﬁed by β , this spacing lets us treat these blo c ks as indep enden t. 1 In fact, they present the b ounds in terms of the Rademacher complexity , a closely related idea. How ever, using Gaussian complexity instead requires no mo diﬁcations to their results while simplifying the proofs contained here. The constan t ( π / 2) 1 / 2 in Theorem 2.4 is given in Ledoux and T alagrand [9]. 4 3 Results Autoregressiv e mo dels are used frequently in economics, ﬁnance, and other disciplines. Their main utilit y lies in their straightforw ard parametric form, as well as their interpretabilit y: predictions for the future are linear combinations of some ﬁxed length of previous observ ations. See Shum wa y and Stoﬀer [19] for a standard introduction. Supp ose that X is a real-v alued random sequence, evolving as X t = p X i =1 φ i X t − i +  t , where  t has mean zero, ﬁnite v ariance,  j ⊥ ⊥  i for all i 6 = j , and  i ⊥ ⊥ X j for all i > j . This is the traditional sp eciﬁcation of an autor e gr essive or der p or AR( p ) mo del. Having observed data { X t } n t =1 , and supp osing p to b e kno wn, ﬁtting the model amounts to estimating the co eﬃcien ts { φ } p i =1 . The most natural w ay to do this is to use ordinary least squares (OLS). Let φ =      φ 1 φ 2 . . . φ p      Y =        X p +1 X p +2 . . . X n − 1 X n        X =        X p X p − 1 · · · X 1 X p +1 X p · · · X 2 . . . . . . . . . . . . X n − 2 X n − 3 · · · X n − p − 1 X n − 1 X n − 2 · · · X n − p        . Generalization error b ounds for these pro cesses follow from an abilit y to characterize their Gaus- sian complexit y . The theorem b elo w uses stationarity to b ound the risk of AR mo dels. The remainder of this section pro vides the comp onen ts necessary to prov e the results. Theorem 3.1. L et D n b e a sample of length n fr om a stationary β -mixing distribution. F or any µ, m > 0 with 2 µm = n and η > 4( µ − 1) β ( m ) , then under squar e d err or loss trunc ate d at M , the pr e diction err or of an AR ( p ) ( p > 1 ) mo del c an b e b ounde d with pr ob ability at le ast 1 − η using R ( b f ) ≤ b R n ( b f ) + 4 p π M log ( p + 1) µ max 1 ≤ j,j 0 ≤ p +1 X i ∈I  X i , φ j − φ j 0  2 ! 1 / 2 + 3 M s ln 4 /η 0 2 µ , or R ( b f ) ≤ b R n ( b f ) + 4 p π M log ( p + 1) µ E   max 1 ≤ j,j 0 ≤ p +1 X i ∈I  X i , φ j − φ j 0  2 ! 1 / 2   + M s ln 2 /η 0 2 µ , wher e I = { i : i = b a/ 2 c + 2 ak, 0 ≤ k ≤ µ } , φ j is the j th vertex of the stability domain, and X i is the i th r ow of the design matrix X . F or p = 1 slight adjustments are required. W e state this result as a corollary . Corollary 3.2. Under the same c onditions as ab ove, the pr e diction err or of an AR (1) mo del c an b e b ounde d with pr ob ability at le ast 1 − η using R ( b f ) ≤ b R n ( b f ) + 4 µ r M 2 X i ∈I X 2 i ! 1 / 2 + 3 M s ln 4 /η 0 2 µ , or R ( b f ) ≤ b R n ( b f ) + 4 µ r M 2 E   X i ∈I X 2 i ! 1 / 2   + M s ln 2 /η 0 2 µ . 5 3.1 Pro of comp onen ts T o prov e Theorem 3.1 it is necessary to control the size of the mo del class by using the stationarity assumption. 3.1.1 Stationarit y con trols the h yp othesis space Deﬁne, as an estimator of φ , b φ ≡ argmin φ || Y − X φ || 2 2 , (3) where || · || 2 is the Euclidean norm. 2 Equation 3 has the usual closed form OLS solution: b φ = ( X 0 X ) − 1 X 0 Y . (4) Despite the simplicity of Eq. 4, mo dellers often require that the estimated autoregressive pro cess b e stationary . This can b e chec ked algebraically: the complex ro ots of the polynomial Q p ( z ) = z p − φ 1 z p − 1 − φ 2 z p − 2 − · · · − φ p m ust lie strictly inside the unit circle. Eq. 3 is thus not quite righ t for estimating a stationary autoregressiv e model, as it do es not incorp orate this constraint. Constraining the roots of Q p ( z ) constrains the coe ﬃcien ts φ . The set φ where the process is stationary is the stabilit y domain, B p . Clearly , B 1 is just | φ 1 | < 1. F am and Meditch [6] gives a recursiv e metho d for determining B p for general p . In particular, they show that the conv ex hull of the space of stationary solutions is a con vex p olyhedron with vertices at the extremes of the B p . This conv ex hull basically determines the complexity of stationary AR mo dels. 3.1.2 Gaussian complexity of AR mo dels Returning to the AR( p ) mo del, it is necessary to ﬁnd the Gaussian complexity of the function class F p = ( φ : x t = p X i =1 φ i x t − i and x t is stationary ) . Theorem 3.3. F or the AR( p ) mo del with p > 1 , the empiric al Gaussian c omplexity is given by b G k ( F ) ≤ 2 √ 2 n (log( p + 1)) 1 / 2 max 1 ≤ j,j 0 ≤ p +1 k X i =1  X i , φ j − φ j 0  2 ! 1 / 2 , wher e φ j is the j th vertex of the stability domain and X i is the i th r ow of the design matrix X . The pro of relies on the following v ersion of Slepian’s Lemma (see, for example Ledoux and T alagrand [9] or Bartlett and Mendelson [1]). Lemma 3.4 (Slepian) . L et V 1 , . . . , V k b e r andom variables such that for al l 1 ≤ j ≤ k , V j = P n i =1 a ij g i wher e g 1 , . . . , g n ar e iid standar d normal r andom variables. Then, E  max j V j  ≤ √ 2(log k ) 1 / 2 max j,j 0 q E [( V j − V j 0 ) 2 ] . Pr o of of The or em 3.3. b G n F = E sup f ∈F 2 n n X i =1 g i f ( x i ) = E sup φ ∈B p 2 n n X i =1 g i h X i , φ i = E sup φ ∈B p * 2 n n X i =1 g i X i , φ + = E sup φ ∈ conv ( B p ) * 2 n n X i =1 g i X i , φ + , 2 There are other wa ys to estimate AR mo dels, but they t ypically amount to very similar optimization problems. 6 1970 1980 1990 2000 2010 -0.15 -0.10 -0.05 0.00 0.05 0.10 growth rate (day/day) Figure 1: Growth rate of 10-year treasury b ond where the last equality follows from Theorem 12 in [1]. By standard results from conv ex optimization, this supremum is attained at one of the vertices of conv ( B p ). Therefore, b G n ( F ) = E  max j 2 n n X i =1 g i h X i , φ j i  , where φ j is the j th v ertex of conv ( B p ). Let V j = P n i =1 g i h X i , φ j i . Then by the Lemma 3.4, b G n ( F ) ≤ 2 √ 2 n (log p + 1) 1 / 2 max j,j 0 q E [ V j − V j 0 ] 2 = 2 √ 2 n (log p + 1) 1 / 2 max j,j 0 v u u t E " n X i =1 g i  X i , φ j − φ j 0  # 2 = 2 √ 2 n (log p + 1) 1 / 2 max 1 ≤ j,j 0 ≤ p v u u t n X i =1  X i , φ j − φ j 0  2 where X i is the i th -en try of the design matrix. When p = 1, as in Corollary 3.2, we can calculate the complexit y directly . The pro of ’s last line sho ws that we are essentially interested in the diameter of the stability domain B p pro jected on to the column space of X , which gives a tigh ter bound than that from the general results on linear prediction in e.g. Kak ade et al. [7]. Since w e care about the complexity of the model class F viewed through the loss function ` , w e m ust also account for this additional complexit y . F or c -Lipsc hitz loss functions, this just means m ultiplying G n ( F ) by 2 c . 4 Application W e illustrate our results b y predicting in terest rate c hanges — sp eciﬁcally , the 10-year T reasury Constan t Maturit y Rate series from the F ederal Reserv e Bank of St. Louis’ FRED database 3 — recorded daily from January 2, 1962 to August 31, 2010. T ransforming the series into daily natural- log gro wth rates leav es n = 12150 observ ations (Figure 1). The changing v ariance apparent in the ﬁgure is why interest rates are typically forecast with GAR CH(1,1) mo dels. F or this illustration ho wev er, we will use an AR( p ) mo del, pic king the memory order p b y the risk b ound. Figure 2 shows the training error b R n ( b f ) = 1 n − p n X t = p +1 ( b X t − X t ) 2 3 Av ailable at http://research.stlouisfed.org/fred2/series/DGS10?cid=115 . 7 0 5 10 15 20 25 30 1.078 1.080 1.082 1.084 1.086 p training error (x1000) 0 10 20 30 40 0 20 40 60 80 p AIC Figure 2: T raining error (top panel) and AIC (bottom panel) against mo del order 0 5 10 15 20 25 30 0.01 0.02 0.03 0.04 0.05 p risk bound Figure 3: Generalization error b ound for diﬀeren t mo del orders where X t is the t th datap oin t, and b X t is the mo del’s prediction. b R n shrinks as the order of the mo del ( p ) grows, as it must since ordinary least squares minimizes b R n for a giv en p . Also sho wn is the gap b et ween the AIC for diﬀerent p and the low est attainable v alue; this would select an AR(36) mo del. A b etter strategy uses the probabilistic risk b ound deriv ed ab o ve. The goal of mo del selection is to pic k, with high probability , the mo del with the smallest risk; this is V apnik’s structural risk minimization principle. Here, it is clear that AIC is dramatically ov erﬁtting. The optimal mo del using the risk b ound is an AR(1). Figure 3 plots the risk b ound against p with the loss function truncated at 0 . 05. (No daily interest rate change has ever had loss larger than 0 . 034, and results are fairly insensitive to the level of the loss cap.) This b ound says that with 95% probability , r e gar d less of the true data gener ating pr o c ess , the AR(1) model will make mistakes with squared error no larger than 0 . 0079. If w e had instead predicted with zero, this loss would ha ve o ccurred three times. One issue with Theorem 2.4 is that it requires knowledge of the β -mixing co eﬃcien ts, β ( m ). Of course, the dependence structure of this data is unknown, so we calculated it under generous assumptions on the data generating process. In a homogeneous Mark ov pro cess, the β -mixing co eﬃcien ts work out to β ( m ) = Z π ( dx ) || P m ( x, · ) − π || T V where P m ( x, · ) is the m -step transition op erator and π is the stationary distribution [14, 4]. Since AR mo dels are Marko vian, we estimated an AR( q ) model with Gaussian errors for q large and calculated the mixing co eﬃcien ts using the stationary and transition distributions. T o create the b ound, we used m = 7 and µ = 867. W e address non-parametric estimation of β -mixing co eﬃcien ts elsewhere [Anon.]. 5 Discussion W e ha v e constructed a ﬁnite-sample predictiv e risk bound for autoregressive models, using the stationarit y assumption to constrain OLS estimation. Interestingly , stationarity — a common as- sumption among applied researchers — constrains the mo del space enough to yield b ounds without further regularization. Moreov er, this is the ﬁrst predictive risk bound we kno w of for any of the standard mo dels of time series analysis. 8 T raditionally , time series analysts hav e selected mo dels by blending empirical risk minimization, more-or-less quan titative inspection of the residuals (e.g., the Box-Ljung test; see [19]), and AIC. In many applications, ho wev er, what really matters is prediction, and none of these tec hniques, including AIC, controls generalization error, esp ecially with mis-sp eciﬁcation. (Cross-v alidation is a partial exception, but it is tric ky for time series; see [16] and references therein.) Our bound controls prediction risk directly . Admittedly , our b ound cov ers only univ ariate autoregressiv e mo dels, the plainest of a large family of traditional time series mo dels, but we b eliev e a similar result will cov er the more elaborate mem b ers of the family such as vector autoregressive, autoregressiv e-mo ving a verage, or autoregressive conditionally heteroskedastic mo dels. While the characterization of the stationary domain from [6] on which we relied breaks do wn for such mo dels, they are all v ariants of the linear state space mo del [5], whose parameters are restricted under stationarity , and so we hop e to obtain a general risk b ound, p ossibly with stronger v ariants for particular sp eciﬁcations. References [1] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk b ounds and structural results. Journal of Machine L e arning R ese ar ch , 3:463–482, 2002. [2] B.C. Beck er, H. T ummala, and C.N. Riviere. Autoregressiv e modeling of physiological tremor under microsurgical conditions. In Engine ering in Me dicine and Biolo gy So ciety, 2008. EMBS 2008. 30th A nnual International Confer enc e of the IEEE , pages 1948–1951. IEEE, 2008. [3] Richard C. Bradley . Basic properties of strong mixing conditions. a surv ey and some open questions. Pr ob ability Surveys , 2:107–144, 2005. URL . [4] Y.A. Davydo v. Mixing conditions for marko v chains. The ory of Pr ob ability and its Applic ations , 18(2):312–328, 1973. [5] J. Durbin and S.J. Koopman. Time Series A nalysis by State Sp ac e Metho ds . Oxford Univ Press, Oxford, 2001. [6] Adly T. F am and James S. Meditch. A canonical parameter space for linear systems design. IEEE T r ansactions on Automatic Contr ol , 23(3):454–458, 1978. [7] Sham M. Kak ade, Karthik Sridharan, and Am buj T ewari. On the complexit y of linear predic- tion: Risk b ounds, margin b ounds, and regularization. T ec hnical rep ort, NIPS, 2008. URL http://ttic.uchicago.edu/ ~ karthik/rad- paper.pdf . [8] R. L. Karandik ar and M. Vidyasagar. Probably appro ximately correct learning with b eta-mixing input sequences. submitted for publication, 2009. [9] M. Ledoux and M. T alagrand. Pr ob ability in Banach Sp ac es: Isop erimetry and Pr o c esses . A Series of Mo dern Surveys in Mathematics. Springer V erlag, Berlin, 1991. ISBN 3540520139. [10] J. Li and A.W. Mo ore. F orecasting w eb page views: Metho ds and observ ations. Journal of Machine L e arning R ese ar ch , 9:2217–2250, 2008. [11] Ron Meir. Nonparametric time series prediction through adaptive mo del selection. Machine L e arning , 39(1):5–34, 2000. URL http://www.ee.technion.ac.il/ ~ rmeir/Publications/ MeirTimeSeries00.pdf . [12] Mehryar Mohri and Afshin Rostamizadeh. Rademac her complexity b ounds for non-iid pro cesses. In D. Koller, D. Sch uurmans, Y. Bengio, and L. Bottou, editors, A dvanc es in Neur al Information Pr o c essing Systems 21 , volume 21, pages 1097–1104, 2009. [13] Mehryar Mohri and Afshin Rostamizadeh. Stability b ounds for stationary ϕ -mixing and β - mixing pro cesses. Journal of Machine L e arning R ese ar ch , 11:789–814, F ebruary 2010. 9 [14] A. Mokk adem. Mixing properties of arma processes. Sto chastic pr o c esses and their applic ations , 29(2):309–315, 1988. [15] R.K. Olsson and L.K. Hansen. Linear state-space mo dels for blind source separation. The Journal of Machine L e arning R ese ar ch , 7:2585–2602, 2006. ISSN 1532-4435. [16] J. Racine. Consisten t cross-v alidatory model-selection for dep enden t data: Hv-blo c k cross- v alidation. Journal of e c onometrics , 99(1):39–61, 2000. [17] J. Ruiz-del Solar and P . V allejos. Motion detection and tracking for an aib o rob ot using motion comp ensation and k alman ﬁltering. In L e ctur e Notes in Computer Scienc e 3276 (R ob oCup 2004) , pages 619–627. Springer V erlag, 2005. [18] M. Sak, D.L. Do we, and S. Ra y . Minimum message length moving av erage time series data mining. In Computational Intel ligenc e Metho ds and Applic ations, 2005 ICSC Congr ess on , page 6. IEEE, 2006. ISBN 1424400201. [19] R.H. Shum wa y and D.S. Stoﬀer. Time Series Analysis and Its Applic ations . Springer Series in Statistics. Springer V erlag, New Y ork, 2000. [20] Ingo Steinw art and Andreas Christmann. F ast learning from non-i.i.d. observ ations. In Y. Bengio, D. Sc huurmans, J. Laﬀert y , C. K. I. Williams, and A. Culotta, editors, A d- vanc es in Neur al Information Pr o c essing Systems 22 , pages 1768–1776. MIT Press, 2009. URL http://books.nips.cc/papers/files/nips22/NIPS2009_1061.pdf . [21] M. Vidyasagar. A The ory of L e arning and Gener alization: With Applic ations to Neur al Net- works and Contr ol Systems . Springer V erlag, Berlin, 1997. [22] Bin Y u. Rates of conv ergence for empirical pro cesses of stationary mixing sequences. The A nnals of Pr ob ability , 22(1):94–116, 1994. 10

Generalization error bounds for stationary autoregressive models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment