Finite Sample Analysis Of Dynamic Regression Parameter Learning

Finite Sample Analysis Of Dynamic Regr ession Parameter Lear ning Mark Kozdoba T echnion, Israel Institute of T echnology markk@ef.technion.ac.il Edward Mor oshko T echnion, Israel Institute of T echnology edward.moroshko@gmail.com Shie Mannor T echnion, Israel Institute of T echnology shie@ee.technion.ac.il Koby Crammer T echnion, Israel Institute of T echnology koby@ee.technion.ac.il Abstract W e consider the dynamic linear re gression problem, where the predictor vector may v ary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator , where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a v ariety of applications, existing approaches to this problem either lack guarantees altogether, or only hav e asymptotic guarantees without explicit rates. In particular , existing literature does not provide any clues to the following fundamental question: In terms of data characteristics, what does the con ver gence rate depend on? In this paper we study the global system operator – the operator that maps the noise vectors to the output. W e obtain estimates on its spectrum, and as a result deriv e the ﬁrst kno wn v ariance estimators with ﬁnite sample comple xity guarantees. The proposed bounds depend on the shape of a certain spectrum related to the system operator , and thus provide the ﬁrst kno wn explicit geometric parameter of the data that can be used to bound estimation errors. In addition, the results hold for arbitrary sub Gaussian distributions of noise terms. W e e v aluate the approach on synthetic and real-world benchmarks. 1 Introduction A dynamic linear regression (W est and Harrison, 1997, Chapter 3), or non-stationary re gression, is a situation where we are given a sequence of scalar observations { Y t } t ≤ T ⊂ R , and observation vectors { u t } t ≤ T ⊂ R n such that Y t = h X t , u t i + z t where X t ∈ R n is a regressor vector , and z t a random noise term. In contrast to a standard linear regression, the vector X t may change with time. One common objecti ve for this problem is at time T , to estimate the trajectory of X t for t ≤ T , gi ven the observation v ectors and observ ations, { u t } t ≤ T , { Y t } t ≤ T , and possibly to forecast Y T +1 if u T +1 is also known. In this paper we model the problem as follows: X t +1 = X t + h t (1) Y t = h X t , u t i + z t , (2) where h· , ·i is the standard inner product on R n , z t , the observation noise , are zero-mean sub Gaussian random v ariables, with v ariance η 2 , and the pr ocess noise variables h t take v alues in R n , such that coordinates of h t are zero-mean sub Gaussian, independent, and hav e variance σ 2 . All h t and z t 36th Conference on Neural Information Processing Systems (NeurIPS 2022). variables are assumed to be mutually independent. The vectors u t are an arbitrary sequence in R n , and the observed, kno wn, quantities at time T are { Y t } t ≤ T and { u t } t ≤ T . The system (1)-(2) is a special case of a Linear Dynamical System (LDS). As is well kno wn, when the parameters σ, η are giv en, the mean-squared loss optimal forecast for Y T +1 and estimate for X T are obtained by the Kalman Filter (Anderson and Moore, 1979; Hamilton, 1994; Chui and Chen, 2017). In this paper we are concerned with estimators for σ , η , and ﬁnite sample complexity guarantees for these estimators. Let us ﬁrst make a fe w remarks about the particular system (1)-(2). First, as a natural model of time varying regression, this system is useful in a considerable v ariety of applications. W e refer to W est and Harrison (1997), Chapter 3, for numerous examples. In addition, an application to electricity consumption time series as a function of the temperature is pro vided in the e xperiments section of this paper . Second, one may regard the problem of estimating σ, η in (1)-(2) as a pure case of ﬁnding the optimal learning rate for X t . Indeed, the Kalman ﬁlter equations for (1)-(2), are giv en by (3)-(4) below , where (3) describes the ﬁltered covariance update and (4) the ﬁltered state update. Here ¯ x t is the estimated state, giv en the observ ations Y 1 , . . . , Y t , see W est and Harrison (1997). C t +1 = η 2 h ( C t + σ 2 I ) u t +1 , u t +1 i + η 2  C t + σ 2 I  (3) ¯ x t +1 = ¯ x t + C t +1 η 2 u t +1 · ( Y t +1 − h ¯ x t , u t +1 i ) . (4) In particular, following (4), the role of σ and η may be interpreted as regulating how much the estimate of ¯ x t +1 is inﬂuenced, via the operator C t +1 η 2 , by the most recent observation and input Y t +1 , u t +1 . Roughly speaking, higher values of σ or lower v alues of η would imply that the past observations are gi ven less weight, and result in an o verﬁt of the forecast to the most recent observ ation. On the other hand, very low σ or high η would make the problem similar to the standard linear re gression, where all observations are giv en equal weight, and result in a lag of the forecast. See Figure 3 in Supplementary Material Section A for an illustration. Finally , it is worth mentioning that the system (1)-(2) is closely related to the study of online gr adient (OG) methods (Zinke vich, 2003; Hazan, 2016). In this ﬁeld, assuming quadratic cost, one considers the update ¯ x t +1 = ¯ x t + α · u t +1 · ( Y t +1 − h ¯ x t , u t +1 i ) , (5) where α is the learning rate, and studies the performance guarantees of the forecaster h ¯ x t , u t +1 i . Compared to (4), the update (5) is simpler , and uses a scalar rate α instead of the input-dependent operator rate C t +1  η 2 of the Kalman ﬁlter . Howe ver , due to the similarity , every domain of appli- cability of the OG methods is also a natural candidate for the model (1)-(2) and vice-versa. As an illustration, we compare the OG to Kalman ﬁlter based methods with learned σ , η in the experiments section. In this paper we introduce a new estimation algorithm for σ, η , termed STVE (Spectrum Thresholding V ariance Estimator), and prov e ﬁnite sample complexity bounds for it. In particular, our bounds are an explicit function of the parameters T and { u t } T t =1 for any ﬁnite T , and indicate that the estimation error decays roughly as T − 1 2 , with high probability . T o the best of our knowledge, these are the ﬁrst bounds of this kind. As we discuss in detail in Section 2, most existing estimation methods for LDSs, such as subspace identiﬁcation (van Overschee and de Moor, 1996; Qin, 2006), or improper learning (Anav a et al., 2013; Hazan et al., 2017; K ozdoba et al., 2019), do not apply to the system (1)-(2), due to non-stationarity . On the other hand, the methods that do apply to (1)-(2) either lack guarantees, or hav e only asymptotic analysis which in addition relies strongly on Guassianity of the noises. Moreov er , our approach dif fers signiﬁcantly from the existing methods. W e show that the structure of equations (1)-(2) is closely related to, and inherits se veral important properties from, the classical discrete Laplacian operator on the line — leading to new arguments that hav e not been explored in the literature. In particular , we use this connection to sho w that an appropriate in version of the system produce estimators that are concentrated enough so that σ and η may be recovered. The heart of the paper is the new deﬁnition of the estimators that exploits e xplicitly the shape of a certain data dependent operator , and the subsequent concentration analysis. In particular , this approach yields the ﬁrst known g eometric parameters of the data that can be used to bound con ver gence rates. 2 The rest of the paper is organized as follo ws: The related work is discussed in Section 2 and Section 3 contains the necessary deﬁnitions. In Section 4 we describe in general lines the methods and the main results of this paper . The technical estimates on certain operator spectra, that are critical to the analysis and may be of independent interest, are stated in Section 5. In Section 6 we present experimental results on synthetic and real data. Due to space constraints, while we outline the main arguments in the te xt, the full proofs are deferred to the Supplementary Material. 2 Literature W e refer to Chui and Chen (2017); Hamilton (1994); Anderson and Moore (1979); Shumway and Stoffer (2011) for a general background on LDSs, the Kalman Filter and maximum likelihood estimation. Existing approaches to the variance estimation problem may be di vided into three categories: (i) General methods for parameter identiﬁcation in LDS, either via maximum likelihood estimation (MLE) (Hamilton, 1994), or via subspace identiﬁcation (van Overschee and de Moor, 1996; Qin, 2006). In particular , ﬁnite sample bounds for system identiﬁcation were giv en in (Campi and W eyer, 2005; V idyasagar and Karandikar, 2006) and in the recent work Tsiamis and Pappas (2019). (ii) Methods designed speciﬁcally to learn the noise parameters of the system, de veloped primarily in the control theory community , in particular via the innov ation auto-correlation function, such as the classical Mehra (1970); Belanger (1974), or for instance more recent W ang et al. (2017); Dunik et al. (2018). (iii) Impr oper Learning methods, such as Ana v a et al. (2013); Hazan et al. (2017); K ozdoba et al. (2019). In these approaches, one does not learn the LDS directly , but instead learns a model from a certain auxiliary class and sho ws that this auxillary model produces forecasts that are as good as the forecasts of an LDS with “optimal” parameters. Despite the apparent simplicity of the system (1)-(2), most of the abov e methods do not apply to this system. This is due to the f act that most of the methods are designed for time in variant, asymptotically stationary systems, where the observ ation operator ( u t in our notation) is constant and the Kalman gain (or , equiv alently C t u t in eq. (3)) con verges with t . In particular this limitation e xists in all the system identiﬁcation results cited above, and is essential to the approaches taken there. Ho we ver , if the observ ation vector sequence u t changes with time – a necessary property for the dynamic regression problem – the system will no longer be asymptotically stationary . In particular , due to this reason, neither the subspace identiﬁcation methods, nor any of the improper learning approaches abov e apply to system (1)-(2) . Among the methods that do apply to (1)-(2) are the general MLE estimation, and some of the auto- correlation methods (Belanger, 1974; Dunik et al., 2018). On one hand, both types of approaches may be applicable to systems apriori more general than (1)-(2). On the other hand, the situation with consistency guarantees – the guarantee that one recov ers true parameters giv en enough observations – for these methods is somewhat complicated. Due to the non-conv exity of the likelihood function, the MLE method is not guaranteed to ﬁnd the true maximum, and as a result the whole method has no guarantees. The results in Belanger (1974); Dunik et al. (2018) do ha ve asymptotic consistenc y guarantees. Ho wev er , these rely on some explicit and implicit assumptions about the system, the sequence u t in our case, which can not be easily veriﬁed. In particular , Belanger (1974); Dunik et al. (2018) assume uniform observability of the system, which we do not assume, and in addition rely on certain implicit assumption about in vertibility and condition number of the matrices related to the sequence u t . Moreov er , even if one assumes that the assumptions hold, the results are purely asymptotic, and for any ﬁnite T , do not provide a bound of the expected estimation error as a function of T and { u t } T t =1 . In addition, as mentioned earlier , MLE methods by deﬁnition must assume that the noises are Gaussian (or belong to some other predetermined parametric family) and autocorrelation based methods also strongly use the Gaussianity assumption. Our approach, on the other hand, requires only sub Gaussian noises with independent coordinates. W e note that there are straightforward extensions of our methods to certain cases with dependencies. Indeed, the operator analysis part of this paper does not depend on the distribution of the noises. Therefore, to achieve such an extension, one would only need to correspondingly extend the main probabilistic tool, the Hanson-Wright inequality (Hanson and Wright, 1971; Rudelson et al., 2013, see also Section 4 and Supplementary Material 3 Section E). One such extension, for vectors with the conve x concentration property , was recently obtained in Adamczak (2015). 3 Notation W e refer to Bhatia (1997) and V ershynin (2018) as general references on the notation introduced below , for operators and sub Gaussian variables, respecti vely . Let A : R n → R m be an operator with a singular v alue decomposition A = U · Diag ( λ 1 , . . . , λ s ) · W , where s ≤ min { m, n } and λ 1 ≥ λ 2 ≥ . . . ≥ λ s > 0 . Note that singular values are strictly positiv e by deﬁnition (that is, vectors corresponding to the kernel of A do not participate in the decomposition A = U · D iag ( λ 1 , . . . , λ s ) · W ). The Hilbert-Schmidt (Frobenius) norm is deﬁned as k A k H S = p P s i =1 λ 2 i . The nuclear and the operator norms are given by k A k nuc = P s i =1 λ i and k A k op = λ 1 respectiv ely . A centered ( E X = 0 ) scalar random variable X is sub-Gaussian with constant κ , denoted X ∼ S G ( κ ) , if for all t > 0 it satisﬁes P ( | X | > t ) ≤ 2 exp  − t 2 κ 2  . A random vector X = ( X 1 , . . . , X m ) is κ sub-Gaussian, denoted X ∼ S G m ( κ ) , if for ev ery v ∈ R m with | v | = 1 the random variable h v , X i is κ sub-Gaussian. A random vector X is σ -isotropic if for ev ery v ∈ R m with | v | = 1 , E h v , X i = σ 2 . Finally , a random vector X = ( X 1 , . . . , X m ) is σ -isotropically κ sub-Gaussian with independent components, denoted X ∼ I S G m ( σ, κ ) if X i are independent, and for all i ≤ m , E X i = 0 , E X 2 i = σ 2 and X i ∼ S G ( κ ) . Clearly , if X ∼ I S G m ( σ, κ ) then X is σ -isotropic. Recall also that X ∼ I S G m ( σ, κ ) implies X ∼ S G m ( κ ) (V ershynin, 2018). The noise variables we discuss in this paper are I S G ( κ, σ ) . Throughout the paper , absolute constants are denoted by c, c 0 , c 00 , . . . . etc. Their v alues may change from line to line. 4 Overview of the approach W e begin by rewriting (1)-(2) in a vector form. T o this end, we ﬁrst encode sequences of T vectors in R n , { a t } t ≤ T ⊂ R n , as a vector a ∈ R T n , constructed by concatenation of a t ’ s. Next, we deﬁne the summation operator S 0 : R T → R T which acts on any v ector ( h 1 , h 2 , . . . , h T ) ∈ R T by S 0 ( h 1 , h 2 , . . . , h T ) = ( h 1 , h 1 + h 2 , . . . , X i ≤ T − 1 h i , X i ≤ T h i ) . (6) Note that S 0 is an in vertible operator . Next, we similarly deﬁne the summation operator S : R T n → R T n , an n -dimensional extension of S 0 , which sums n -dimensional vectors. Formally , for ( h l ) T n l =1 ∈ R T n , and for 1 ≤ j ≤ n, 1 ≤ t ≤ T , ( S h ) ( t − 1) · n + j = P i ≤ t h ( i − 1) · n + j . Observe that if the sequence of process noise terms h 1 , . . . , h T ∈ R n is viewed as a v ector h ∈ R T n , then by deﬁnition S h is the R T n encoding of the sequence X t . Next, gi ven a sequence of observation v ectors u 1 , . . . , u T ∈ R n , we deﬁne the observ ation operator O u : R T n → R T by ( O u x ) t =  u t ,  x ( t − 1) · n +1 , . . . , x ( t − 1) · n + n  . In words, coordinate t of O u x is the inner product between u t and t -th part of the vector x ∈ R T n . Deﬁne also Y = ( Y 1 , ..., Y T ) ∈ R T to be the concatenation of Y 1 , ..., Y T . W ith this notation, one may equiv alently rewrite the system (1)-(2) as follows: Y = O u S h + z , (7) where h and z are independent zero-mean random vectors in R T n and R T respecti vely , with indepen- dent sub Gaussian coordinates. The v ariance of each coordinate of h is σ 2 and each coordinate of z has variance η 2 . Up to now , we hav e reformulated our data model as a single vector equation. Note that in that equation, the observ ations Y and both operators O u and S are kno wn to us. Our problem may now be reformulated as follows: Giv en Y ∈ R T , assuming Y was generated by (7), provide estimates of σ, η . 4 As a motiv ation, we ﬁrst consider taking the expectation of the norm squared of eq. (7). For any operator A : R m → R m and zero-mean vector h with independent coordinates and coordinate variance σ 2 , we have E | Ah | 2 = k A k 2 H S σ 2 , where k A k H S is the Hilbert-Schmidt (or Frobenius) norm of A . T aking the norm and expectation of (7), and dividing by T 2 , we thus obtain E | Y | 2 T 2 = k O u S k 2 H S T 2 σ 2 + T T 2 η 2 . (8) Next, note that k O u S k 2 H S is known, and an elementary computation shows that k O u S k 2 H S T 2 is of constant order (as a function of T ; see (25)), while the coef ﬁcient of η 2 is 1 T . Thus, if the quantity | Y | 2 T 2 were close enough to its expectation with high probability , we could take this quantity as a (slightly biased) estimator of σ 2 . Howe ver , as it will become apparent later , the de viations of | Y | 2 T 2 around the expectation are also of constant order , and thus | Y | 2 T 2 can not be used as an estimator . The reason for these high de viations of | Y | 2 T 2 is that the spectrum of O u S is extremely peak ed. The highest squared singular value of O u S is of order T 2 , the same order as sum of all of them, k O u S k 2 H S . Contrast this with the case of identity operator , I d : R T n → R T n : W e hav e E | I d ( h ) | 2 = E | h | 2 = T nσ 2 , and one can also easily sho w that, for instance, V ar | I d ( h ) | 2 = T nσ 2 , and thus the de viations are of order √ T nσ – a smaller order than E | I d ( h ) | 2 . While for the identity operator the computation is elementary , for a general operator A the situation is signiﬁcantly more in volv ed, and the bounds on the de viations of | Y | 2 will be obtained from the Hanson-Wright inequality (Hanson and Wright, 1971, see also Rudelson et al. (2013)), combined with standard norm de viation bounds for isotropic sub Gaussian vectors. W ith these observ ations in mind, we proceed to ﬂatten the spectrum of O u S by taking the pseudo- in verse. Let R : R T → R T n be the pseudo-in verse, or Moore-Penrose in verse of O u S . Speciﬁcally , let O u S = U ◦ D iag ( γ 1 , . . . , γ T ) ◦ W , (9) be the singular value decomposition of O u S , where γ 1 ≥ γ 2 ≥ . . . ≥ γ T are the singular values. For the rest of the paper , we will assume that all of the observation vectors u t are non-zero. This assumption is made solely for notational simplicity and may easily be avoided, as discussed later in this section. Under this assumption, since S is in vertible and O u has rank T , we hav e λ t > 0 for all t ≤ T . For i ≤ T , denote χ i = γ − 1 T +1 − i . Then χ i are the singular values of R , arranged in a non-increasing order , and we ha ve by deﬁnition R = W ∗ ◦ D iag ( χ T , χ T − 1 , . . . , χ 2 , χ 1 ) ◦ U ∗ , (10) where W ∗ , U ∗ denote the transposed matrices of U, V , deﬁned in (9). Similarly to Eq. (8), we apply R to (7), and since k RO u S k H S = T , by taking the expectation of the squared norm we obtain | RY | 2 T = σ 2 + k R k 2 H S T η 2 + | RY | 2 T − E | R Y | 2 T ! . (11) In this equation, the deviation term  | RY | 2 T − E | RY | 2 T  is of order O ( 1 √ T ) with high probability (Theorem 1). Moreov er , the coefﬁcient of σ 2 is 1 , and the coefﬁcient of η 2 , which is k R k 2 H S T , is of order at least Ω( 1 log 2 T ) (Theorem 3, see Section 5 for additional details) – much larger order than 1 √ T . Since | RY | 2 and k R k 2 H S are known, it follo ws that we have obtained one equation satisﬁed by σ 2 and η 2 up to an error of 1 √ T , where both coefﬁcients are of order lar ger than the error . Next, we would like to obtain another linear relation between σ 2 , η 2 . T o this end, choose some p = αT , where 0 < α < 1 is of constant order . The possible choices of p are discussed later in this section. W e deﬁne an operator R 0 : R T → R T n to be a version of R truncated to the ﬁrst p singular values. If (10) is the SVD decomposition of R , then R 0 = W ∗ ◦ D iag (0 , 0 , . . . , χ p , χ p − 1 , . . . , χ 1 ) ◦ U ∗ . 5 Algorithm 1 Spectrum Thresholding V ariance Estimator (STVE) 1: Input: Observations Y t , observation v ectors u t , with t ≤ T , and p = αT . 2: Compute the SVD of O u S , O u S = U ◦ D iag ( γ 1 , . . . , γ T ) ◦ W , where γ 1 ≥ γ 2 ≥ . . . ≥ γ T > 0 . Denote χ i = γ − 1 T +1 − i for 1 ≤ i ≤ T . 3: Construct the operators R = W ∗ ◦ D iag ( χ T , . . . , χ 1 ) ◦ U ∗ and R 0 = W ∗ ◦ D iag (0 , 0 , . . . , 0 , χ p , . . . , χ 1 ) ◦ U ∗ 4: Produce the estimates: b η 2 = | R 0 Y | 2 p − | RY | 2 T ! . k R 0 k 2 H S p − k R k 2 H S T ! c σ 2 = | RY | 2 T − k R k 2 H S T b η 2 . Similarly to the case for R , we hav e | R 0 Y | 2 p = σ 2 + k R 0 k 2 H S p η 2 + | R 0 Y | 2 p − E | R 0 Y | 2 p ! . (12) The deviations in (12) are also described by Theorem 1. Note also that since k R 0 k 2 H S is the sum of p largest squared singular v alues of R , by deﬁnition it follo ws that k R 0 k 2 H S p ≥ k R k 2 H S T . Now , giv en two equations in two unkno wns, we can solve the system to obtain the estimates c σ 2 and b η 2 . The full procedure is summarized in Algorithm 1, and the bounds implied by Theorem 1 on the estimators c σ 2 and b η 2 are gi ven in Corollary 2. W e ﬁrst state these results, and then discuss in detail the various parameters appearing in the bounds. Theorem 1. Consider a r andom vector Y ∈ R T of the form Y = O u S h + z wher e h ∼ I S G T n ( σ, κ ) and z ∼ I S G T ( η , κ ) . Set | u min | = min t | u t | . Then for any 0 < δ < 1 , P      | RY | 2 T − σ 2 + k R k 2 H S T η 2 !      ≥ c B √ T ! ≤ 4 δ, (13) P      | R 0 Y | 2 p − σ 2 + k R 0 k 2 H S p η 2 !      ≥ c B √ p ! ≤ 4 δ (14) wher e B is given by B =  1 + κ 2   1 + | u min | − 2  log 1 δ . (15) The bounds on the estimators of Algorithm 1 are giv en in the following Corollary . As discussed below , in addition to Theorem 1, the k ey to the deri vation of this Corollary are the estimates of the spectrum of R , giv en in Theorem 3, Section 5. Corollary 2. Let c σ 2 , b η 2 be the estimators of σ 2 , η 2 obtained form Algorithm 1 with p ≥ 1 4 T . Set | u max | = max t | u t | . Then for any 0 < δ < 1 , with pr obability at least 1 − 8 δ ,    c σ 2 − σ 2    ≤ c B √ T 1 − p k R k 2 H S T k R 0 k 2 H S ! − 1 , (16)    b η 2 − η 2    ≤ c B √ T 1 − p k R k 2 H S T k R 0 k 2 H S ! − 1 n 2 | u max | 2 log 2 T , (17) with B given by (15) . 6 0 200 400 600 800 0 1 2 3 p= T/4 R 2 sp ectrum , χ 2 k R 0 ( p ) k 2 H S /p k R k 2 H S /T (a) Spectrum ( χ 2 i ) of R for electricity data (blue). k R 0 ( p ) k 2 H S /p as a function of p (orange). The mean k R k 2 H S /T (green). 0 100 200 300 400 500 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 p= T/4 R 2 sp ectrum , χ 2 k R 0 ( p ) k 2 H S /p k R k 2 H S /T (b) Similar ﬁgure for synthetic data, T = 500 . Figure 1: Spectra of R W e ﬁrst discuss the assumption | u min | > 0 . This assumption is made solely for notational con- venience, as detailed below . T o begin, note that some form of lower bound on the norms of the observation vectors u t must appear in the bounds. This is simply because if one had u t = 0 for all T , then clearly no estimate of σ would hav e been possible. On the other hand, our use of the smallest value | u min | may seem restrictiv e at ﬁrst. W e note ho wev er , that instead of considering the observation operator O u : R T n → R T , one may consider the operator O ¯ u : R T n → R ¯ T for any subsequence { ¯ u ¯ t } ¯ T ¯ t =1 . The observ ation vector Y would be correspondingly restricted to the subsequence of indices. This allows us to treat missing values and to exclude any outlier u t with small norms. All the arguments in Theorems 1 and 3 hold for this modiﬁed O ¯ u without change. The only price that will be paid is that T will be replaced by ¯ T in the bounds. Moreover , we note that typically we have | u t | ≥ 1 by construction, see for instance Section 6.2. Additional discussion of missing values may be found in Supplementary Material Section H. Next, up to this point, we have obtained two equations, (11)-(12), in two unknowns, σ 2 , η 2 . Note that in order to be able to obtain η 2 from these equations, at least one of the coefﬁcients of η 2 , either k R k 2 H S T or k R 0 k 2 H S p must be of larger order than 1 √ T , the order of deviations. Providing lower bounds on these quantities is one of the main technical contributions of this work. This analysis uses the connection between the operator S and the Laplacian on the line, and resolves the issue of translating spectrum estimates for the Laplacian into the spectral estimates for R . W e note that there are no standard tools to study the spectrum of R , and our approach proceeds indirectly via the analysis of the nuclear norm of O u S . These results are stated in Theorem 3. In particular , we show that k R k 2 H S T is Ω( 1 log 2 T ) , which is the source of the log factor in (17). Finally , in order to solve the equations (11)-(12), not only the equations must have large enough coefﬁcients, but the equations must be differ ent . This is reﬂected by the term  1 − p k R k 2 H S T k R 0 k 2 H S  − 1 in (16), (17). Equiv alently , while k R 0 k 2 H S /p k R k 2 H S /T ≥ 1 by deﬁnition, we would like to ha ve k R 0 k 2 H S /p k R k 2 H S /T ≥ 1 + const (18) for the bounds (16), (17) to be stable. Note that since both k R 0 k 2 H S and k R k 2 H S are computed in Algorithm 1, the condition (18) can simply be veriﬁed before the estimators c σ 2 , b η 2 are returned. It is worth emphasizing that for simple choices of p , say p = 1 4 T , the condition (18) does hold in practice. Note that, for an y p , we can ha v e k R 0 k 2 H S /p k R k 2 H S /T = 1 only if the spectrum of R is constant . Thus (18) amounts to stating that the spectrum of R exhibits some decay . As we show in experiments below , the spectrum (squared) of R , for u t deriv ed from daily temperature features, or for random Gaussian u t , indeed decays. See Section 6, Figures 1a and 1b. In particular , in both cases (18) holds 7 with const > 1 . Additional bounds on the quantity  1 − p k R k 2 H S T k R 0 k 2 H S  − 1 under various assumptions on the sequence u t are giv en in Section G of the Supplementary Material. 5 Properties of O u S and R As discussed in Section 4 (see the discussion follo wing eq. (11) ), one of the crucial points enabling Algorithm 1 and its analysis is the f act that the quantity k R k 2 H S T is bounded belo w by an e xpression that is of much higher order than the noise magnitude 1 √ T . In this section we provide the formal statement of this and other associated results, and discuss the related arguments. First, we obtain the following bound on the spectrum of O u S (Lemma 4, Supplementary Material Section B). Recall that the nuclear norm was deﬁned in Section 3, and that for a sequence u t we set | u max | = max t | u t | and | u min | = min t | u t | . Then: k O u S k nuc = X t ≤ T λ t ( O u S ) ≤ 4 n | u max | T log T . (19) The proof of this bound exploits the connection between S and the Laplacian on the line. In particular , we use the fact that the eigenv alues of the Laplacian are known precisely , satisfying λ l = 2 sin  π ( T − l ) 2 T  . Next, we state the lo wer (and upper) bounds for R . Theorem 3. Let R : R T → R T n be the pseudoin verse of O u S . Then c 1 n | u max | log T ≤ k R k op ≤ 2 | u min | − 1 , (20) c 1 n 2 | u max | 2 log 2 T T ≤ k R k 2 H S ≤ 4 | u min | − 2 T , (21) c 1 n 4 | u max | 4 log 4 T T ≤ k R ∗ R k 2 H S ≤ 16 | u min | − 4 T . (22) Due to the complicated structure of R as a pseudo-in verse of a composition of operators, there are no direct w ays to control indi vidual eigen v alues of R . Thus the main technical issue resolved in Theorem 3 is nev ertheless obtaining lo wer bounds on k R k 2 H S . Our approach is rather indirect, and we obtain these bounds from the nuclear norm bound (19) via a Markov type inequality on the eigen values. 6 Experiments 6.1 Synthetic Data In this section the performance of STVE is e valuated on synthetic data. The data was generated by the LDS (1)-(2), using Gaussian noises with σ 2 = 0 . 5 , η 2 = 2 . The input dimension was n = 5 , and the input sequence u t sampled from the Gaussian N (0 , I n ) . W e run the STVE algorithm for different v alues of T , where for each T we sampled the data 150 times. Figure 2a shows the av erage (ov er 150 runs) estimation error for both process and observation noise v ariances for various values of T . As expected from the bounds in Corollary 2, it may be observed in Figure 2a that the estimation errors decay roughly at the rate of const/ √ T . A typical spectrum of R is shown in Figure 1b. For larger T , the spectra also exhibits similar decay . 6.2 T emperatures and Electricity Consumption In this section we e xamine the relation between daily temperatures and electricity consumption in the data from Hong et al. (2014) (see also Hong (2016)). The following forecasting methods are compared: a stationary regression, an online gradient, and a Kalman ﬁlter for a dynamic regression, with parameters learned via MLE or STVE. W e ﬁnd that the Kalman ﬁlter methods provide the best performance, with no signiﬁcant difference between STVE and MLE deri ved systems. 8 500 1000 1500 2000 2500 3000 T 0 . 04 0 . 06 0 . 08 | σ 2 − b σ 2 | | σ 2 − b σ 2 | | η 2 − b η 2 | T − 1 2 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 | η 2 − b η 2 | (a) STVE variance estimation errors vs T , and the T − 1 2 decay . − 2 − 1 0 1 2 − 2 0 2 4 F ull Data Regression First Half Regression Second Half (b) Load (y-axis) against T empera- ture (x-axis), both axis normalized. The full data (blue points), regres- sion learned on the ﬁrst half (orange), regression learned on the second half (green). 2004-01 2004-07 2005-01 2005-07 2006-01 2006-07 2007-01 2007-07 2008-01 2008-07 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 Stationa ry Regression MLE Kalman STVE Kalman OG (c) Smoothed prediction errors. Sta- tionary Regression trained on ﬁrst half (blue), MLE (orange), STVE (green), OG (red). Figure 2: Evaluation The data consists of total daily electricity consumption (load) y t , and the av erage daily temperature, v t , for a certain region, for the period Jan- 2004 to Jun- 2008 . Full details on the preprocessing of the data, as well as additional details on the experiments, are giv en in Supplementary Material Section I. Here we note that the data contains missing load v alues, for 9 non-consecutive weeks (out of about 234 weeks total). All methods discussed here, including STVE, can naturally incorporate missing values, as discussed in Supplementary Material Section H. An elementary inspection of the data rev eals that the load may be reasonably approximated as a quadratic function of the temperature, y t = x t, 1 · 1 + x t, 2 · v t + x t, 3 · v 2 t , where u t = (1 , v t , v 2 t ) is the observ ation v ector (features), and x t = ( x t, 1 , x t, 2 , x t, 3 ) is the possibly time v arying re gression v ector . This is sho wn in Figure 2b, where we ﬁt a stationary (time in variant) re gression of the abo ve form, using either only the ﬁrst or only the second half of the data. W e note that these regressions dif fer – the regression v ector changes with time. It is therefore of interest to track it via online regression. W e use the ﬁrst half of the data (train set) to learn the parameters σ, η of the online regression (1)-(2) via MLE optimization and using STVE. W e also use the train set to ﬁnd the optimal learning rate α for the OG forecaster described by the update equation (5). This learning rate is chosen as the rate that yields smallest least squares forecast error on the train set. In addition, we learn a time independent, stationary regression on the ﬁrst half of the data. W e then employ the learned parameters to make predictions of the load giv en the temperature, by all four methods. The predictions for the system (1)-(2) are made with a Kalman ﬁlter (at time t , we use the ﬁltered state estimate ˜ x t , which depends only on y 1 , . . . , y t and u 1 , . . . , u t , and make the prediction ˜ y t +1 = h ˜ x t , u t +1 i ). Daily squared prediction errors (that is, ( y t − ˜ y t ) 2 ) are shown in Figure 2c (smoothed with a moving av erage of 50 days). W e see that the adaptiv e models (MLE, STVE, OG) outperform the stationary regression already on the train set (ﬁrst half of the data), and that the difference in performance becomes dramatic on the second half (test). It is also interesting to note that the performance of the Kalman ﬁlter based methods (MLE, STVE) is practically identical, b ut both are some what better than the simpler OG approach. W e also note that by construction, we hav e | u min | ≥ 1 in this experiment, due to the constant 1 coordinate, and also | u max | ≤ 5 , due to the normalization. Since these operations are typical for any re gression problem, we conclude that the direct inﬂuence of | u min | and | u max | on the bounds in Theorem 1 and Corollary 2 will not usually be signiﬁcant. 7 Conclusion And Future W ork In this work we introduced the STVE algorithm for estimating the v ariance parameters of LDSs of type (1)-(2), and obtained the ﬁrst sample complexity guarantees for such estimators. W e hav e also shown ho w the shape of the spectrum of R can be exploited to obtain the estimators and the related bounds, thus providing the ﬁrst e xplicit geometric parameter of the data that affects the bounds. As discussed in Section 1 and demonstrated in Section 6, the system (1)-(2) is of independent interest in applications. Howe ver , we also believ e that the analysis presented here is an important ﬁrst step 9 tow ards a ﬁnite time data-dependent quantitati ve understanding of general LDSs. and perhaps even non-linear dynamical systems. Acknowledgments and Disclosure of Funding This research was supported by the ISRAEL SCIENCE FOUND A TION (grant No. 2199/20). References Adamczak, R. (2015). A note on the hanson-wright inequality for random vectors with dependencies. Electr onic Communications in Probability , 20. Anav a, O., Hazan, E., Mannor , S., and Shamir , O. (2013). Online learning for time series prediction. In COLT 2013 - The 26th Annual Confer ence on Learning Theory , J une 12-14, 2013, Princeton University , NJ, USA . Anderson, B. and Moore, J. (1979). Optimal Filtering . Prentice Hall. Belanger , P . R. (1974). Estimation of noise cov ariance matrices for a linear time-v arying stochastic process. Automatica , 10(3). Bhatia, R. (1997). Matrix Analysis . Graduate T exts in Mathematics. Springer New Y ork. Campi, M. C. and W eyer , E. (2005). Guaranteed non-asymptotic conﬁdence regions in system identiﬁcation. Automatica , 41(10):1751–1764. Chui, C. and Chen, G. (2017). Kalman F iltering: with Real-T ime Applications . Springer International Publishing. Dunik, J., K ost, O., and Straka, O. (2018). Design of measurement difference autocov ariance method for estimation of process and measurement noise cov ariances. Automatica , 90. Gohberg, I. and Krein, M. (1969). Intr oduction to the Theory of Linear Nonselfadjoint Operators . T ranslations of mathematical monographs. American Mathematical Society . Hamilton, J. (1994). T ime Series Analysis . Princeton University Press. Hanson, D. L. and Wright, E. T . (1971). A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Statist. , 42. Hazan, E. (2016). Introduction to online conv ex optimization. F ound. T rends Optim. Hazan, E., Singh, K., and Zhang, C. (2017). Online learning of linear dynamical systems. In Advances in Neural Information Pr ocessing Systems , pages 6686–6696. Hong, T . (2016). T ao hongs’s blog. http://blog.drhongtao.com/2016/07/ gefcom2012- load- forecasting- data.html . Accessed: 1/8/2019. Hong, T ., Pinson, P ., and Fan, S. (2014). Global energy forecasting competition 2012. International Journal of F or ecasting , 30:357–363. K ozdoba, M., Marecek, J., Tchrakian, T . T ., and Mannor, S. (2019). On-line learning of linear dynamical systems: Exponential forgetting in Kalman ﬁlters. AAAI . Mehra, R. (1970). On the identiﬁcation of v ariances and adapti ve kalman ﬁltering. IEEE T ransactions on Automatic Contr ol , 15(2). Mitchell, A. and Grif ﬁths, D. (1980). The ﬁnite differ ence method in partial differ ential equations . W iley-Interscience publication. W iley . Petris, G. (2010). An R package for dynamic linear models. Journal of Statistical Softwar e . Qin, S. J. (2006). An overvie w of subspace identiﬁcation. Computer s & chemical engineering , 30(10-12):1502–1513. 10 Rudelson, M., V ershynin, R., et al. (2013). Hanson-wright inequality and sub-gaussian concentration. Electr onic Communications in Probability , 18. Shumway , R. and Stof fer , D. (2011). T ime Series Analysis and Its Applications (3r d ed.) . Tsiamis, A. and Pappas, G. J. (2019). Finite sample analysis of stochastic system identiﬁcation. In 2019 IEEE 58th Confer ence on Decision and Contr ol (CDC) , pages 3648–3654. IEEE. van Overschee, P . and de Moor, L. (1996). Subspace identiﬁcation for linear systems: theory , implementation, applications . Kluwer Academic Publishers. V ershynin, R. (2018). High-Dimensional Pr obability: An Intr oduction with Applications in Data Science . Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. V idyasagar , M. and Karandikar , R. L. (2006). A learning theory approach to system identiﬁcation and stochastic adapti ve control. Pr obabilistic and randomized methods for design under uncertainty , pages 265–302. W ang, H., Deng, Z., Feng, B., Ma, H., and Xia, Y . (2017). An adaptiv e kalman ﬁlter estimating process noise cov ariance. Neur ocomputing , 223. W est, M. and Harrison, J. (1997). Bayesian F or ecasting and Dynamic Models (2nd ed.) . Springer- V erlag. Zinke vich, M. (2003). Online con ve x programming and generalized inﬁnitesimal gradient ascent. ICML. Checklist 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reﬂect the paper’ s contributions and scope? [Y es] (b) Did you describe the limitations of your work? [Y es] (c) Did you discuss any pote ntial negati ve societal impacts of your work? [N/A] The main focus of this work is a theoretical analysis. (d) Hav e you read the ethics re view guidelines and ensured that your paper conforms to them? [Y es] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Y es] (b) Did you include complete proofs of all theoretical results? [Y es] Thus Supplementary Materail contains all the proofs. Proofs outlines are given in the main body of the paper . 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main e xperi- mental results (either in the supplemental material or as a URL)? [Y es] The data is publically av ailable, see the references in Section 6. The short code is not provided at the moment, but can be fully deri ved from Algorithm 1. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Y es] See Sections 6 and I. (c) Did you report error bars (e.g., with respect to the random seed after running e xperi- ments multiple times)? [Y es] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster , or cloud provider)? [Y es] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing ne w assets... (a) If your work uses existing assets, did you cite the creators? [Y es] 11 (b) Did you mention the license of the assets? [N/A] The link to the data author’ s docu- mentation is provided. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and ho w consent was obtained from people whose data you’ re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensi ve content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Re view Board (IRB) approv als, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] 12 670 680 690 700 710 720 730 740 Time 10 5 0 5 10 observations c o r r e c t 2 a n d 2 t o o s m a l l 2 o r t o o l a r g e 2 t o o l a r g e 2 o r t o o s m a l l 2 Figure 3: Lag and Overﬁtting to the most recent observation, for various v ariance values. See the discussion in Sections 1 and A. Finite Sample Analysis Of Dynamic Regression P a- rameter Lear ning - Supplementary Material A Outline This Supplementary Material is organized as follows: The proofs of the results of Section 5, including Theorem 3, as well as proofs of Theorems 1, and Corollary 2, are given in Sections C to F. In Section G we state and prove additional bounds on the quantity  1 − p k R k 2 H S T k R 0 k 2 H S  − 1 that appears in Corollary 2. Section H contains the discussion of missing values in STVE. Additional details on the electricity consumption experiment are gi ven in Section I. Finally , we describe Figure 3. The data (observations, red) was generated from the system (1)-(2) with one dimensional state ( n = 1 ), and we set u t = 1 . The states produced by the Kalman ﬁlter with ground truth v alues of σ, η are sho wn in green, while states obtained from other choices of parameters are shown in black and blue. B Bounds on O u S , Lemma 4 In this Section we pro ve the follo wing bounds on the spectrum of O u S . See Section 5 for a discussion of these bounds. Lemma 4. The singular values of O u S satisfy the following: λ 1 ( O u S ) ≤ | u max | · T and λ T ( O u S ) ≥ 1 2 | u min | (23) k O u S k nuc = X t ≤ T λ t ( O u S ) ≤ 4 n | u max | T log T (24) 1 4 | u min | 2 T 2 ≤ k O u S k 2 H S = X t ≤ T t | u t | 2 ≤ 1 2 | u max | 2 T 2 . (25) The proof of this lemma uses the follo wing auxiliary result: Let D T : R T → R T − 1 be the difference operator , ( D T x ) t = x t +1 − x t for t ≤ T − 1 . In the ﬁeld of Finite Difference methods, the operator D T D ∗ T is kno wn as the Laplacian on the line, or as the discrete deri v ati ve with Dirichlet boundary conditions, and is well studied. The eigenv alues of D T D ∗ T may be deri ved by a direct computation, and correspond to the roots of the Chebyshe v polynomial of second kind of order T . In particular , the following holds: Lemma 5. The operator D T has kernel of dimension 1 and singular values λ l ( D T ) = 2 sin  π ( T − l ) 2 T  for l = 1 , . . . , T − 1 . 13 W e refer to Mitchell and Grifﬁths (1980) for the proof of Lemma 5. Next, in Lemma 6, we show that the in verse of the operator S 0 , deﬁned in (6), is a one dimensional perturbation of D T , which implies bounds on singular values of S 0− 1 . Lemma 6. The singular values of S 0− 1 T satisfy 2 sin  π ( T − t ) 2( T + 1)  ≤ λ t ( S 0− 1 T ) ≤ 2 sin  π ( T + 1 − t ) 2( T + 1)  (26) for 1 ≤ t ≤ T − 1 , and 1 T ≤ λ T ( S 0− 1 T ) ≤ 2 sin  π 2( T + 1)  . (27) The proof of this is gi ven in the ne xt section. W ith the estimates of Lemma 6, the bounds in Lemma 4 follow . Proof of Lemma 4: Pr oof. By Lemma 6, k S k op ≤ T and | S 0 T x | ≥ 1 2 | x | for all x ∈ R T . (28) Note also that S is by deﬁnition a collection of n independent copies of S 0 T , and therefore the spectrum of S is that of S 0 T , but each singular v alue is taken with multiplicity n . In particular it follo ws that (28) holds also for S itself. Since clearly k O u k op ≤ | u max | , the upper bound on k O u S k op in (23) follows from (28). For the lower bound, denote by V 0 the orthogonal complement to the kernel of O u , V 0 = ( K er ( O u )) ⊥ . Denote by P V 0 : R T n → V 0 the orthogonal projection onto V 0 . W e hav e in par- ticular that O u S = O u P V 0 S . Next, the operator S maps the unit ball B T n of R T n into an ellipsoid E T , and by (28), we hav e 1 2 B T n ⊂ E T . It therefore follows that 1 2 B V 1 ⊂ ( P V 0 S )( B T n ) , (29) where B V 1 is the unit ball of V 1 . It remains to observe that for e very x ∈ V 1 , we hav e | O u x | ≥ | u min | | x | . (30) Combining (29) and (30), we obtain the lower bound in (23). T o derive (24), recall that the nuclear norm is sub-multiplicati v e with respect to the operator norm: k O u S k nuc ≤ k O u k op · k S k nuc ≤ | u max | · k S k nuc . (31) This follo ws for instance from the characterization of the nuclear norm as trace dual of the operator norm (Bhatia, 1997, Propositions IV .2.11, IV .2.12). Next, since the spectrum of S is the spectrum of S 0 taken with multiplicity n , we ha ve k S k nuc = n k S 0 k nuc , (32) and it remains to bound the nuclear norm of S 0 . Using the inequality sin π 2 α ≥ α for all α ∈ [0 , 1] , (33) and Lemma 6, we hav e k S 0 k nuc = T + 2 X t ≤ T − 1 sin − 1  π T − t 2( T + 1)  ≤ 2 T X t ≤ T 1 t + 1 ≤ 4 T log T . (34) Combining (31), (32) and (34), the inequality (24) follows. It remains to estimate the Hilbert-Schmidt norm of O u S , which can be done by a direct computation. Recall that for any operator A : R m → R m the Hilbert-Schmidt norm satisﬁes k A k 2 H S = tr A ∗ A = X i ≤ m h A ∗ Aφ i , φ i i = X i ≤ m h Aφ i , Aφ i i = X i ≤ m | Aφ i | 2 , (35) 14 for any orthonormal basis φ i in R m . Let e ti be the standard basis vector in R T n corresponding to coordinate i ≤ n at time t ≤ T . Let e t 0 , t 0 ≤ T denote the standard basis in R T . Then O u S e ti = T X t 0 = t u t 0 i e t 0 , (36) where u t 0 i is the i -th coordinate of u t 0 . It follows that | O u S e ti | 2 = T X t 0 = t u 2 t 0 i (37) and hence k O u S k 2 H S = X t ≤ T ,i ≤ n | O u S e ti | 2 = X t ≤ T T X t 0 = t | u t 0 | 2 = X t ≤ T t | u t | 2 . (38) The bounds (25) follow directly from (38). C Proof of Lemma 6 Pr oof. Recall that the operator S 0− 1 T is giv en by S 0− 1 T x = ( x 1 , x 2 − x 1 , . . . , x T − x T − 1 ) and the operator D = D T : R T → R T − 1 is giv en by D x = ( x 2 − x 1 , . . . , x T − x T − 1 ) . Let V = span { e 2 , . . . , e T } be the subspace spanned by all but the ﬁrst coordinate. Let P V : R T → R T be the projection onto V , i.e. a restriction to second to T ’th coordinate. Observe that the action of D P V is equi valent to that of S 0− 1 T − 1 . Therefore the singular v alues of S 0− 1 T − 1 are identical to those of D P V . T o obtain bounds on the singular values of D P V , note that ( D P V ) ∗ D P V = P V D ∗ D P V – that is, P V D ∗ D P V is a compression of D ∗ D . Thus, by the Cauchy’ s Interlacing Theorem (Bhatia, 1997, Corollary III.1.5), λ t ( D ∗ D ) ≥ λ t ( P V D ∗ D P V ) ≥ λ t +1 ( D ∗ D ) (39) for all t ≤ T − 1 . In conjunction with Lemma 5 this provides us with the estimates for all but the smallest singular v alue of S 0− 1 T − 1 (since λ T ( D ∗ D ) = 0 ). W e therefore estimate λ T − 1 ( S 0− 1 T − 1 ) = λ T ( D P V ) directly , by bounding the norm of S 0 T − 1 . Indeed, for any T , by the Cauchy-Schwartz inequality , | S 0 T x | 2 = X t ≤ T   X i ≤ t x i   2 ≤ X t ≤ T   X i ≤ t x 2 i     X i ≤ t 1   (40) = X t ≤ T t   X i ≤ t x 2 i   ≤   X t ≤ T t   | x | 2 ≤ | x | 2 T 2 . (41) Thus we hav e k S 0 T k op ≤ T , which concludes the proof of the Lemma. D Proof of Theorem 3 Pr oof. The bound on k R k op follows directly from the lo wer bound on the singular values of O u S in (23). Since R is of rank T , the upper bounds on k R k H S follow directly from the k R k op bound. The lower bounds on k R k 2 H S and k R ∗ R k 2 H S follow from the upper bounds on the nuclear norm in Lemma 4. Note that this argument would not hav e worked if we only had upper bounds on the Hilbert-Schmidt norm, rather than the nuclear norm in Lemma 4. Denote γ i = λ i ( O u S ) . From (24) in Lemma 4, the number of γ i that are larger than 4 n | u max | log T satisﬁes # { γ i | γ i ≥ 4 n | u max | log T } ≤ T 2 . (42) Since there are total of T singular v alues γ i ov erall, we can equiv alently rewrite (42) as #  γ i     γ − 1 i ≥ 1 4 n | u max | log T  ≥ T 2 . (43) This immediately implies the lower bounds on k R k 2 H S and k R ∗ R k 2 H S . 15 E Proof of Theorem 1 The two main probabilistic tools that we use are the Hanson-Wright inequality (Hanson and Wright, 1971), and a classical norm deviation inequality for sub Gaussian v ectors, as follo ws: Theorem 7 (Hanson-Wright Inequality) . Let X = ( X 1 , . . . , X m ) ∈ R m be a random vector such that the components X i ar e independent and X i ∼ S G ( κ ) for all i ≤ m . Let A be an m × m matrix. Then, for every t ≥ 0 , P ( |h AX , X i − E h AX, X i| > t ) ≤ 2 exp ( − c min t 2 κ 4 k A k 2 H S , t κ 2 k A k op !) . (44) In particular , recall that we are interested in concentration of | RY | 2 . W e may write: | RY | 2 = | RO u S h + Rz | 2 = | RO u S h | 2 + | Rz | 2 + h h, ( R O u S ) ∗ Rz i . (45) The de viations of the ﬁrst two terms may be bounded via Theorem 7. For the third term, we use the following: Lemma 8. F or any X ∼ S G m ( κ ) we have: P  | X | > 4 κ √ m + t  ≤ exp  − ct 2 κ 2  . (46) Note that in Lemma 8 we do not require the coordinates of X to be independent. This will be important in what follows. Lemma 8 is standard and can be proved via cov ering number estimates of the Euclidean ball, see for instance V ershynin (2018), Section 4.4. In addition, the following observ ation is used throughout the text: Lemma 9. Let A : R n → R m be an operator , and let h = ( h 1 , . . . , h m ) have independent coor dinates with E h 2 i = σ 2 . Denote by { λ i } k i =1 , k ≤ min( m, n ) , the singular values of A . Then E | Ah | 2 = σ 2 k A k 2 H S = σ 2 k X i =1 λ 2 i . (47) The elementary proof is omitted. W e now prov e Theorem 1. Pr oof. Let 0 < δ < 1 be giv en. W e ﬁrst bound the deviations from the expectation for the ﬁrst term in (45), | RO u S h | 2 . W e apply the Hanson-Wright inequality with X = h and A = ( RO u S ) ∗ RO u S . By deﬁnition, A has a single eigen v alue 1 with multiplicity T . Thus clearly k A k 2 H S = T and k A k op = 1 . For an appropriate constant c 0 > 0 , set t = c 0 κ 2 √ T log 1 δ . Then, P    | RO u S h | 2 − T σ 2    ≥ c 00 κ 2 r T log 1 δ ! ≤ δ. (48) The deviation of the second term in (45) is similarly bounded using A = R ∗ R . Recall that by Theorem 3 we hav e k R k op ≤ 2 | u min | − 1 , and note that k R ∗ R k 2 H S ≤ T k R k 4 op ≤ cT | u min | − 4 . Set t = c 0 κ 2 | u min | − 2 √ T log 1 δ . W ith this choice it follo ws that both terms in the minimum in (44) are larger than c log 1 δ and we hav e P     | Rz | 2 − η 2 k R k 2 H S    ≥ c 0 κ 2 | u min | − 2 √ T log 1 δ  ≤ δ. (49) Finally , we bound the third term in (45). Denote by D the ev ent D = ( | ( RO u S ) ∗ Rz | > c | u min | − 1 κ √ T + κ r log 1 δ !) , (50) 16 and by E the e vent E = ( |h h, ( R O u S ) ∗ Rz i| > c 0 | u min | − 1 κ √ T + κ r log 1 δ ! κ r log 1 δ ) . (51) By Lemma 8 applied to z , and using the fact that k ( RO u S ) ∗ R k op ≤ 2 | u min | − 1 , P ( D ) ≤ P | z | > cκ √ T + cκ r log 1 δ ! ≤ δ. (52) Next, using independence of h, z and h ∼ S G T n ( κ ) , P ( E | D c ) ≤ δ (53) where D c is the complement of D . Therefore, combining (52) and (53), P ( E ) = P ( D ) P ( E | D ) + P ( D c ) P ( E | D c ) ≤ 2 δ. (54) Combining (48), (49) and (54), we obtain via the union bound: P        | RY | 2 T − σ 2 + k R k 2 H S T η 2 !      ≥ c  1 + κ 2   1 + | u min | − 2  log 1 δ √ T   ≤ 4 δ. (55) Similarly , we obtain a bound for the equations in volving R 0 : P        | R 0 Y | 2 p − σ 2 + k R 0 k 2 H S p η 2 !      ≥ c  1 + κ 2   1 + | u min | − 2  log 1 δ √ p   ≤ 4 δ. (56) The only dif ference in the deri v ation of (56) compared to (55) is in the application of Lemma 8. In the later case, to replace √ T with √ p in (52), we apply Lemma 8 with X = P V z rather than with X = z , where P V is the projection onto the range of R 0 – a p -dimensional space. Note that P V z does not necessarily hav e a structure of p independent coordinates, but is sub Gaussian and isotropic. Therefore Lemma 8 still applies. F Proof of Corollary 2 W e now turn to prov e Corollary 2. Pr oof. Denote E ( a ) =  1 + κ 2   1 + | u min | − 2  log 1 δ a . (57) Using (56) and (55) we may write | RY | 2 T + e 1 = σ 2 + k R k 2 H S T η 2 , (58) | R 0 Y | 2 p + e 2 = σ 2 + k R 0 k 2 H S p η 2 , (59) where e 1 , e 2 are error terms such that | e 1 | ≤ E ( √ T ) and | e 2 | ≤ E ( √ p ) holds with probability at least 1 − 8 δ . It follows that η 2 = | R 0 Y | 2 p − | RY | 2 T ! k R 0 k 2 H S p − k R k 2 H S T ! − 1 + ( e 2 − e 1 ) k R 0 k 2 H S p − k R k 2 H S T ! − 1 (60) 17 and p k R k 2 H S T k R 0 k 2 H S | R 0 Y | 2 p + p k R k 2 H S T k R 0 k 2 H S e 2 = p k R k 2 H S T k R 0 k 2 H S σ 2 + k R k 2 H S T η 2 . (61) Thus σ 2 = 1 − p k R k 2 H S T k R 0 k 2 H S ! − 1 | RY | 2 T − | R 0 Y | 2 T k R k 2 H S k R 0 k 2 H S ! + 1 − p k R k 2 H S T k R 0 k 2 H S ! − 1 e 1 − p k R k 2 H S T k R 0 k 2 H S e 2 ! . (62) It remains to observe that k R 0 k 2 H S p − k R k 2 H S T ! − 1 = 1 − p k R k 2 H S T k R 0 k 2 H S ! − 1 p k R 0 k 2 H S (63) ≤ c 1 − p k R k 2 H S T k R 0 k 2 H S ! − 1 n 2 | u max | 2 log 2 T , (64) where the inequality follows from eq. (43) in the proof of Theorem 3. G Thresholding Gap Analysis The main result of this section is the following Proposition. Proposition 10. Given the sequence { u t } T t =1 ⊂ R n , deﬁne the scalar sequence ˜ u t = P i ≤ n u ti and set | ˜ u min | = min t | ˜ u t | and | ˜ u max | = max t | ˜ u t | . (65) Then for p = 1 4 T , k R 0 k 2 H S p − k R k 2 H S T ≥ c  | ˜ u min | n | u max |  k R k 2 H S T . (66) The general idea behind the proof of Proposition 10 is to show that for n = 1 , the ratio  1 − p k R k 2 H S T k R 0 k 2 H S  − 1 can be controlled. This is done in Lemmas 11 and 12 below . In particular , Lemma 11 is a general statement about integrals of monotone real functions under certain order constraints, and Lemma 12 provides a relation of the spectrum of R to that of S 0− 1 . It is then shown that for arbitrary n , O u S contains a certain copy of an n = 1 -dimensional operator with parameters ˜ u t , which implies the bounds. For the case n = 1 , stated in Lemma 12, the argument consists of showing that the spectrum of R is upper and lower bounded by appropriately decaying functions, and therefore can not be “too constant”. W e ﬁrst obtain general estimates for the integrals of such upper and lower bounded functions in the following Lemma: Lemma 11. Let f : [0 , 1] → R be a monotone non-incr easing function such that for all x ∈ [0 , 1] , (1 − x ) 2 ≤ f ( x ) ≤ M (1 − x ) 2 , (67) for some M ≥ 1 . Set t 0 = 1 4 . Then r ( f ) := 1 t 0 R t 0 0 f ( x ) dx R 1 0 f ( x ) dx ≥ 1 + c √ M . (68) Pr oof. Denote I ( a, b, f ) = Z b a f ( x ) dx. (69) 18 Write r ( f ) = t − 1 0 I (0 , t 0 , f ) I (0 , t 0 , f ) + I ( t 0 , 1 , f ) = t − 1 0 1 + I ( t 0 , 1 ,f ) I (0 ,t 0 ,f ) (70) and set v := f ( t 0 ) . Then, among all f that satisfy (67) and f ( t 0 ) = v , r ( f ) is minimized on f with maximal I ( t 0 , 1 , f ) and minimal I (0 , t 0 , f ) . Due to the form of the constraint (67) and monotonicity , this minimizer is giv en by ˜ f v ( x ) =    (1 − x ) 2 x ∈ [0 , t − v ] v x ∈ [ t − v , t + v ] M (1 − x ) 2 x ∈ [ t + v , 1] , (71) where t − v := max { 0 , 1 − √ v } and t + v := 1 − p v M . Our problem is therefore no w reduced from a minimization of r ( f ) ov er the function space to a problem of minimizing r ( v ) := r ( ˜ f v ) ov er a single scalar variable v . T o this end, ﬁrst note that we can assume w .l.o.g that M ≥ (1 − t 0 ) − 2 . Indeed, for larger M , the lower bound on r ( f ) can only become smaller , since r ( f ) would be minimized ov er a larger set. By construction, the value v must satisfy (1 − t 0 ) 2 ≤ v ≤ M (1 − t 0 ) 2 , and for M ≥ (1 − t 0 ) − 2 , the value v = 1 satisﬁes these inequalities. Next, by direct computation (taking the deriv ative in v ) one veriﬁes that for any M ≥ (1 − t 0 ) − 2 , r ( v ) as a function of v is minimized at v = 1 . It remains to observe that r (1) = 1 − 1 3 √ M , which yields the statement of the Lemma. W e can now treat the n = 1 case. Lemma 12. Consider the case n = 1 , and p = 1 4 T . Then k R 0 k 2 H S p − k R k 2 H S T ≥ c  | u min | | u max |  k R k 2 H S T . (72) Pr oof. For an y two operators A, B we ha ve λ i ( AB ) ≤ λ i ( A ) k B k op and λ i ( B A ) ≤ λ i ( A ) k B k op , (73) see Bhatia (1997); Gohber g and Krein (1969). Note that in the case n = 1 and | u min | > 0 , O u S is in vertible. Thus the singular values of O u S 0 satisfy | u min | λ i ( S 0 ) ≤ λ i ( O u S 0 ) ≤ | u max | λ i ( S 0 ) , (74) where the ﬁrst inequality follo ws by applying (73) to ( O u S 0 ) − 1 , and the second by considering O u S 0 itself. Equiv alently , for i = 1 , . . . , T | u max | − 1 λ T +1 − i ( S 0− 1 ) ≤ λ T +1 − i ( R ) ≤ | u min | − 1 λ T +1 − i ( S 0− 1 ) , (75) and using Lemma 6, c | u max | − 1 i T + 1 ≤ λ T +1 − i ( R ) ≤ | u min | − 1 π | u min | − 1 i T + 1 . (76) By changing the index and taking squares, we ha ve c | u max | − 2  1 − i T + 1  2 ≤ λ 2 i ( R ) ≤ c 0 | u min | − 2  1 − i T + 1  2 . (77) Now , using an elementary discretization argument, for p = 1 4 T by Lemma 11 it follo ws that k R 0 k 2 H S /p k R k 2 H S /T ≥ 1 + c | u max | | u min | , (78) which implies (72). 19 Finally , we prove Proposition 10. Pr oof. As discussed earlier , the key to a statement such as (66) is to show that the spectrum of R is non constant, and in particular has enough small singular v alues. This is equiv alent to providing appropriate lower bounds on the spectrum of O u S . Here we deriv e such bounds by comparison with an n = 1 case as follows: Let V ⊂ R T n be a T dimensional subspace spanned by vectors for which for e very time t , all coordinates at time t are identical. Formally , for x ∈ V , for e very t we require that x ( t − 1) n + i = x ( t − 1) n + j for all i, j ≤ n . Observe that the restriction of O u S to V , the operator O u S P V , acts equi valently to the n = 1 case operator deﬁned by O ˜ u S 0 . In particular , similarly to the ar gument in Lemma 12, it follows that | ˜ u min | λ i ( S 0 ) ≤ λ i ( O ˜ u S 0 ) . (79) Moreov er , note that ( O u S P V ) ∗ O u S P V = P V S ∗ O ∗ u O u S P V and ( O u S ) ∗ O u S are non-negati ve operators and ( O u S ) ∗ O u S ≥ ( O u S P V ) ∗ O u S P V (that is, ( O u S ) ∗ O u S − ( O u S P V ) ∗ O u S P V is non-negati ve). It follows that for all i ≤ T , λ i ( O u S ) ≥ λ i ( O u S P V ) . (80) The upper bounds on the spectrum λ i ( O u S ) may again be obtained via (73). Indeed, λ i ( O u S ) ≤ | u max | λ i ( S ) ≤ | u max | λ d i n e ( S 0 ) , (81) where the ﬁrst inequality follows from (73) while the second is due to the multiplicity n of each singular value of S 0 in S . The rest of the argument proceeds as in Lemma 12. H Missing V alues in STVE W e now discuss the treatment of missing values in STVE. Recall that our starting point is the v ector form of the system, (7), which we rewrite here: Y = O u S h + z , (82) where h ∈ R T n and z ∈ R T are sub Gaussian vectors and Y = ( y 1 , . . . , y T ) ∈ R T is the observation vector . Suppose that M out of T observ ation v alues are missing, at times t 1 , . . . , t M . Set T 0 = T − M and deﬁne a projection operator P A : R T → R T as the operator that omits the coordinates t 1 , . . . , t M . Formally , let e ( t ) , t ≤ T , be the standard basis vector in R T , with 1 at coordinate t and zeros elsewhere. Then P A ( e ( t )) =  0 if t ∈ { t 1 , . . . , t M } , e ( t ) otherwise. (83) W e can then rewrite (82) as P A Y = P A O u S h + P A z . (84) Note that the vector P A Y contains only available, non-missing values of y t . Similarly to the case with no missing values, we deﬁne R as the Moore-Penrose in verse of P A O u S . Then we hav e RP A Y = RP A O u S h + RP A z . (85) Note that since a Moore-Penrose in verse of R only acts on the image of R , we hav e RP A = R and therefore RY = RO u S h + Rz (86) similarly to the case with no missing v alues. The only dif ference here will be that R will no w hav e T 0 rather than T non-zero singular values. R 0 will be deﬁned similarly . The analysis of the spectrum of R concerns only non-zero singular values of R and holds with no change other than that T should be replaced by T 0 . Consequently , the whole approach works identically with T replaced by T 0 ev erywhere, and in the bounds of Theorems 1 and 3 in particular . 20 2 0 0 3 - 1 1 2 0 0 4 - 0 5 2 0 0 4 - 1 1 2 0 0 5 - 0 5 2 0 0 5 - 1 1 2 0 0 6 - 0 5 2 0 0 6 - 1 1 2 0 0 7 - 0 5 2 0 0 7 - 1 1 2 0 0 8 - 0 5 − 2 − 1 0 1 2 3 4 L o a d ( Z o n e 0 ) 6 m o n t h M A Y e a r M A (a) Load over time, with 6 month (orange) and year (green) moving a verage smoothing. 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 S t a t i o n a r y R e g r e ssi o n M L E K a l m a n S T V E K a l m a n O G (b) T otal av erage prediction error , for each zone. Figure 4: Supplementary Experiment Figures I Electricity Consumption and T emperatures Data The raw data in Hong et al. (2014) contains electricity consumption le vels for 20 nearby areas, also referred to as zones. Each zone had slightly different consumption characteristics. All ﬁgures in Section 6.2 refer to Zone 1. Howe ver , the results are similar for all zones. In particular, while Figure 2c sho ws the (smoothed) prediction errors of each method o ver time, the total error on the test set, normalized by the number of days, for each method and each zone, is sho wn in Figure 4b. As with zone 1, adapti ve methods perform better for all zones. STVE based estimator is better than MLE in half the cases, but the dif ferences between STVE and MLE performance are negligible compared to errors in the stationary regression or OG. In the rest of this section we discuss the preprocessing of the data. The raw data contains hourly consumption lev els for the 20 zones, and hourly temperatures from 11 weather stations with unspeci- ﬁed locations in the region. Here we are only interested in the total daily consumption – the total consumption ov er all hours of the day – for each zone. For each station we also consider the av erage daily temperature for the day . Moreov er , since the av erage daily temperatures at different stations are strongly correlated ( ρ ≥ 0 . 97 for all pairs of stations), we only consider a single daily number , v t – an av erage temperature across all hours of the day and all stations. From the total of about 234 weeks in the data, 9 non-consecutiv e weeks hav e missing loads (con- sumption) data, 4 in the ﬁrst half of the period (train set), and 5 in the second half. For the purposes of training the stationary regression, we e xcluded the data points with missing v alues from the train set. The STVE, MLE and OG methods were trained with missing v alues. The prediction results of all methods were ev aluated only at the points where actual values were kno wn. The temperatures and all the loads are normalized to hav e zero-mean and standard deviation 1 on the ﬁrst half of the data (train set). The MLE estimates of the parameters σ 2 , η 2 were obtained using the DLM package of the R en vi- ronment (Petris, 2010). The package uses an L-BGFS-B algorithm with numerically approximated deriv atives as the underlying optimization procedure. One immediate reason for the dependence of the load on temperature to change with time (as shown in Figure 2b for Zone 1) is that the load simply gro ws with time (assuming the temperatures do not exhibit a trend of the same magnitude). Indeed, it is easy to verify that there is an upward trend in the ov erall load, as shown in Figure 4a. Ho wev er , the upward trend is non-uniform in temperature, since the parabolas in Figure 2b are not a shift of each other by a constant. Finally , in Figure 4b a verage prediction errors are sho wn for each method and for each zone in the dataset. Zone 9 is kno wn to be an indust rial zone, (Hong, 2016), where the load does not signiﬁcantly depend on the temperature, hence the higher error rates for all the methods. For the other zones the situation is similar to Zone 1, with STVE and MLE performing comparably , and better than either OG or the Stationary regression. 21

Finite Sample Analysis Of Dynamic Regression Parameter Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment