Random design analysis of ridge regression

This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides …

Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

RANDOM DESIGN ANAL YSIS OF RIDGE R EGRESSION DANIEL HSU, SHAM M. KA K ADE, AND TO NG ZHAN G Abstract. This work gives a simultane ous analysis of b oth the ordi nary least squares estimator and the ridge regression estimator in the random design setting under mild assum ptions on the co v ariate/response distributions. In particular, the analysis provides sharp results on the “out-of-sample” prediction error, as opposed to the “in-sample” ( fixed design) error . The analysis also reve als the effect of errors in the estimated co v ariance structure, as well as the effect of modeling errors, neither of which effec ts are presen t i n the fixed design setting. The pro ofs of the main results are based on a s i mple decomposition l emma combined wi th concen tration inequalities for random ve ctors and matrices. 1. Introduction In the random des ign setting for linear reg ression, w e are provided with samples of cov ariates and re- sp onses, ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ), whic h a re sa mpled indep endently from a po pulation, where the x i are random vectors and the y i are random v ar iables. T ypica lly , these pa irs are hypothesized to hav e the linear relationship y i = h β , x i i + ǫ i for some linear function β (though this h yp o thesis need no t b e true). Here, the ǫ i are error terms, typically assumed to b e normally distributed a s N (0 , σ 2 ). The goal o f estimation in this setting is to find c o efficients ˆ β based o n these ( x i , y i ) pa ir s such that the expected prediction error on a new draw ( x, y ) from the p opulatio n, measured as E [( h ˆ β , x i − y ) 2 ], is as small as p oss ible . This goal can also b e interpreted a s estimating β with accuracy measured under a particular no rm. The random design setting stands in contrast to the fixed design setting, where the cov ariates x 1 , x 2 , . . . , x n are fix ed ( i.e. , deterministic), and o nly the resp onses y 1 , y 2 , . . . , y n treated a s ra ndo m. Th us, the cov aria nce structure of the des ign points is completely known and need not be es timated, which s implifies the a nalysis of s tandard e s timators. Howev er , the fixed design setting do e s not directly address out-of-sample prediction, which is of primary concern in many applications; for ins tance, in prediction pro blems, the estimator ˆ β is computed fr om an initial sample fro m the po pula tion, and the end-goal is to use ˆ β as a predicto r of y g iven x where ( x, y ) is a new dr aw from the p opula tion. A fixed design analysis only asses ses the accuracy of ˆ β on data a lr eady seen, while a random design a nalysis is co ncerned with the predictive p erformance o n unseen data. This work g ives a detailed a nalysis of b oth the or dinary leas t s quares a nd ridge estimators [9] in the random design setting that quantifies the essent ial differences b etw ee n ra ndo m and fixed design. In pa rticular, the analysis reveals, through a simple decomp osition: • the effect of e r rors in the estimated co v ariance structure; • the effect o f erro rs in the es timated cov ariance structure, as well as the effect of approximating the true regressio n function by a linear function in the case the mode l is missp ecified; • the effect of e r rors due to noise in the r esp onse. Neither of the first tw o effects is present in the fixed design ana lysis of ridge r e gressio n, a nd the r a ndom design analys is sho ws that the effect of err ors in the estimated cov ar ia nce structur e is minimal—ess ent ially a second-or der effect as so on as the sample size is large enough. The analysis also isolates the effect of approximation e r ror in the main terms of the e stimation er r or b ound so that the b ound reduces to one that scales only with the noise v a riance when the approximation error v anishes. 2010 Mathematics Subje ct Classific ation. Primary 62J07; Secondary 62J05. Key wor ds and phr ases. Linear r egression, ordinary least squares, ridge regression, randomized approximation. 1 Another imp or tant feature of the analysis that distinguishes it from that of pr evious work is that it applies to the ridge estimator with a n arbitrary setting of λ ≥ 0. The estimation er r or is given in terms of the spectr um of the second moment of x and the particular choice of λ —the dimension of the co v ariate space does not enter explicitly except when λ = 0. When λ = 0, we immediately obtain an analysis o f ordinary least squares ; we ar e not aw are of any other rando m design analysis of the ridge estima to r with this characteristic. More gener a lly , the conv erg ence ra te ca n b e optimized by a ppr opriately setting λ based on assumptions abo ut the spe c trum. Finally , while o ur analysis is bas ed on an o pe rator- theoretical appr o ach similar to that of [19] and [4 ], it relies on probabilistic ta il ine q ualities in a mo dula r way that gives explicit dependencies without additional bo undedness assumptions other than those assumed b y the pr obabilistic bounds. Outline. Section 2 dis cusses the mo del, preliminaries, and r e lated work. Section 3 presents the main results on the exce ss mean squa red error of the ordinary leas t squares and ridge estimators under random desig n and discuss es the relationship to the standar d fixed design a na lysis. Section 4 discusses an application to accelerating lea st square s co mputations on large data sets. The pro ofs of the main results ar e given in Section 5. 2. Preliminaries 2.1. Notation. Unless otherwis e sp ecified, all vectors in this work are ass umed to live in a finite dimensional inner pro duct s pace with inner pro duct h· , ·i . The restriction to finite-dimensions is due to the probabilistic bo unds used in the pro o fs; the main results of this work can b e extended to (p ossibly infinite-dimensional) separable Hilbert spaces under mild a ssumptions b y using suitable infinite-dimens io nal generaliza tions of these probabilistic bo unds . W e denote the dimensiona lity of this space by d , but stress that our r esults will not explicitly dep end on d except when consider ing the sp ecia l case of λ = 0. Let k · k M for a self-adjoint po sitive definite linear op erato r M ≻ 0 denote the vector norm given b y k v k M := p h v , M v i . When M is omitted, it is assumed to b e the identit y I , so k v k = p h v , v i . Let u ⊗ u denote the outer pro duct of a vector u , which ac ts as the ra nk -one linear op era tor v 7→ ( u ⊗ u ) v = h v , u i u . F o r a linear op era tor M , let k M k denote its sp ectral (opera tor) no rm, i.e. , k M k = sup v 6 =0 k M v k / k v k , a nd let k M k F denote its F rob enius norm, i.e. , k M k F = p tr( M ∗ M ). If M is self-a djoint , k M k F = p tr( M 2 ). Let λ max [ M ] and λ min [ M ], res p ectively , denote the largest and smallest e ig env alue of a self-adjoint linear op era to r M . 2.2. Linear regression. Let x b e a r andom vector, and let y b e a random v ariable. Throughout, it is a ssumed that x a nd y hav e finite s econd moments ( E [ k x k 2 ] < ∞ a nd E [ y 2 ] < ∞ ). Let { v j } b e the eigenv ecto rs of (1) Σ := E [ x ⊗ x ] , so that they form an o r thonormal basis. The corre s po nding eigenv a lues are λ j := h v j , Σ v j i = E [ h v j , x i 2 ] . It is without loss of g e nerality tha t w e assume all eigenv alues λ j are s trictly pos itive, s ince otherwise we may restrict attention of all vectors to a subspace in which the assumption ho lds. Let β achiev e the minimum me an squar e d err or ov er all linear functions, i.e. , E [( h β , x i − y ) 2 ] = min w  E [( h w, x i − y ) 2 ]  , so that (2) β := X j β j v j where β j := E [ h v j , x i y ] E [ h v j , x i 2 ] . W e also ha ve that the exc ess mea n s quared error of w over the minim um is E [( h w, x i − y ) 2 ] − E [( h β , x i − y ) 2 ] = k w − β k 2 Σ (see Prop os ition 5). 2 2.3. The ridge and ordinary least squares estim ators. Let ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) b e indep en- dent copies of ( x, y ), and let b E denote the empir ical expec tation with resp ect to these n copies, i.e. , (3) b E [ f ] := 1 n n X i =1 f ( x i , y i ) b Σ := b E [ x ⊗ x ] = 1 n n X i =1 x i ⊗ x i . Let ˆ β λ denote the ridge estimator with parameter λ ≥ 0 , defined as the minimizer of the λ -reg ularized empirical mean squared error , i.e. , (4) ˆ β λ := ar g min w n b E [( h w, x i − y ) 2 ] + λ k w k 2 o . The sp ecial cas e with λ = 0 is the or dinary le ast squar es estimator , which minimizes the empirical mean squared err or. These estimato r s are uniquely defined if and only if b Σ + λI ≻ 0 (a s ufficient condition is λ > 0 ), in whic h ca se ˆ β λ = ( b Σ + λI ) − 1 b E [ xy ] . 2.4. Data mo del. W e now sp e cify the conditions o n the random pair ( x, y ) under which the a nalysis applies. 2.4.1. Covariate mo del. W e fir st define the following effective dimensions o f the cov ariate x bas e d on the second moment op era to r Σ and the regulariza tion level λ : (5) d p,λ := X j  λ j λ j + λ  p , p ∈ { 1 , 2 } . It will beco me apparent in the analy sis that these dimensions gov ern the sa mple size needed to ens ure that Σ is es tima ted with sufficient accuracy . F or technical r easons, we also use the quant it y (6) ˜ d 1 ,λ := max { d 1 ,λ , 1 } merely to simplify certa in probability tail inequalities in the main result in the p eculiar c a se that λ → ∞ (upo n which d 1 ,λ → 0). W e rema r k that d 2 ,λ app ears natur ally arises in the standard fixed design analys is of ridge re gressio n (see Prop ositio n 1), and that d 1 ,λ was also used by [23] and [4] in their r a ndom design analyses o f (kernel) ridge regress ion. It is easy to see that d 2 ,λ ≤ d 1 ,λ , and that d p,λ is at most the dimension d of the inner pro duct spa ce (with equality iff λ = 0 ). Our ma in condition requires tha t the s q uared leng th o f ( Σ + λI ) − 1 / 2 x is never more than a consta n t factor gr e ater than its exp ectatio n (hence the name b oun de d statistic al lever age ). The linear mapping x 7→ ( Σ + λI ) − 1 / 2 x is so metimes called whitening when λ = 0. The reaso n for considering λ > 0, in which case we call the mapping λ -whitening , is that the exp e ctation E [ k ( Σ + λI ) − 1 / 2 x k 2 ] may only be s ma ll fo r sufficiently large λ , as E [ k ( Σ + λI ) − 1 / 2 x k 2 ] = tr(( Σ + λI ) − 1 / 2 Σ ( Σ + λI ) − 1 / 2 ) = X j λ j λ j + λ = d 1 ,λ . Condition 1 (Bounded statistical leverage at λ ) . There exists finite ρ λ ≥ 1 such that, almost surely , k ( Σ + λI ) − 1 / 2 x k p E [ k ( Σ + λI ) − 1 / 2 x k 2 ] = k ( Σ + λI ) − 1 / 2 x k p d 1 ,λ ≤ ρ λ . The ha rd “almost sure” b ound in Condition 1 may be relaxed to momen t conditions simply b y using different pr obability ta il inequalities in the analy sis. W e do not co nsider this r elaxation for s ake o f simplicity . W e also remark that it is po ssible to replace Condition 1 with a subgauss ia n condition (sp ecifically , a requirement that every pro jection of ( Σ + λI ) − 1 / 2 x be subgauss ian), whic h can lead to a sha rp er deviation bo und in certain cases. R emark 1 (Ordinary least squar es) . If λ = 0 , then Condition 1 reduces to the requirement that ther e exists a finite ρ 0 ≥ 1 such that, almost surely , k Σ − 1 / 2 x k p E [ k Σ − 1 / 2 x k 2 ] = k Σ − 1 / 2 x k √ d ≤ ρ 0 . 3 R emark 2 (Bo unded cov ar iates) . If k x k ≤ r almos t sur ely , then k ( Σ + λI ) − 1 / 2 x k p d 1 ,λ ≤ r p (inf { λ j } + λ ) d 1 ,λ in which case Condition 1 holds with ρ λ satisfying ρ λ ≤ r p λd 1 ,λ . 2.4.2. R esp onse mo del. The resp onse mo del co nsidered in this work is a relaxa tion of the typical Gaussia n mo del; the mo del sp ecifically allows for approximation err or and genera l subgauss ian noise. Define the random v ariables (7) noise( x ) := y − E [ y | x ] a nd a pprox ( x ) := E [ y | x ] − h β , x i where noise( x ) corresp onds to the resp o nse nois e, a nd a pprox( x ) corresp o nds to the approximation error of β . This gives the following mo deling equation: y = h β , x i + approx( x ) + noise( x ) . Conditioned on x , noise( x ) is r andom, while approx ( x ) is deter ministic. The no ise is assumed to s atisfy the followin g subgaussian momen t condition: Condition 2 (Subgaussian noise) . There exists finite σ ≥ 0 such tha t, a lmost surely , E [exp( η noise( x )) | x ] ≤ exp( η 2 σ 2 / 2) ∀ η ∈ R . Condition 2 is satisfied, for insta nce, if noise( x ) is normally distributed with mean ze ro and v a riance σ 2 . F or the next c ondition, define β λ to be the minimizer of the regula r ized mean sq uared e r ror, i.e. , (8) β λ := ar g min w  E [( h w, x i − y ) 2 ] + λ k w k 2  = ( Σ + λI ) − 1 E [ xy ] , and also define (9) approx λ ( x ) := E [ y | x ] − h β λ , x i . The final condition requires a bound on the size of approx λ ( x ). Condition 3 (Bounded approximation error at λ ) . There exist finite b λ ≥ 0 such that, almost surely , k ( Σ + λI ) − 1 / 2 x approx λ ( x ) k p E [ k ( Σ + λI ) − 1 / 2 x k 2 ] = k ( Σ + λI ) − 1 / 2 x approx λ ( x ) k p d 1 ,λ ≤ b λ . The hard “almo s t sure” bo und in Co ndition 3 can ea sily b e rela xed to moment conditions, but we do not consider it here for sake of simplicit y . W e also remar k tha t b λ only app ears in low e r-order terms in the main bo unds. R emark 3 (Or dinary least squares) . If λ = 0 and the dimension of the cov aria te space is d , then Condition 3 reduces to the r equirement that there exists a finite b 0 ≥ 0 such that, almost surely , k Σ − 1 / 2 x approx( x ) k p E [ k Σ − 1 / 2 x k 2 ] = k Σ − 1 / 2 x approx( x ) k √ d ≤ b 0 . R emark 4 (Bounded appr oximation error ) . If | appr ox( x ) | ≤ a almost sur ely and Condition 1 (with parameter ρ λ ) holds, then k ( Σ + λI ) − 1 / 2 x approx λ ( x ) k p d 1 ,λ ≤ ρ λ | approx λ ( x ) | ≤ ρ λ ( a + |h β − β λ , x i| ) ≤ ρ λ ( a + k β − β λ k Σ + λI k x k ( Σ + λI ) − 1 ) ≤ ρ λ ( a + ρ λ p d 1 ,λ k β − β λ k Σ + λI ) where the first a nd la st inequalities use Condition 1, the seco nd inequality uses the definition of a pprox λ ( x ) in (9) and the triang le inequality , and the third inequality follows from Cauch y -Sch warz. The q uantit y 4 k β − β λ k Σ + λI can b e b ounded by √ λ k β k using the arguments in the pro of o f P rop osition 7. In this cas e , Condition 3 is satisfied w ith b λ ≤ ρ λ ( a + ρ λ p λd 1 ,λ k β k ) . If in addition k x k ≤ r almos t sur ely , then Condition 1 and Co ndition 3 are sa tisfied with ρ λ ≤ r p λd 1 ,λ and b λ ≤ ρ λ ( a + r k β k ) as per Remar k 2. 2.5. Related w ork. The ridge and ordinary least squares e s timators are classically studied in the fixed design setting: the cov a riates x 1 , x 2 , . . . , x n are fixed vectors in R d , a nd the res po nses y 1 , y 2 , . . . , y n are independent random v a riables, each w ith mean E [ y i ] = h β , x i i and v a riance v ar( y i ) ≤ σ 2 [16]. The analy sis reviewed in Sectio n 3 .1 reveals the exp ected prediction er r or E [ k ˆ β λ − β k 2 Σ ] is controlled by the sum of a bias term, whic h is zero when λ = 0, and a v a riance term, which is b ounded by σ 2 d 2 ,λ /n . A s discussed in the intro duction, our random desig n analysis of the ridg e estimator reveals the essential differe nc e s b etw een fixed and random design b y compa ring with this cla s sical a nalysis. Many classica l analy ses o f the ridge and ordinary least squares estimator s in the random design setting ( e.g. , in the context of nonpar a metric es tima to rs) do no t actually show nonasymptotic O ( d/n ) co nvergence of the mean squa red erro r to that of the best linear predic to r, where d is the dimensio n of the cov aria te space. Rather , the error relative to the B ay es error is b ounded by some multiple c > 1 of the err or of the optimal linear predictor relative to the Bay e s error , plus a O ( d/n ) term [8 ]: E [( h ˆ β , x i − E [ y | x ]) 2 ] ≤ c · E [( h β , x i − E [ y | x ]) 2 ] + O ( d/n ) . Such b ounds a re appropria te in non-pa rametric s ettings wher e the er ror of the optimal linear predictor also approa ches the Bayes err or at an O ( d/n ) r ate. Beyond these classical results, a nalyses of or dinary least square s often come with no nstandard restrictions on applicability or a dditional dep endencies o n the sp ectrum of the second moment op era to r (see the recen t work of [2] for a comprehensive s urvey of these results); for instance, a result of [5] g ives a bo und on the excess mean sq uared error of the form k ˆ β − β k 2 Σ ≤ O d + log (det( ˆ Σ ) / det( Σ )) n ! , but the bo und is only shown to hold when every linear predictor with low empirica l mean squared err or satisfies certain b o undedness conditions. This work provides r idge regr ession bounds explicitly in terms of the vector β (as a sequence) and in terms of the eigensp ectr um of the s e cond moment op e rator Σ . While the e s sential se tting we study is not new, previous analyses make unnecessarily strong bo undedness assumptions or fail to give a b o und in the case λ = 0 . Her e we review the analyses of [23], [19], [4], and [20]. [23] assumes k x k ≤ b x and |h β , x i − y | ≤ b approx almost surely , and gives the b ound k ˆ β λ − β k 2 Σ ≤ λ k ˆ β λ − β k 2 + c · d 1 ,λ · ( b approx + b x k ˆ β λ − β k ) 2 n for so me c > 0 , where d 1 ,λ is the effective dimension at scale λ a s defined in (5). The quantit y k ˆ β λ − β k is then bounded by assuming k β k < ∞ . Thus, the dominant terms o f the final b ound hav e explicit dep endences on b approx and b x . [19] assume that | y | ≤ b y and k x k ≤ b x almost surely , and prov e the bo und k ˆ β λ − β λ k 2 Σ ≤ c ′ · b 2 x b 2 y nλ 2 for some c ′ > 0 (and note tha t the b ound becomes trivia l when λ = 0); this is then used to b ound k ˆ β λ − β k 2 Σ under explicit assumptions o n β . [4] ass ume k x k ≤ b x almost surely , and prov e the b ound (in their Theor em 4) k ˆ β λ − β k 2 Σ ≤ c ′′ ·  k β λ − β k 2 Σ + b x k β λ − β k 2 Σ nλ + σ 2 d 1 ,λ n + o (1 / n )  . Here, we also note that, if one des ires the b o und to hold with pro bability ≥ 1 − e − t for some t > 0, then the leading factor c ′′ > 1 dep ends qua dratically on t . Finally , [20] explicitly require | y | ≤ b y and their main 5 bo und on k ˆ β λ − β k 2 Σ (spe c ialized for the r idge estimator) dep ends on b y in a dominant term. Moreover, this main b ound contains c ′′′ · ( λ k β λ k 2 + k β λ − β k 2 Σ ) a s a do minant ter m for some c ′′′ > 1, and it is only given under explicit decay conditions on the eigensp e ctrum (their Equa tio n 6). The b ound is a lso trivial when λ = 0. Our result for ridge re gressio n is given explicitly in terms of k β λ − β k 2 Σ (and therefore explicitly in terms of β as a sequence, the eigensp ectrum of Σ , and λ ); this quantit y v anis he s when λ = 0 and can b e small even when k β k itself is large. W e note that k β λ − β k 2 Σ is precisely the bias ter m from the class ical fixed design analysis of ridge r egressio n, and therefore is natural to exp ect in a r andom design analysis. Recently , [3] der ived shar p r isk b ounds fo r the ordinary least squar es and ridge estimator s (in a ddition to sp ecially developed P AC-Ba yesian estimators) in a random design s etting under very mild moment assump- tions using P AC-Ba yesian tec hniques. Their no nasymptotic b ound for ordina ry least squares holds with probability at lea s t 1 − e − t but only for t ≤ ln n ; this is ess ent ially due to their weak moment assumptions. By relying o n stronger momen t ass umptions, we a llow the probability tail par ameter t to be as larg e as Ω( n/d ). Our a nalysis is also arguably more tr ansparent a nd yields mo re reas o nable q ua ntit ative b ounds. The analysis o f [3 ] for the r idge es timator is established only in an asymptotic sense and therefore are no t directly comparable to those provided here. Finally , although the foc us of our pre s ent work is on understanding the o rdinary least squa res and ridge estimators, it should also b e mentioned that a num b er of other estimators hav e b een co nsidered in the literature with nonas ymptotic pr ediction error bo unds [14, 3, 13]. Indeed, the works of [3] and [13] prop ose estimators that require considerably weak er moment conditions on x and y to obtain o ptimal rates. 3. Random design regression This section presents the main re s ults of the paper on the excess mean squared error of the ridge estimator under random design (and its sp ecializ ation to the ordinar y least squares estimato r ). First, we review the standard fixed design analysis. 3.1. Review of fixed desig n analysis. It is informative to first review the fixed design analysis of the ridge estimator. Recall that, in this setting, the desig n p oints x 1 , x 2 , . . . , x n are fix ed (deter ministic) vectors, and the resp onses y 1 , y 2 , . . . , y n are independent random v ariable s . Therefore, we define Σ := b Σ = n − 1 P n i =1 x i ⊗ x i (whic h is nonra ndom), and ass ume it has e ig env ector s { v j } and co r resp onding eigenv alues λ j := h v j , Σ v j i . As in the random design setting, the linea r function β := P j β j v j where β j := ( nλ j ) − 1 P n i =1 h v j , x i i E [ y i ] minimizes the exp ected mean squared error , i.e. , β := arg min w 1 n n X i =1 E [( h w, x i i − y i ) 2 ] . Similar to the random design setup, define noise( x i ) := y i − E [ y i ] and approx ( x i ) := E [ y i ] − h β , x i i for i = 1 , 2 , . . . , n , so the following mo deling equation holds: y i = h β , x i i + appr ox ( x i ) + noise( x i ) for i = 1 , 2 , . . . , n . Because Σ = b Σ , the ridge estimator ˆ β λ in the fixed design setting is an unbiased estimator of the minimizer of the regularized mean s quared error, i.e. , E [ ˆ β λ ] = ( Σ + λI ) − 1 1 n n X i =1 x i E [ y i ] ! = ar g min w ( 1 n n X i =1 E [( h w, x i i − y i ) 2 ] + λ k w k 2 ) . This unbiasedness implies that the exp ected mean squared erro r of ˆ β λ has the bias-v ariance decompo s ition (10) E [ k ˆ β λ − β k 2 Σ ] = k E [ ˆ β λ ] − β k 2 Σ + E [ k ˆ β λ − E [ ˆ β λ ] k 2 Σ ] . The following bo und on the expected exc ess mean s quared error eas ily follows from this deco mpo sition and the definition of β (see, e.g. , Prop o sition 7). Prop ositi o n 1 (Ridge regressio n: fixed design) . Fix λ ≥ 0 , and assume Σ + λI is invertible. If ther e exists σ ≥ 0 such that v ar( y 2 i ) ≤ σ 2 for al l i = 1 , 2 , . . . , n , then E [ k ˆ β λ − β k 2 Σ ] ≤ X j λ j ( λ j λ + 1 ) 2 β 2 j + σ 2 n X j  λ j λ j + λ  2 6 with e qu ality iff v ar( y i ) = σ 2 for al l i = 1 , 2 , . . . , n . R emark 5 (Effect of appr oximation error in fixed design) . Observe that a pprox ( x i ) has no effect on the exp ected excess mean squared error. R emark 6 (E ffective dimension) . T he second sum in the b ound is equal to d 2 ,λ , a notion o f effective dimension at regularizatio n level λ . R emark 7 (Ordinar y leas t sq uares in fixed design) . Setting λ = 0 gives the follo wing bound for the ordina ry least squares estimator ˆ β 0 : E [ k ˆ β 0 − β k 2 Σ ] ≤ σ 2 d n where, as befor e, equalit y holds iff v a r( y i ) = σ 2 for all i = 1 , 2 , . . . , n . 3.2. Ordinary least squares. O ur analysis of the ordina ry least squares estimator (under random design) is based on a simple decomp osition of the excess mean sq ua red erro r, similar to the one from the fixed des ign analysis. T o state the dec o mpo sition, first let ¯ β 0 denote the conditional exp ectation of the least s quares estimator ˆ β 0 conditioned on x 1 , x 2 , . . . , x n , i.e. , ¯ β 0 := E [ ˆ β 0 | x 1 , x 2 , . . . , x n ] = b Σ − 1 b E [ x E [ y | x ]] . Also, define the bias a nd v ariance as: ε bs := k ¯ β 0 − β k 2 Σ , ε vr := k ˆ β 0 − ¯ β 0 k 2 Σ Prop ositi o n 2 (Random design decomp osition) . We have k ˆ β 0 − β k 2 Σ ≤ ε bs + 2 √ ε bs ε vr + ε vr ≤ 2( ε bs + ε vr ) Pr o of. The claim follows fro m the tria ngle inequality and the fact ( a + b ) 2 ≤ 2( a 2 + b 2 ).  R emark 8 . No te that, in genera l, E [ ˆ β 0 ] 6 = β (unlik e in the fixed design setting where E [ ˆ β 0 ] = β ). Hence, our decomp osition differs from that in the fixe d des ign analysis (s e e (10)). Our first main r e sult characterizes the excess loss of the ordinary least squares estimator. Theorem 1 (Ordinary least squa res r egress ion) . Pick any t > max { 0 , 2 . 6 − lo g d } . Assume Condition 1 (with p ar ameter ρ 0 ), Condition 2 ( with σ ), and Condition 3 (with b 0 ) hold and that n ≥ 6 ρ 2 0 d (log d + t ) . With pr ob ability at le ast 1 − 3 e − t , the fol lowing holds: (1) R elative sp e ctr al norm err or in b Σ : b Σ is invertible, and k Σ 1 / 2 b Σ − 1 Σ 1 / 2 k ≤ (1 − δ s ) − 1 , wher e Σ is define d in (1) , b Σ is define d in (3) , and δ s := r 4 ρ 2 0 d (log d + t ) n + 2 ρ 2 0 d (log d + t ) 3 n (note that the lower b ou n d on n ens u r es δ s ≤ 0 . 9 3 < 1 ). (2) Effe ct of bias due to r andom design : ε bs ≤ 2 (1 − δ s ) 2 E [ k Σ − 1 / 2 x approx( x ) k 2 ] n (1 + √ 8 t ) 2 + 16 b 2 0 dt 2 9 n 2 ! ≤ 2 (1 − δ s ) 2 ρ 2 0 d E [approx( x ) 2 ] n (1 + √ 8 t ) 2 + 16 b 2 0 dt 2 9 n 2 ! , and appr ox ( x ) is define d in (9) . 7 (3) Effe ct of noise : ε vr ≤ 1 1 − δ s · σ 2 ( d + 2 √ dt + 2 t ) n . R emark 9 (Simplified form) . Suppres sing the terms that a re o (1 /n ), the o verall b o und from Theorem 1 is k ˆ β 0 − β k 2 Σ ≤ 2 E [ k Σ − 1 / 2 x approx( x ) k 2 ] n (1 + √ 8 t ) 2 + σ 2 ( d + 2 √ dt + 2 t ) n + o (1 / n ) (so b 0 app ears only in the o (1 / n ) terms). If the linear mo del is correct ( i.e. , E [ y | x ] = h β , x i almost surely), then (11) k ˆ β 0 − β k 2 Σ ≤ σ 2 ( d + 2 √ dt + 2 t ) n + o (1 / n ) . One can s how that the constants in the first-or der term in (11) are the same a s those that one would obtain for a fix ed design tail bo und. R emark 10 (Tightness of the b o und) . Since k ¯ β 0 − β k 2 Σ = k ( Σ 1 / 2 b Σ − 1 Σ 1 / 2 ) b E [ Σ − 1 / 2 x approx( x )] k 2 and k Σ 1 / 2 b Σ − 1 Σ 1 / 2 − I k → 0 as n → ∞ (Lemma 2), k ¯ β 0 − β k 2 Σ is within constant facto r s of k b E [ Σ − 1 / 2 x approx( x )] k 2 for sufficiently large n . Moreover, E [ k b E [ Σ − 1 / 2 x approx( x )] k 2 ] = E [ k Σ − 1 / 2 x approx( x ) k 2 ] n , which is the main term that a ppea rs in the bo und for ε bs . Simila rly , k ˆ β 0 − ¯ β 0 k 2 Σ is within co nstant factor s of k ˆ β 0 − ¯ β 0 k 2 b Σ for sufficient ly large n , a nd E [ k ˆ β 0 − ¯ β 0 k 2 b Σ ] ≤ σ 2 d n with equality iff v ar( y ) = σ 2 (this comes from the fixed design risk bo und in Remark 7). Therefore, in this ca se where v ar( y ) = σ 2 , we co nclude that the b ound Theorem 1 is tight up to constant factors and low er- order terms. 3.3. Random design ridge regressi on. The analys is of the ridge estimator under random desig n is ag ain based on a simple decompo sition of the excess mean squared error . Here, let ¯ β λ denote the conditional exp ectation of ˆ β λ given x 1 , x 2 , . . . , x n , i.e. , (12) ¯ β λ := E [ ˆ β λ | x 1 , x 2 , . . . , x n ] = ( b Σ + λI ) − 1 b E [ x E [ y | x ]] . Define the bias from r egulariza tion, the bias from the random design, and the v a riance as: ε rg := k β λ − β k 2 Σ , ε bs := k ¯ β λ − β λ k 2 Σ , ε vr := k ˆ β λ − ¯ β λ k 2 Σ , where β λ is the minimizer of the regularized mean s quared error (see (8)). Prop ositi o n 3 (General rando m des ign decompo sition) . k ˆ β λ − β k 2 Σ ≤ ε rg + ε bs + ε vr + 2 ( √ ε rg ε bs + √ ε rg ε vr + √ ε bs ε vr ) ≤ 3( ε rg + ε bs + ε vr ) Pr o of. The claim follows fro m the tria ngle inequality and the fact ( a + b ) 2 ≤ 2( a 2 + b 2 ).  R emark 11 . A gain, note that E [ ˆ β λ ] 6 = β λ in genera l, so the bias-v ar iance decomp ositio n in (10) fro m the fixed design analysis is not dir ectly applicable in the random design setting. The fo llowing theorem is the main result of the pape r : 8 Theorem 2 (Ridge regress ion) . Fix some λ ≥ 0 , and pick any t > max { 0 , 2 . 6 − log ˜ d 1 ,λ } . Assum e Condi- tion 1 (with p ar ameter ρ λ ), Condition 2 (with p ar ameter σ ), and Condition 3 (with p ar ameter b λ ) hold; and that n ≥ 6 ρ 2 λ d 1 ,λ (log ˜ d 1 ,λ + t ) , wher e d p,λ for p ∈ { 1 , 2 } is define d in (5) , and ˜ d 1 ,λ is define d in (6) . With pr ob ability at le ast 1 − 4 e − t , the fol lowing holds: (1) R elative sp e ctr al norm err or in b Σ + λI : b Σ + λI is invertible, and k ( Σ + λI ) 1 / 2 ( b Σ + λI ) − 1 ( Σ + λI ) 1 / 2 k ≤ (1 − δ s ) − 1 , wher e Σ is define d in (1) , b Σ is define d in (3) , and δ s := s 4 ρ 2 λ d 1 ,λ (log ˜ d 1 ,λ + t ) n + 2 ρ 2 λ d 1 ,λ (log ˜ d 1 ,λ + t ) 3 n (note that the lower b ou n d on n ens u r es δ s ≤ 0 . 9 3 < 1 ). (2) F ro b enius n orm err or in b Σ : k ( Σ + λI ) − 1 / 2 ( b Σ − Σ )( Σ + λI ) − 1 / 2 k F ≤ p d 1 ,λ δ f , wher e δ f := r ρ 2 λ d 1 ,λ − d 2 ,λ /d 1 ,λ n (1 + √ 8 t ) + 4 p ρ 4 λ d 1 ,λ + d 2 ,λ /d 1 ,λ t 3 n . (3) Effe ct of r e gularization : ε rg ≤ X j λ j ( λ j λ + 1 ) 2 β 2 j . If λ = 0 , then ε rg = 0 . (4) Effe ct of bias due to r andom design : ε bs ≤ 2 (1 − δ s ) 2 E [ k ( Σ + λI ) − 1 / 2 ( x approx λ ( x ) − λβ λ ) k 2 ] n (1 + √ 8 t ) 2 + 16  b λ p d 1 ,λ + √ ε rg  2 t 2 9 n 2 ! ≤ 4 (1 − δ s ) 2 ρ 2 λ d 1 ,λ E [approx λ ( x ) 2 ] + ε rg n (1 + √ 8 t ) 2 +  b λ p d 1 ,λ + √ ε rg  2 t 2 n 2 ! , and appr ox λ ( x ) is define d in (9) . If λ = 0 , then approx λ ( x ) = approx ( x ) as define d in (7) . (5) Effe ct of noise : ε vr ≤ σ 2  d 2 ,λ + p d 1 ,λ d 2 ,λ δ f  n (1 − δ s ) 2 + 2 σ 2 r  d 2 ,λ + p d 1 ,λ d 2 ,λ δ f  t n (1 − δ s ) 3 / 2 + 2 σ 2 t n (1 − δ s ) . W e now discuss v arious asp ects of Theorem 2. R emark 12 (Simplified form) . Ignoring the ter ms that are o (1 /n ) and treating t as a constant, the overall bo und from Theorem 2 is k ˆ β λ − β k 2 Σ ≤ k β λ − β k 2 Σ + O  E [ k ( Σ + λI ) − 1 / 2 ( x approx λ ( x ) − λβ λ ) k 2 ] + σ 2 d 2 ,λ n  ≤ k β λ − β k 2 Σ + O  ρ 2 λ d 1 ,λ E [approx λ ( x ) 2 ] + k β λ − β k 2 Σ + σ 2 d 2 ,λ n  ≤ k β λ − β k 2 Σ + O  ρ 2 λ d 1 ,λ E [approx ( x ) 2 ] + ( ρ 2 λ d 1 ,λ + 1 ) k β λ − β k 2 Σ + σ 2 d 2 ,λ n  where the last ine q uality follows from the fact p E [approx λ ( x ) 2 ] ≤ p E [approx( x ) 2 ] + k β λ − β k Σ . 9 R emark 1 3 (Effect of errors in b Σ ) . The accur acy of b Σ has a relatively mild effect on the bound—it app ears essentially through m ultiplicative factor s (1 − δ s ) − 1 = 1 + O ( δ s ) a nd 1 + δ f , where both δ s and δ f are decreasing with n (as n − 1 / 2 ), and ther efore only contribute to low er -order terms ov erall. R emark 1 4 (Effect of approximation erro r) . The effect o f approximation error is iso la ted in the term k ¯ β λ − β λ k 2 Σ . The bo und ε bs scales with a fourth-moment quantit y E [ k ( Σ + λI ) − 1 / 2 ( x approx λ ( x ) − λβ λ k 2 ]; when using the loo ser b ound O ( ρ 2 λ d 1 ,λ E [approx ( x ) 2 ] + ( ρ 2 d 1 ,λ + 1) k β λ − β k 2 Σ ), the ov era ll s implified b ound fro m Remark 12 c a n b e view ed a s E [( h ˆ β λ , x i − E [ y | x ]) 2 | ˆ β λ ] ≤ E [( h β , x i − E [ y | x ]) 2 ]  1 + c 1 ρ 2 λ d 1 ,λ n  + E [ h β λ − β , x i 2 ]  1 + c 2 ( ρ 2 λ d 1 ,λ + 1 ) n  + ter ms due to sto chastic noise for some p ositive constants c 1 and c 2 . Therefore, the (b ound on the) mean squared err or of ˆ β λ is the sum of tw o contributions (up to low er-o rder terms): the first is a scaling of the appr oximation errors E [( h β , x i − E [ y | x ]) 2 ] + E [ h β λ − β , x i 2 ], where the scaling 1 + O (( ρ 2 λ d 1 ,λ + 1) /n ) tends to one as n → ∞ ; and the second is the s to chastic no ise contribution. The approximation error co n tribution is unique to ra ndom desig n, while the sto chastic noise app ears in bo th random and fixed design. R emark 15 (Bo unded cov ar ia tes) . Supp ose a pprox ( x ) = 0 and that there exists r > 0 s uch that k x k ≤ r almost s urely . This is the setting of a well-specified mo del with bo unded c ov ariates ; the minimax risk over the class o f mo dels β with k β k ≤ B for s ome B > 0 is a t least Ω( p σ 2 r 2 B 2 /n ) [17]. In this case, using the inequalities k β λ − β k 2 Σ ≤ λ k β k 2 / 2 and d 2 ,λ ≤ tr( Σ ) / (2 λ ), the simplified b ound from Remar k 12 reduces to k ˆ β λ − β k 2 Σ ≤ 1 + O  1 + r 2 /λ n  ! · λ k β k 2 2 + σ 2 n · tr( Σ ) 2 λ . Cho osing λ > 0 to minimize the b ound and using the fact tr( Σ ) ≤ r 2 gives k ˆ β λ − β k 2 Σ ≤ s σ 2 r 2 B 2 n ·  1 + O (1 /n )  + O  r 2 B 2 n  , which matches the low er b ound up to constant factor s and lower-order terms. R emark 16 (Application to smoo thing splines) . The applications of r idge regr ession considered by [23] ca n also b e analyzed using Theo rem 2 (although technically our result is only proved in the finite- dimens ional setting). W e spe c ifically consider the problem of a pproximating a p er io dic function with smo othing splines, which are functions f : R → R whose s -th deriv atives f ( s ) , for some s > 1 / 2, s a tisfy Z  f ( s ) ( t )  2 dt < ∞ . The one-dimensional cov ariate t ∈ R can b e mapp ed to the infinite-dimensional representation x := φ ( t ) ∈ R ∞ where x 2 k := sin( k t ) ( k + 1) s and x 2 k +1 := cos( k t ) ( k + 1) s , k ∈ { 0 , 1 , 2 , . . . } . Assume that the regres s ion function is E [ y | x ] = h β , x i so approx ( x ) = 0 almost surely . Observe that k x k 2 ≤ 2 s 2 s − 1 , so Condition 1 is s atisfied with ρ λ :=  2 s 2 s − 1  1 / 2 1 p λd 1 ,λ 10 as per Remar k 2. Therefore, the simplified bound fro m Rema rk 12 beco mes in this case k ˆ β λ − β k 2 Σ ≤ k β λ − β k 2 Σ + C ·  2 s 2 s − 1 · k β λ − β k 2 Σ λn + k β λ − β k 2 Σ + σ 2 d 2 ,λ n  ≤ λ k β k 2 2 + C · σ 2 d 2 ,λ n + C ·  2 s 2 s − 1 + λ 2  · k β k 2 n for some constant C > 0, wher e w e hav e us e d the inequa lit y k β λ − β k 2 Σ ≤ λ k β k 2 / 2. [23] shows that d 1 ,λ ≤ inf k ≥ 1  2 k + 2 /λ (2 s − 1) k 2 s − 1  . Since d 2 ,λ ≤ d 1 ,λ , it follows that setting λ := k − 2 s where k = ⌊ ((2 s − 1) n/ (2 s )) 1 / (2 s +1) ⌋ g ives the b ound k ˆ β λ − β k 2 Σ ≤  k β k 2 2 + 2 C σ 2  ·  2 s − 1 2 s · n  − 2 s 2 s +1 + lower-order terms which has the optimal data-dep endent rate of n − 2 s 2 s +1 [22]. R emark 1 7 (Comparison with fixed des ig n) . As alr eady discussed, the ridg e estimator b ehav es simila rly under fixed and r andom desig ns, with the ma in differ e nces being the lack of erro rs in b Σ under fixed design, and the influence o f a ppr oximation error under random design. These are revealed through the quantities ρ λ and d 1 ,λ (and b λ in low er-o rder terms), whic h are needed to apply the probability tail inequalities . Therefor e, the scaling of ρ 2 λ d 1 ,λ with λ crucially controls the effect of r andom design compared with fixed de s ign. 4. Applica tio n to accelera ting l east squares comput a tions Our results for the ordinar y leas t squares estimator can b e used to analyze a randomized approxima- tion scheme for ov er complete lea st squa r es pr oblems [7 , 18]. The goa l of these ra ndomized metho ds is to approximately solve the least square s pro ble m min w ∈ R d 1 m k Aw − b k 2 for some lar ge, full-rank design matr ix A ∈ R m × d ( m ≫ d ) and vector b ∈ R m . Note that us ing a standar d metho d to exa ctly s olve the least squares problem requires Ω( md 2 ) oper ations, which can b e pro hibitive for large-s c ale problems. How e ver, when an approximate solution is satisfactory , s ig nificant computational savings can be ach ieved through the use of r andomization. 4.1. A randomized approximation sc heme for least squares. The approximation scheme is as follows: (1) The columns of A and the vector b ar e first sub jected to a randomly chosen ro tation matrix ( i.e. , a n orthogo nal transfor ma tion) Θ ∈ R m × m . The distribution over rotation matrices that may b e used is discussed b elow. (2) A sa mple of n rows o f [Θ A, Θ b ] ∈ R m × ( d +1) are then selected unifor mly at ra ndom with replace ment; let { [ x ⊤ i , y i ] : i = 1 , 2 , . . . , n } (where x i ∈ R d and y i ∈ R ) b e the n selected rows of [Θ A, Θ b ]. (3) Finally , the least squa res pr oblem min w ∈ R d 1 n n X i =1 ( h w, x i i − y i ) 2 is solved by computing the ordinary le a st squares estima tor ˆ β 0 on the sample { ( x i , y i ) : i = 1 , 2 , . . . , n } . The mo tiv ation for the r andom r otation Θ is captured in Lemma 1, which shows that, if Θ is c hosen r andomly from certain distributions ov er rotation matrices, then applying Θ to A and b creates a n e quiv alent least squares problem for which the statistica l leverage par ameter (the qua nt it y ρ 0 in Condition 1) is s ma ll. Consequently , the new least squares pro blem can be approximately solved with a small r andom sample, as per Theorems 2 and 1 . Without the ra ndom rotation, the s tatistical leverage pa rameter could b e so large that s ma ll r andom sa mple o f the rows will likely miss a r ow crucial for obtaining an accurate a pproximation. The r ole of statistica l leverage in this setting w as also po int ed out by [6], although Lemma 1 makes the 11 connection more direct. W e note tha t Lemma 1 and the analysis b elow can b e generalized to the case where Θ is o nly approximately orthogona l; for most standard distributions ov er r otation matrices, the additional error terms that arise do not a ffect the o verall analysis. The running time of the approximation scheme is g iven by (i) the time required to apply the m × m rando m rotation op er ator Θ to the original m × ( d + 1) matrix [ A, b ] and rando mly s ample n rows, plus (ii) the time to solve the least squa res pr oblem on the smaller design matrix of size n × d . F o r (i), na ¨ ıvely applying an arbitrar y m × m r otation matrix r equires Ω( m 2 d ) op erations ; howev er , there are (distributions over) rota tion matrices for which this r unning time can be reduced to O ( md log m ) (see Example 2 in Section 4.3 b elow), which is a co nsiderable sp eed-up when m is lar ge. In fac t, beca use only n out of m r ows are to b e r etained anyw ay , this computation can b e reduce d to O ( md log n ) [1]. F or (ii), standard metho ds c a n pro duce the ordinary le a st squares e s timator o r the ridg e reg r ession estimato r with O ( nd 2 ) o p er ations. Therefore, we a re int erested in the sample size n tha t suffice s to yield an accur ate appr oximation. 4.2. Analysis of the approximation scheme. Our a pproach to analyzing the ab ov e approximation scheme is to treat it as a ra ndom desig n regress ion problem. W e a pply Theor em 1 in this setting to give error b o unds for the so lution pro duced by the approximation scheme. Let ( x, y ) ∈ R d × R b e a random pair distributed uniformly ov er the r ows of [Θ A, Θ b ], where we assume that Θ is rando mly chosen from a suitable distribution ov er ro ta tion matrices suc h as those described in Example 1 and E xample 2 . Lemma 1 (b elow) implies that ther e exists a co ns tant c 0 > 0 such that Condition 1 is satisfied with ρ 2 0 ≤ c 0 ·  1 + log m + τ d  with probability a t least 1 − e − τ ov er the choice of the random rotation ma tr ix Θ. Henceforth, we co ndition on the ev en t that this ho lds . Let β ∈ R d be the solution to the o riginal least squa res problem ( i.e. , β := arg min w k Aw − b k 2 /m ), and let ˆ β 0 ∈ R d be the ordinar y least s quares estimator co mputed on the random sa mple o f the rows of [Θ A, Θ b ]. Note that, for any w ∈ R d , E [( h w, x i − y ) 2 ] = 1 m k Θ Aw − Θ b k 2 = 1 m k Aw − b k 2 . Moreov er, we may assume for simplicity that y − h β , x i = approx ( x ) ( i.e. , there is no sto chastic noise), so E [approx( x ) 2 ] = E [( h β , x i − y ) 2 ] = k Aβ − b k 2 /m . By Theorem 1, if a t lea st n ≥ 6  d + c 0 (log m + τ )  (log d + t ) rows of [Θ A, Θ b ] are sampled, then the ordinar y leas t squares estimator ˆ β 0 satisfies the following approxi- mation error guarantee (with probability at least 1 − 3 e − t ov er the random sample of r ows): 1 m k A ˆ β 0 − b k 2 ≤ 1 m k Aβ − b k 2 ·  1 + c 1 ( d + log m + τ ) t n  + o (1 / n ) for some co nstant c 1 > 0. W e note that the o (1 /n ) terms can be remov ed if one only requires c o nstant probability of succes s ( i.e. , τ and t tr e ated as constants), as is consider e d b y [7]. In this ca se, we achiev e an error b o und of 1 m k A ˆ β 0 − b k 2 ≤ 1 m k Aβ − b k 2 · (1 + ǫ ) for ǫ > 0 pr ovided that the num b er of rows sa mpled is n ≥ c 2 ( d + log m )  1 ǫ + lo g d  for some constant c 2 > 0. 12 4.3. Random rotations and b ounding statis tical levera ge. The following lemma g ives a simple co n- dition on the distribution of the ra ndom or thogonal matrix Θ ∈ R n × n used to prepro c e ss a data ma tr ix A so that Condition 1 is applicable to a ra ndom vector x dr awn uniformly from the rows of Θ A . Its pro of is a straightforward a pplication of Lemma 8. Lemma 1. Fix any τ > 0 and λ ≥ 0 . Supp ose Θ ∈ R m × m is a r andom ortho gonal matrix and κ > 0 is a c onstant such that (13) E  exp  α ⊤  √ m Θ ⊤ e i  ≤ exp  κ k α k 2 / 2  , ∀ α ∈ R m , ∀ i = 1 , 2 , . . . , m, wher e e i is the i -th c o or dinate ve ctor in R m . L et A ∈ R m × d b e any matrix of r ank d , and let Σ := (1 /m )(Θ A ) ⊤ (Θ A ) = (1 /m ) A ⊤ A . Ther e exists ρ 2 λ ≤ κ 1 + 2 s log m + τ d 1 ,λ + 2(log m + τ ) d 1 ,λ ! such t hat Pr  max i =1 , 2 ,...,m k ( Σ + λI ) − 1 / 2 (Θ A ) ⊤ e i k 2 > ρ 2 λ d 1 ,λ  ≤ e − τ wher e d 1 ,λ := P d j =1 λ j λ j + λ and { λ 1 , λ 2 , . . . , λ d } ar e t he eigenvalues of Σ . Pr o of. Let z i := √ m Θ ⊤ e i for each i = 1 , 2 , . . . , n . Let U ∈ R m × d be a matrix of left orthonor ma l singular vectors of (1 / √ m ) A , and let D λ := diag ( λ 1 λ 1 + λ , λ 2 λ 2 + λ , . . . , λ d λ d + λ ). Note that D λ = I if λ = 0. W e hav e k ( Σ + λI ) − 1 / 2 (Θ A ) ⊤ e i k = k √ mD 1 / 2 λ U ⊤ Θ ⊤ e i k = k D 1 / 2 λ U ⊤ z i k . Since tr( U D λ U ⊤ ) = d 1 ,λ , tr( U D 2 λ U ⊤ ) ≤ d 1 ,λ , and λ max [ U D λ U ⊤ ] ≤ 1, Lemma 8 implies Pr  k D 1 / 2 λ U ⊤ z i k 2 > κ  d 1 ,λ + 2 q d 1 ,λ (log m + τ ) + 2(log m + τ )  ≤ e − τ /m. Therefore, b y a union bo und, Pr  max i =1 , 2 ,...,m k ( Σ + λI ) − 1 / 2 (Θ A ) ⊤ e i k 2 > κ  d 1 ,λ + 2 q d 1 ,λ (log m + τ ) + 2(log m + τ )  ≤ e − τ .  Below, w e give t wo simple examples under which the condition (13) in Lemma 1 holds. Example 1. Let Θ b e distributed unifor mly over a ll m × m ortho gonal matrices. Fix any i = 1 , 2 , . . . , m . The rando m vector v := Θ ⊤ e i is distributed uniformly on the unit sphere § m − 1 . Let l b e a χ random v aria ble with m deg rees of free dom, so z := l v follows an isotropic mu ltiv ariate Gaussian distr ibution. By Jense n’s inequality and the fact that E [exp( q ⊤ z )] ≤ exp( k q k 2 / 2) for an y vector q ∈ R m , E  exp  α ⊤  √ m Θ ⊤ e i  = E  exp  α ⊤  √ mv  = E  E  exp  √ m E [ l ] α ⊤ ( E [ l ] v )     v  ≤ E  exp  √ m E [ l ] α ⊤ ( lv )  = E  exp  √ m E [ l ] α ⊤ z  ≤ exp  k α k 2 m 2 E [ l ] 2  ≤ exp k α k 2 2  1 − 1 4 m − 1 360 m 3  − 2 ! where the last ine q uality is due to the following low e r estimate for χ random v ariables: E [ l ] ≥ √ m  1 − 1 4 m − 1 360 m 3  . 13 Therefore, the condition (13) is satisfied with κ = 1 + O (1 / m ). Example 2. Let m b e a power of tw o, a nd let Θ := H diag( s ) / √ m , where H ∈ {± 1 } m × m is the m × m Hadamard matrix, and s := ( s 1 , s 2 , . . . , s n ) ∈ {± 1 } m is a vector of m Rademacher v ariables ( i.e. , s 1 , s 2 , . . . , s m are i.i.d. with Pr[ s 1 = 1] = Pr[ s 1 = − 1 ] = 1 / 2). It is easy to chec k that Θ is an or thogonal matrix. The ra ndom ro tation Θ is a key comp onent of the fast Jo hnson-Lindenstraus s tr ansform of [1], a lso used by [7]. It is esp ecially imp or tant for the present application b eca use it can b e a pplied to vectors with O ( m lo g m ) op era tions, which is significantly faster than the Ω( m 2 ) running time of na ¨ ıve matrix- vector m ultiplication. F or each i = 1 , 2 , . . . , m , the distribution of √ m Θ ⊤ e i is the s ame as that of s , and ther e fore E  exp  α ⊤  √ m Θ ⊤ e i  = E [exp ( α ⊤ s )] ≤ exp( k α k 2 / 2) where the last s tep follows by Ho effding’s inequality . Therefore, the condition (13) is satisfied with κ = 1. 5. Proofs o f Theorem 1 an d Theorem 2 The pro o f o f Theor em 2 uses the decomp osition of k ˆ β λ − β k 2 Σ in P rop ositio n 3, and then bo unds each term using the lemmas prov ed in this s ection. The pro of of Theorem 1 o mits o ne term from the decomp os ition in Pro po sition 3 due to the fact that β = β λ when λ = 0 ; and it uses a s lig htly simpler a rgument to handle the effect of noise (Lemma 6 rather than Lemma 7), which reduces the num ber of lower-order terms. Other than these differences, the pro of is the same as that for Theo rem 2 in the s pec ia l case of λ = 0. Define Σ λ := Σ + λI , (14) b Σ λ := b Σ + λI , and (15) ∆ λ := Σ − 1 / 2 λ ( b Σ − Σ ) Σ − 1 / 2 λ (16) = Σ − 1 / 2 λ ( b Σ λ − Σ λ ) Σ − 1 / 2 λ . Recall the basic decompo sition from P rop osition 3: k ˆ β λ − β k 2 Σ ≤  k β λ − β k Σ + k ¯ β λ − β λ k Σ + k ˆ β λ − ¯ β λ k Σ  2 . Section 5.1 first esta blishes basic pro per ties of β and β λ , which a re then used to bo und k β λ − β k 2 Σ ; this part is exactly the same as the s tandard fixed design analys is of ridg e r egress ion. Section 5.2 employs probability tail inequalities for the sp ectr al and F robenius nor ms of r andom matrices to b ound the matrix errors in estimating Σ with b Σ . Finally , Section 5.3 and Section 5.4 bo und the c o ntributions of a pproximation error (in k ¯ β λ − β λ k 2 Σ ) and no ise (in k ˆ β λ − ¯ β λ k 2 Σ ), r e sp ectively , using pro bability tail inequa lities for random vectors as w ell a s the matrix error bounds fo r b Σ . 5.1. Basic prop erties o f β and β λ , and the effect of reg ularization. The following prop ositions are well known in the study of in verse problems : Prop ositi o n 4 (Normal equations ) . E [ h w , x i y ] = E [ h w , x ih β , x i ] for any w . Pr o of. It suffices to prove the claim for w = v j . Since E [ h v j , x ih v j ′ , x i ] = 0 for j ′ 6 = j , it follows that E [ h v j , x ih β , x i ] = P j ′ β j ′ E [ h v j , x ih v j ′ , x i ] = β j E [ h v j , x i 2 ] = E [ h v j , x i y ], wher e the last equality follows fro m the definition of β in (2).  Prop ositi o n 5 (Excess mea n squar ed err or) . E [( h w, x i − y ) 2 ] − E [( h β , x i − y ) 2 ] = E [ h w − β , x i 2 ] for any w . 14 Pr o of. Directly ex pa nding the squares in the ex pec tations reveals that E [( h w, x i − y ) 2 ] − E [( h β , x i − y ) 2 ] = E [ h w, x i 2 ] − 2 E [ h w , x i y ] + 2 E [ h β , x i y ] − E [ h β , x i 2 ] = E [ h w, x i 2 ] − 2 E [ h w , x ih β , x i ] + 2 E [ h β , x ih β , x i ] − E [ h β , x i 2 ] = E [ h w, x i 2 − 2 h w , x ih β , x i + h β , x i 2 ] = E [ h w − β , x i 2 ] where the third equality follows from Pro p osition 4.  Prop ositi o n 6 (Shrink ag e ) . F or any j , h v j , β λ i = λ j λ j + λ β j . Pr o of. Since ( Σ + λI ) − 1 = P j ( λ j + λ ) − 1 v j ⊗ v j , h v j , β λ i = h v j , ( Σ + λI ) − 1 E [ xy ] i = 1 λ j + λ E [ h v j , x i y ] = λ j λ j + λ E [ h v j , x i y ] h v j , x i 2 = λ j λ j + λ β j .  Prop ositi o n 7 (Effect of regular ization) . k β − β λ k 2 Σ = X j λ j ( λ j λ + 1) 2 β 2 j . Pr o of. By Pro po sition 6 , h v j , β − β λ i = β j − λ j λ j + λ β j = λ λ j + λ β j . Therefore, k β − β λ k 2 Σ = X j λ j  λ λ j + λ β j  2 = X j λ j ( λ j λ + 1 ) 2 β 2 j .  5.2. Effect of errors in b Σ . Lemma 2 (Sp ectral norm erro r in b Σ ) . A ssume Condition 1 (with p ar ameter ρ λ ) holds. Pick t > max { 0 , 2 . 6 − log ˜ d 1 ,λ } . With pr ob ability at le ast 1 − e − t , k ∆ λ k ≤ s 4 ρ 2 λ d 1 ,λ (log ˜ d 1 ,λ + t ) n + 2 ρ 2 λ d 1 ,λ (log ˜ d 1 ,λ + t ) 3 n wher e ∆ λ is define d in (16) . Pr o of. The claim is a consequence of the tail inequality from Lemma 10. First, define ˜ x := Σ − 1 / 2 λ x and e Σ := Σ − 1 / 2 λ Σ Σ − 1 / 2 λ (where Σ λ is defined in (14)), and let Z := ˜ x ⊗ ˜ x − e Σ = Σ − 1 / 2 λ ( x ⊗ x − Σ ) Σ − 1 / 2 λ so ∆ λ = b E [ Z ]. Observe that E [ Z ] = 0 and k Z k = max { λ max [ Z ] , λ max [ − Z ] } ≤ ma x {k ˜ x k 2 , 1 } ≤ ρ 2 λ d 1 ,λ where the second inequality follows from Condition 1. More over, E [ Z 2 ] = E [( ˜ x ⊗ ˜ x ) 2 ] − e Σ 2 = E [ k ˜ x k 2 ( ˜ x ⊗ ˜ x )] − e Σ 2 15 so λ max [ E [ Z 2 ]] ≤ λ max [ E [( ˜ x ⊗ ˜ x ) 2 ]] ≤ ρ 2 λ d 1 ,λ λ max [ e Σ ] ≤ ρ 2 λ d 1 ,λ tr( E [ Z 2 ]) ≤ tr ( E [ k ˜ x k 2 ( ˜ x ⊗ ˜ x )]) ≤ ρ 2 λ d 1 ,λ tr( e Σ ) = ρ 2 λ d 2 1 ,λ . The claim now follows from Lemma 10 (recall that ˜ d 1 ,λ = max { 1 , d 1 ,λ } ).  Lemma 3 (Rela tive sp ectral norm error in b Σ λ ) . If k ∆ λ k < 1 wher e ∆ λ is define d in (16) , then k Σ 1 / 2 λ b Σ − 1 λ Σ 1 / 2 λ k ≤ 1 1 − k ∆ λ k wher e Σ λ is define d in (14) and b Σ λ is define d in (15) . Pr o of. Obse r ve that Σ − 1 / 2 λ b Σ λ Σ − 1 / 2 λ = Σ − 1 / 2 λ ( Σ λ + b Σ λ − Σ λ ) Σ − 1 / 2 λ = I + Σ − 1 / 2 λ ( b Σ λ − Σ λ ) Σ − 1 / 2 λ = I + ∆ λ , and that λ min [ I + ∆ λ ] ≥ 1 − k ∆ λ k > 0 by the assumption k ∆ λ k < 1 and W eyl’s theore m [1 0]. The r efore, k Σ 1 / 2 λ b Σ − 1 λ Σ 1 / 2 λ k = λ max [( Σ − 1 / 2 λ b Σ λ Σ − 1 / 2 λ ) − 1 ] = λ max [( I + ∆ λ ) − 1 ] = 1 λ min [ I + ∆ λ ] ≤ 1 1 − k ∆ k .  Lemma 4 (F rob enius norm error in b Σ ) . A ssume Condition 1 ( with p ar ameter ρ λ ) holds. Pick any t > 0 . With pr ob ability at le ast 1 − e − t , k ∆ λ k F ≤ s E [ k Σ − 1 / 2 λ x k 4 ] − d 2 ,λ n (1 + √ 8 t ) + 4 q ρ 4 λ d 2 1 ,λ + d 2 ,λ t 3 n ≤ s ρ 2 λ d 2 1 ,λ − d 2 ,λ n (1 + √ 8 t ) + 4 q ρ 4 λ d 2 1 ,λ + d 2 ,λ t 3 n wher e ∆ λ is define d in (16) . Pr o of. The claim is a consequence of the tail inequality in Lemma 9. As in the pro o f of Lemma 2 , define ˜ x := Σ − 1 / 2 λ x and e Σ := Σ − 1 / 2 λ Σ Σ − 1 / 2 λ , a nd let Z := ˜ x ⊗ ˜ x − e Σ so ∆ λ = b E [ Z ]. Now endow the s pa ce of self-adjoint linear op erators with the inner pro duct given by h A, B i F := tr( AB ), and no te that this inner pro duct induces the F rob enius norm k M k F = h M , M i F . Observe that E [ Z ] = 0 a nd k Z k 2 F = h ˜ x ⊗ ˜ x − e Σ , ˜ x ⊗ ˜ x − e Σ i F = h ˜ x ⊗ ˜ x, ˜ x ⊗ ˜ x i F − 2 h ˜ x ⊗ ˜ x, e Σ i F + h e Σ , e Σ i F = k ˜ x k 4 − 2 k ˜ x k 2 e Σ + tr( e Σ 2 ) = k ˜ x k 4 − 2 k ˜ x k 2 e Σ + d 2 ,λ ≤ ρ 4 λ d 2 1 ,λ + d 2 ,λ , where the inequality follows from Condition 1. Mor e ov er, E [ k Z k 2 F ] = E [ h ˜ x ⊗ ˜ x, ˜ x ⊗ ˜ x i F ] − h e Σ , e Σ i F = E [ k ˜ x k 4 ] − d 2 ,λ ≤ ρ 2 λ d 1 ,λ E [ k ˜ x k 2 ] − d 2 ,λ = ρ 2 λ d 2 1 ,λ − d 2 ,λ , where the inequality again uses Condition 1. The claim now follows from Lemma 9.  16 5.3. Effect of appro xim ation error. Lemma 5 (Effect of approximation erro r) . Assume Conditio n 1 (with p ar ameter ρ λ ) and Condition 3 (with p ar ameter b λ ) hold. Pick any t > 0 . If k ∆ λ k < 1 wher e ∆ λ is define d in (16) , then k ¯ β λ − β λ k Σ ≤ 1 1 − k ∆ λ k k b E [ x approx λ ( x ) − λβ λ ] k Σ − 1 λ , wher e ¯ β λ is define d in (12) , β λ is define d in (8) , a ppr ox λ ( x ) is define d in (9) , and Σ λ is define d in (14) . Mor e over, with pr ob ability at le ast 1 − e − t , k b E [ x approx λ ( x ) − λβ λ ] k Σ − 1 λ ≤ s E [ k Σ − 1 / 2 λ ( x approx λ ( x ) − λβ λ ) k 2 ] n (1 + √ 8 t ) + 4( b λ p d 1 ,λ + k β − β λ k Σ ) t 3 n ≤ r 2( ρ 2 λ d 1 ,λ E [approx λ ( x ) 2 ] + k β − β λ k 2 Σ ) n (1 + √ 8 t ) + 4( b λ p d 1 ,λ + k β − β λ k Σ ) t 3 n . Pr o of. By the definitions of ¯ β λ and β λ , ¯ β λ − β λ = b Σ − 1 λ  b E [ x E [ y | x ]] − b Σ λ β λ  = Σ − 1 / 2 λ ( Σ 1 / 2 λ b Σ − 1 λ Σ 1 / 2 λ ) Σ − 1 / 2 λ  b E [ x (approx ( x ) + h β , x i )] − b Σ β λ − λβ λ  = Σ − 1 / 2 λ ( Σ 1 / 2 λ b Σ − 1 λ Σ 1 / 2 λ ) Σ − 1 / 2 λ  b E [ x (approx ( x ) + h β , x i − h β λ , x i )] − λβ λ  = Σ − 1 / 2 λ ( Σ 1 / 2 λ b Σ − 1 λ Σ 1 / 2 λ ) Σ − 1 / 2 λ  b E [ x approx λ ( x ) − λβ λ ]  . Therefore, using the submult iplicative prop erty o f the sp ectra l norm, k ¯ β λ − β λ k Σ ≤ k Σ 1 / 2 Σ − 1 / 2 λ kk Σ 1 / 2 λ b Σ − 1 λ Σ 1 / 2 λ kk b E [ x approx λ ( x ) − λβ λ ] k Σ − 1 λ ≤ 1 1 − k ∆ λ k k b E [ x approx λ ( x ) − λβ λ ] k Σ − 1 λ where the second inequality follows from Lemma 3 and beca use k Σ 1 / 2 Σ − 1 / 2 λ k 2 = λ max [ Σ − 1 / 2 λ Σ Σ − 1 / 2 λ ] = max i λ i λ i + λ ≤ 1 . The seco nd part of the cla im is a consequence of the tail inequality in Lemma 9. Observe that E [ x approx( x )] = E [ x ( E [ y | x ] − h β , x i )] = 0 by Prop osition 4, and that E [ x h β − β λ , x i ] − λβ λ = Σ β − ( Σ + λI ) β λ = 0. Therefore, E [ Σ − 1 / 2 λ ( x approx λ ( x ) − λβ λ )] = Σ − 1 / 2 λ E [ x (approx( x ) + h β − β λ , x i ) − λβ λ ] = 0 . Moreov er, by Pro po sition 6 and Pr op osition 7, k λΣ − 1 / 2 λ β λ k 2 = X j λ 2 λ j + λ h v j , β λ i 2 = X j λ 2 λ j + λ  λ j λ j + λ β j  2 ≤ X j λ 2 λ j + λ  λ j λ j + λ  β 2 j = X j λ j ( λ j λ + 1 ) 2 β 2 j = k β − β λ k 2 Σ . (17) 17 Combining the inequality from (17) with Condition 3 and the triangle inequalit y , it follows that k Σ − 1 / 2 λ ( x approx λ ( x ) − λβ λ ) k ≤ k Σ − 1 / 2 λ x approx λ ( x ) k + k λΣ − 1 / 2 λ β λ k ≤ b λ p d 1 ,λ + k β − β λ k Σ . Finally , b y the triang le ineq uality , the fact ( a + b ) 2 ≤ 2( a 2 + b 2 ), the inequa lit y from (17), and Condition 1 , E [ k Σ − 1 / 2 λ ( x approx λ ( x ) − λβ λ ) k 2 ] ≤ 2( E [ k Σ − 1 / 2 λ x approx λ ( x ) k 2 ] + k β λ − β k 2 Σ ) ≤ 2( ρ 2 λ d 1 ,λ E [approx λ ( x ) 2 ] + k β λ − β k 2 Σ ) . The claim now follows from Lemma 9.  5.4. Effect of noise. Lemma 6 (Effect of noise, λ = 0 ) . A ssume λ = 0 . A ssume Condition 2 (with p ar ameter σ ) holds. Pick any t > 0 . With pr ob ability at le ast 1 − e − t , either k ∆ 0 k ≥ 1 , or k ∆ 0 k < 1 and k ¯ β 0 − ˆ β 0 k 2 Σ ≤ 1 1 − k ∆ 0 k · σ 2 ( d + 2 √ dt + 2 t ) n , wher e ∆ 0 is define d in (16) . Pr o of. Obse r ve that k ¯ β 0 − ˆ β 0 k 2 Σ ≤ k Σ 1 / 2 b Σ − 1 / 2 k 2 k ¯ β 0 − ˆ β 0 k 2 b Σ = k Σ 1 / 2 b Σ − 1 Σ 1 / 2 kk ¯ β 0 − ˆ β 0 k 2 b Σ ; and if k ∆ 0 k < 1, then k Σ 1 / 2 b Σ − 1 Σ 1 / 2 k ≤ 1 / (1 − k ∆ 0 k ) by Lemma 3. Let ξ := (noise( x 1 ) , noise( x 2 ) , . . . , noise( x n )) be the rando m vector whose i -th compo nent is noise( x i ) = y i − E [ y i | x i ]. By the definition of ˆ β 0 and ¯ β 0 k ˆ β 0 − ¯ β 0 k 2 b Σ = k b Σ − 1 / 2 b E [ x ( y − E [ y | x ])] k 2 = ξ ⊤ b K ξ , where b K ∈ R n × n is the s y mmetric matrix who se ( i, j )- th en try is b K i,j := n − 2 h b Σ − 1 / 2 x i , b Σ − 1 / 2 x j i . Note that the nonzero eigenv alues of b K are the same as those of 1 n b E h ( b Σ − 1 / 2 x ) ⊗ ( b Σ − 1 / 2 x ) i = 1 n b Σ − 1 / 2 b Σ b Σ − 1 / 2 = 1 n I . By Lemma 8, with pr obability at least 1 − e − t (conditioned on x 1 , x 2 , . . . , x n ), ξ ⊤ b K ξ ≤ σ 2 (tr( b K ) + 2 q tr( b K 2 ) t + 2 λ max ( b K ) t ) = σ 2 ( d + 2 √ dt + 2 t ) n . The claim follows.  Lemma 7 (Effect of noise, λ ≥ 0) . Assum e Condition 2 (with p ar ameter σ ) holds. Pick any t > 0 . L et K b e the n × n symmetric matrix whose ( i, j ) -th entry is K i,j := 1 n 2 h Σ 1 / 2 b Σ − 1 λ x i , Σ 1 / 2 b Σ − 1 λ x j i , wher e b Σ λ is define d in (15) . With pr ob ability at le ast 1 − e − t , k ¯ β λ − ˆ β λ k 2 Σ ≤ σ 2 (tr( K ) + 2 p tr( K ) λ max ( K ) t + 2 λ max ( K ) t ) . Mor e over, if k ∆ λ k < 1 wher e ∆ λ is define d in (16) , then λ max ( K ) ≤ 1 n (1 − k ∆ λ k ) and tr( K ) ≤ d 2 ,λ + p d 2 ,λ k ∆ λ k 2 F n (1 − k ∆ λ k ) 2 . 18 Pr o of. Let ξ := (noise( x 1 ) , noise( x 2 ) , . . . , noise( x n )) b e the random vector whose i -th comp one nt is noise( x i ) = y i − E [ y i | x i ]. By the definition of ˆ β λ , ¯ β λ , and K , k ˆ β λ − ¯ β λ k 2 Σ = k b Σ − 1 λ b E [ x ( y − E [ y | x ])] k 2 Σ = ξ ⊤ K ξ . By Lemma 8, with pr obability at least 1 − e − t (conditioned on x 1 , x 2 , . . . , x n ), ξ ⊤ K ξ ≤ σ 2 (tr( K ) + 2 p tr( K 2 ) t + 2 λ max ( K ) t ) ≤ σ 2 (tr( K ) + 2 p tr( K ) λ max ( K ) t + 2 λ max ( K ) t ) , where the second inequality follows from von Neumann’s theor em [1 0]. Note tha t the nonzero eigen v alues of K are the sa me as tha t o f 1 n b E h ( Σ 1 / 2 b Σ − 1 λ x ) ⊗ ( Σ 1 / 2 b Σ − 1 λ x ) i = 1 n Σ 1 / 2 b Σ − 1 λ b Σ b Σ − 1 λ Σ 1 / 2 . T o bo und λ max ( K ), obs erve that b y the s ubm ultiplicative pro per ty of the spectr a l norm a nd Le mma 3 , nλ max ( K ) = k Σ 1 / 2 b Σ − 1 λ b Σ 1 / 2 k 2 ≤ k Σ 1 / 2 Σ − 1 / 2 λ k 2 k Σ 1 / 2 λ b Σ − 1 / 2 λ k 2 k b Σ − 1 / 2 λ b Σ 1 / 2 k 2 ≤ k Σ 1 / 2 λ b Σ − 1 / 2 λ k 2 = k Σ 1 / 2 λ b Σ − 1 λ Σ 1 / 2 λ k ≤ 1 1 − k ∆ λ k . T o bo und tr( K ), fir s t define the λ -whitened versions of Σ , b Σ , and b Σ λ as Σ w := Σ − 1 / 2 λ Σ Σ − 1 / 2 λ , b Σ w := Σ − 1 / 2 λ b Σ Σ − 1 / 2 λ , b Σ λ,w := Σ − 1 / 2 λ b Σ λ Σ − 1 / 2 λ . Using these definitions with the cycle prop e rty of the trace, n tr( K ) = tr( Σ 1 / 2 b Σ − 1 λ b Σ b Σ − 1 λ Σ 1 / 2 ) = tr( b Σ − 1 λ b Σ b Σ − 1 λ Σ ) = tr( b Σ − 1 λ,w b Σ w b Σ − 1 λ,w Σ w ) . Let { λ j [ M ] } denote the e ig env alues of a linear oper ator M . By von Neumann’s theorem [10], tr( b Σ − 1 λ,w b Σ w b Σ − 1 λ,w Σ w ) ≤ X j λ j [ b Σ − 1 λ,w b Σ w b Σ − 1 λ,w ] λ j [ Σ w ] and b y Ostrowski’s theorem [10], λ j [ b Σ − 1 λ,w b Σ w b Σ − 1 λ,w ] ≤ λ max [ b Σ − 2 λ,w ] λ j [ b Σ w ] . 19 Therefore tr( b Σ − 1 λ,w b Σ w b Σ − 1 λ,w Σ w ) ≤ λ max [ b Σ − 2 λ,w ] X j λ j [ b Σ w ] λ j [ Σ w ] ≤ 1 (1 − k ∆ λ k ) 2 X j λ j [ b Σ w ] λ j [ Σ w ] = 1 (1 − k ∆ λ k ) 2 X j  λ j [ Σ w ] 2 + ( λ j [ b Σ w ] − λ j [ Σ w ]) λ j [ Σ w ]  ≤ 1 (1 − k ∆ λ k ) 2   X j λ j [ Σ w ] 2 + s X j ( λ j [ b Σ w ] − λ j [ Σ w ]) 2 s X j λ j [ Σ w ] 2   = 1 (1 − k ∆ λ k ) 2   d 2 ,λ + s X j ( λ j [ b Σ w ] − λ j [ Σ w ]) 2 p d 2 ,λ   ≤ 1 (1 − k ∆ λ k ) 2  d 2 ,λ + k b Σ w − Σ w k F p d 2 ,λ  = 1 (1 − k ∆ λ k ) 2  d 2 ,λ + k ∆ λ k F p d 2 ,λ  , where the second inequality follows fro m Lemma 3, the third inequality follows from Cauch y-Sch w arz, a nd the fourth inequality follows from Mirsk y’s theor em [2 1].  A cknow le dgements. The authors thank Dean F os ter , David McAllester, and Rober t Stine for ma ny insigh tful discussions. References [1] N. Ailon and B. Chazelle. Approximate nearest neigh bors and the f ast Johnson-Lindenstrauss transform. SIAM J. Comput. , 39(1):302– 322, 2009. [2] J.-Y. Audib er t and O. Catoni. Linear regressi on through P AC-Ba yesian truncation, 2010. arXi v:1010.0072. [3] J.-Y. Audib er t and O. Catoni. Robust l i near least squares regression. The A nnals of Statistics , 30(5) :2766–2794 , 2011. [4] A. Cap onnetto and E. D e Vito. Optimal rates for the r egularized least-squares al gorithm. F oundations of Computational Mathematics , 7(3):33 1–368, 2007. [5] O. Catoni. Statistica l L ea rning The ory and Sto chastic O ptimization, L e c t ur es on Pr ob ability and Statistics, Ec ole d’Et´ e de Pr ob abiliti´ es de Saint-Flour XX X I – 2001 , volume 1851 of L ectur e Notes in Mathematics . Springer, 2004. [6] P . Dr ineas and M. W. Mahoney . Effectiv e resis tances, statistical lev erage, and applications to l inear equation solvi ng, 2010. [7] P . Drineas, M. W. Mahoney , S. Muth ukrishnan, and T. Sarl´ os. F aster least squares approximation. Numerische Mathematik , 117(2):219 –249, 2010. [8] L. Gy¨ orfi, M . Kohler, A. Kry ˙ zak, and H. W alk. A Distribution-F r e e The ory of Nonp ar ametric R e gressio n . Springer, 2004. [9] A. E. Ho erl. Application of ridge analysis to r egression problems. Chemic al Engine ering Pr o gr e ss , 58:54–59, 1962. [10] R . Horn and C. R. Johnson. Matrix An alysis . Cam br idge Universit y Press, 1985. [11] D . Hsu, S. M. K ak ade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors, 2011. [12] D . Hsu, S. M. Kak ade, and T. Zhang. T ail inequalities f or sums of r andom matrices that dep end on the intrinsic dimension. Ele ctr onic Communic ations i n Pr ob ability , 17(14):1–13, 2012. [13] D . Hsu and S. Sabato. Loss minimization and parameter estimation with hea vy tails, 2013. [14] V . Koltc hinski i. Lo cal Rademac her complexities and or acle inequalities in r isk minimization. The Annals of Statistics , 34(6):2593 –2656, 2006. [15] B . Laurent and P . M assart. Adaptiv e estimation of a quadratic f unctional by model selection. The Annals of St ati st ics , 28(5):1302 –1338, 2000. [16] E . L. Lehmann and G. Casella. The ory of Point Estimation . Springer, second edition, 1998. [17] M . Nussbaum. Minimax risk: Pinsker b ound. In S. Kotz, editor, Encyclop e dia of Statistic al Sciences, Up date V olume 3 , pages 451–460. Wiley , New Y ork, 1999. [18] V . Rokhlin and M. Tygert. A f ast randomized algori thm for ov erdetermined linear least-squares regression. Pr o c. Natl. A c ad. Sci. USA , 105(36):13212–132 17, 2008. [19] S. Smale and D.-X. Zhou. Learning theory estimates via in tegral operators and their appro xim ations. Constructi ve Ap- pr oximations , 26:153–172 , 2007. 20 [20] I. Stein wart, D. H ush, and C. Scov el. O ptimal rates for regularized least squares regressi on. In Pr o c e edings of the 22nd Annu al Confer e nc e on L e arning The ory , pages 79–93, 2009. [21] G. W. Stewart and J.-G. Sun. Matrix Perturb ation The ory . Academic Pr ess, 1990. [22] C . J. Stone. Optimal global rates of conv ergence for nonparametric regression. The A nnals of Statistic s , 10:1040–1053, 1982. [23] T . Zhang. Learning b ounds for k ernel regression using effectiv e data dimensionality . Ne ur al Computation , 17:2077–2098, 2005. Appendix A. Probability t ail inequalities The fo llowing probability tail inequalities are us ed in o ur analysis. These s pec ific inequalities were chosen in order to sa tisfy the gener al c onditions set up in Section 2.4; how ever, our analy s is ca n sp ecialize or generalize with the av ailability of other tail inequalities of these sorts. The first tail ineq uality is for p ositive semidefinite quadratic fo r ms o f a subgaussian r andom vector. It generalizes a standard ta il inequality for Gaussian ra ndom vectors based o n linear co mbinations o f χ 2 random v ariables [15]. Lemma 8 (Q uadratic forms o f a subgaussian random vector; [11]) . Le t ξ b e a r andom ve ctor t aking values in R n such t hat for some c ≥ 0 , E [exp( h u, ξ i )] ≤ exp( c k u k 2 / 2) , ∀ u ∈ R n . F or al l symmetric p ositive semidefinite matric es K  0 , and al l t > 0 , Pr  ξ ⊤ K ξ > c  tr( K ) + 2 p tr( K 2 ) t + 2 k K k t   ≤ e − t . The next lemma is a tail inequality for sums of b o unded r andom vectors; it is a standard applica tion of Bernstein’s inequality . Lemma 9 (V ector Bernstein b ound; see , e.g. , [11]) . L et x 1 , x 2 , . . . , x n b e indep endent r andom ve ctors such that n X i =1 E [ k x i k 2 ] ≤ v and k x i k ≤ r for al l i = 1 , 2 , . . . , n , almost su r ely. Le t s := x 1 + x 2 + · · · + x n . F or al l t > 0 , Pr h k s k > √ v (1 + √ 8 t ) + (4 / 3 ) r t i ≤ e − t The la st tail inequality concerns the sp ectral accura cy of an empirical second moment matrix. Lemma 10 (Matrix Bernstein b ound; [12]) . L et X b e a r andom matrix, and r > 0 , v > 0 , and k > 0 b e such t hat, almost sur ely, E [ X ] = 0 , λ max [ X ] ≤ r, λ max [ E [ X 2 ]] ≤ v, tr( E [ X 2 ]) ≤ v k . If X 1 , X 2 , . . . , X n ar e indep endent c opies of X , t hen for any t > 0 , Pr " λ max " 1 n n X i =1 X i # > r 2 v t n + rt 3 n # ≤ k t ( e t − t − 1) − 1 . If t ≥ 2 . 6 , then t ( e t − t − 1) − 1 ≤ e − t/ 2 . (D. Hsu) Dep ar tment of Computer Science, Columbia University, 450 Computer Science Building, 1214 Amster- dam A venue, Mailcode: 040 1, New York, NY 10027-70 03 E-mail addr ess , D. Hsu: djhsu@cs .columbia .edu (S. M. Kak ade) Microsoft Research, One Memorial Drive, Cambridge, MA, 02142 E-mail addr ess , S.M. Kak ade: skakade@micr osoft.co m (T. Zhang) Dep ar tment of St atis tics, Rutgers Univ ersity, 501 Hill Center, 110 Frelinghu ysen Ro ad, Pisca t a w a y, NJ 08854 E-mail addr ess , T. Zhang: tzhang@stat .rutgers .edu 21

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment