A tail inequality for quadratic forms of subgaussian random vectors

A tail inequalit y for quadratic forms of subgaussian random v ectors Daniel Hsu 1 , Sham M. Kak ade 1,2 , and T ong Zhang 3 1 Microsoft Researc h New England 2 Departmen t of Statistics, Wharton School, Univ ers it y of Pen nsylv ania 3 Departmen t of Statistics, Rutgers Univ ersit y No v e m ber 27, 2024 Abstract W e prov e an exp onential proba bilit y tail inequa lity for p os itive semideﬁnite quadratic forms in a subgaussia n ra ndo m vector. The b ound is analog ous to one that holds whe n the vector has independent Gaussia n entries. 1 In tro d uction Supp ose that x = ( x 1 , . . . , x n ) is a random ve ctor. Let A ∈ R m × n b e a ﬁxed matrix. A natural quan tit y that a rises in man y settings is the quadr atic form k Ax k 2 = x ⊤ ( A ⊤ A ) x . Thr oughout k v k denotes the Euclidean norm of a vec tor v , and k M k denotes the sp ectral (op erator) norm of a matrix M . W e are intereste d in how close k Ax k 2 is to its exp ectation. Consider the special case where x 1 , . . . , x n are indep endent standard Gaussian random v aria bles. The follo wing prop osition provides an (upp er) tail b ound for k Ax k 2 . Prop osition 1. L et A ∈ R m × n b e a matrix, and let Σ := A ⊤ A . L et x = ( x 1 , . . . , x n ) b e an isotr opic multivariate Gaussian r andom ve ctor with me an zer o. F or al l t > 0 , Pr h k Ax k 2 > tr( Σ ) + 2 p tr( Σ 2 ) t + 2 k Σ k t i ≤ e − t . The p ro of, giv en in App endix A.2, is straigh tforw ard give n the rotational in v ariance of the m ultiv ariate Gaussian distribution, together with a tail b ound for linear combinations of χ 2 random v ariables du e to Laurent and Massart (2000). W e note that a sligh tly w eak er form of Pr op osition 1 can b e pro v ed directly using Gaussian concen tration (Pisier, 1989). In this note, we consider th e case w here x = ( x 1 , . . . , x n ) is a su b gaussian random vecto r. By this, we mean that there exists a σ ≥ 0, such that for all α ∈ R n , E h exp  α ⊤ x i ≤ exp  k α k 2 σ 2 / 2  . W e pr o vide a sharp u pp er tail b ound for this case analogous to one that holds in th e Gaussian case (indeed, the same as Prop osition 1 when σ = 1). E-mail: dahsu@micr osoft.com , skakade@wharton.upe nn.edu , tzhang@stat.rutg ers.edu 1 T a il inequalities for sums of random v ect or s One motiv ation for our main r esu lt comes from th e follo wing observ ations ab out sums of rand om v ectors. Let a 1 , . . . , a n b e ve ctors in a Euclidean space, and let A = [ a 1 | · · · | a n ] b e th e matrix with a i as its i th column. Consid er the squared norm of the random sum k Ax k 2 =     n X i =1 a i x i     2 (1) where x := ( x 1 , . . . , x n ) is a martingale diﬀerence sequence with E [ x i | x 1 , . . . , x i − 1 ] = 0 and E [ x 2 i | x 1 , . . . , x i − 1 ] = σ 2 . Under mild b ound edness assu mptions on the x i , the pr obabilit y that the squared norm in (1) is muc h larger than its exp ectation E [ k Ax k 2 ] = σ 2 n X i =1 k a i k 2 = σ 2 tr( A ⊤ A ) falls oﬀ exp onen tially f ast. Th is can b e shown, f or instance, usin g the follo wing lemma by taking u i = a i x i (the pro of is standard, b u t w e giv e it for completeness in App endix A.1). Prop osition 2. L et u 1 , . . . , u n b e a martingale diﬀer e nc e ve ctor se quenc e ( i.e. , E [ u i | u 1 , . . . , u i − 1 ] = 0 for al l i = 1 , . . . , n ) such that n X i =1 E  k u i k 2 | u 1 , . . . , u i − 1  ≤ v and k u i k ≤ b for al l i = 1 , . . . , n , almost sur ely. F o r al l t > 0 , Pr "     n X i =1 u i     > √ v + √ 8 v t + (4 / 3) bt # ≤ e − t . After squaring the quantit ies in the s tated probabilistic ev en t, Prop osition 2 giv es the b ound k Ax k 2 ≤ σ 2 · tr( A ⊤ A ) + σ 2 · O  tr( A ⊤ A )( √ t + t ) + q tr( A ⊤ A ) max i k a i k ( t + t 3 / 2 ) + max i k a i k 2 t 2  with probabilit y at least 1 − e − t when the x i are almost surely b oun ded by 1 (or any constan t). Unfortunately , this b oun d obtained from Prop osition 2 ca n b e sub optimal wh en the x i are subgaussian. F or instance, if th e x i are Rademac her r andom v ariables, s o Pr[ x i = +1] = Pr[ x i = − 1] = 1 / 2, then it is known that k Ax k 2 ≤ tr( A ⊤ A ) + O  q tr(( A ⊤ A ) 2 ) t + k A k 2 t  (2) with pr obabilit y at least 1 − e − t . A similar result holds f or any subgaussian distribution on the x i (Hanson and W righ t , 1971). This is an impro vemen t o v er the previous b ound b ecause the deviation terms ( i.e. , those inv olving t ) can b e signiﬁcan tly s maller, esp ecially for large t . In this wo rk, we giv e a simple pro of of (2) with explicit constant s that matc h the analogous b ound when the x i are in dep end ent standard Gaussian r andom v ariables. 2 2 P ositiv e semideﬁnite quadratic forms Our main theorem, giv en b elo w, is a generalization of (2) . Theorem 1. L et A ∈ R m × n b e a matrix, and let Σ := A ⊤ A . Supp ose that x = ( x 1 , . . . , x n ) is a r andom ve ctor such that, for some µ ∈ R n and σ ≥ 0 , E h exp  α ⊤ ( x − µ ) i ≤ exp  k α k 2 σ 2 / 2  (3) for al l α ∈ R n . F or al l t > 0 , Pr " k Ax k 2 > σ 2 ·  tr( Σ )+2 p tr( Σ 2 ) t +2 k Σ k t  + k Aµ k 2 ·  1+4  k Σ k 2 tr( Σ 2 ) t  1 / 2 + 4 k Σ k 2 tr( Σ 2 ) t  1 / 2 # ≤ e − t . R e mark 1 . Note that w hen µ = 0 and σ = 1 we h a ve: Pr h k Ax k 2 > tr( Σ ) + 2 p tr( Σ 2 ) t + 2 k Σ k t i ≤ e − t whic h is the same as Prop osition 1. R e mark 2 . Our p ro of actually establishes the follo wing upp er b ound s on th e moment generating function of k Ax k 2 for 0 ≤ η < 1 / (2 σ 2 k Σ k ): E  exp  η k Ax k 2  ≤ E h exp  σ 2 k A ⊤ z k 2 η + µ ⊤ A ⊤ z p 2 η i ≤ exp  σ 2 tr( Σ ) η + σ 4 tr( Σ 2 ) η 2 + k Aµ k 2 η 1 − 2 σ 2 k Σ k η  where z is a v ector of m indep end en t standard Gaussian rand om v ariables. Pr o of of The or em 1. Let z b e a vec tor of m in dep end ent standard Gaussian random v ariables (sampled indep end en tly of x ). F or any α ∈ R m , E h exp  z ⊤ α i = exp  k α k 2 / 2  . Th us, for an y λ ∈ R and ε ≥ 0, E h exp  λz ⊤ Ax i ≥ E  exp  λz ⊤ Ax      k Ax k 2 > ε  · Pr  k Ax k 2 > ε  ≥ exp  λ 2 ε 2  · Pr  k Ax k 2 > ε  . (4) Moreo v er, E h exp  λz ⊤ Ax i = E  E  exp  λz ⊤ A ( x − µ )      z  exp  λz ⊤ Aµ   ≤ E  exp  λ 2 σ 2 2 k A ⊤ z k 2 + λµ ⊤ A ⊤ z  (5) 3 Let U S V ⊤ b e a singular v alue decomp osition of A ; wh ere U and V are, resp ectiv ely , matrices of orthonormal left and right singular v ectors; and S = diag( √ ρ 1 , . . . , √ ρ m ) is the diagonal matrix of corresp ondin g singular v alues. No te that k ρ k 1 = m X i =1 ρ i = tr( Σ ) , k ρ k 2 2 = m X i =1 ρ 2 i = tr( Σ 2 ) , and k ρ k ∞ = max i ρ i = k Σ k . By rotational in v ariance, y := U ⊤ z is an isotropic multi v ariate Gaussian rand om v ector with mean zero. T herefore k A ⊤ z k 2 = z ⊤ U S 2 U ⊤ z = ρ 1 y 2 1 + · · · + ρ m y 2 m and µ ⊤ A ⊤ z = ν ⊤ y = ν 1 y 1 + · · · + ν m y m , where ν := S V ⊤ µ (note that k ν k 2 = k S V ⊤ µ k 2 = k Aµ k 2 ). Let γ := λ 2 σ 2 / 2. By Lemma 1, E " exp γ m X i =1 ρ i y 2 i + √ 2 γ σ m X i =1 ν i y i !# ≤ exp  k ρ k 1 γ + k ρ k 2 2 γ 2 + k ν k 2 γ /σ 2 1 − 2 k ρ k ∞ γ  (6) for 0 ≤ γ < 1 / (2 k ρ k ∞ ). Com bining (4), (5), and (6) give s Pr  k Ax k 2 > ε  ≤ exp  − εγ /σ 2 + k ρ k 1 γ + k ρ k 2 2 γ 2 + k ν k 2 γ /σ 2 1 − 2 k ρ k ∞ γ  for 0 ≤ γ < 1 / (2 k ρ k ∞ ) and ε ≥ 0. Choosing ε := σ 2 ( k ρ k 1 + τ ) + k ν k 2 s 1 + 2 k ρ k ∞ τ k ρ k 2 2 and γ := 1 2 k ρ k ∞ 1 − s k ρ k 2 2 k ρ k 2 2 + 2 k ρ k ∞ τ ! , w e ha ve Pr " k Ax k 2 > σ 2 ( k ρ k 1 + τ ) + k ν k 2 s 1 + 2 k ρ k ∞ τ k ρ k 2 2 # ≤ exp − k ρ k 2 2 2 k ρ k 2 ∞ 1 + k ρ k ∞ τ k ρ k 2 2 − s 1 + 2 k ρ k ∞ τ k ρ k 2 2 !! = exp − k ρ k 2 2 2 k ρ k 2 ∞ h 1 k ρ k ∞ τ k ρ k 2 2 !! where h 1 ( a ) := 1 + a − √ 1 + 2 a , whic h has the inv erse fu nction h − 1 1 ( b ) = √ 2 b + b . The r esult follo ws b y setting τ := 2 p k ρ k 2 2 t + 2 k ρ k ∞ t = 2 p tr( Σ 2 ) t + 2 k Σ k t . The follo wing lemma is a standard estimate of the logarithmic momen t generating function of a quadratic form in standard Gaussian ran d om v ariables, pro v ed m u c h along the lines of the estimate due to Laurent and Massart (2000). Lemma 1. L et z b e a ve ctor of m indep endent standar d Gaussian r andom variables. Fix any non-ne gative ve ctor α ∈ R m + and any ve ctor β ∈ R m . If 0 ≤ λ < 1 / (2 k α k ∞ ) , then log E " exp λ m X i =1 α i z 2 i + m X i =1 β i z i !# ≤ k α k 1 λ + k α k 2 2 λ 2 + k β k 2 2 / 2 1 − 2 k α k ∞ λ . 4 Pr o of. Fix λ ∈ R such that 0 ≤ λ < 1 / (2 k α k ∞ ), and let η i := 1 / √ 1 − 2 α i λ > 0 for i = 1 , . . . , m . W e ha v e E  exp  λα i z 2 i + β i z i  = Z ∞ −∞ 1 √ 2 π exp  − z 2 i / 2  exp  λα i z 2 i + β i z i  dz i = η i exp  β 2 i η 2 i 2  Z ∞ −∞ 1 q 2 π η 2 i exp  − 1 2 η 2 i  z i − β i η 2 i  2  dz i so log E " exp λ m X i =1 α i z 2 i + m X i =1 β i z i !# = 1 2 m X i =1 β 2 i η 2 i + 1 2 m X i =1 log η 2 i . The righ t-hand side can b e b ounded usin g the inequ alities 1 2 m X i =1 log η 2 i = − 1 2 m X i =1 log(1 − 2 α i λ ) = 1 2 m X i =1 ∞ X j =1 (2 α i λ ) j j ≤ k α k 1 λ + k α k 2 2 λ 2 1 − 2 k α k ∞ λ and 1 2 m X i =1 β 2 i η 2 i ≤ k β k 2 2 / 2 1 − 2 k α k ∞ λ . Example: ﬁxed-design regression with subgaussian noise W e giv e a simple app lication of Theorem 1 to ﬁxed-design linear regression with the ordinary least squares estimator. Let x 1 , . . . , x n b e ﬁxed design v ectors in R d . Let the resp onses y 1 , . . . , y n b e random v ariables for whic h there exists σ > 0 su c h th at E " exp n X i =1 α i ( y i − E [ y i ]) !# ≤ exp σ 2 n X i =1 α 2 i ! for an y α 1 , . . . , α n ∈ R . This condition is satisﬁed, for instance, if y i = E [ y i ] + ε i for indep en den t subgaussian zero-mean noise v ariables ε 1 , . . . , ε n . Let Σ := P n i =1 x i x ⊤ i /n , w h ic h w e assume is in vertible without loss of generalit y . Let β := Σ − 1 1 n n X i =1 x i E [ y i ] ! b e the co eﬃcien t v ector of minimum exp ected squ ared error. T he ordinary least s q u ares estimator is giv en by ˆ β := Σ − 1 1 n n X i =1 x i y i ! . 5 The excess loss R ( ˆ β ) of ˆ β is the diﬀerence b et wee n the exp ected sq u ared error of ˆ β and that of β : R ( ˆ β ) := E " 1 n n X i =1 ( x ⊤ i ˆ β − y i ) 2 # − E " 1 n n X i =1 ( x ⊤ i β − y i ) 2 # . It is easy to see that R ( ˆ β ) =   Σ 1 / 2 ( ˆ β − β )   2 =    n X i =1  Σ − 1 / 2 x i  ( y i − E [ y i ])    2 . By Theorem 1, Pr " R ( ˆ β ) > σ 2  d + 2 √ dt + 2 t  n # ≤ e − t . Note that in the case that E [( y i − E [ y i ]) 2 ] = σ 2 for eac h i , then E [ R ( ˆ β )] = σ 2 d n ; so the tail inequalit y ab ov e is essential ly tight when the y i are indep end en t Gaussian random v ariables. References D. L. Hanson and F. T. W righ t. A b ound on tail probabilities for quadratic form s in indep end en t random v ariables. The Anna ls of M athematic al Statistics , 42(3):107 9–10 83, 1971. B. Laurent and P . Massart. Adap tive estimatio n of a quadratic fu nctional by mo del selection. The Anna ls of Statistics , 28(5):1 302–1 338, 2000. G. Pisier. The volume of c onvex b o dies and Banach sp ac e ge ometry . Cam b ridge Un iversit y Press, 1989. A Standard tail inequalities A.1 Martingale tail inequalities The follo w ing is a standard form of Bernstein’s inequalit y stated for martingale d iﬀerence sequences. Lemma 2 (Bernstein’s inequalit y for martingales) . L et d 1 , . . . , d n b e a martingale diﬀer enc e se- quenc e with r e sp e ct to r andom variables x 1 , . . . , x n ( i.e. , E [ d i | x 1 , . . . , x i − 1 ] = 0 for al l i = 1 , . . . , n ) such that | d i | ≤ b and P n i =1 E [ d 2 i | x 1 , . . . , x i − 1 ] ≤ v . F or al l t > 0 , Pr " n X i =1 d i > √ 2 v t + (2 / 3) bt # ≤ e − t . The pro of of Prop osition 2, which is en tirely standard, is an imm ed iate consequence of the follo wing t w o lemmas together with Jensen’s in equ alit y . 6 Lemma 3. L et u 1 , . . . , u n b e r andom ve ctors such that n X i =1 E  k u i k 2 | u 1 , . . . , u i − 1  ≤ v and k u i k ≤ b. for al l i = 1 , . . . , n , almost sur ely. F o r al l t > 0 , Pr "     n X i =1 u i     − E      n X i =1 u i      > √ 8 v t + (4 / 3) bt # ≤ e − t . Pr o of. L et s n := u 1 + · · · + u n . Deﬁne the Do ob martingale d i := E [ k s n k | u 1 , . . . , u i ] − E [ k s n k | u 1 , . . . , u i − 1 ] for i = 1 , . . . , n , so d 1 + · · · + d n = k s n k − E [ k s n k ]. First, clearly , E [ d i | u 1 , . . . , u i − 1 ] = 0. Next, th e triangle inequalit y implies d i = E [ k ( s n − u i ) + u i k | u 1 , . . . , u i ] − E [ k ( s n − u i ) + u i k | u 1 , . . . , u i − 1 ] ≤ E [ k s n − u i k + k u i k | u 1 , . . . , u i ] − E [ kk s n − u i k − k u i k | u 1 , . . . , u i − 1 ] = k u i k + E [ k u i k | u 1 , . . . , u i − 1 ] , and similarly , d i ≥ −k u i k − E [ k u i k | u 1 , . . . , u i − 1 ] . Therefore, | d i | ≤ k u i k + E [ k u i k | u 1 , . . . , u i − 1 ] ≤ 2 b almost sur ely . Moreo v er, E  d 2 i | u 1 , . . . , u i − 1  ≤ E h k u i k 2 + 2 · k u i k · E [ k u i k | u 1 , . . . , u i − 1 ] + E [ k u i k | u 1 , . . . , u i − 1 ] 2 | u 1 , . . . , u i − 1 i = E  k u i k 2 | u 1 , . . . , u i − 1  + 3 · E [ k u i k | u 1 , . . . , u i − 1 ] 2 ≤ 4 · E  k u i k 2 | u 1 , . . . , u i − 1  , so n X i =1 E  d 2 i | u 1 , . . . , u i − 1  ≤ 4 v almost su rely . The claim no w follo w s from Bernstein’s in equalit y (Lemma 2). Lemma 4. If u 1 , . . . , u n is a martingale diﬀer enc e v e c tor se quenc e ( i.e. , E [ u i | u 1 , . . . , u i − 1 ] = 0 for al l i = 1 , . . . , n ), then E      n X i =1 u i     2  = n X i =1 E  k u i k 2  . Pr o of. L et s i := u 1 + · · · + u i for i = 1 , . . . , n ; we ha v e E  k s n k 2  = E  E  k u n + s n − 1 k 2 | u 1 , . . . , u n − 1  = E h E h k u n k 2 + 2 u ⊤ n s n − 1 + k s n − 1 k 2 | u 1 , . . . , u n − 1 ii = E  k u n k 2  + E  k s n − 1 k 2  so the claim follo w s by indu ction. 7 A.2 Gaussian quadratic forms and χ 2 tail inequalities It is well-kno wn that if z ∼ N (0 , 1) is a s tand ard Gaussian r an d om v ariable, then z 2 follo ws a χ 2 distribution with one degree of fr eedom. T he f ollo wing inequality d ue to Laurent and Massart (2000) giv es a b oun d on linear com bin ations of χ 2 random v ariables. Lemma 5 ( χ 2 tail inequ alit y; Laurent and Massart , 2000) . L et q 1 , . . . , q n b e indep endent χ 2 r andom variables, e ach with one de gr e e of fr e e dom. F or any ve c tor γ = ( γ 1 , . . . , γ n ) ∈ R n + with non-ne gative entries, and any t > 0 , Pr " n X i =1 γ i q i > k γ k 1 + 2 q k γ k 2 2 t + 2 k γ k ∞ t # ≤ e − t . Pr o of of Pr op osition 1. Let V Λ V ⊤ b e an eigen-decomp osition of A ⊤ A , w here V is a matrix of orthonormal eigen v ectors, and Λ := diag ( ρ 1 , . . . , ρ n ) is the diagonal matrix of corresp onding eigen- v alues ρ 1 , . . . , ρ n . By the rotational inv ariance of the distribu tion, z := V ⊤ x is an isotropic m ulti- v ariate Gaussian ran d om v ector with mean zero. Thus, k Ax k 2 = z ⊤ Λ z = ρ 1 z 2 1 + · · · + ρ n z 2 n , and the z 2 i are ind ep endent χ 2 random v ariables, eac h with one degree of fr eedom. The claim now follo ws from a tail b oun d for χ 2 random v ariables (Lemma 5, du e to L auren t and Massart, 2000). 8

A tail inequality for quadratic forms of subgaussian random vectors

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment