A Risk Comparison of Ordinary Least Squares vs Ridge Regression
We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a Principal Component Analysis) and then performs an ordinary (un-regular…
Authors: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade
Journal of Mac hine Learning Researc h 14 (2013) xx-xx Submitted 05/12; Published 05/13 A Risk Compariso n of Ordinary Least Squ ares vs Ridge Regression P aram v e er S. Dhillon dhillon@cis.upenn.edu Dep a rtment of Computer and Information Scienc e University of Pennsylvania Philad elphia, P A 19104 , U SA Dean P . F oster f oster@whar ton.u penn.edu Dep artm ent of Statistics Wharton Scho ol, U niversity of Pennsylvania Philad elphia, P A 19104 , U SA Sham M. Kak ade skakade@microsoft.com Micr osoft Rese ar ch One Memorial Drive Cambridge , MA 0214 2, US A Lyle H. Ungar ungar@cis.upenn.edu Dep artm ent of Computer and Information Scienc e University of Pennsylvania Philad elphia, P A 19104 , U SA Editor: Gab or Lugosi Abstract W e c o mpare the r isk of ridge regr ession to a simple v arian t of ordinary least squar es, in which one simply pro jects the data onto a finite dimensional subspace (as sp ecified by a principal comp onent analysis) a nd then p erforms an ordina ry (un-regular ized) least squar e s regress ion in this subspace. This note shows that the ris k of this ordinar y least squa res method (PCA-OLS) is within a co nstant factor (namely 4) of the risk of r idge regr ession (RR). Keyw ords: risk inflation, ridg e regression, pca 1. In tro duction Consider the fixed design setting where w e ha v e a set of n v ectors X = { X i } , and let X denote the matrix where the i th ro w of X is X i . The observe d lab el vector is Y ∈ R n . Supp os e that: Y = X β + ǫ, where ǫ is indep enden t noise in eac h co ord inate, with the v arian ce of ǫ i b eing σ 2 . The ob jectiv e is to learn E [ Y ] = X β . The exp ected loss of a vector β estimator is: L ( β ) = 1 n E Y [ k Y − X β k 2 ] , c 2013 P aram v eer S. Dhillon, Dean P . F oster, Sham M . Kak ade and Lyle H. Unga r. Dhillon, Foster, Kakade and Ungar Let ˆ β b e an estimator of β (constructed with a sample Y ). Denoting Σ := 1 n X T X , w e ha v e that the risk (i.e., exp ected excess loss) is : Risk ( ˆ β ) := E ˆ β [ L ( ˆ β ) − L ( β )] = E ˆ β k ˆ β − β k 2 Σ , where k x k Σ = x ⊤ Σ x and where the exp ectation is with resp ect to the randomness i n Y . W e sho w that a simple v arian t of ordinary (un-regularized) l east squares alw a ys compares fa v orably to ridge regression (as measured b y the risk). This observ at ion is based on the follo wing bias v ariance decomp osition: Risk( ˆ β ) = E k ˆ β − ¯ β k 2 Σ | {z } V ariance + k ¯ β − β k 2 Σ | {z } Prediction Bias , (1) where ¯ β = E [ ˆ β ] . 1.1 The Risk of Ridge Regression (RR) Ridge regression or Tikhono v R egularization (Tikhono v, 1963) p enalizes the ℓ 2 norm of a para meter v ecto r β and “shrinks” it to w ard s zero, penalizing large v alues more. The estimator is: ˆ β λ = argmin β {k Y − X β k 2 + λ k β k 2 } . The closed form estimate is then: ˆ β λ = ( Σ + λ I ) − 1 1 n X T Y . Note that ˆ β 0 = ˆ β λ =0 = argmin β {k Y − X β k 2 } , is the ordinary least squares estimator. Without loss of generalit y , rotate X suc h that: Σ = diag ( λ 1 , λ 2 , . . . , λ p ) , where the λ i ’s are ordered in decreasing order. T o see the nature of this shrink age observe that: [ ˆ β λ ] j := λ j λ j + λ [ ˆ β 0 ] j , where ˆ β 0 is the ordinary least squares estimator. Using the bias-v ariance decomp osition, (Equation 1), w e ha v e that: Lemma 1 Risk( ˆ β λ ) = σ 2 n X j λ j λ j + λ 2 + X j β 2 j λ j (1 + λ j λ ) 2 . The pro of is straigh tforw ard and is pro vided in the app endix . 2 A Risk Comp a rison of Ordinar y Least Squares vs Ridge Regression 2. Ordinary Least Squares with PCA (PCA-OLS) No w let us construct a simple estimator based on λ . Note that our rotated co ordinate system where Σ is equal to diag ( λ 1 , λ 2 , . . . , λ p ) corresp onds the PCA co ordinate sys tem. Consider the follo wing ordinary least squares es timator on the “top” PCA subspace — it uses the least squares estimate on co o rdinate j if λ j ≥ λ and 0 otherwise [ ˆ β P C A,λ ] j = [ ˆ β 0 ] j if λ j ≥ λ 0 otherwise . The follo wi ng claim sho ws this estimator compares fa v orably to the ridge estimator (for ev ery λ )– no matter ho w the λ is chosen e.g., using cross v alidation or an y other s trategy . Our main theorem (Theorem 2) b ounds the Ris k Ratio/Risk Inflation 1 of the PCA-OLS and the RR estimators. Theorem 2 (Bounde d Risk Inflation) F or al l λ ≥ 0 , we have that: 0 ≤ Risk( ˆ β P C A,λ ) Risk( ˆ β λ ) ≤ 4 , and the left hand i n e quality is tight. Pro of Using the bias v ari ance decomp osition of the risk we can write the risk as: Risk( ˆ β P C A,λ ) = σ 2 n X j 1 λ j ≥ λ + X j : λ j <λ λ j β 2 j . The first term represen ts the v aria nce and the s econd the bias. The ridge regression risk is giv en b y Lemma 1. W e no w sho w that the j th term in the expression for the PCA risk is within a factor 4 of the j th term of the ridge regression risk . First, let’s consider the case when λ j ≥ λ , then the ratio of j th terms is: σ 2 n σ 2 n λ j λ j + λ 2 + β 2 j λ j (1+ λ j λ ) 2 ≤ σ 2 n σ 2 n λ j λ j + λ 2 = 1 + λ λ j 2 ≤ 4 . Similarly , if λ j < λ , the ratio of the j th terms is: λ j β 2 j σ 2 n λ j λ j + λ 2 + β 2 j λ j (1+ λ j λ ) 2 ≤ λ j β 2 j λ j β 2 j (1+ λ j λ ) 2 = 1 + λ j λ 2 ≤ 4 . Since, eac h term is within a factor of 4 the pro of is complete. It is w orth noting that the con v erse is not true and the ridge regression estimator (R R) can b e arbitrarily w orse than the PCA-OLS estimator. A n example whic h shows that the left hand inequalit y is tigh t is given in the A pp en dix. 1. R isk Inflation has al so b een used as a criterio n for ev aluating fe ature selection pro ce- dures (F oster and George, 1994). 3 Dhillon, Foster, Kakade and Ungar 3. Exp erimen ts First, w e generated synthet ic data with p = 100 and v arying v alues of n = {20, 50, 80, 110}. The data w as generated in a fixed design setting as Y = X β + ǫ where ǫ i ∼ N (0 , 1 ) ∀ i = 1 , . . . , n . F urthermore, X n × p ∼ M V N ( 0 , I ) where M V N( µ, Σ ) is the Multiv ariate Normal Dis tribu- tion with mean vec tor µ , v arianc e-co v ariance matrix Σ and β j ∼ N (0 , 1) ∀ j = 1 , . . . , p . The results are sho wn in Figure 1. As can b e seen, the risk ratio of PCA (PCA-OLS) and ridge regression (R R ) is never worse than 4 and often its b etter than 1 as dictated b y Theorem 2. Next , we c hose tw o real worl d datasets, namely USPS (n=1500, p=241) and BCI (n=400, p=117) 2 . Since we do not kno w the true mo del for these datasets, w e used all the n observ ations to fit an OLS regression and used it as an estimate of the true parameter β . This is a reasonable appro ximation to the true parameter as w e estimate the ridge regression (RR) and PCA-OLS mo dels on a small subset of these observ ations. N ext we choose a random subset of the observ ations, namely 0 . 2 × p , 0 . 5 × p and 0 . 8 × p to fit the ridge regression (RR) and PCA-OLS mo dels. The results are sho wn in Figure 2. As can b e seen, the risk ratio of PCA-OLS to ridge regression (RR ) is again within a factor of 4 and often PCA-OLS is b etter i.e., the ratio < 1 . 4. Conclusion W e show ed that the risk inflation of a particula r ordinary least sq uares estimator (on the “top” PCA subspace) is within a factor 4 of the ridge estimator. It turns out the con v erse is not true — this PCA estimator may b e arbitrarily b et ter than the ridge one. App endix A. Pro of of Lemma 1. Pro of W e analyze the bias-v ariance decomp osition in Equation 1. F or the v ariance, E Y k ˆ β λ − ¯ β λ k 2 Σ = X j λ j E Y ([ ˆ β λ ] j − [ ¯ β λ ] j ) 2 = X j λ j ( λ j + λ ) 2 1 n 2 E " n X i =1 ( Y i − E [ Y i ])[ X i ] j n X i ′ =1 ( Y ′ i − E [ Y ′ i ])[ X ′ i ] j # = X j λ j ( λ j + λ ) 2 σ 2 n n X i =1 V ar ( Y i )[ X i ] 2 j = X j λ j ( λ j + λ ) 2 σ 2 n n X i =1 [ X i ] 2 j = σ 2 n X j λ 2 j ( λ j + λ ) 2 . 2. The details ab out the datasets can b e found here: http://olivier.c hapelle.cc/ ssl- boo k/benchmarks.html. 4 A Risk Comp a rison of Ordinar y Least Squares vs Ridge Regression 0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 Lambda Risk Ratio (PCA−OLS/RR) n=0.2p 0 10 20 30 40 50 0.2 0.6 1.0 Lambda Risk Ratio (PCA−OLS/RR) n=0.5p 0 10 20 30 40 50 0.5 0.7 0.9 1.1 Lambda Risk Ratio (PCA−OLS/RR) n=0.8p 0 10 20 30 40 50 0.5 0.7 0.9 Lambda Risk Ratio (PCA−OLS/RR) n=1.1p Figure 1: Plots sho wing the risk ratio as a function of λ , the regularization parameter and n , for the syn thetic dataset. p=100 in all the cases. The error bars corresp ond to one standard deviation for 100 such random trials. 0 10 20 30 40 50 1.05 1.10 1.15 Lambda Risk Ratio (PCA−OLS/RR) n=0.2p 0 10 20 30 40 50 1.10 1.20 1.30 Lambda Risk Ratio (PCA−OLS/RR) n=0.5p 0 10 20 30 40 50 1.10 1.15 1.20 1.25 Lambda Risk Ratio (PCA−OLS/RR) n=0.8p 0 10 20 30 40 50 1.10 1.25 1.40 Lambda Risk Ratio (PCA−OLS/RR) n=1.1p 0 10 20 30 40 50 0.970 0.985 Lambda Risk Ratio (PCA−OLS/RR) 0 10 20 30 40 50 0.985 0.995 Lambda Risk Ratio (PCA−OLS/RR) 0 10 20 30 40 50 0.990 1.000 Lambda Risk Ratio (PCA−OLS/RR) 0 10 20 30 40 50 0.980 0.990 1.000 1.010 Lambda Risk Ratio (PCA−OLS/RR) Figure 2: Plots sho wing the risk ratio as a function of λ , the regularization parameter and n , for t w o real w orld datasets (BCI and USPS–top to b ottom). 5 Dhillon, Foster, Kakade and Ungar Similarly , for the bias, k ¯ β λ − β k 2 Σ = X j λ j ([ ¯ β λ ] j − [ β ] j ) 2 = X j β 2 j λ j λ j λ j + λ − 1 2 = X j β 2 j λ j (1 + λ j λ ) 2 , whic h completes the pro of. The risk for RR can b e arbitrarily worse t han the PCA-OLS estimator. Consider the standard OLS setting describ ed in Section 1 in whic h X is n × p matrix and Y is a n × 1 vector. Let X = diag ( √ 1 + α , 1 , . . . , 1) , then Σ = X ⊤ X = diag (1 + α, 1 , . . . , 1) for s ome ( α > 0 ) and also c ho ose β = [2 + α, 0 , . . . , 0] . F or con v enience let’s also choose σ 2 = n . Then, using Lemma 1, we get the risk of RR estimator as Risk( ˆ β λ ) = 1 + α 1 + α + λ 2 | {z } I + ( p − 1) (1 + λ ) 2 | {z } II + (2 + α ) 2 × (1 + α ) (1 + 1+ α λ ) 2 | {z } II I . Let’s consider t w o cases • Case 1: λ < ( p − 1) 1 / 3 − 1 , then I I > ( p − 1) 1 / 3 . • Case 2: λ > 1 , then 1 + 1+ α λ < 2 + α , hence I I I > (1 + α ) . Com bini ng these t w o cases w e get ∀ λ , Risk( ˆ β λ ) > min (( p − 1) 1 / 3 , (1 + α )) . If w e choose p suc h that p − 1 = (1 + α ) 3 , then Risk( ˆ β λ ) > (1 + α ) . The PCA-OLS risk (F rom Theorem 2) is: Risk( ˆ β P C A,λ ) = X j 1 λ j ≥ λ + X j : λ j <λ λ j β 2 j . Considering λ ∈ (1 , 1 + α ), the first term will con tribute 1 to the risk and rest eve rything will b e 0. So the risk of PCA-OLS is 1 and the risk ratio is Risk( ˆ β P C A,λ ) Risk( ˆ β λ ) ≤ 1 (1 + α ) . No w, for large α , the ris k ratio ≈ 0 . 6 A Risk Comp a rison of Ordinar y Least Squares vs Ridge Regression References D. P . F oster and E. I. George. The risk inflation criterion for m ultiple regression. The A nnals of Statistics , pages 1947–1975, 1994. A. N. Tikhono v. Solution of incorrectly formula ted problems and the regularization meth o d. Soviet Math Dokl 4 , pages 501–504, 1963. 7
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment