A note on adjusting $R^2$ for using with cross-validation

A note on adjusting R 2 for using with cross-v alidation Indr ˙ e ˇ Zliobait ˙ e ∗ 1,2 and Nik ola j T atti 2 1 Dept. of Geosciences and Geograph y , Univ ersit y of Helsinki, Finland 2 Helsinki Institute fo r In forma t ion T ec hnology HIIT, Aalto Univ ersit y , Finland August 27, 2018 Abstract W e show how to adjust the co eﬃcien t o f determi- nation ( R 2 ) when used for measuring predictive ac- curacy via leave-one-out cross- v alidation. 1 Bac kground The co eﬃcien t of determinatio n, denoted a s R 2 , is commonly used in ev aluating the p erformance of predictive mo dels, particular ly in life sciences. It indicates wha t pr oportio n of v ariance in the tar g et v aria ble is explained by model pre dic tio ns. R 2 can be seen as a normalized version of the mean squa red error . Normaliza tio n is such that R 2 = 0 is equiv a- lent to the p erformance of a na iv e ba seline a lways predicting a constant v a lue, eq ua l to the mea n of the target v a riable. R 2 < 0 means that the p erfor- mance is worse than the naive baseline. R 2 = 1 is the ideal prediction. Given a data set of n p oint s R 2 is computed as R 2 = 1 − P n ( y i − ˆ y i ) 2 P n ( y i − ¯ y ) 2 , (1) where ˆ y i is the prediction for y i , a nd ¯ y is the average v alue of y i . T ra ditionally R 2 is computed over all data p o in ts used for mo del ﬁtting. The naive baseline is a predictio n strategy which do es not use an y model, but simply a lw ays predicts a cons tan t v a lue, equal to the mean of the target v aria ble, that is, ˆ y i = ¯ y . It follows fro m Eq . (1 ) that then for the naive predictor R 2 = 0. ∗ indre.zliobaite@helsinki.ﬁ Cross-v alidation is a standard pro cedure com- monly used in machine learning for asse s sing o ut- of-sample per formance of a predictive mo del [1]. The idea is to partition data into k ch unks at ran- dom, leav e one ch unk o ut from mo del ca libration, use that ch unk for testing mo del p erformance, and contin ue the same pro cedure with all the c h unks. Leav e-one-o ut cr o ss-v alidation (LOOCV) is used when sample size is particularly small, then the test set consists of one data p oint at a time. When cro ss-v alidation is used, the naive baseline that alwa ys predicts a cons tan t v alue, the average v alue o f the o utput s in the tra ining set, g iv es R 2 < 0 if computed according to Eq . 1. This happens due to an impro per no rmalization: the deno mina tor in Eq. 1 uses ¯ y , and ¯ y is computed ov er the whole dataset, and not just the tra ining data. 2 Cross-v alidated R 2 T o co r rect this, we deﬁne R 2 cv = 1 − P n ( y i − ˆ y i ) 2 P n ( y i − ¯ y i ) 2 , where ¯ y i is the av erag e of outputs without y i , ¯ y i = 1 n − 1 n X j =1 ,j 6 = i y j . That is, ¯ y i is the naive predictor ba sed on the train- ing data, solely . W e show that adjusted R 2 cv for leav e-o ne-out cross- v alidation can be e xpressed as R 2 cv = R 2 − R 2 naive 1 − R 2 naive , (2) 1 2 6 10 14 18 22 26 30 − 3 − 2 − 1 0 num b er of d ata p oints, n standard R 2 Figure 1: The standar d R 2 score fo r the naive con- stant predictor. where R 2 is measured in a standard way a s in Eq. (1 ), and R 2 naive is the r esult of the naive con- stant predictor, and is equal to R 2 naive = 1 − n 2 ( n − 1) 2 , (3) where n is the num b er of data po in ts. Figure 1 plots the standa rd R 2 score for the naive predictor, as p er Eq. (3). The re maining part of the pap er describ es math- ematical pr oof f or this adjustment. W e will show that R 2 naive do es not dep end o n the v aria nce of the target v a riable y , only dep ends on the size o f the dataset n . 3 Ho w this w orks Let us deﬁne R 2 naive as the R 2 score for naive pre- dictor based on training data, R 2 naive = 1 − P ( y i − ¯ y i ) 2 P ( y i − ¯ y ) 2 . Prop osition 1. L et R 2 b e the R 2 sc or e of the pr e- dictor. The adjuste d R 2 is e qual to R 2 cv = R 2 − R 2 naive 1 − R 2 naive , (4) wher e the le ave-one-out c r oss-validate d R 2 naive for the c onstant pr e diction is R 2 naive = 1 −  n n − 1  2 , wher e n is the numb er of data p oints. Pr o of. Let us write A = X ( y i − ˆ y i ) 2 , B = X ( y i − ¯ y ) 2 and C = X ( y i − ¯ y i ) 2 . Note that R 2 = 1 − A/B and R 2 cv = 1 − A/C . Our ﬁrst step is to show that C = αB , where α = n 2 / ( n − 1 ) 2 . Note that A , B and C do not change if we transla te { y i } by a consta nt; w e can assume that n ¯ y = P n i =1 y i = 0. This immediately implies ¯ y i = 1 n − 1 n X j =1 ,j 6 = i y j = − y i n − 1 + n ¯ y = − y i n − 1 . The i th error term of C is ( y i − ¯ y i ) 2 =  y i + y i n − 1  2 =  y i n n − 1  2 = αy 2 i . This leads to C = α n X i =1 y 2 i = αB . Finally , R 2 − R 2 naive 1 − R 2 naive = R 2 − 1 + α α = 1 − A/B − 1 + α α = 1 − A αB = 1 − A C = R 2 cv , which concludes the pro of. References [1] T r ev or Hastie, Rob ert Tibshir ani, and Jerome F riedma n. The Elements of Statistic al L e arn- ing: Data Mining, Infer enc e, and Pr e diction . Springer, 200 9. 2

A note on adjusting $R^2$ for using with cross-validation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment