A note on basis dimension selection in generalized additive modelling

A note on basis dimensi on selection in genera lized additi ve modelling Natalya Pya and Simon N W ood ∗ Nazarbaye v University and Uni versity of Br istol February 23, 2016 Abstract T wo new approa ches for checking th e dimension o f th e basis f unctions wh en using pe nalized regression smoothers a re p resented. The ﬁrst approa ch is a test f or adeq uacy of the basis d imension based on an estimate of the residual v ar iance calculated by differencing residuals that are neighb ours accordin g to the smoo th covariates. The seco nd app roach is based o n e stimated d egrees of freedo m for a smooth of the mo del residua ls with respect to the model covariates. In co mparison with b asis dimension selection alg orithms based on smoothness selection cr iterion (GCV , AIC, REML) the above procedures are compu tationally efﬁcient enough for routine use as part of model checking. 1 Introd uction Penalized re gression smoothing splines ha ve been use d exten si vely for repres entati on of smooth mod el terms in generaliz ed ad ditiv e modelli ng [O’Sulliv an, 1986 , Eilers and Marx, 1996, W ood, 2006]. In com- pariso n w ith fu ll rank smoo thing splines [W ahba, 1990, Gu, 2002 ], with a free para m eter for each data point, penalized regressi on sp lines ar e computationa lly at tracti ve as the basis dimension, k , is set t o much less than the number of data p oints. Provide d tha t k is la r ge enough to a void o versmooth ing/un derﬁtting, its e xact v alue is rather unimport ant, since ov erﬁtting and the ef fectiv e degrees of freedo m of the s mooth are then con trolled by the weight g i ven to the smoothi ng pe nalty during estimatio n, which is determin ed by the smoo thing pa rameter , λ , rat her than k . Ho wev er it is necessary to check that k really is large enoug h, and here useful readily applied ch ecking metho ds are lacking . The choice of k has not bee n w idely discussed in the liter ature. Ruppe rt [20 02 ] pro posed two algo- rithms bas ed on minimiz ing G CV o ver a set of spec iﬁed basis dimen sions. H e demo nstrate d empirica lly that there is a minimal threshold for th e number of knots ne eded for a good ﬁt, larger n umbers of knots ha ve only slight inﬂuence on the ﬁt. Theoretic al justiﬁcation s are pr esente d in Li and Ruppert [2008] who also sho w that the threshol d depends on the degre e of the splin e order . Kau ermann an d Opsomer [2011] intr oduce d a lik elihood based criterion for selec ting k in the cas e of the mixed-mod el speci ﬁca- tion of the penalized spl ine [Ruppert et al., 2003, W and an d Ormerod, 200 8]. In this speciﬁcat ion the smoothin g parameter λ , is represented as the rat io of the residual varian ce and the va riance of the spline coef ﬁcients which all o w s estimating λ simultaneo usly w ith the coef ﬁcients by li keli hood max imizatio n. This is equi vale nt to t he restricted maximum like lihoo d criteri on, REML [W ood, 20 06]. k is treated as an ∗ simon.wood@ba th.edu 1 additi onal discrete parame ter of the log likelih ood in Kauerman n and Opsomer [2011]. T wo nested op- timizatio ns are proposed for parameters es timation : for ea ch va lue for k log lik elihood is re-maximized with respe ct to m odel paramete rs. Searchin g for the ‘optimal’ k is clearly a comp utatio nally expe nsi ve op tion, and ar guably of limited practic al uti lity gi ven th e empirica l and the oretic al evide nce that all th at really matters is that k is not to o small. This note therefo re propose s two computatio nally ef ﬁcient approa ches for checking the adequa cy of the ba sis dimension from a ﬁtted model, and comp ares their practica l efﬁcac y to a fuller se arch for an optimal k , in a small simulatio n stu dy . 2 Methods f or checking the basis dimension W e consider a genera lized ad diti ve model g ( µ i ) = A θ + p X j =1 f j ( x j i ) , Y i ∼ EF( µ i , φ ) , i = 1 , . . . , n, (1) where Y i are independe nt univ ariate res ponse v ariables from an expone ntial f amily dis trib ution with mean µ i and sca le parameter φ , g is a kn own smooth monotoni c link function, A is a model matrix, θ is a vec tor of unkno wn parameters, f j is an unkno wn smooth function of predictor var iable x j that may be vec tor v alued. For representing the smooth function s f j ( x j ) v arious penalized regress ion smoother s are a vail able, such as cubi c re gression spline s and P-splines for smooths of a single co vari ate, or thin pla te re gression spline s and tensor pro duct smooth s for repr esenti ng smo oths of s e veral cov ariates. T he ide a is to specify a basis for each function and choose an appr opriate set of basis functions, B j t , so that the j th smooth functi on ca n be represe nted as f j ( x j ) = k j X t =1 B j t ( x j ) β j t , where β j t are c oef ﬁcients to be estimated, and k j is a numb er of basis functio ns. After selecting smoot h- ing penalties for each smooth function, the penal ized log likelihoo d functio n is maximized to get esti- mates of β j t gi ven smoothing parameter s. The smoothing paramet ers, λ j , contr olling the strength of penali zation can be estimated b y optimizi ng criteria such as G CV , AIC, or REML. Although the smoot hness of each smooth term in (1) is controlle d by the smooth ing parameter , λ j , the basis di mension k j used for s mooth rep resent ation needs to b e suf ﬁciently lar ge to a voi d ov ersmoothing . T o ensure that k j is adequ ate the follo wing ap proach es will be consider ed 1. A hypot hesis testing approac h based on comparison of the resid ual v ariance estimate from the whole model ﬁt, with th e resi dual va riance estimat ed by dif ferencin g residuals that are ‘n eigh- bours’ accord ing to th e cov ariates of the smooth term under considera tion. 2. A proced ure based on re-smoothing the model resid uals with respec t to the cov ariates of the smooth of inter est, using a higher basi s di mension , in order to search fo r missed pattern in the residu als. 3. Optimizing the GCV or REML criterio n with respect to the basis dimensions, as w ell as the smoothin g parameters. The ﬁrst two proce dures do not seem to ha ve appeared in the literatu re, while t he third is essentially what is propose d in Ruppert [2002 ] and Kauer mann and Opsomer [2011]. Option 1 is simple and efﬁcient enoug h for routine automatic us e. 2 2.1 Hypothesis testing appr o ach T o test whet her k j is suf ﬁciently lar ge for rep resent ing f j ( x j ) the fo llo wing test is pro posed . Let r i denote dev iance residuals for the m odel that inc ludes f j ( x j ) , in the Gaussian case simply r i = y i − ˆ µ i . First conside r a smooth of a univ ariate x j . Let ˜ r denote the v ector o f residuals r i , ord ered accord ing to the v alues of the obser ved cov ariates, x j i . D eﬁne ∆ i = ˜ r i +1 − ˜ r i for i = 1 , . . . , n − 1 . Then φ ∆ = 1 2 n − 2 P n − 1 i =1 ∆ 2 i is an estimate of the scale pa rameter , a nd if ˆ φ is the cor respon ding estimate from the model ﬁt, we can deﬁne a statisti c κ = φ ∆ / ˆ φ which should be cl ose to one if the basis dimension is adequate, bu t sho uld be less than one if k j is too small so that there is residu al pattern left in the residua ls. In the c ase o f multi varia te x j , th en th e M n earest n eighb ours of e ach o bserva tion can be fo und based on th e pro ximity of x j vec tors. This has relati vely modest total cost O ( nM log( n )) , see Press et al. (2007 ). Let m ij be the index of the j th neare st neighbo ur of obse rv ation i (exclu ding self). Then we deﬁne ∆ ij = r i − r m ij for i = 1 , . . . n , j = 1 , . . . , M , and φ ∆ = 1 2 M n P M j =1 P n i =1 ∆ 2 ij . The ex pressi on for κ is as before. The dis trib ution of κ under the nu ll hypothesis that k j is ad equate (so that there is no pattern in the residu als orde red wit h respec t to the c ov ariate), can be s imulated by repeatedl y rando mly re-shu fﬂing th e origin al v ector of residu als, and re- computi ng κ for each replicate. A p-v alue can then be obtained. If it is too small, we shou ld co nside r increasing k j . 2.2 Residual re -smoothing A secon d simple method for checkin g k j is as follo ws. 1. Fit the model with your choice of k j and ext ract the de viance res iduals , r i , i = 1 , . . . , n . 2. Create a smoot h, f ∗ j ( x j ) , equi v alent to f j ( x j ) , b ut with basis di mension 2 k j , then estimate the model r i = f ∗ j ( x j i ) + e i , 3. If the estimate o f f ∗ j indica tes that ther e is pattern in the resi duals (e.g. if the effec tive degrees of freedo m of ˆ f ∗ is more than the minimum possi ble for such a smoot h) then k j may not be lar ge enoug h, and an in crease of k j should be consid ered. It is al so wort h consi dering changes in the smo othing select ion criterion after ﬁtting with increased k j . If the valu e of the criterion is increased or the decre ase is not m ore than 2% of the pre vious valu e, then it is sugges ted to stop doublin g k j since it is not exp ected to impro ve the criterion [Ruppert, 2002]. 2.3 GCV/REML based appr oach The ﬁ nal approach is the simplest, b ut also the lea st computatio nally ef ﬁcient. The idea is simply to search o ver a spec iﬁ ed set k j v alues for the optimize r of the criter ion used for smoothi ng parameter esti- mation. T he full model has to be re-ﬁtted for each set of trial values of k j , so this is quite costly compu- tation ally . Suc h alg orithms were p roposed ﬁrst by Rupp ert an d Carro ll [2000] an d Rupp ert [2002] f or P- spline s with the basi s dimension selec tion by minimizing the GC V criterion, w hile K auermann and Opsomer [2011] sugge sts a s imilar like lihood based method . 3 0.0 0.2 0.4 0.6 0.8 1.0 −0.5 0.0 0.5 1.0 1.5 (a) x y 0.0 0.2 0.4 0.6 0.8 1.0 −0.5 0.5 1.5 2.5 (b) x y 0.0 0.2 0.4 0.6 0.8 1.0 −1.5 −0.5 0.5 (c) x y Figure 1: Shapes of the uni v ariate test func tions used for simulation study . 3 Illustrative simulated examples The approaches were compar ed on ﬁve simulated ex amples of Gaussian data: three uni vari ate single smooth models follo wing Rupper t [2002], a biv ariate single smooth model and an add iti ve model with two smoo th terms. y i = f t ( x i ) + e i , t = 1 , 2 , 3 , y i = f ( x 1 i , x 2 i ) + e i , y i = f 1 ( x 1 i ) + f 2 ( x 2 i ) + e i , e i ∼ N (0 , σ 2 ) . For all examples data were simulated ind epend ently from y i ∼ N ( µ i , 0 . 2 2 ) . For uni varia te cases x i were equal ly sp aced on [0 , 1] . Three test funct ions f t ( x i ) with shap es shown in Fig. 1 were applied. For sample sizes n = 100 and 2 00 , 30 0 replicate s were simulate d. Each single smooth regress ion m odel was ﬁ t usin g the R pack age mgc v [W ood, 2006]. T hin plate regressio n splines with the second order deri vati ve in p enalty were used fo r representatio n of all smooths [W ood, 200 3]. The basis dimension was selecte d by the approaches presented in section 2. F or methods 1 and 2 the basis dimen sion for a term was doubled if the check suggested th at the current dimension wa s inadeq uate. T wo v ariants of method 3 were tried , one using GCV and the ot her REM L. T he tested v alues of k were 10, 20, 40, 80. The performanc e of the all methods w as ev aluate d by the mean sum of squared dif ferences be tween the ﬁtted, ˆ f ( x ) , and true val ues of f t ( x ) , MSE = n − 1 n X i =1 n ˆ f ( x i ) − f t ( x i ) o 2 . The results are summariz ed in Fig. 2. Fig. 3 sho ws bar plots of k sele cted by the four algorithms. The simulatio n results sh o w that the perfor mance of al l fou r alg orithms is q uite similar e xcepting REML when ﬁ tting the s ine f unctio n with n = 100 . In t hat c ase REML tended to select a lar ger basis dimen sion than other methods which resulted in lar ger MSE . The basis dimensions required for ﬁtting the f 1 are around 10 – 20 which is typical for man y monoton e functions used in practic e. It is clear th at k = 10 is too small for the ‘bump’ fu nction f 2 and sine wa ves with six c ycles f 3 . A ll alg orithms selected at lea st 4 pv sm gcv reml pv sm gcv reml pv sm gcv reml 0.000 0.005 0.010 0.015 0.020 MSE f 1 f 2 f 3 pv sm gcv reml pv sm gcv reml pv sm gcv reml 0.000 0.004 0.008 MSE f 1 f 2 f 3 Figure 2: MS E comparison between h ypothe sis testing approach ( pv ), resi duals smoothin g ( sm ), GCV search ( gcv ), an d REML searc h ( reml ) method s for th ree sing le uni variat e smooth term model s. T he upper panel sho ws the re sults for n = 100 , the lower for n = 200 . 10 20 40 80 pv sm gcv reml frequency 0 50 100 150 200 250 300 f 1 10 20 40 80 0 50 100 150 200 250 300 f 2 10 20 40 80 0 50 100 150 200 250 300 f 3 10 20 40 80 frequency 0 50 100 150 200 250 300 f 1 10 20 40 80 0 50 100 150 200 250 300 f 2 10 20 40 80 0 50 100 150 200 250 300 f 3 Figure 3: Bar plots of k selected by the hyp othesis testing app roach ( pv ), residual s smoothing ( sm ), GCV search ( gcv ), and REML search ( re ml ) methods for three single uni v ariate smooth term models. Three upper pane ls show the results for n = 100 , the lower p anels for n = 200 . 5 x1 x2 f pv sm gcv reml pv sm gcv reml 0.002 0.004 0.006 0.008 MSE n = 400 n = 900 15 30 60 120 pv sm gcv reml frequency 0 50 100 150 200 n = 400 15 30 60 120 frequency 0 50 100 150 200 n = 900 Figure 4: Plots for biv ariate example. Upper left: perspect i ve plot of the biv ariate test functi on; upper right: MSE comparison between four approa ches; lower left: bar plots o f k selected fo r n = 400 ; lower right: bar plots of k selected for n = 900 . 20 basis functions to gi ve satisfac tory ﬁts for those much ‘wigg lier’ functions. It can be noticed that the hypot hesis testing approach stops earlier and chooses smaller b asis dimensio ns compared to the other methods . The shape of the bi va riate te st fu nction use d in the nex t e xample is s ho wn in Fig. 4. 20 a nd 30 v alues of each o f the two co variat es, x 1 i and x 2 i , were equi distan t in [ − 1 , 3] and [0 , 1] , g i ving samples of sizes n = 400 and 900 . 200 repli cate data sets were simulated for each sample size. T he trial value s of the basis dimension were 15, 30, 60, and 120. While sett ing k = 15 is t oo low for all the approach es, k = 30 appear ed to be en ough with only GCV choo sing occasion ally larger k for n = 40 0 and more often for n = 900 (see Fig. 4). The last e xample is an additi ve model with two smo oth terms, y i = f 1 ( x 1 i ) + f 2 ( x 2 i ) + e i , where f 1 is the ﬁrst functi on used in th e uni variate case and f 2 is a sine funct ion with a single cycle . x 1 i and x 2 i were i.i.d. U (0 , 1) . Sample size s n = 200 and 400 were considered. The simulation result s of this study base d on 200 repl icates are sho wn in Fig. 5. The beha viour of all algorithms is similar to th at of the univ ariate examp le. Ho w e ver , the main beneﬁt of th e hypothes is testing and re-smoothing residu als approa ches is in their computa tional efﬁcien cy . The G CV/REML algorithms require the w hole model to be re-ﬁtted at ev ery combi nation of ( k 1 , k 2 ) values ov er the speciﬁed grid. But the hy pothes is testing approa ch needs only calculating the estimate of the residual var iance and its p-valu e. And the secon d approa ch requires re-s moothin g of a singl e term rather th an the whole model. R e-ﬁtting the whole model with an in crease d value of the suspect ed k j occurs only in case of a lo w p-v alue or high edf for the particu lar smooth te rm. The computation al adv antage gro ws with the number of smooth terms in the GAM. 6 pv sm gcv reml pv sm gcv reml 0.002 0.004 0.006 0.008 MSE n = 200 n = 400 10 20 40 80 pv sm gcv reml frequency 0 50 100 150 200 k 1 , n = 200 10 20 40 80 frequency 0 50 100 150 200 k 2 , n = 200 10 20 40 80 frequency 0 50 100 150 200 k 1 , n = 400 10 20 40 80 frequency 0 50 100 150 200 k 2 , n = 400 Figure 5: Plots for the ad ditiv e model. Upper left: boxplots of th e MSE of four ap proach es for n = 200 and 400 ; upper m iddle and right : bar plots of k 1 and k 2 selecte d for two smooth terms, f 1 and f 2 , for n = 200 ; lower p anels: bar plot s of k i for two smoot hs for n = 400 . 4 Conclusions In pract ice the simple and ef ﬁcient basis dimension check propos ed in Section 2.1 appear s to perform as well as more expen si ve approa ches requi ring m ultipl e m odel ﬁts. Giv en its computa tional ef ﬁciency it seems sensible to inco rporate this check as a standard part of model check ing for pena lized re gressi on spline based generalize d additi ve models. Of course as with any model checking, the methods hav e to be us ed with care. Mean v ariance relations hip problems and residu al au tocorr elation problems unrelated to k j can ob viously cause any of the methods consider ed he re to suggest increasin g k j , when the real proble m li es else where. This work was s upported by EPSRC gra nts EP/K005251/1 and EP/1000917/1 Refer ences P .H. Eilers and B.D. Marx. Flexib le smoothing w ith B-splines and penalties. S tatistical Scie nce , 11: 89–12 1, 1996. C. Gu. Smooth ing Spline ANO V A Models . Ne w Y ork: Springer , 2002. G. Kauermann and J.D. Opso mer . Data-dri ven select ion of th e spline dimen sion in pen alized spline reg ression . Biometrika , 98(1):22 5–230, 2011. Y . Li and D. Ruppert. On the asympto tics of penal ized sp lines. Biometrika , 95(2) :415–436, 2008. F . O’Sulli van. A stati stical perspecti ve o n ill-po sed in verse proble ms. Stat istical Science , 1:50 2–518, 1986. 7 D. Ruppert. Selecting the number of kno ts for penali zed spli nes. Jour nal of Compu tation al and Grap h- ical Statis tics , 1 1(4):7 35–75 7, 2002. D. Ruppert and R.J. C arroll. Spatially-ad aptiv e penal ties for spli ne ﬁttin g. Au str alian and N ew Zeal and J ournal of St atistic s , 42:205 –223, 20 00. D. Ruppert, M.P . W an d, a nd R.J. Carroll. Semiparamet ric Re gr ession . New Y ork: Cambrid ge Uni versi ty Press, 2003 . G. W ahba. Spline models for observ ational data . Philadel phia: SIAM, 1990 . M.P . W and and J.T . Ormerod. On semipa rametric regressio n with o’ su lli van penalized splines. Austr alian and New Zeal and Jou rnal of Statist ics , 50 (2), 2008 . S.N. W ood. Thin plate re gression splines. J ournal of the R oyal Stat istica l So ciety: Series B , 65(1): 95–11 4, 2003. S.N. W ood. Gener alized Additive Models. An Intr oduc tion with R . Chapman & Hal l, 20 06. 8

A note on basis dimension selection in generalized additive modelling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment