Comment: Boosting Algorithms: Regularization, Prediction and Model Fitting

Comment on ``Boosting Algorithms: Regularization, Prediction and Model Fitting'' [arXiv:0804.2752]

Authors: ** *원 논문(Boosting Algorithms: Regularization, Prediction, Model Fitting) 저자:  **Hastie

Comment: Boosting Algorithms: Regularization, Prediction and Model   Fitting
Statistic al Scienc e 2007, V ol. 22, No. 4, 513– 515 DOI: 10.1214 /07-STS242A Main article DO I: 10.1214/07-STS242 c  Institute of Mathematical Statisti cs , 2007 Comment: Bo osting Algo rithms: Regula rization, Prediction and Mo de l Fitting T revo r Hastie W e congratulate the authors (hereafter BH) for an in teresting tak e on the b o osting tec hnology , and for dev eloping a mo d ular computational en v ir onmen t in R for exploring their mo dels. Their use of low-degree - of-freedom smo othing sp lines as a b ase learner pr o- vides an in teresting a ppr oac h to adaptive add itiv e mo deling. The notion o f “Twin Boosting” is in ter- esting as w ell; b esides the adaptiv e lasso, we hav e seen th e id ea applied more directly for th e lasso and Dan tzig select or ( James, Radc henko and Lv ( 2007 )). In this discussion w e elab orate on the connections b et wee n L 2 -b o osting of a lin ear mo del and infinites- imal forward s tagewise linear regression. W e th en tak e the authors to t ask on their defin ition of de- grees of freedom. 1. L 2 -BOOST AND IN FINITESIMA L F ORW ARD ST A GEWISE LINEAR REGRESSION Motiv ated b y a v ersion of L 2 -b o osting in C hap- ter 10 of Hastie, Tib shirani and F riedman ( 2001 ), Efron, Hastie, Joh n stone and Tibsh irani ( 2004 ) p ro- p osed the LARS algorithm. The inte nt w as to: • deve lop a limiting v ersion of L 2 -b o ost in w h ic h the step-length ν w ent to zero; • show that this limiting version ga ve paths iden ti- cal to the lasso, as w as h in ted in that c h apter. The result w as thr ee v ery similar v arieties of the LARS algorithm, n amely lasso, LAR and in finitesi- mal forward stagewise (iFSLR) (pac k age lars for R , T r evor Hastie is Pr ofessor, Dep artment of Statistics, Stanfor d University, St arfor d, California 94305, USA e-mail: hastie@stanfor d.e du . This is an electr o nic reprint of the origina l article published by the Institute of Mathema tica l Statistics in Statistic al Scienc e , 200 7, V ol. 22, No. 4, 513 –515 . This reprint differs from the or iginal in pagina tio n a nd t yp ogr aphic detail. a v ailable from CRAN ). iFSLR is ind eed the limit of L 2 -b o ost as ν ↓ 0 , with piecewise-linear co efficient profiles, but is n ot alw a ys the same as the lasso. On a sligh t tec h n ical note, the version of L 2 -b o ost prop osed in BH is slightly d ifferen t from that in Hastie, Tibsh ir ani and F riedman ( 2001 ). Compare [BH] ˆ β [ m ] = ˆ β [ m − 1] + ν · ˆ β ( ˆ S m ) , (1) [HTF] ˆ β [ m ] = ˆ β [ m − 1] + ν · sign[ ˆ β ( ˆ S m ) ] . (2) Despite the difference, they b oth ha v e the same limit, whic h is compu ted exactly f or squared-error loss b y the type="forwa rd.stage wise" option in th e pac k- age lars . As ν gets v ery small, i nitially the same co efficien t tends to get con tinuously up d ated by in- finitesimal a mounts (hence lin early). Ev entually a second v ariable ties with the first for co efficient up- dates, w hic h they share in a balanced w ay while remaining tie d. Then a third joins in, and so on. Using s im p le least-squares computations, the LARS algorithm computes the en tire iFSLR path with the same cost as a single multiple-l east-squares fit. Note that in this limiting case, w e can no longer index the sequence b y step-n umber m as in ( 1 ) or ( 2 ), but m ust resort to some other measur e, su c h as the L 1 - arc-length of the coefficien t profile ( Hastie, T aylo r, Tibs h irani and W alther ( 2007 )). Lasso and iFSLR are not alwa ys the same. In high-dimensional p roblems with correlated predic- tors, lasso profiles b ec ome wigg ly quic kly , wh ereas iFSLR profiles tend to b e m uch smo other and mono- tone (Hastie et al., 2007 ). E fron et al. ( 2004 ) estab- lish sufficien t p ositive c one c onditions on the mo del matrix X whic h effectiv ely limit the amount of cor- relation b etw een th e v ariables and guarant ee that lasso and iFSLR are the same; in p articular, if the lasso profi les are monotone, all three algorithms are iden tical. 2. DEGREES OF FREEDOM The authors prop ose a simple formula for the de- grees of fr eedom for an L 2 -b o osted mo d el. They 1 2 T. HASTIE Fig. 1. The effe ctive de gr e es of fr e e dom for L 2 -b o ost c om pute d using the tr ac e formula (vert ic al axis) vs. the exact de gr e es of fr e e dom. The left plot is for the pr ostate c anc er data example; the right plot is for a simulate d univariate smo othing pr oblem. In b oth c ases df ( m ) under estimates the true de gr e es of fr e e dom quite dr amatic al ly. construct th e h at matrix B m that computes the fit at iteration m , and th en use d f( m ) = trace( B m ). They are in effect treating the mo del at stage m as if it we re compu ted by a p redetermined sequence of lin ear u p d ates. If th is w ere the case, their for- m ula w ould b e sp ot on, by the accepted defin itions for effectiv e degrees of fr eedom for linear op erators (Hastie et al., 2001 ; Efr on et al., 2004 ). They ac- kno wledge th at this is an approximati on (since the sequence w as n ot predetermined, b ut rather adap- tiv ely c h osen), but do not elab orate. In fact this ap- pro ximation can b e v ery badly off. Figure 1 shows the tru e degrees of f r eedom df T ( k ) plotted against df( k ) for t wo examples. W e see that df( k ) alw a ys underestimates d f T ( k ). W e n o w discuss the details of these examples, and the basis for th ese claims. The left example is the prostate d ata ( Hastie et al., 2001 , Figure 10.1 2) and has 67 observ ations and 9 pr edictors (including inte rcept). The right exam- ple fits a univ ariate piecewise-constan t spline mo del of th e form f ( x ) = P 50 j =1 β j h j ( x ), where the h j ( x ) = I ( x ≥ c j ) are a sequence of Haar basis fun ctions w ith predefined knots c j at the unique v alues of the inp ut v alues x i . T here are 50 observ ations and 50 pr edic- tors. In b oth p roblems we fit th e limiting L 2 -b o ost mo del iFSLR, us ing the lars/f orward.s tagewise pro cedur e. Figure 2 shows the co efficien t p rofiles. In this case, using the results in Efr on et al. ( 2004 ), it can b e dedu ced th at the equ iv alen t limiting ver- sion of the hat matrix (5.6) of BH simplifies to a similar bu t more compact exp ression: B k = I − ( I − γ k H k ) (3) · ( I − γ k − 1 H k − 1 ) · · · ( I − γ 1 H 1 ) . Here k in d exes the step numb er in the la rs algo- rithm, w h ere the steps delineate the br eakp oin ts in Fig. 2. Co efficient pr ofiles f or the iFSLR algorithm f or the two examples. Both pr ofiles ar e m onotone, and ar e identic al to the lasso pr ofiles on these examples. In this c ase the df incr ement by 1 exactly at every vertic al br e ak-p oint line. COMMENT 3 the piecewise-linea r p ath. H j is the hat matrix corre- sp ond ing to the v ariables in v olv ed in the j th p ortion of the p iecewise linear path, and γ j is the r elativ e distance in arc-length trav eled along this piece un- til the next v ariable joins the activ e set (relativ e to the arc-length of the step that wen t all the wa y to the least squares fit). Using the BH definition, we w ould compute df( k ) = trace( B k ) (v ertical axis in Figure 1 ). These tw o examples w ere c hosen carefully , f or they b oth satisfy the p ositive c one c ondition men tioned ab o v e. In particular, the iFSLR path is the lasso path in b ot h cases, and the activ e set gro ws b y one at eac h s tep. More imp ortan tly , it is und er these conditions th at Efron et al. ( 2004 ) established that df T ( k ) = k + 1 e xactly (horizon tal axis in Figure 1 ). The +1 tak es care of the interce pt. Consider the fi rst step. The dominan t v ariable en- ters the mo del, and gets its co efficien t incremented unt il we reac h the p oi nt that the n ext comp etito r is ab out to ente r. At this p oint the df is exactly 2, while the formula df(1) = trace ( B 1 ) = 1 . 48 for the first example in Figure 1 ; this is off by 25%. The exact df satisfies our in tuition as well. If th e first v ariable is far m ore s ignifican t than the rest, w e will almost fit it entirely ( γ 1 ≈ 1) b efore the next one en ters, an d at that p oin t the mo del has 2df. There is virtually n o price for searching, b ecause searc h ing w as not really needed. On the other h and, if man y v ariables are comp eti ng for the first slot, shortly after the chosen one en ters, another might app ear, long b efore the fir st is fit completely ( γ 1 ≪ 1). Here th e mo d el also has 2df , despite the fact that the first v ariable has hard ly pr ogressed at all. This is the pr ice p aid for selection. Ev en when the positive cone conditions are not satisfied, it can b e sho wn that the size of the activ e set is an u n biased estimate of the true df ( Zou, Hastie and Tibshirani ( 2007 )). It is p ossible that the authors can devise a correc- tion for their df( k ) formula, b ased on the in sigh ts learned here. In some cases it ma y be p ossible to calibrate the form ula to matc h the size of th e activ e set. F ailing that, one can use b ootstrap m etho ds to estimate df. But if the main p urp ose for estimating df is for mo del selection, K-fold cross-v alidation is a useful alternativ e. A CKNO WLEDGMENTS This researc h w as sup p orted by NSF Grant DMS- 05-05 676 and NIH Gran t 2R01 CA 72028-0 7. REFERENCES Efr on, B., Hastie, T., Johnsto ne, I. and T ibshirani, R. (2004). Least angle regression (with discussion). An n. Statist. 32 407–499 . MR2060166 Hastie, T., T a ylor, J., Tibshirani, R. and W al ther, G. (2007). F orw ard stagewise regression and the monotone lasso. Ele ctr on. J. Statist. 1 1–29. MR2312144 Hastie, T., T ibshirani, R. and Friedman, J. (2001). The Elements of Statistic al L e arning. Data Mining, Infer enc e, and Pr e diction . Springer, N ew Y ork. MR1851606 James, G., Radchenko, P. and L v, J. (2007). The dasso al- gorithm for fitting the dantzi g selector and the lasso. T ec h- nical rep ort, Marshall School of Business, Univ. S outhern Califo rnia. Zou, H., Hastie, T. and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. An n. Statist. 35 2173– 2192.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment