Recursive Bias Estimation and $L_2$ Boosting

Submitted to the Annals of Statistics RECURSIVE BIAS ESTIMA TION AND L 2 BOOSTING By Pierre-Andr ´ e Cornillon , Nicolas Hengar tner and Eric Ma tzner-Løber Montp el lier SupA gr o, University R ennes 2 and L os Alamos National L ab or atory This pap er presents a general iterativ e bias correction pro cedure for regression smoothers. This bias reduction schema is shown to cor- respond op era tionally to the L 2 Boosting algorithm and pro vides a new statistical in terpretation for L 2 Boosting. W e analyze the b e- havior of the Bo o sting algorithm a pplied to common smo others S whic h we show d e p end on the sp ectrum of I − S . W e present exam- ples of common smo oth er for which Bo osting generates a divergen t sequence. The statistical interpretation suggest combining algorithm with an appropriate stopping rule for the iterativ e pro cedure. Finally w e illustrate the practical ﬁnite sample performances of the iterative smoother via a sim ulation study . sim ulations. 1. In tro duction. Regression is a fundamen tal data analysis to o l for unco v erin g functional r e lationships b et we en pairs of observ ations ( X i , Y i ) , i = 1 , . . . , n . The tr ad itional approac h s p eciﬁes a parametric family of regression functions to d e scrib e the conditional exp ect ation of the dep enden t v ariable Y giv en th e in depend e n t v ariables X ∈ R p , and estimates the free parameters b y minimizing the squared error b et ween the p redicte d v alues and the data. An alternativ e approac h is to assume that the regression function v aries smo othly in the indep enden t v ariable x and estimate lo cally the conditional exp ecta tion of Y giv en X . This resu l ts in nonp arametric regression estima- tors (e. g. F an and Gijb els [13], Hastie and Tibshirani [19], Simonoﬀ [34]). The v ecto r of predicted v alues b Y i at the observed co v ariates X i from a non- parametric regression is called a regression smo o ther, or simply a sm oother, b ecause the pr e dicted v alues b Y i are less v ariable than the original observ a- tions Y i . Ov er the p ast thirt y ye ars, numerous smo ot hers ha ve b een prop osed: runn in g -mean smo o ther, run ning-line smo other, bin smo other, k ernel based smo other (Nadara ya [29], W atso n [38]), spline r egression smo other, smo oth- ing sp li nes smoother (W ah b a [37], Whittak er [39]), lo ca lly w eighted ru nning- line sm o other (Clev eland [6]), j u st to men tion a few. W e refer to Buja et al. AMS 2000 subje ct classiﬁc ations: 62G08 Keywor ds and phr ases: nonparametric regression, smo other, kernel, nearest neighbor, smoothing splines, stopping rules 1 imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 2 [5], E u bank [12], F an an d Gijb els [13], Hastie and Tibs hirani [19] for more in depth tr eatmen ts of r egression smo others. An imp ortan t prop erty of smoothers is that they do not r equire a rigid (parametric) sp eciﬁcation of the r egression function. Th at is, we mo del the pairs ( X i , Y i ) as Y i = m ( X i ) + ε i , i = 1 , . . . , n, (1.1) where m ( · ) is an unknown s m o oth fun ction. The d isturbances ε i are ind e- p end ent mean zero and v ariance σ 2 random v ariables that are in dep end en t of the co v ariates X i , i = 1 , . . . , n . T o help our discussion on sm o others, we rewrite Equation (1.1) compactly in vec tor form by setting Y = ( Y 1 , . . . , Y n ) t , m = ( m ( X 1 ) , . . . , m ( X n )) t and ε = ( ε 1 , . . . , ε n ) t , to get Y = m + ε. (1.2) Finally we wr ite b m = b Y = ( b Y 1 , . . . , b Y n ) t , the vect or of ﬁtted v alues from the regression smo other at the observ ations. Op erationally , lin ear smo others can b e written as b m = S λ Y , where S λ is a n × n smo othin g matrix. While in general th e smo othing matrix will b e n ot b e a pro jection, it is usually a c on tractio n (Buja et al. [5]). That is, k S λ Y k ≤ k Y k . Smo othing matrices S λ t y p ically dep end on a tuning parameter, whic h denoted by λ , th at go verns the tradeoﬀ b etw een the smo othness of the esti- mate and the go o dness-of-ﬁt of the smo other to the data. W e parameterize the smo othin g matrix suc h that large v alues of λ will pro d uce very smo oth curv es while small λ will pro du ce a more wiggly curve that wan ts to inte r- p olate the data. The parameter λ is the ban d width f or k ernel smo other, the span size for runn ing-mean sm o other, b in smo other, and the p enalt y factor λ for s p line smo other. Muc h has b een written on ho w to select an appropriate smoothing pa- rameter, see for example (Simonoﬀ [34]). Ideally , w e wa n t to c ho ose the smo othing parameter λ to minimize the exp ected squared prediction error. But without explicit kno wledge of the und erlying r egression function, the prediction error can not b e computed. Instead, one minimizes estimates of imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 3 the pr ediction error usin g Stein Unbiased Risk Estimate or Cross-V alidation (Li [26]). This pap er tak es a diﬀeren t app r oac h. Instead of selecting the tunin g pa- rameter λ , we ﬁ x it to some r easonably large v alue, in a wa y th at ensures that the r esu lting smoothers oversm o oths th e data , t hat is, the resulting smo other will ha ve a relativ ely small v ariance but a substan tial bias. Ob- serv e th at the conditional exp ectation of the − R = − ( Y − b Y ) give n X is the bias of the smo other. This pr o v id es us with the opp ortunity of estimating the b ias by smo othing the residuals R , thereby enabling us to bias correct the initial smo other b y subtracting from it the estimated b ias. The idea of estimating the b ias from residuals to correct a pilot estimator of a r egression function go es back to the concept of twicing in tro d uced by (T u k ey [35]) to estimate bias from mo del missp eciﬁcation in multiv ariate r egression. Obvi- ously , one can iterativ ely r ep eat the bias correction step until the increase to the v ariance from the bias correcti on out w eighs the mag nitude of the reduction in bias, leading to an iterativ e bias correction. Another iterativ e function estimation metho d , seemingly unrelated to bias reduction, is Bo osting. Boosting w as in tro du ced as a machine learning al- gorithm for com bin ing multiple we ak le arners by a v eraging th eir weigh ted predictions (F reund [15], Sc hapire [31 ]). The go o d p erformance of the Bo ost- ing algorithm on a v ariety of datasets stim u lated statisticians to unders tand it from a statistical p oint of view. In his seminal pap er, Breiman [2] sho ws ho w Bo osting ca n b e inte rpreted as a gradient descent m etho d. This v iew of Bo osting w as reinforced by F riedman [1 6]. Ad ab o ost, a p opular v ariant of t he Bo osting algorithm, can b e u ndersto o d as a m etho d for ﬁtting an additiv e mo del (F riedman et al. [17]) and recentl y Efron et al. [11] made a connection b et w een L 2 Bo osting and Lasso for linear m o dels. But connections b etw een iterativ e b ias reduction and Bo osting can b e made. In the conte xt of nonp arametric density estimation, Di Marzio and T a ylor [8] hav e sho wn that one iteration of the Bo osting algorithm red u ced the bias of the initial estimat or in a m anner similar to the m ultiplicativ e bias reduction metho ds (Hengartner and Matzner-Løb er [20], Hjort and Glad [22], Jones et al . [25 ]). In the follo w-up pap er (Di Marzio and T a ylor [9]), they extend their results to the n on p arametric r egression s etting and sho w that one step of the Bo osting algorithm applied to an o versmo oth eﬀects a bias r eduction. As exp ected, the decrease in the bias co mes at the cost of an increase in the v ariance of the corrected smo other. In Section 2, we show that in the con text of regression, suc h iterativ e bias reduction s chemes obtained by correcting an estimator b y smo others of imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 4 the r esid uals corresp ond op erationally to the L 2 Bo osting a lgorithm. This pro vides a n o vel statistical in terpretation of L 2 Bo osting. T h is new inte rpre- tation helps exp lain why , as the n um b er of iteration increases, th e estimator ev entually deteriorates. Indeed, by iterativ ely red u cing the bias, one even- tually adds m ore v ariabilit y than one reduces the bias. In Section 3, we d iscu ss the b eha vior of the L 2 Bo osting of many com- monly used smo others: sm o othing s p lines, Nadaray a-W atson k ernel and K - nearest neighbor smo others. Un lik e the go o d b eha v ior of the L 2 b o osted smo othing splines discussed in Buhlmann and Y u [4], w e sho w that Boosting K -nearest neigh b or sm o others and k ernel smo others that a re not p ositive deﬁnite pr o duces a s equ ence of smo others that b eha v e err atically after a small n um b er of iteration, and ev entually div erge. T h e reason for the f ailure of the L 2 Bo osting algorithm, when applied to these smo others, is that the bias is o v erestimated. As a result, the Bo osting algorithm o v er-corrects the bias and pro d uces a diverge n t smo other sequence. Section 4 discu sses m o d- iﬁcations to the original smo other to ensure go o d b eha v ior of the sequence of b o osted smo others. T o control b oth the o v er-ﬁtting and o v er-correction prob lems, one needs to stop the L 2 Bo osting algorithm in a timely mann er. Our interpretatio n of the L 2 Bo osting as an iterativ e bias correction s c h eme leads us to prop ose in Section 5 sev eral data d riv en s topp ing rules: Ak aik e Information Criteria (AIC), a mo diﬁed AIC, Generalized Cross V alidatio n (GCV), one and L - fold Cross V alidation, and estimated prediction error estimation usin g d ata splitting. Using either the asymptotic results of Li [27 ] or the ﬁn ite sample oracle inequalit y of Hengartner et al. [21], we see that stopp ed b o osted smo other h as desirable statistical prop erties. W e u s e either of th ese theorems to conclude that the desirable prop erties of the b o osted sm o other d o es not dep end on th e initial pilot smo other, p ro vided that the pilot ov ersmo oths the data. This conclusion is reaﬃrmed from the sim ulation study we present in Section 6. T o im p lemen t these data drive n stopping ru les, we need to calculate predictions of the smo other for an y desired v alue of the co v ariates, and not only at the observ ations. W e sh o w in Section 5 ho w to extend linear smo others to giv e predictions at an y desired p oin t. The simulatio ns in Section 6 sho w that when we com bin e a GC V based stopping rule to the L 2 Bo osting algorithm seems to work we ll. It stops early when the Bo osting algorithm m isb ehav es, and otherwise take s adv an- tage of the bias reduction. Our simulatio n compares optim um smo others and optim um iterativ e b ias corrected smo others (using generalized cross v alida- tion) for general smo others with ou t kno wledge of the underlying r egression function. W e conclud e that the optimal iterativ e bias corrected smo other imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 5 outp erforms the optimal smo other. Finally , the pro ofs are gathered in the App endix. 2. Recursiv e bias estimat ion. In th is section, we deﬁn e a class of iterativ ely bias corrected lin ear smo others and highlight some of their prop- erties. 2.1. Bias Corr e c te d Line ar Smo others. F or ea se of exp osition, we shall consider the univ ariate nonp arametric regression mo del in v ector form (1.2) from Section 1 Y = m + ε, where the errors ε are indep end en t, ha ve mean zero and constan t v ariance σ 2 , and are indep endent of the co v ariates X = ( X 1 , . . . , X n ), X j ∈ R . Extensions to multiv ariate sm o others are strait forward and w e refer to Buja et al. [5] for example. Linear smo others can b e w r itten as (2.1) b m 1 = S Y , where S is an n × n smo othing matrix. Typical sm o othing matrices are con tr actions, so that k S Y k ≤ k Y k , and as a result the asso ciated smo other S Y is called a shrink age smo other (see for example Buja et al. [5]). Let I b e the n × n id entit y matrix. The linear s mo other (2.1) h as bias (2.2) B ( b m 1 ) = E [ b m 1 | X ] − m = ( S − I ) m and v ariance V ( b m 1 | X ) = S S ′ σ 2 , resp ectiv ely . A n atur al questio n is “ho w can one estimate the bias?” T o answer this question, observe that the residu als R 1 = Y − b m 1 = ( I − S ) Y ha v e ex- p ected v alue E [ R 1 | X ] = m − E [ b m 1 | X ] = ( I − S ) m = − B ( b m 1 ). This suggests estimating the bias by smo othing the negativ e residuals (2.3) b b 1 := − S R 1 = − S ( I − S ) Y . This bias estimator is zero wheneve r the smo othing matrix S is a p ro jection, as is the case for linear regression, bin sm o others and regression sp lines. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 6 Ho wev er, since most common smo others are not p r o jections, we ha v e an opp ortun it y to extract fur ther signal from the residual and p ossibly improv e up on the initial estimator. Note th at a smo othing m atrix other th an S can b e used to estimate the bias in (2.3), but as w e shall see, in man y examples, usin g S wo rks v ery well, and leads to an attractiv e inte rpretation of Equation (2.3). Indeed, since the matrices S and I − S commute, we can expr ess the estimated bias as b b 1 = − S ( I − S ) Y = − ( I − S ) S Y = ( S − I ) b m 1 . W e recognize the lat ter as the righ t-hand side of (2.2) with the sm o other b m 1 replacing the u n kno wn vect or m . T his says that ˆ b 1 is a p lug-in estimate for the b ias B ( ˆ m 1 ). Subtracting the estimated bias from the in itial smo other b m 1 pro du ces the twicing estimator b m 2 = b m 1 − b b 1 = ( S + S ( I − S )) Y = ( I − ( I − S ) 2 ) Y . Since the twiced smo other b m 2 is also a linear sm o other, one can r ep eat the ab o v e discussion with b m 2 replacing b m 1 , p ro ducing a thric e d li near smo other. W e can iterate the bias correction step to recursiv ely d eﬁne a family of bias corrected smo others. Starting with b m 1 = S Y , construct recursive ly for k = 2 , 3 , . . . , the sequences of residuals, estimated bias and bias correcte d smo others R k − 1 = ( I − S ) k − 1 Y b b k = − S R k − 1 = − ( I − S ) k − 1 S Y b m k = b m k − 1 − b b k = b m k − 1 + S R k − 1 . (2.4) W e sh o w in the n ext theorem that th e iterativ ely bias corrected smo other ˆ m k deﬁned b y Equation 2.4 has a nice represen tation in terms of th e original smo othing m atrix S . Theorem 2.1 . The k th iter ate d bi as c orr e cte d line ar smo other b m k (2.4) c an b e explicitly written as b m k = S [ I + ( I − S ) + ( I − S ) 2 + · · · + ( I − S ) k − 1 ] Y = [ I − ( I − S ) k ] Y = S k Y . (2.5) imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 7 Example with a Gaussian k ernel smo other Thr oughout the next tw o sections, we shall use the follo wing example to illustrate th e b ehavio r of the Bo osting algorithms applied to v arious common smo others. T ak e the design p oints to b e 50 indep endently drawn p oint s from an uniform distribution on the unit interv al [0 , 1]. T h e true r egression fu nction is m ( x ) = s in (5 π x ), the solid line in th e Figure 1, and the disturbances are m ean zero Gaussians with v ariance pro d u cing a signal to n oise r atio of ﬁve. In the n ext ﬁgur e, the in itial smo other is a k ernel one, with a bandw id th equals to 0.2 and a Gaussian ke rnel. This pilot smo other h ea vily o versmooths the d ata, see Figure 1 that s h o ws that the pilot smo other (p lain line) is nearly constan t. The iterativ e bias corrected estimators are plotted in ﬁgure (1 ) for v alues of k , the n um b er of iterations, in { 1 , 10 , 50 , 100 , 500 , 10 3 , 10 5 , 10 6 } 0.0 0.2 0.4 0.6 0.8 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Fig 1 . T rue function m 1 (fat plain li ne) and diﬀer ent estimators varying with the numb er of iter ations k . Figure 1 sho ws how eac h bias correction iteration change s the smo other, starting from a nearly constant smo other and slo wly deforming (going down in to the v alleys and up in to the p eaks) with increasing num b er of iterations k = 10, k = 50 and k = 100. Af ter 500 iterations, th e iterativ e smo other is v ery cl ose t o the true function. Ho w ev er wh en the n u m b er of iteratio ns is v ery large (here k = 10 5 and 10 6 ) the iterativ e smo other deteriorates. Lemma 2.2 . The squar e d b i as and varianc e of the k th iter ate d bias c or- r e cte d line ar smo other b m k (2.4) ar e B 2 ( ˆ m k ) = m t  ( I − S ) k  t ( I − S ) k m V( ˆ m k ) = σ 2 ( I − ( I − S ) k )  ( I − ( I − S ) k )  t . imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 8 Remark: Symmetric smo othin g matrices S can b e d ecomp osed as S = P S Λ S P t S , with orthonormal matrix P S = [ u 1 , u 2 , · · · , u n ] and diag onal ma- trix Λ S . ˆ m k = P S diag(1 − (1 − Λ S ) k ) P t S Y = X j (1 − (1 − λ j ) k ) u j u t j Y . (2.6) Applying Lemma 2.2, w e get B 2 ( ˆ m k ) = m t P S ( I − Λ S ) 2 k P t S m V( ˆ m k ) = σ 2 P S ( I − ( I − Λ S ) k ) 2 P t S . Hence if the magnitude of the eigen v alues of I − S are b ounded by one, eac h iteration of th e b ias correction will decrease the bias and increase the v ariance. This monotonicit y (decreasing bias, increasing v ariance) with in- creasing n um b er of iterations k allo ws us consider data driven selection for n um b er of bias correction steps to ac hiev es th e b est compromise b et w een bias and v ariance of the smo other. The preceding remark su ggests that the b ehavio r of the iterativ e bias corrected smoother b m is tied to the sp ectrum of I − S , and not of S . T h e next theorem collects the v arious con v ergence results for iterated bias corrected linear smo others. Theorem 2.3 . Supp ose that the singular values λ j = λ j ( I − S ) of I − S satisfy − 1 < λ j < 1 for j = 1 , . . . , n. (2.7) Then we have that k ˆ b k k < k ˆ b k − 1 k and lim k →∞ ˆ b k = 0 , k R k k < k R k − 1 k and lim k →∞ R k = 0 , lim k →∞ b m k = Y and lim k →∞ E [ k b m k − m k 2 ] = nσ 2 . Conversely, if I − S has a singular value | λ j | > 1 , then lim k →∞ k ˆ b k k = lim k →∞ k R k k = lim k →∞ k b m k k = ∞ . Remark 1: T h is theorem sho ws that iterating the b o oting algorithm to reac h th e limit of the sequ en ce of b o osted smo others, Y ∞ , is not the desirable. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 9 Ho wev er, sin ce eac h iteration decreases the bias and increases the v ariance, a su itably stopp ed Bo osting estimator is lik ely to improv e up on the in itial smo other. Remark 2: When | λ j ( I − S ) | > 1, the iterativ e bias correction fails. The reason is that ˆ b k o verestimat es the true bias b k , and hence Bo osting re- p eatedly o vercorrects the b ias of the smo others, whic h results in a dive rgen t sequence of smo others. Div er gence of the sequence of b o osted smo others can b e detect ed n umerically , making it p ossible to a v oid this bad b eha vior by com b ining t he iterativ e bias co rrection procedu re with a su itable stopp ing rule. Remark 3: The assu mption that for all j , the s in gular v alues − 1 < λ j ( I − S ) < 1 implies that I − S is a cont raction, so that k ( I − S ) Y k < k Y k . Th is condition d o es not imply th at the smo other S itself is a shrink age smo other as d eﬁned by (Buja et al. [5]). Con v ers ely , not all shrink age estimators sat- isfy the cond ition 2.7 of the theorem. In Section 3, w e will giv en examples of common shrink age smo others for whic h | λ j ( I − S ) | > 1, and show n u- merically that for these shr in k age smo others , the iterativ e bias correction sc heme will fail. 2.2. L 2 Bo osting for r e gr e ssion. Bo osting is one of the most successful and practical method s th at arose 1 5 y ears ago from the m ac hin e learning comm u nit y (F reund [15], Schapire [31]). In ligh t of F riedman [16], the Boost- ing algorithms h as b een inte rpreted as functional gradien t descen t tec hnique. Let us su mmarize the L 2 Bo ost algorithm describ ed in Buhlmann a nd Y u [4]. Step 0: Set k = 1. Giv en the data { ( X i , Y i ) , i = 1 , . . . , n } , calculate an pilot regression smo other ˆ F 1 ( x ) = h ( x ; ˆ θ X,Y ) , b y least squ ares ﬁtting of the parameter, that is, ˆ θ X,Y = argmin θ n X i =1 ( Y i − h ( X i , θ )) 2 . Step 1: With a current smoother b F k , compute the residuals U i = Y i − ˆ F k ( X i ) and ﬁt the r eal-v alued learner to the current residuals by least square. The ﬁt is denoted by ˆ f k +1 ( . ). Up d ate ˆ F k +1 ( . ) = ˆ F k ( . ) + ˆ f k +1 ( . ) . (2.8) Step 2: In cr ease iteration index k by one and rep eat step 1. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 10 Lemma 2.4 (Buhlmann and Y u, 200 3) . The smo othing matrix asso ciate d with the k th Bo osting iter ate of line ar smo other with smo othing matrix S is ˆ F k = ( I − ( I − S ) k ) Y = B k Y . Viewing Bo osting as a greedy grad ient descen t metho d, th e up date for- m ula (2.8) is often mod iﬁed to include c onver genc e factor µ k , as in F riedman [16], to b ecome ˆ F k +1 ( . ) = ˆ F k ( . ) + ˆ µ k +1 ˆ f k +1 ( · ) , where ˆ µ k +1 is the b est step to w ard the b est dir ection ˆ f k +1 ( · ). This general formulation allo ws a great deal of ﬂexibilit y , b oth in s electing the t yp e of sm o other used in eac h iteration of the Bo osting algorithm, and in the selection of the conv ergence factor. F or example, w e may start w ith a run ning mean pilot smo other, and use a smo othing spline to estimate the bias in the ﬁrst Bo osting iteration and a nearest neighbor smo other to estimate the b ias in the second iteration. Ho wev er in practice, one typical ly uses th e same s mo other f or all iterations and ﬁx the con vergence factor µ k ≡ µ ∈ (0 , 1). That is, the sequen ce of s m o others r esulting fr om the Bo osting algorithm is giv en by ˆ F k = ( I − ( I − µ S ) k ) Y = B k Y . (2.9) W e shall d iscuss in detail in Section 4 the imp act of this con v ergence factor and other mo diﬁcations to the Bo osting algorithm to ensu re go o d b ehavio r of the sequence of b o osted smo others. 3. Bo osting classical smo others. This section is d ev oted to und er - standing the b eha vior of the iterativ e Bo osting sc h ema u sing classical smo others, whic h in ligh t of Theorem 2.3, dep end s on the magnitude of the s in gular v al- ues of th e m atrix I − S . W e start our discussion by n oting that Bo osting a pro jection t yp e smo others is of n o in terest because r esid uals ( I − S ) Y are orthogonal to smo other S Y . It follo ws that th e smo othed residuals S ( I − S ) Y = 0, and as a r esult, b m k = b m 1 for all k . Hence Bo osting a bin smo other or a regression spline smo other lea ves the initial smo other unc hanged. Consider the K -nearest neigh b or smo other . Its asso ciated smo othing matrix is S ij = 1 /K if X j b elongs to the K -nearest neigh b or of X i and imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 11 S ij = 0 otherwise. Note that this smo othing matrix is not symmetric. While this smo other enj o ys many desirable p rop erties, it is not well suited for Bo osting b ecause the matrix I − S has singular v alues larger than one. Theorem 3.1 . In the ﬁxe d design or in the uniform design, as so on as the numb er of K is bigge r than one and smal ler than n , at le ast one singular value of I − S is bigger than 1. The pr o of of the theorem is foun d in th e app endix. A consequence of Prop osition 3.1 and Theorem 2.3, is that th e Bo osting algorithm applied to a K -nearest neigh b or sm o other pro du ces a sequence of diverge n t smo others , and hence should not b e used in p ractice. Example con tinued w ith K -nearest neigh b or smo other. W e con- ﬁrm th is b eha vior n umerically . Usin g the same data as b efore, w e apply the Bo osting algorithm starting with an pilot K -nearest neigh b or smo other with K = 10. The pilot estimator is p lotted in a plain line, and the v arious b o osted smo others with k , the n um b er of iterations, v alued in { 2 , · · · , 5 } in dotted lines. 0.0 0.2 0.4 0.6 0.8 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Fig 2 . T rue function m 1 (fat plain li ne) and diﬀer ent estimators varying with the numb er of iter ations k . F or k = 1, the pilot smo other is nearly constant (since w e tak e K = 10 neigh b ors) and v ery quickly the iterativ e bias corrected estimator exp lo des. Qualitativ ely , th e smo others are getting higher at the p eaks and lo wer in the v alleys, w hic h is consisten t with an ov ercorrection of th e bias. Con trast this b eha vior with the one sho wn in Figure 1. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 12 Kernel t yp e smo other . F or Nadaray a k ernel t yp e estimator, the smo oth- ing matrix S has ent ries S ij = K h ( X i − X j ) / P k K h ( X i − X k ), w here K ( . ) is a symmetric function (e.g., uniform, Epanechnik o v, Gaussian), h denotes the bandwidth and K h ( · ) is the scaled kernel K h ( t ) = h − 1 K ( t/h ). The matrix S is not sy m metric b ut can b e w ritten as S = D K w here K is sym- metric with general elemen t [ K h ( X i − X j )] and D is d iagonal with element 1 / P j K h ( X i − X j ). Algebraic manipu lations allo ws us to rewrite the iterated estimator as ˆ m k = [ I − ( I − S ) k ] Y = [ I − ( D 1 / 2 D − 1 / 2 − D 1 / 2 D 1 / 2 K D 1 / 2 D − 1 / 2 ) k ] Y = [ I − D 1 / 2 ( I − D 1 / 2 K D 1 / 2 ) k D − 1 / 2 ] Y = D 1 / 2 [ I − ( I − A ) k ] D − 1 / 2 Y . Since the matrix A = D 1 / 2 K D 1 / 2 is sym metric, w e apply the classica l de- comp osition A = P A Λ A P t A , with P A orthonormal and Λ A diagonal, to get a closed form expr ession for th e b o osted sm o other ˆ m k = D 1 / 2 P A [ I − ( I − Λ A ) k ] P t A D − 1 / 2 Y . The eigen decomp osition of A = D 1 / 2 K D 1 / 2 can b e used to d escrib e the b ehavio r of the sequence of iterativ e estimators. In particular, any eigen v alue of A = D 1 / 2 K D 1 / 2 that is negativ e or greater than 2 will lead to u nstable pro cedur e. If the kernel K ( · ) is a symmetric p robabilit y density function p ositiv e deﬁnite, then the sp ectrum of the Nadara y a-W atson k ern el smo other lies b et ween zero and one. Theorem 3.2 . If the inverse F ourier-Stieltjes tr ansform of a kernel K ( · ) is a r e al p ositive ﬁnite me asur e, then the sp e ctrum of the Nadar aya-Watson kernel smo other lies b etwe en zer o and one. Conversely, supp ose that X 1 , . . . , X n ar e an indep endent n -sample fr om a density f (with r e sp e ct to L eb esgue me asur e ) that is b ounde d away fr om zer o on a c omp act set strictly include d in the supp ort of f . If the inverse F ourier-Stieltjes tr ansform of a kernel K ( · ) is not a p ositive ﬁnite me asur e, then with pr ob ability appr o aching one as the sample size n gr ows to inﬁnity, the maximum of the sp e ctrum of I − S is lar ge r than one. Remark 1: Since the sp ec( A ) is the same as the sp ec( S ) and S is ro w sto c hastic, w e conclude that sp ec( A ) ≤ 1. So w e are only concern by th e presence of n egativ e eigen v alues in the sp ectrum of A . Remark 2: In Di Marzio and T a ylor [10] p ro v ed the ﬁr st part of the the- orem. Our p r o of of th e con v erse sho w s that for large enough sample sizes, imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 13 most conﬁgurations fr om a random design lead to smo othing matrix S with negativ e sin gular v alues. Remark 3: The assu mption that the inv erse F ourier-Stieltjes transform of a kernel K ( · ) is a real p ositiv e ﬁnite m easur e is equiv alent to the ke rnel K ( · ) b eing p ositiv e a deﬁnite fu nction, that is, for an y ﬁnite s et of p oin ts x 1 , . . . , x m , the matrix       K (0) K ( x 1 − x 2 ) K ( x 1 − x 3 ) . . . K ( x 1 − x m ) K ( x 2 − x 1 ) K (0) K ( x 2 − x 3 ) . . . K ( x 2 − x m ) . . . . . . K ( x m − x 1 ) K ( x m − x 2 ) K ( x m − x 3 ) . . . K (0)       is p ositiv e d eﬁnite. W e r efer to Sc h w artz [32] for a detailed study of p ositiv e deﬁnite fu nctions. The Gaussian and triangular k ernels are p ositive deﬁ nite kernels (they are the F our ier transform of a ﬁnite p ositiv e measur e (F eller [14])) and in ligh t of Th eorem 3.2 the Bo osting of Nadaray a-W atson kernel smo others with these k ernels pr o duces a sequ en ce of w ell b ehavio r smo other. Ho w ever, the uniform and the E p anec hnik o v k ernels are not p ositiv e deﬁnite. Theorem 3.2 states that for large samples, the sp ectrum of I − S is larger than one and as a resu lt the sequence of bo osted smo other d iv erges. Prop osition 3.3 b elo w strengthen this resu lt by stating that th e largest singular v alue of I − S is alw ays larger th an one. Pr oposition 3.3 . L et S b e the smo othing matrix of a Nadar aya-Watson r e gr ession smo other b ase d on either the uniform or the Ep ane chnikov k e rnel. Then the lar gest singular value of I − S is lar ger than one. Example con tin ued with Epanec hnik o v k ernel smo other. In the next ﬁgure, the p ilot smo other is a k ernel one with an Ep an echnik ov ke rnel and with bandw id th is equal to 0.15. The pilot smo other is the plain line, and the sub sequen t iterations with k , th e num b er of iterations, v alued in { 1 , 2 , 5 , 10 , 20 , 50 , 100 } , are the dotted lines. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 14 0.0 0.2 0.4 0.6 0.8 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Fig 3 . T rue function m 1 (fat plain li ne) and diﬀer ent estimators varying with the numb er of iter ations k . F or k = 1, th e p ilot smo other o v ersmo oths the true regression since the bandwidth tak es almost one third of the data and very quic kly the itera- tiv e est imator explo d es. Contrast this b eha vior with the one shown by the Gaussian kernel smo other in Figure 1. Finally , let us now consider th e smo othing splines smo othe r . The smo othing m atrix S is symm etric, and therefore admits an eigen decomp o- sition. Denote by { u 1 , u 2 , · · · , u n } an orthonormal basis of eigen v ectors of S asso ciated to the eigen v alues 1 ≥ λ 1 ≥ λ 2 ≥ · · · ≥ λ n ≥ 0 (Utreras [36]). Denote b y P S = [ u 1 , u 2 , · · · , u n ] the orthonormal matrix of column eigen v ec- tors and write S = P S diag( λ S ) P t S , that is S = P j λ j u j u t j . The iterated bias reduction estimator is giv en by (2.6). Since all the eigen v alues are b etw een 0 and 1, then if k is large, the iterativ e p ro cedur e kills the eigen v alues less than 1 and put the others to 1. Example con t in ued with smo ot hing splines smo other In the next ﬁgure, the pilot smo other is a sm o othing s pline, with λ equals to 0.2. The diﬀeren t estimators are plotted in ﬁgu r e (4), with the pilot estimator in plain line and the b o osted smo others with n um b er of iterations k b eing { 10 , 50 , 100 , 50 0 , 10 3 , 10 5 , 10 6 } in d otted lines. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 15 0.0 0.2 0.4 0.6 0.8 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Fig 4 . T rue function m 1 (fat plain li ne) and diﬀer ent estimators varying with the numb er of iter ations k . The pilot estimator is more v ariable than the pilot estimator of ﬁgure 1 and by the w a y the conv ergence and the deterioration arise faster. 4. Smo other engineering. Practical implemen tations of the Bo osting algorithm include a u ser selected conv ergence f actor µ ∈ (0 , 1) that app ears in the d eﬁnition of th e b o osted sm o other ˆ m k = ( I − ( I − µ S ) k ) Y = B k Y . (4.1) In this section, we show that when µ < 1, one eﬀectiv ely op erates a partial bias correction. This partial bias correction do es not how ev er resolv e the problems asso ciated w ith Bo osting a nearest neighbor or Nadara ya W atson k ernel smo others with compact k ern el w e exh ib ited in the previous sec- tion. T o r esolv e these pr oblems, w e prop ose to s uitably mo dify the b o osted smo other. W e call such targeted c hanges smo other eng i ne ering . The follo wing iterativ e partial bias redu ction sc h eme is equiv alen t to the Bo osting algorithm deﬁned by Equation (4.1): Giv en a s m o other b m k = B k Y at the k th iteration of the Bo osting algorithm, calculate the residu als R k and estimated bias b b k R k = Y − b m k = ( I − B k ) Y b b k = S R k = S ( I − B k ) Y . Next, giv en 0 < µ < 1, consider the partially bias corrected smo other (4.2) b m k +1 = b m k + µ b b k . imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 16 Algebraic man ip ulations of the smo othin g matrix of th e righ t-hand side of (4.2) yields B k + µS ( I − B k ) = I − ( I − µS ) k +1 , from whic h we conclude that b m k +1 satisﬁes (4.1) and therefore is the ( k + 1) th iteration of the Bo osting algorithm. It is interesting to rewrite (4.2) as b m k +1 = (1 − µ ) b m k + µ h b m k + b b k i , whic h s h o ws that b o osted smo other b m k +1 is a conv ex combinatio n b et ween the smo other b m k at iteration k , and the f ully bias corrected smo other b m k + b b k . As a result, w e und erstand ho w the intro d uction of a con v ergence factor pro du ces a ”w eak er learner” than the one obtained for µ = 1. In analogy to Theorem 2.3, the b eha vior of the sequence of the smo other dep end s on the sp ectrum of I − µ S . S p eciﬁcally , if max j | λ j ( I − µS ) | ≤ 1, then lim k − →∞ b m k = Y ,and con v ersely , if max j | λ j ( I − µS ) | > 1, lim k − →∞ k b m k k = ∞ . Insp ection of the pro ofs of prop ositions 3.1 and 3.2 reveal that the sp ec- trum of ( I − µS ) for b oth the nearest neigh b or smo other and the Nadara y a W atson kernel s mo other has singu lar v alues of magnitude larger than one. Hence the intro d uction of the conv ergence factor do es not help resolv e the diﬃculties arising when Bo osting these smo others. T o resolve the p oten tial con vergence issu es, one needs to suitably mo dify the und er lyin g smo other to ensu re that the magnitude of the singular v alues of I − µS are b ound ed by one. A practical solution is to replace the smo othing matrix S by S ⋆ = S S t . If S is a con traction, it follo ws that the eigen v alues of I − S ⋆ are nonnegativ e and b oun ded by one. Hence the Bo osting algorithm with this sm o other will pr o duce a well b eha v ed sequence of smo others with lim k − →∞ b m k = Y . While substituting the smo other S ⋆ for S can pro d uces b etter b o osted smo others in cases where B o osting failed, our n umerical experimentatio ns has sho wn that the p erformance of Bo osting S ⋆ is not as go o d as Bo ost- ing S when t he p ilot estimator enjo yed go o d prop erties, as is the ca se for smo othing splines and the Nadara ya W atson k ernel smo other with Gaussian k ernel. 5. Stopping rules. Th eorem 2.3 in Section 2 states that the limit of the sequence of b o osted smo others is either the r a w d ata Y or has norm k Y ∞ k = ∞ . It f ollo ws th at iterating the Bo osting algorithm u n til con vergence is not desirable. How ev er, since eac h iteration of the Boosting alg orithm r educes the bias and increases the v ariance, often a f ew iteration of the Bo osting imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 17 algorithm will pro du ce a b etter sm o other than the pilot sm o other. Th is brings up the imp ortan t question of ho w to decide when to stop the iterativ e bias correction p ro cess. Viewing the latter question as a mo d el selection p roblem suggests stop- ping rules based on Mallo ws’ C p (Mallo ws [28]), Ak aik e Information Cri- terion, AIC, (Ak aik e [1]), Ba y esian In formation Cr iterion, BIC, (S c hw arz [33]), and Generalized cross v alidation (C r a v en and W ahba [7]). Eac h of these selecto rs estimate the optim u m num b er of iterations k of the Boosting algorithm b y minimizing estimates of the exp ected squared prediction err or of the smo others ov er s ome pre-sp eciﬁed set K = { 1 , 2 , . . . , M } . Three of the six criteria w e stud y n umerically in S ection 6 u se plug-in estimates for the squared bias and v ariance of the exp ected p rediction mean square error. Sp eciﬁcally , consider ˆ k AI C = argmin k ∈K  b σ 2 + 2 trace( S k ) n  , (5.1) ˆ k GC V = argmin k ∈K  log b σ 2 − 2 log  1 − trace( S k ) n  , (5.2) ˆ k AI C C = argmin k ∈K  log b σ 2 + 1 + 2(trace( S k ) + 1) n − trace( S k ) − 2  . (5.3) In n on p arametric smo othing, the AIC criteria (5.1 ) h as a noticeable ten- dency to select more iterations than needed, leading to a ﬁnal sm o other b m b k AI C that t ypically undersmo oths the data. As a remedy , Hurvic h et al. [24] introd u ced a corrected version of th e AIC (5.3) under the sim p lifying assumption that the nonparametric smo other ˆ m is un biased, w h ic h is r arely hold in pr actice and wh ic h is particularly n ot true in ou r con text. The other three criteria considered in our simulati on study in S ection 6 are Cr oss-V alidation, L-fold cross-v alidation and d ata splitting, all of which estimate empirically the exp ected pr ediction mean square err or b y splitting the d ata in to learning and testing sets. Implementa tion of these criterion require one to ev aluate the sm o other at lo cations outside the of th e design. T o this end, write the k th iterated sm o other as a k times bias corrected smo other b m k = b m 0 + b b 1 + · · · + b b k = S [ I + ( I − S ) + ( I − S ) 2 + · · · + ( I − S ) k − 1 ] Y , whic h w e rewrite as b m k = S ˆ β k , imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 18 where ˆ β k = [ I + ( I − S ) + ( I − S ) 2 + · · · + ( I − S ) k − 1 ] Y = Y + R 1 + R 2 + · · · + R k is a v ector of size n . Giv en the v ector S ( x ) of size n whose en tries are the w eigh ts for p redicting m ( x ), w e calculate b m k ( x ) = S ( x ) t b β k . (5.4) This formulatio n is compu tationally adv an tageo us b ecause the vec tor of w eigh ts S ( x ) on ly needs to b e computed once, while eac h Bo osting iter- ation u p dates the parameter ve ctor b β k b y add in g the residuals R k = Y − b m k of the ﬁ t to the pr evious v alue of the parameter, i.e., b β k = b β k − 1 + R k . The v ecto r S ( x ) is readily computed for man y of th e smo others used in p ractice. F or k ernel smo others, the i th en try in the vecto r S ( x ) is S i ( x ) = K h ( x − X i ) P j K h ( x − X j ) . F or smo othing sp line, let N ( x ) denote the vec tor of basis function ev aluated at x . On e can sho w that ˆ m k ( x ) = N ( x ) M ˆ β k , wher e M is the n × n matrix giv en by M = ( N t N + λ Ω ) − 1 N t . Finally , for the K -nn smo other, th e entries of the v ector S ( x ) are S i ( x ) = ( 1 /K if X i is a K -nn of x 0 otherwise . W e note that if the sp ectrum of I − S is b ounded in absolute v alue by one, then the parameter ˆ β k − → β ∞ , and hence w e h a ve p oint wise con v ergence of b m k ( x ) to some m ∞ ( x ), whose prop erties dep end on S ( x ). T o deﬁne the d ata sp litting and cross v alidation stopping rules, one di- vides the sample into t w o disjoint subset: a learning set L whic h is us ed to estimate the smo other b m L , and a testing set T on w hic h p redictions from the smo other are compared to the observ atio ns. The data splitting selector for the num b er of iterations is ˆ k D S = argmin k ∈K X i ∈T  Y i − ˆ m L k ( X i )  2 . (5.5) imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 19 One-fold cross v alidation, or simply cross v alidation, and more generally L - fold cross v alidation a v erage the pr ediction er r or o v er all partitions of the data into int o learning and testing sets, with ﬁ xed size of the testing set |T | = L . Th is leads to ˆ k C V = argmin k ∈K X |T | = L n X i ∈T  Y i − ˆ m L k ( X i )  2 . (5.6) W e rely on the expansiv e literature on mo d el selectio n to pro vide in sigh t in to the statistical p rop erties of stopp ed b o osted smo other. F or example, Theorem 3.2 of Li [27] describ es the asymp totic b eha vior of the generalized cross-v alidation (GCV) stopping ru le applied to sp line smo others. Theorem 5.1 (Li, 1987) . Assu me that Li’s assumptions ar e veriﬁe d for the smo other S . Then k m − S ˆ k GC V Y k 2 inf k ∈K k m − S k Y k 2 → 1 in pr ob ability . Results on the ﬁ nite sample p erformance f or data splitting for arb itrary smo others is pr esen ted in Th eorem 1 of He ngartner et al. [21] w h o p ro v ed the follo wing oracle inequalit y . Theorem 5.2 . F or e ach k in K , λ > 0 and α > 0 , we have P    1 m n + m X i = n +1 ( ˆ m K D S − m ) 2 ( X i ) − (1 − α ) n + m X i = n +1 ( ˆ m k − m ) 2 ( X i ) ≥ λ    ≤ | K | s  32(1 + α ) σ 2 π αmλ   exp  αmλ 8(1 + α ) σ 2  − 1  − 1 . Example con tin ued with smo othing splines Figure 5 sho ws the three p ilot smo others (smo othing splines with diﬀeren t smo othing p arameters) considered in the simulatio n stud y in S ection 6. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 20 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.0 0.2 0.4 0.6 0.8 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 Fig 5 . T rue function m 1 (plain line) and diﬀer ent pilot smo othing spline smo other, S ( λ 1 ) (dotte d line), S ( λ 2 ) (dashe d l ine), S ( λ 3 ) (dash-dot te d li ne) for the 50 data p oi nts of one r eplic ation (Gaussian err or). Starting with th e smo othest pilot smo other S ( λ 1 ), the Generalized Cr oss V alidation criteria stops after 1389 iteratio ns. Starting with smo other S ( λ 2 ), GCV stopp ed after 23 iterations, wh ile starting with the noisiest pilot S ( λ 3 ), GCV stopp ed after one iteration. It is remark able how sim ilar the ﬁnal smo other are. 0.0 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5 1.0 Fig 6 . T rue function m 1 (plain line) and diﬀer ent pilot smo othing spline smo other, S ( λ 1 ) (dashe d line), S ( λ 2 ) (dotte d line), S ( λ 3 ) (dash-dotte d line) for the same 50 data p oints as in ﬁgur e 5 of one r epli c ation (Gaussian err or). The ﬁnal selected estimators are very close to one another, desp ite the dif- feren t pilot smo others and the diﬀerent n um b ers iterations that w ere selected b y the GV C criteria. Despite the ﬂatness of th e pilot smo other S ( λ 1 ), it suc- ceeds after 1389 iteration to captur e the signal. Note that larger smo othin g imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 21 parameter λ are asso ciated to we ak er learners that requir e a larger n u m b er of bias correction iterations b efore they b ecome desir ab le smo others according the the generalized cross v alidation criteria. A close examination of ﬁ gure 6 sho ws that u sing the less biased estimator S ( λ 3 ) leads to the worse ﬁn al esti- mator. Th is can b e explained as follo ws: if the pilot s mo other is not enough biased, after the ﬁr s t step almost no signal is le ft in the resid uals an d the iterativ e b ias reduction is s topp ed. W e remark again th at one do es n ot n eed to k eep th e same smo other throughout the it erativ e b ias correcting sc heme. W e conjecture that there are adv antag es to using we ak er smo others later in the iterativ e sc heme, and shall inv estigate this in a forthcoming pap er. 6. Sim ulations. Th is section presen ts selected results fr om our sim u- lation stu dy that inv estigates the s tatistica l prop erties of the v arious data driv en s topping r ules. The sim ulations examine, within the f r amew ork set b y Hur vic h et al. [24], th e imp act on p erformance of v arious stopping rules, smo other type, smo othn ess of the p ilot smo other, sample s ize, true regres- sion fun ction, and the relativ e v ariance of the errors as m easur ed b y the signal to noise ratio. W e examine the in ﬂuence of v arious factors on the p erformance of the selectors, with 100 sim ulation replications and a rand om uniform grid in [0 , 1]. The err or standard d eviation is σ = 0 . 2 R g , w h ere R g is th e range of g ( x ) o ver x ∈ [0 , 1]. F or eac h setting of factors, w e hav e (A) sample size: n = 50, 100 and 500 (B) the f ollo wing 3 regression fun ctions, most of which w ere u s ed in earlier studies 1. m ( x ) = sin(5 π x ), 2. m ( x ) = 1 − 48 x + 218 x 2 − 315 x 3 + 145 x 4 , 3. m ( x ) = exp ( x − 1 3 ) { x < 1 3 } + exp[ − 2( x − 1 3 )] { x ≥ 1 3 } . (C) error distrib ution: Gaussian and S tudent( 5) (D) pilot smo others: smo othing splines, Gaussian ke rnel, K -nearest neigh- b or t yp e smo other (E) three s tarting smo others: S 1 , S 2 and S 3 b y d ecreasing ord er of smo oth- ing. F or eac h setting, we compute the ideal num b ers of iterations by computing at data p oint s { X i } n i =1 k opt = argmin k ∈K n X i =1 k m ( X i ) − ˆ m k ( X i ) k 2 . imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 22 Since the r esults are numerous we rep ort here a summary , fo cusing on th e main ob jectiv es of the p ap er. First of a ll, do es the stopping pro cedures prop osed in sect ion 5 w ork ? Figure 7 repr esen ts the k ernel densit y estimates of the log ratios of th e n um b er of iterations to the ideal num b er of iterations for the smo othing spline t yp e smo other. −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 GCV CV CV5 DS AIC AICc Fig 7 . Estimate d density of log ( ˆ k/k opt ) , ˆ k evaluate d by diﬀer ent stopping criterion : GCV, CV (le ave one out), CV 5 fold (CV5), data splitting (DS), AIC and c orr e cte d AIC (AICc). Density is estimate d on 100 r epli c ation for function m 1 , with Gaussian err or, spline smo other S 2 and n = 50 data p oints. Ob viously , n egativ e v alues indicate u n dersmo othing ( ˆ k smaller than k opt , that is not enough bias reduction) while p ositiv e v alues indicate o versmooth- ing. The results r emain essen tially u nc hanged o ver the r ange of starting v al- ues, r egression function and smo others typ es we considered in our simulatio n study . F or small data sets ( n = 50 ), the stopping rule based on data splitting pro du ced v alues for b k that we re v ery v ariable. A similar observ atio n ab out the v ariabilit y of band width selection from d ata splitting was m ade in [see 21]. W e also found that the ﬁve fold cross v alidation stoppin g rule pro du ced highly v ariable v alues for b k . The AIC stopping ru le selects v alues ˆ k that are often to o b ig (o v er s mo oth- ing) and sometimes selects the largest p ossible v alue of b k ∈ K . In that cases, the curve k v ersus AI C (not shown) indicates t wo minimum, a lo cal one whic h is around the true v alue and the global one at the b oundary . This can b e attribu ted to the fac t that the p en alty term u sed by AIC is to o s mall. The AICc criteria uses a larger p enalt y term, w hic h leads to smaller v alues imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 23 for ˆ k . In fact, the selected v alues are typically smaller than the optimal one. The p enalt y asso ciated with GC V lies in b et w een the AIC c p enalt y an d AIC p enalt y , and pro d uces in pr actice v alues of b k that are closer to the optim u m than either AIC or AICc. Finally , the lea v e one out cross-v alidation selection rule pro duces ˆ k that are t ypically larger than the op timal one. In v estiga tion of the MSE as a fu nction of the num b er of iterations k rev eal that, in the examples w e consid ered, that function decreases rapid ly to wards its minimum and then remains relativ ely ﬂ at o v er a range of v alues to righ t of the minimum. It follo ws that the loss of stopping after k opt is less than stopping b efore k opt . W e v erify this empirically as f ollo ws: for eac h estimate, we calc ulate th e approxima tion to the integ rated mean squared error b et ween the estimator ˆ m ˆ k and the true regression f u nction m MSE( ˆ m ˆ k ) = 1 / 100 X x ∈G | m ( x ) − ˆ m ˆ k ( x ) | 2 , where G is a ﬁx grid of 100 regularly spaced p oin ts in the un it interv al [0 , 1]. W e p artition the calculated integ rated mean squ ared error dep end ing on whether ˆ k is bigger than k opt or smaller than k opt . Figure 8 presents the b o xplot of the int egrated m ean squared err or when ˆ k o ver-estimat es k opt and when it u nder-estimates k opt and clearly sh o w s that ov er-estimating k opt leads to sm aller integrate d mea n squared error than under-estimating k opt . GCV + GCV − CV + CV − 0.05 0.10 0.15 Fig 8 . Boxplot of MSE ˆ m ˆ k when ˆ k GCV > k opt (denote d as GCV+), of me an squar e d err or of ˆ f ˆ k GCV when ˆ k GCV ≤ k opt (denote d as GCV-), and the same b oxplots with le ave one out stopping criterion. Me an squar e d er r or ar e estimate d on a grid of 100 p oints r e gu- larly sp ac e d b etwe en 0 and 1, 100 r eplic ation f or function m 1 , with Gaussian err or, spline smo other S 2 and n = 50 data p oints. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 24 F or bigger data sets, say n = 100 or bigger, most of the stopp in g cri- terion act the same except for the mo d iﬁed AIC whic h tends to select a smaller n um b er of iterations k than the ideal one. On e fold cross-v alidation is rather compu tational in tensive as the usual relation b et w een cr oss v ali- dated estimator at X i and full data estimator is n o longer v alid [e.g. 19, p. 47]. These conclusions remain tru e for k ernel smo other and nearest neigh b or smo others. Ho wev er if th e pilot smo other is not smo oth enough (not b iased enough), then the n um b er of iteration is too small to allo w us to discrimin ate b et w een the d iﬀeren t stopping rules. These initial smo others we name as wiggly learner are almost unbiased and therefore, there is little v alue to apply an iterativ e b ias correction scheme. In conclusion, for small d ata sets, our sim u lations sho w that both GCV and lea v e one cross-v alidation w ork w ell, and f or bigger data s ets, we recommend using GCV. T ables (1) and (2) here b elo w rep ort the ﬁ nite sample p erformance of stopp ed b o osted smoother by the GCV criterion. Eac h en try in the ta ble rep orts the median num b er of iterations an d th e median mean square error o ver 10 0 sim ulations. As e xp ected, larger smo othing p arameter o f th e ini- tial smo other require more iterations of the b o osting algorithm to reac h its optim um. Interestingly , the s elected smo other starting with a ve ry smo oth smo other, has sligh tly smaller mean squared e rror. The quan tify the b en- eﬁts of the iterativ e b ias correction sc heme, the last column of the ta bles giv es the mean squared error of the original smoother with sm o othing pa- rameters selected u s ing GCV. In all cases, the iterativ e bias correction h as smaller mean squared error than the ”one-step” smo other, with im p ro v e- men ts ranging from 15% to 30%. T able (1) pr esen ts the results for smo othing sp lines. F un ction m 1 error ˆ k 1 1 S ˆ k GCV ( λ 1 ) ˆ k 2 S ˆ k GCV ( λ 2 ) ˆ k 3 S ˆ k GCV ( λ 3 ) S ( ˆ λ GCV ) Gaussian 4077 0.0273 65 0.0282 2 0. 0293 0 .0379 student 4115 0.0273 70 0.0286 2 0.02 96 0.0 352 F un ction m 2 Gaussian 1219 0.0798 21 0.0845 1 0. 0837 0 .0829 student 1307 0.0887 22 0.0944 1 0.09 32 0.0 937 F un ction m 3 Gaussian 135 0.0 014 3 0.0014 1 0.0016 0.0016 student 1 47 0.0016 3 0.0016 1 0.0018 0.0019 T able 1 Me dian over 100 simulations of the numb er of iter ations and the MSE for smo othing splines smo other, n = 50 data p oints. S ( ˆ λ GCV ) denotes the tr aditional smo othing splines estimate with λ chosen w i th GCV. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 25 T able (2) presents th e results for k ern el sm o others with a Gaussian kernel. F un ction m 1 error ˆ k 1 1 S ˆ k GCV ( h 1 ) ˆ k 2 S ˆ k E GCV ( h 2 ) ˆ k 3 S ˆ k GCV ( h 3 ) S ( ˆ h AICc ) Gaussian 385 0.0231 27 0.0254 4 0.0368 0.04857 student 3 60 0.0221 25 0.02 62 4 0.0353 0.05199 F un ction m 2 Gaussian 330 0.0477 128 0.0581 14 0.0782 0.1175 student 1621 0.0563 160 0.0660 16 0.0754 0.1184 F un ction m 3 Gaussian 30 0.0017 7 0. 0016 2 0.0014 0.00178 student 29 0.0017 8 0.001 6 2 0.0016 0.0018 T able 2 Me dian over 100 simulations of the num b er of iter ations and the MSE for Gaussian kernel smo other, n = 50 data p oints. S ( ˆ h AICc ) denotes the b andwi dth chosen by the mo diﬁe d AIC criteria. The simulatio n results rep orted in the ab o v e tables sho w that the iterativ e bias r eduction sc heme w orks w ell in practice, even for mo derate sample sizes. While starting with a very s mo oth p ilot requires more iterations, the m ean squared error of the resulting smo other is somewhat smaller compared to a more noisy in itial smo other. Figures 5 and 6 also sup p ort this claim. 7. Discussion. In this pap er, w e mak e the connectio n b et w een iterativ e bias correction and the L 2 b o osting algorithm, th ereb y providing a new in terpretation for the latter. A link b et w een b ias reduction and b o osting w as suggested b y [30] in his d iscussion of the seminal p ap er [17 ], and explored in Di Marzio and T aylor [8, 9] for the sp ecial case of k ernel smo others. In this pap er, we sho w that this interpretation holds f or general linear sm o others. It w as su rprising to us that not all smo others w ere suitable to b e u sed for b o osting. W e sh o w that man y w eak learners , such as the k -nearest neighbor smo other and some ke rnel smo others, are not stable under L 2 b o osting. Ou r results extend and complement the recen t results of Di Marzio and T a ylor [9]. Iterating the b o osting algorithm until con ve rgence is not d esirable. Better smo others r esult if one stops the iterativ e sc heme. W e h a ve explored, via sim ulations, v arious data driven stoppin g rules and hav e foun d that for the linear smo others, the Generalized Cr oss V alidation criteria works very w ell, ev en for mo derate sample sizes of 50. In our sim ulations sho w that optimally correcting the bias (by stopping the L 2 b o osting algorithm after a suitable n um b er of iterations) p ro du ced b etter sm o others than the one with the b est data-dep endent smo othing parameter. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 26 Finally , the iterativ e bias correctio n sc heme can b e rea dily extended to m ultiv ariate cov ariates X , as in Buh lmann [3]. APPENDIX A: APPENDIX Pro of of Theorem 2.1 T o sho w 2.5 , let Σ = I + ( I − S ) + · · · + ( I − S ) k − 1 . The conclusion f ollo ws fr om a telescoping sum argumen t applied to S Σ = Σ − ( I − S )Σ = I − ( I − S ) k . Pro of of Theorem 2.3 k ˆ b k k 2 = k − ( I − S ) k − 1 S Y k 2 = k ( I − S )( I − S ) k − 2 S Y k 2 ≤ k ( I − S ) k 2 k ˆ b k − 1 k 2 ≤ k ˆ b k − 1 k 2 , where the last inequalit y follo w s from the assump tions on the sp ectrum of I − S . Similarly , one shows that k R k k 2 = k ( I − S ) k Y k 2 ≤ k I − S k 2 k R k − 1 k 2 < k R k − 1 k 2 . Pro of of Theorem 3.1 T o simplify the exp osition, let us assum e that the X i ’s are ord ered. Let us consider the K -nn smo other the matrix S is of general term S ij = 1 K if X j ∈ K -nn( X i ) . In ord er to b ound the singular v alues of ( I − S ), consider the eigen v alues of ( I − S )( I − S ) ′ whic h are the square of the singular v alues of I − S . Since A = ( I − S )( I − S ) ′ is symm etric, we ha v e for any v ector x that λ n ≤ x ′ Ax x ′ x ≤ λ 1 . (A.1) Let us ﬁ nd a v ector x suc h that x ′ Ax > x ′ x . First notice that A ii = 1 − 1 K . Next, consider the v ector x of R n that is zero every where except at p osition ( i − l 1 ) (resp ectiv ely i and i + l 2 ) wh ere its v alue is -1 (resp ective ly 2 and -1). F or th is c hoice, we expand x ′ Ax to get x ′ Ax = A i − l 1 ,i − l 1 + 4 A i,i + A i + l 2 ,i + l 2 − 4 A i − l 1 ,i − 4 A i,i + l 2 + 2 A i + l 2 ,i − l 1 = 6 − 6 K − 4 A i − l 1 ,i − 4 A i,i + l 2 + 2 A i + l 2 ,i − l 1 . imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 27 T o sho w that this last quantit y is larger than x t x = 6, w e n eed to suitably b ound the oﬀ-diagonal elemen ts of A = I − S − S ′ + S S ′ . T o b ound A ij , where j = i + l and l < K , we need to consider three cases: 1. If X i b elongs to the K -nn of X j and vice versa, then S ij = S ′ j i = 1 /K . This do es not mean that all the K -nn neigh b or of X i are the same as those of X j , but if it is the case, then ( S S ′ ) ij ≤ K/K 2 and otherw ise in t he p essimistic ca se, w e b ound ( S S ′ ) ij ≥ ( l + 1) /K 2 . It therefore follo ws that ( l + 1) /K 2 − 2 K ≤ A i,i + l ≤ K K 2 − 2 K = − 1 K . 2. If X i b elongs to the K -nn of X j S ij = 1 /K but X j do es not b elong to the K -nn of X i then S ′ j i = 0. Th er e is at a maximum of K − 1 p oints that are in the K -nn of X i and in the K -nn of X j so ( S S ′ ) ij ≤ ( K − 1) /K 2 . In the p essimistic case, there is only one p oin t, whic h leads to the b ound 1 K 2 − 1 K ≤ A i,i + l ≤ K − 1 K 2 − 1 K ≤ − 1 K 2 . 3. If X i do es n ot b elong to the K -nn of X j S ij = 0 and X j do es n ot b elong to the K -nn of X i then S ′ j i = 0. How ev er th ere are p oten tiall y as man y as l − 1 p oints that are in the K -nn of X i and in the K -nn of X j , so that ( S S ′ ) ij ≤ ( l − 1) /K 2 . In that case 0 ≤ A ij ≤ l − 1 K 2 ≤ K − 2 K 2 . With these b ound s for the oﬀ-diagonal terms, w e are able to ma jor x ′ Ax . Before con tinuing, we need to discuss th e relativ e p osition of the p oints X i − l 1 , X i and X i + l 2 . W e chose them such that X i − l 1 ∈ K -nn( X i ) and X i ∈ K -nn( X i − l 1 ) . F or this c h oice, we calculate l 1 + 1 − 2 K K 2 ≤ A i − l 1 ,i ≤ − 1 K l 2 + 1 − 2 K K 2 ≤ A i,i + l 2 ≤ − 1 K , so that 6 − 6 K + 8 K + 2 A i + l,i − l ≤ x ′ Ax ≤ 6 + 2 K + 2 A i + l,i − l . imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 28 The latter sho ws that x ′ Ax > x ′ x wheneve r A i + l 2 ,i − l 1 > − 1 K , whic h is alwa ys true if the condition X i − l 1 6∈ K -nn( X i + l 2 ) or X i + l 2 6∈ K -nn( X i − l 1 ) is satisﬁed b ecause in suc h case, we hav e − 1 K < A i − l 1 ,i + l 2 ≤ 1 K 2 . Pro of of Theorem 3.2 Let X 1 , . . . , X n is an i.i.d. sample from a d ensit y f that is b oun ded aw a y from zero on a compact set strictly included in the supp ort of f . Consider w ithout loss of generalit y that f ( x ) ≥ c > 0 f or all | x | < b . W e are intereste d in the sign of th e quadratic form u t Au where the ind i- vidual entries A ij of matrix A are equal to A ij = K h ( X i − X j ) p P l K h ( X i − X l ) q P l K h ( X j − X l ) . Recall the deﬁnition of the scaled kernel K h ( · ) = K ( · /h ) /h . If v is the ve ctor of coord inate v i = u i / p P l K h ( X i − X l ) then w e h a v e u t Au = v t K v , where K is the m atrix with individu al entries K h ( X i − X j ). Thus an y conclusion on the qu adratic form v t K v carry on to the qu adratic form u t Au . T o sh o w the existence of a negativ e eigen v alue for K , we seek to construct a v ecto r U = ( U 1 ( X 1 ) , . . . , U n ( X n )) for wh ic h w e can sho w that the q u adratic form U t K U = n X j =1 n X k =1 U j ( X j ) U k ( X k ) K h ( X j − X k ) con verges in probability to a negativ e quantit y as th e sample size gro ws to inﬁnity . W e sh o w the latter by ev aluating th e exp ectation of the quadr atic form and app lying the weak la w of large n um b er. Let ϕ ( x ) b e a real function in L 2 , deﬁn e its F ourier transf orm ˆ ϕ ( t ) = Z e − 2 iπ tx ϕ ( x ) dx and its F ourier inv erse by ˆ ϕ inv ( t ) = Z e 2 iπ tx ϕ ( x ) dx. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 29 F or k ernels K ( · ) that are real sym metric probab ility densities, w e ha v e ˆ K ( t ) = ˆ K inv ( t ) . F r om Bo c hner’s theorem, w e know that if the k ernel K ( · ) is not p ositiv e deﬁnite, then there exists a b oun ded symm etric set A of p ositive Leb esgue measure (denoted by | A | ), such th at (A.2) ˆ K ( t ) < 0 ∀ t ∈ A. Let b ϕ ( t ) ∈ L 2 b e a real s ymmetric function sup p orted on A , b ounded by B (i.e. | b ϕ ( t ) | ≤ B ). Obvio usly , its in v erse F ourier trans f orm ϕ ( x ) = Z ∞ −∞ e − 2 π ixt b ϕ ( t ) dt is integ rable and b y virtue of P arcev al’s identit y k ϕ k 2 = k b ϕ k 2 ≤ B 2 | A | < ∞ . Using the follo wing v ersion of Parcev al’s ident it y [see 14, p.620] Z ∞ −∞ Z ∞ −∞ ϕ ( x ) ϕ ( y ) K ( x − y ) dxdy = Z ∞ −∞ | b ϕ ( t ) | 2 ˆ K ( t ) dt, whic h when combined with equation (A.2), leads u s to conclude that Z ∞ −∞ Z ∞ −∞ ϕ ( x ) ϕ ( y ) K ( x − y ) dxdy < 0 . Consider the follo wing v ector U = 1 nh        ϕ ( X 1 /h ) f ( X 1 ) I ( | X 1 | < b ) ϕ ( X 2 /h ) f ( X 2 ) I ( | X 2 | < b ) . . . ϕ ( X n /h ) f ( X n ) I ( | X n | < b )        . With this choic e, the exp ected v alue of the quadratic form is E [ Q ] = E   n X j,k =1 U j ( X j ) U k ( X k ) K h ( X j − X k )   = 1 n Z b − b 1 f ( s ) h 2 ϕ ( s/h ) 2 K h (0) ds + n 2 − n n 2 Z b − b Z b − b 1 h 2 ϕ ( s/h ) ϕ ( t/h ) K h ( s − t ) dsdt = I 1 + I 2 . imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 30 W e b ound the ﬁr st integral I 1 = K h (0) nh 2 Z b − b ϕ ( s/h ) 2 f ( s ) ds ≤ K h (0) nch Z b/h − b/h ϕ ( u ) 2 du ≤ B 2 | A | K (0) ch 2 n − 1 . Observe that for an y ﬁx ed v alue h , the latter can b e made arbitrarily sm all b y choosing n large enough. W e ev aluate the second integral by n oting that I 2 =  1 − 1 n  h − 2 Z b − b Z b − b ϕ ( s/h ) ϕ ( t/h ) K h ( s − t ) dsdt =  1 − 1 n  h − 2 Z b − b Z b − b ϕ ( s/h ) ϕ ( t/h ) 1 h K  s h − t h  dsdt =  1 − 1 n  h − 1 Z b/h − b/h Z b/h − b/h ϕ ( u ) ϕ ( v ) K ( u − v ) dudv . (A.3) By virtue of the d ominated con v er gence theorem, th e v alue of the last in tegral conv erges to R ∞ −∞ | b ϕ ( t ) | 2 ˆ K ( t ) dt < 0 as h go es to zero. Thus for h small enough , (A.3) is less than zero, and it follo ws that w e can mak e E [ Q ] < 0 b y taking n ≥ n 0 , for some large n 0 . Finally , con v ergence in probabilit y of the quadratic form to its exp ectation is guaran teed b y th e w eak la w of large n um b ers f or U statistics (see Grams and Serﬂing [18] for example). Th e conclusion of the theorem follo ws. Pro of of Prop osition 3.3 W e are interested in th e sign of the quadratic form u t K u (see pr o of of T h eorem 3.2). Recall th at if K is semideﬁnite then all its p r incipal minor [see 23, p.398] are nonnegativ e. In particular, w e can sho w that A is non-p ositiv e deﬁnite b y pro ducing a 3 × 3 prin cipal min or with negativ e determinant. T o this end , tak e the principal m in or K [3] o btained b y taking the rows and columns ( i 1 , i 2 , i 3 ). Without loss of generalit y , let us assume that X i 1 < X i 2 < X i 3 . The d eterminan t of K [3] is det ( K [3]) = K h (0) h K h (0) 2 − K h ( X i 3 − X i 2 ) 2 i − K h ( X i 2 − X i 1 ) × [ K h (0) K h ( X i 2 − X i 1 ) − K h ( X i 3 − X i 2 ) K h ( X i 3 − X i 1 )] + K h ( X i 3 − X i 1 ) × [ K h ( X i 2 − X i 1 ) K h ( X i 3 − X i 2 ) − K h (0) K h ( X i 3 − X i 1 )] . Let us ev aluate this quantit y for the uniform and Epanec hniko v k ernels. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 31 Uniform k ernel. Let h b e larger than the m inim um distance b et w een thr ee consecutiv e p oints, and chose the index i 1 , i 2 , i 3 suc h that X i 2 − X i 1 < h, X i 3 − X i 2 < h, and X i 3 − X i 1 > h. With this choic e, w e readily calculate det ( K [3]) = 0 − K h (0) h K h (0) 2 − 0 i − 0 < 0 . Since a principal min or of K is n egativ e, we conclude that K and A are not semideﬁnite p ositiv e. Epanec hnik ov k ernel. F or i 1 , i 2 , i 3 ﬁxed, den ote by x = X i 2 − X i 1 and by y = X i 3 − X i 2 , and assume that h > min( x, y ). The d eterminan t det ( K [3]) is a biv ariate function of x and y (as X i 3 − X i 1 = x + y ). Numerical ev aluations of that fun ction sho w that as so on as we hav e the range of the three p oin ts less than th e bandw id th, the determinan t of K [3] is negativ e. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05 Fig 9 . Contour of det ( K [3]) as a f unction of ( x, y ) . Th us a prin cipal minor of K is negativ e, and as a result, K an d A are n ot semideﬁnite p ositiv e. REFERENCES [1] H. Ak aik e. Information theory and an extension of the maximum likel ihoo d princi- ple. In B. N. P etro v and B. F. Csaki, editors, Se c ond international symp osium on information the ory , pages 267–28 1, Budap est, 1973. Academiai Kiado. [2] L. Breiman. Arcing classiﬁer (with discussion). Ann. of Statist. , 26:801–84 9, 1998. [3] P . Buhlmann . Bo osting for h igh-dimensional linear mo dels. A nn. of St atist. , 34: 559–583 , 2006. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 32 [4] P . Buhlmann and B. Y u. Bo osting with the l 2 loss: Regression and classi ﬁcation. J. Amer . Statist. Asso c. , 98:324–339, 2003. [5] A. Buja, T. Hastie, and R . Tibshirani. Linear smo others and additive mo dels. Ann. of Statist. , 17:453 –510, 1989. [6] W. Cleveland. Robust locally weig hted regression and smo othing scatterplots. J. Amer . Stat. Ass. , 74:829– 836, 1979. [7] P . Cra ven and G. W ahba. Smo othing noisy data with spline functions. Numerische Mathematik , 31:377–403, 1979. [8] M. Di Marzio and C. T aylor. Bo osting kernel d ensity estimates: a bias red uction technique ? Bi om etrika , 91:226–233, 2004. [9] M. Di Marzio and C. T aylor. Multiple ke rnel regression smoothing by bo osting. submitte d , 2007. [10] M. Di Marzio and C. T aylor. O n b o osting k ernel regression. to app e ar in JSPI , 2008. [11] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Lea st angle regression (with discussion). Ann. of Statist. , 32:407–451 , 2004. [12] R. Eubank. Spline Smo othing and Nonp ar ametric R e gr ession . Marcel Dekker, New- Y ork, 1988. [13] J. F an and I. Gijb els. Lo c al Polynomial Mo deli ng and Its Applic ation, The ory and Metho dolo gies . Chapman et Hall, New Y ork, 1996. [14] W. F eller. An i ntr o duction to pr ob ability and i ts applic ations , volume 2. Wiley , N ew Y ork, 1966. [15] Y. F reund . Boosting a weak learning algorithm by ma jority . Information and Com- putation , 121:2 56–285, 1995. [16] J. F riedman. Greedy function approximatio n: A gradient b oosting machine. Ann. Statist. , 28(337-407), 2001. [17] J. F riedman, T. Hastie, and R . Tibshirani. Additive logistic regression: a statistical view of b o osting. Ann. of Statist. , 28:337 –407, 2000. [18] W. Grams and R. Serﬂing. Conv ergence rates for u- statistics and related statistics. Ann als of Statistics , 1:153–1 60, 1973. [19] T. H astie and R. Tibshirani. Gener alize d A dditive Mo dels . Chapman & Hall, 1995. [20] N. Hengartner and E. Matzner-Løb er. Asymptotic unbiased d ensity estimators. ac - c epte d in ESAIM , 2007. [21] N. Hengartner, M. W egk amp , and E. Matzner-Løb er. Bandwidth selection for local linear regression smoothers. JRSS B , 64:1–14, 2002. [22] N. Hjort and I. Glad. Nonp arametric density estimation with a p aram teric start. Ann . Statist. , 23:882– 904, 1995. [23] R. A . Horn and C. R . Johnson. Matrix analysis . Cambridge, New Y ork, 1985. [24] C. Hu rv ich, G. Simonoﬀ, and C. L. Tsai. Smo othing parameter selection in nonpara- metric regression u sing and improve d ak aik e information criterion. J. R. Statist. So c. B , 60:271–294, 1998. [25] M. Jones, O. Linton, and J. Nielsen. A simple and eﬀective bias reduction meth od for kernel density estimation. Biometrika , 82:327–338, 1995. [26] K. Li. F rom stein’s unbiased risk estimate to the metho d of generalized cross v alida- tion. Ann. Statist. , 13:1352 –1377, 1985. [27] K.-C. Li. A symptotic optimalit y for C p , C L , cross-v alidation and generalized cross- v alidation: Discrete ind ex set. The Annals of Statistics , 15:958–97 5, 1987. [28] C. L . Mallo ws. Some comments on C p . T e chnometrics , 15:661–675 , 1973. [29] E. Nadaray a. On estimating regression. The ory Pr ob ability and their Appli c ations , 9 (134-137), 1964. [30] G. Ridgew a y . Additive logistic regression: a statistical v iew of b oosting: D iscussion. imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018 RECURSIVE BIAS ESTIMA TION 33 Ann . of Statist. , 28:393– 400, 2000. [31] R. S chapire. The strength of weak learnabilit y . Machine l e arning , 5:197–227, 1990. [32] L. Sch w artz. A nalyse I V appli c ations ` a l a th´ eorie de la mesur e . Hermann , Pari s, 1993. [33] G. Sch warz. Estimating th e d imen sion of a mo del. Anna ls of statistics , 6:461–4 64, 1978. [34] J. S imonoﬀ. Smo othing Metho ds in Statistics . Springer, New Y ork, 1996. [35] J. T ukey . Explanatory Data Analysis . Add ison-W esley , 1977. [36] F. Utreras. Natural spline fun ctions, th eir associated eigen v alue problem. Numer. Math. , pages 107–117, 1983. [37] G. W ahba. Spline mo dels for observational data . SIA M, Philadelphia, 1990. [38] G. W atson. Smo oth regressio n analysis. Sankhya A , 26(359-372), 1964. [39] E. Wh itt aker. On a new metho d of graduation. Pr o c. Edinbur gh Math; So c. , 41: 63–75, 1923. Address of P-A Cornillon UMR ASB - Montpellier SupAgro 34060 Montpellier Cedex 1 E-mail: pierre-andre.cornil lon@supagro.inra.fr Address of N. Hengar tner Los Alamos Na tional Labora tor y, NW, USA E-mail: nickh@lanl.go v Address of E. Ma tzner-Løber St a tistics, IRMAR UMR 66 2 5, Univ. Rennes 2, 35043 Rennes, France E-mail: eml@uhb.fr imsart-aos ver. 2007/04/13 file: paper.tex date: October 23, 2018

Recursive Bias Estimation and $L_2$ Boosting

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment