Comment: Boosting Algorithms: Regularization, Prediction and Model Fitting
📝 Original Info
- Title: Comment: Boosting Algorithms: Regularization, Prediction and Model Fitting
- ArXiv ID: 0804.2770
- Date: 2008-12-18
- Authors: ** 원 논문(Boosting Algorithms: Regularization, Prediction and Model Fitting) 저자: Hastie, Tibshirani, Friedman 등 *본 코멘트(현재 분석 대상) 저자: 불명 (논문에 명시되지 않음) – 원문에 저자 정보가 포함되지 않아 확인 불가. — **
📝 Abstract
Comment on ``Boosting Algorithms: Regularization, Prediction and Model Fitting'' [arXiv:0804.2752]💡 Deep Analysis
📄 Full Content
• develop a limiting version of L 2 -boost in which the step-length ν went to zero; • show that this limiting version gave paths identical to the lasso, as was hinted in that chapter.
The result was three very similar varieties of the LARS algorithm, namely lasso, LAR and infinitesimal forward stagewise (iFSLR) (package lars for R, Statistical Science, 2007, Vol. 22, No. 4, 513-515. This reprint differs from the original in pagination and typographic detail.
available from CRAN). iFSLR is indeed the limit of L 2 -boost as ν ↓ 0, with piecewise-linear coefficient profiles, but is not always the same as the lasso.
On a slight technical note, the version of L 2 -boost proposed in BH is slightly different from that in Hastie, Tibshirani and Friedman (2001). Compare
Despite the difference, they both have the same limit, which is computed exactly for squared-error loss by the type=“forward.stagewise” option in the package lars. As ν gets very small, initially the same coefficient tends to get continuously updated by infinitesimal amounts (hence linearly). Eventually a second variable ties with the first for coefficient updates, which they share in a balanced way while remaining tied. Then a third joins in, and so on. Using simple least-squares computations, the LARS algorithm computes the entire iFSLR path with the same cost as a single multiple-least-squares fit. Note that in this limiting case, we can no longer index the sequence by step-number m as in (1) or ( 2), but must resort to some other measure, such as the L 1arc-length of the coefficient profile (Hastie, Taylor, Tibshirani and Walther (2007)).
Lasso and iFSLR are not always the same. In high-dimensional problems with correlated predictors, lasso profiles become wiggly quickly, whereas iFSLR profiles tend to be much smoother and monotone (Hastie et al., 2007). Efron et al. (2004) establish sufficient positive cone conditions on the model matrix X which effectively limit the amount of correlation between the variables and guarantee that lasso and iFSLR are the same; in particular, if the lasso profiles are monotone, all three algorithms are identical.
The authors propose a simple formula for the degrees of freedom for an L 2 -boosted model. They construct the hat matrix B m that computes the fit at iteration m, and then use df(m) = trace(B m ). They are in effect treating the model at stage m as if it were computed by a predetermined sequence of linear updates. If this were the case, their formula would be spot on, by the accepted definitions for effective degrees of freedom for linear operators (Hastie et al., 2001;Efron et al., 2004). They acknowledge that this is an approximation (since the sequence was not predetermined, but rather adaptively chosen), but do not elaborate. In fact this approximation can be very badly off. Figure 1 shows the true degrees of freedom df T (k) plotted against df(k) for two examples. We see that df(k) always underestimates df T (k). We now discuss the details of these examples, and the basis for these claims.
The left example is the prostate data (Hastie et al., 2001, Figure 10.12) and has 67 observations and 9 predictors (including intercept). The right example fits a univariate piecewise-constant spline model of the form f (x) = 50 j=1 β j h j (x), where the h j (x) = I(x ≥ c j ) are a sequence of Haar basis functions with predefined knots c j at the unique values of the input values x i . There are 50 observations and 50 predictors. In both problems we fit the limiting L 2 -boost model iFSLR, using the lars/forward.stagewise procedure. Figure 2 shows the coefficient profiles.
In this case, using the results in Efron et al. (2004), it can be deduced that the equivalent limiting version of the hat matrix (5.6) of BH simplifies to a similar but more compact expression:
Here k indexes the step number in the lars algorithm, where the steps delineate the breakpoints in Fig. 2. Coefficient profiles for the iFSLR algorithm for the two examples. Both profiles are monotone, and are identical to the lasso profiles on these examples. In this case the df increment by 1 exactly at every vertical break-point line.
the piecewise-linear path. H j is the hat matrix corresponding to the variables involved in the jth portion of the piecewise linear path, and γ j is the relative distance in arc-length traveled along this piece until the next variable joins the active set (relative to the arc-length of the step that went all the way to the least squares fit). Using the BH definition, we would compute df(k) = trace(B k ) (vertical axis in Figure 1).
These two examples were chosen carefully, for they both satisfy the positive cone condition mentioned above. In particular, the iFSLR path is the lasso path in both cases, and the active set g