Scaled Sparse Linear Regression
Scaled sparse linear regression jointly estimates the regression coefficients and noise level in a linear model. It chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual square and scaling the penalty in proportion to the estimated noise level. The iterative algorithm costs little beyond the computation of a path or grid of the sparse regression estimator for penalty levels above a proper threshold. For the scaled lasso, the algorithm is a gradient descent in a convex minimization of a penalized joint loss function for the regression coefficients and noise level. Under mild regularity conditions, we prove that the scaled lasso simultaneously yields an estimator for the noise level and an estimated coefficient vector satisfying certain oracle inequalities for prediction, the estimation of the noise level and the regression coefficients. These inequalities provide sufficient conditions for the consistency and asymptotic normality of the noise level estimator, including certain cases where the number of variables is of greater order than the sample size. Parallel results are provided for the least squares estimation after model selection by the scaled lasso. Numerical results demonstrate the superior performance of the proposed methods over an earlier proposal of joint convex minimization.
💡 Research Summary
The paper introduces a novel framework called “scaled sparse linear regression” for jointly estimating the regression coefficients and the error variance in high‑dimensional linear models. Traditional ℓ₁‑penalized methods such as the lasso require a penalty level λ that is typically proportional to the unknown noise standard deviation σ. In practice λ is chosen by cross‑validation, which is computationally intensive and lacks solid theoretical guarantees for variable selection consistency.
To address this, the authors propose an iterative algorithm that alternates between (i) estimating σ by the square root of the mean residual sum of squares, and (ii) scaling the penalty λ to be proportional to the current σ estimate (λ = σ̂·λ₀, where λ₀ is a fixed tuning constant). The coefficient vector β̂ is then obtained by solving the usual ℓ₁‑penalized problem (lasso or a concave penalty) at the updated λ. The process repeats until convergence. A degrees‑of‑freedom adjustment a ≥ 0 can be incorporated, with a = 0 reproducing the algorithm of Sun & Zhang (2010).
The authors show that this alternating scheme is equivalent to minimizing a jointly convex loss function
L_{λ₀}(β,σ) = ‖y – Xβ‖₂²/(2nσ) + (1 – a)σ²/2 + λ₀‖β‖₁,
which is a scaled version of Huber’s concomitant loss combined with an ℓ₁ penalty. Because the loss is globally convex in (β,σ), the algorithm converges to the global minimizer, and the estimators are equivariant under scaling of the response.
The theoretical contributions are threefold. First, the authors derive a prediction oracle inequality that bounds the prediction error of the scaled lasso by the best possible linear predictor plus an additional ℓ₁ cost of order λ∑_j min(λ,|β*_j|). This inequality holds for any λ and any sparsity pattern, and it is expressed in terms of the compatibility factor κ(ξ,T), which captures the geometry of the design matrix.
Second, they establish consistency and asymptotic normality of the noise‑level estimator σ̂. By choosing λ₀ = A√(2log p/n) with A > (ξ+1)/(ξ–1) and assuming the compatibility factor is bounded away from zero, they prove that the relative error |σ̂/σ* – 1| converges to zero in probability. Moreover, under mild additional conditions, n^{1/2}(σ̂ – σ*) converges in distribution to N(0,σ²/2), providing a simple, data‑driven estimator of σ that enjoys optimal statistical properties even when p≫n.
Third, they sharpen the convergence rates by bounding the ℓ₁‑error of β̂. They derive an explicit bound µ(λ,ξ) that improves earlier constants and accommodates many small coefficients in β*. Using this bound they show that the error in σ̂ shrinks at the rate τ* = √(λ₀ µ(σλ₀,ξ)/σ), which is essentially the square of the earlier τ₀ rate, yielding faster convergence of the variance estimator.
The paper also extends the results to post‑selection inference: after the scaled lasso selects a model, ordinary least squares (OLS) applied to the selected variables inherits the same oracle inequalities and variance‑estimation guarantees.
Extensive simulations and a real‑data example (e.g., a genomics dataset) demonstrate that the scaled lasso and the scaled OLS outperform the joint penalized maximum‑likelihood estimator of Städler et al. (2010) and its bias‑corrected version. The scaled methods achieve lower prediction mean‑squared error, more accurate σ̂, and comparable or better variable‑selection performance, while requiring only the computation of a regular lasso solution path.
In summary, the paper provides a computationally cheap, theoretically sound, and practically effective method for simultaneous coefficient and noise‑variance estimation in high‑dimensional linear regression. By embedding variance estimation directly into the penalization scheme, it eliminates the need for costly cross‑validation, yields provable oracle properties, and opens the door to reliable inference after model selection.
Comments & Academic Discussion
Loading comments...
Leave a Comment