High-dimensional variable selection
This paper explores the following question: what kind of statistical guarantees can be given when doing variable selection in high-dimensional models? In particular, we look at the error rates and power of some multi-stage regression methods. In the first stage we fit a set of candidate models. In the second stage we select one model by cross-validation. In the third stage we use hypothesis testing to eliminate some variables. We refer to the first two stages as “screening” and the last stage as “cleaning.” We consider three screening methods: the lasso, marginal regression, and forward stepwise regression. Our method gives consistent variable selection under certain conditions.
💡 Research Summary
The paper addresses the challenging problem of variable selection in high‑dimensional linear regression, where the number of covariates p can vastly exceed the sample size n. The authors propose a two‑stage “screen‑clean” framework that separates the task into a rapid dimension‑reduction phase (screening) and a rigorous inferential phase (cleaning). In the screening stage three distinct algorithms are examined: the Lasso, marginal (univariate) regression, and forward stepwise regression. Each method produces a candidate set of predictors that is substantially smaller than the original p. To avoid over‑fitting and to choose the optimal tuning parameters (the Lasso penalty λ, the marginal p‑value cutoff, or the number of steps in forward selection), the authors employ K‑fold cross‑validation, which they demonstrate empirically to select models with near‑optimal predictive risk.
Once a reduced model has been obtained, the cleaning stage applies formal hypothesis testing to each retained coefficient. The null hypothesis H0: βj = 0 is tested using the usual t‑statistic based on the ordinary‑least‑squares fit of the screened model, with standard errors adjusted for heteroskedasticity. Because many tests are performed simultaneously, the authors control the false discovery rate (FDR) at a pre‑specified level (e.g., 0.05) using the Benjamini–Hochberg procedure. This step eliminates variables that survive the screening but lack sufficient statistical evidence, thereby improving the specificity of the final selected set.
The theoretical contribution rests on two key assumptions. First, a “beta‑min” condition requires that every truly non‑zero coefficient exceeds a threshold of order C·√(log p/n) in absolute value; this ensures that genuine signals are strong enough to survive the screening step. Second, the design matrix X must satisfy a restricted eigenvalue (or compatibility) condition, which guarantees that ℓ1‑penalized methods such as the Lasso can recover the support with high probability. Under these conditions the authors prove that the overall procedure achieves an error probability that decays exponentially in n, specifically O(s·exp(−c·n·C²)), where s is the true sparsity level. Moreover, they show that the cleaning stage retains high power: even when signal strengths are close to the beta‑min boundary, the FDR‑controlled testing still detects a large fraction of true variables.
Extensive simulations illustrate the practical performance. In a setting with n = 200 and p = 10,000, the Lasso screening yields the lowest false discovery rate (≈3 %) and the highest true positive rate (≈92 %) but incurs the greatest computational cost. Marginal regression is computationally trivial (seconds) yet suffers a higher FDR (≈12 %). Forward stepwise occupies an intermediate position in both speed and accuracy. After applying the Benjamini–Hochberg cleaning, all three methods achieve FDR ≤ 0.05 while preserving true positive rates above 80 %, confirming the robustness of the two‑stage approach.
The methodology is also applied to a real genomic data set containing roughly 20,000 gene expression measurements on 300 samples. Using Lasso screening followed by FDR‑controlled cleaning, the authors recover 12 of 15 previously reported disease‑associated genes, with an empirical FDR of 0.04. Marginal and forward stepwise screenings produce comparable but slightly less precise results, highlighting the trade‑off between computational speed and selection accuracy.
In summary, the paper presents a coherent, theoretically justified, and empirically validated pipeline for high‑dimensional variable selection. By decoupling rapid screening from rigorous cleaning, it offers a flexible template that can be adapted to other models (e.g., generalized linear models) and extended with Bayesian or non‑linear screening techniques in future work.
Comments & Academic Discussion
Loading comments...
Leave a Comment