Ultrahigh dimensional variable selection: beyond the linear model

Ultrahigh dimensional variable selection: beyond the linear model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan and Lv, 2008) or feature selection using a two-sample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan and Lv (2008)showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing variable deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false discovery rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology.


💡 Research Summary

The paper tackles the problem of variable selection when the number of predictors far exceeds the sample size – a setting that is increasingly common in modern scientific and data‑driven applications. Classical high‑dimensional screening methods, such as correlation ranking (Fan & Lv, 2008) or the two‑sample t‑test used in high‑dimensional classification (Tibshirani et al., 2003), rely on marginal associations between each predictor and the response. While Fan and Lv proved that simple correlation‑based Sure Independence Screening (SIS) enjoys a “sure screening” property under certain regularity conditions, SIS fails when important predictors are marginally unrelated but jointly informative. Iterative Sure Independence Screening (ISIS) was introduced to address this by repeatedly fitting a model, computing residuals, and re‑screening the remaining variables. However, ISIS is tied to the linear‑model residual and therefore cannot be directly applied to generalized linear models (GLMs) or other non‑linear likelihood‑based frameworks.

The authors propose a substantial generalization of ISIS that eliminates the need for explicit residuals and works within a pseudo‑likelihood framework. At each iteration they (i) fit the current set of selected variables by maximizing a pseudo‑likelihood appropriate for the model (e.g., logistic likelihood for binary outcomes, Poisson likelihood for count data), (ii) compute the corresponding score vector (the gradient of the log‑pseudo‑likelihood) which plays the role of a residual, and (iii) evaluate the absolute correlation (or another association measure) between this score and each remaining candidate predictor. The top‑k candidates are added, and simultaneously a deletion step evaluates the contribution of already‑selected variables (via likelihood‑based criteria or penalized regression) and removes those whose marginal contribution falls below a data‑driven threshold. This bidirectional updating (addition + deletion) distinguishes the new method from the original ISIS and improves model parsimony, especially in ultra‑high‑dimensional settings.

A further innovation is a false‑discovery‑rate (FDR) control mechanism for the screening stage. Traditional multiple‑testing corrections (Bonferroni, Benjamini–Hochberg) become overly conservative when p≫n, leading to loss of power. The authors instead estimate an empirical null distribution of the screening statistics and set a data‑adaptive cutoff that targets a desired FDR level, thereby retaining weak but genuine signals while limiting spurious inclusions.

Theoretical contributions include proofs that the generalized ISIS retains the sure‑screening property and achieves model‑selection consistency under a set of assumptions: (1) sub‑Gaussian tails for the predictor distribution, (2) a minimum signal strength condition that scales with log p, and (3) smoothness and strong convexity of the pseudo‑likelihood. The proofs extend the original SIS/ISIS arguments by bounding the deviation of the score vector and invoking concentration inequalities for high‑dimensional sub‑Gaussian processes.

Extensive simulations illustrate the method’s performance across three model families: (a) standard linear regression, (b) logistic regression for binary classification, and (c) Poisson regression for count outcomes. In each scenario p is set to 10 000 while n≈200, mimicking ultra‑high‑dimensional regimes. Compared with SIS, ISIS, Lasso, Elastic Net, and the two‑sample t‑screening, the proposed algorithm consistently yields higher precision and recall in variable selection, lower prediction error (RMSE for regression, AUC for classification), and a markedly reduced false‑discovery rate. Notably, in settings where the t‑screening fails to capture any true signal (because marginal differences are null), the pseudo‑likelihood ISIS still recovers a substantial proportion of the true predictors.

Two real‑world applications demonstrate practical relevance. The first uses a gene‑expression dataset with several thousand genes measured on a few hundred tumor and normal samples. The method selects a few hundred genes that achieve an AUC exceeding 0.95, outperforming both SIS and penalized regression approaches while using far fewer features. The second application involves spam‑email detection based on a bag‑of‑words representation with tens of thousands of word‑features. After a few iterations the algorithm reduces the feature set to fewer than 50 words yet maintains classification accuracy above 95 %, illustrating the power of the deletion step in producing compact, interpretable models.

In summary, the paper extends the scope of iterative independence screening from the narrow linear‑model world to a broad class of pseudo‑likelihood models, introduces a principled variable‑deletion mechanism, and provides a data‑driven FDR control for the screening stage. The combination of rigorous theoretical guarantees, extensive simulation evidence, and successful real‑data case studies positions this methodology as a versatile tool for ultra‑high‑dimensional variable selection in modern statistical learning, bioinformatics, and related fields.


Comments & Academic Discussion

Loading comments...

Leave a Comment