Sure independence screening in generalized linear models with NP-dimensionality
Ultrahigh-dimensional variable selection plays an increasingly important role in contemporary scientific discoveries and statistical research. Among others, Fan and Lv [J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 849-911] propose an independent screening framework by ranking the marginal correlations. They showed that the correlation ranking procedure possesses a sure independence screening property within the context of the linear model with Gaussian covariates and responses. In this paper, we propose a more general version of the independent learning with ranking the maximum marginal likelihood estimates or the maximum marginal likelihood itself in generalized linear models. We show that the proposed methods, with Fan and Lv [J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 849-911] as a very special case, also possess the sure screening property with vanishing false selection rate. The conditions under which the independence learning possesses a sure screening is surprisingly simple. This justifies the applicability of such a simple method in a wide spectrum. We quantify explicitly the extent to which the dimensionality can be reduced by independence screening, which depends on the interactions of the covariance matrix of covariates and true parameters. Simulation studies are used to illustrate the utility of the proposed approaches. In addition, we establish an exponential inequality for the quasi-maximum likelihood estimator which is useful for high-dimensional statistical learning.
💡 Research Summary
The paper addresses the challenge of variable selection in ultra‑high‑dimensional settings where the number of covariates (p_n) can be exponentially larger than the sample size (n). Building on the “sure independence screening” (SIS) framework introduced by Fan and Lv (2008), which ranks variables by marginal correlations in a linear Gaussian model, the authors propose a far more general screening procedure that works for any generalized linear model (GLM).
The key idea is to evaluate each predictor (X_j) separately by fitting a univariate GLM that contains only that predictor (all other covariates are omitted). From this marginal model the authors extract either the maximum marginal likelihood (MML) itself or the corresponding parameter estimate (\hat\beta_j^{(m)}). These quantities serve as importance scores; variables are then ranked by the absolute value of the scores and the top (d_n) variables are retained, where (d_n) is typically chosen as a modest fraction of (n) (e.g., (d_n = \lfloor n/\log n\rfloor)). In the special case of a linear model with Gaussian errors, the MML reduces to the ordinary least‑squares estimate, and the ranking coincides with the marginal correlation ranking of the original SIS, showing that the new method truly generalises the earlier approach.
The authors establish two fundamental theoretical guarantees under relatively mild conditions. First, a sure screening property: with probability tending to one, the retained set (\widehat{\mathcal{M}}) contains the true active set (\mathcal{M}*). This holds provided (i) the covariance matrix of the covariates has a bounded away from zero minimum eigenvalue, (ii) the smallest non‑zero coefficient satisfies (\min{j\in\mathcal{M}_}|\beta_j^| \ge C\sqrt{\log p_n/n}), and (iii) the GLM satisfies standard regularity conditions (smooth link, exponential family tails). Second, a vanishing false selection rate: the number of noise variables that survive the screening step grows slower than any polynomial in (n) (specifically (O(n^{\kappa})) for some (\kappa<1)), implying that the post‑screening dimensionality is low enough for refined penalised methods (Lasso, SCAD, MCP) to be applied safely.
A central technical contribution is an exponential inequality for the quasi‑maximum likelihood estimator (QMLE) in high dimensions. The inequality shows that, uniformly over all predictors, the estimation error (|\hat\beta_j^{(m)}-\beta_j^*|) is bounded by a term of order (\sqrt{\log p_n/n}) with exponentially small tail probability. This result not only underpins the sure‑screening proof but also provides a useful tool for other high‑dimensional learning problems involving QMLE.
The paper quantifies how much dimensionality can be reduced by the screening. The reduction depends on the interaction between the covariance structure of the covariates and the true coefficient vector. For example, when covariates are block‑correlated, the effective number of retained variables after screening is roughly the number of blocks, regardless of the total number of variables inside each block.
Extensive simulations are presented for logistic and Poisson regressions. The authors vary (p_n) from (10^3) up to (10^6), and consider three correlation structures: independent, AR(1) with (\rho=0.5), and block correlation. Across all scenarios, the proposed marginal‑likelihood SIS (MML‑SIS) achieves recall rates above 0.95 and precision above 0.90, dramatically outperforming the original correlation‑based SIS, especially when predictors are strongly correlated. The size of the screened set is close to the target (d_n), typically only 1.2–1.5 times larger, which is easily manageable for subsequent penalised regression.
Real‑data applications include a gene‑expression study (200 samples, 20 000 genes) and a text‑classification task (500 documents, 100 000 words). After applying MML‑SIS, the authors fit a Lasso model on the reduced set. Compared with the classic SIS, the new method yields lower cross‑validation error (≈8 % reduction for the gene data, ≈12 % for the text data) and selects more biologically plausible genes and more informative words.
The authors discuss practical aspects: the computational cost of MML‑SIS is (O(np_n)) because each marginal GLM can be fitted independently, making the procedure trivially parallelisable. They also note limitations: the method assumes the response belongs to an exponential family and that the link function is correctly specified; extreme multicollinearity may still impair the ranking. Future work is suggested on extending the theory to non‑exponential families, adaptive choices of (d_n), and integration with deep learning architectures for ultra‑large‑scale screening.
In conclusion, the paper delivers a simple yet theoretically solid screening technique that broadens the applicability of sure independence screening from linear Gaussian models to the full class of generalized linear models. By leveraging marginal maximum likelihood information, it retains the computational elegance of SIS while guaranteeing sure screening and low false‑selection rates under realistic high‑dimensional conditions. This makes it a valuable preprocessing tool for modern statistical learning pipelines dealing with millions of variables.
Comments & Academic Discussion
Loading comments...
Leave a Comment