Robust Ultra-High-Dimensional Variable Selection With Correlated Structure Using Group Testing

Robust Ultra-High-Dimensional Variable Selection With Correlated Structure Using Group Testing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Background: High-dimensional genomic data exhibit strong group correlation structures that challenge conventional feature selection methods, which often assume feature independence or rely on pre-defined pathways and are sensitive to outliers and model misspecification. Methods: We propose the Dorfman screening framework, a multi-stage procedure that forms data-driven variable groups via hierarchical clustering, performs group and within-group hypothesis testing, and refines selection using elastic net or adaptive elastic net. Robust variants incorporate OGK-based covariance estimation, rank-based correlation, and Huber-weighted regression to handle contaminated and non-normal data. Results: In simulations, Dorfman-Sparse-Adaptive-EN performed best under normal conditions, while Robust-OGK-Dorfman-Adaptive-EN showed clear advantages under data contamination, outperforming classical Dorfman and competing methods. Applied to NSCLC gene expression data for trametinib response, robust Dorfman methods achieved the lowest prediction errors and enriched recovery of clinically relevant genes. Conclusions: The Dorfman framework provides an efficient and robust approach to genomic feature selection. Robust-OGK-Dorfman-Adaptive-EN offers strong performance under both ideal and contaminated conditions and scales to ultra-high-dimensional settings, making it well suited for modern genomic biomarker discovery.


💡 Research Summary

The paper addresses a fundamental challenge in modern biomedical research: selecting predictive variables from ultra‑high‑dimensional genomic data that are characterized by strong within‑group correlations and frequent departures from classical linear model assumptions (e.g., non‑normal errors, outliers, batch effects). Traditional high‑dimensional screening methods such as Sure Independence Screening (SIS), LASSO, and Elastic Net treat predictors as marginally independent, while group‑regularized extensions (Group LASSO, Sparse Group LASSO) require pre‑specified pathway annotations that are often incomplete or context‑specific. Moreover, these approaches can be computationally burdensome and fragile to contamination.

To overcome these limitations, the authors propose a novel “Dorfman‑Screening” framework that adapts the classic group‑testing concept (originally devised for infectious‑disease screening) to continuous‑outcome regression. The procedure consists of three modular stages, each of which can be swapped with alternative techniques without affecting the overall pipeline.

Stage 1 – Data‑driven grouping:
Predictors are clustered into groups using either (A) a simple Pearson correlation matrix transformed into a dissimilarity measure (1 − |R|) followed by average‑linkage hierarchical clustering, or (B) a sparse precision matrix estimated via the graphical LASSO. In the latter case, a grid of penalty parameters ρ is explored, and the optimal ρ* is chosen by maximizing clustering purity (adjusted Rand index). The dendrogram is then cut at a height h selected through 5‑fold cross‑validation that minimizes the downstream prediction RMSE. For robustness, the authors also implement (i) Spearman rank correlation with dynamic tree cutting, and (ii) an OGK‑based robust covariance estimator combined with graphical LASSO to obtain a sparse, outlier‑resistant correlation matrix.

Stage 2 – Two‑level Dorfman testing:
For each predefined group g, a linear model including all variables in that group is fitted. An F‑test evaluates the global null hypothesis H0,g: βg = 0. Groups with p‑value < α1 are declared “active.” Within each active group, a multivariate regression is refitted and individual t‑statistics are used to test H0,j: βj = 0 for each variable j. Variables with p‑value < α2 are retained. The thresholds (h, α1, α2) are jointly tuned via cross‑validation to control type‑I error while preserving power. In the robust variant, ordinary least squares is replaced by Huber M‑estimation, which down‑weights outlying observations.

Stage 3 – Final penalized regression:
The union of variables surviving Stage 2 (denoted S1) is fed into an Elastic Net (EN) model:
\


Comments & Academic Discussion

Loading comments...

Leave a Comment