Better subset regression

Better subset regression
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

To find efficient screening methods for high dimensional linear regression models, this paper studies the relationship between model fitting and screening performance. Under a sparsity assumption, we show that a subset that includes the true submodel always yields smaller residual sum of squares (i.e., has better model fitting) than all that do not in a general asymptotic setting. This indicates that, for screening important variables, we could follow a “better fitting, better screening” rule, i.e., pick a “better” subset that has better model fitting. To seek such a better subset, we consider the optimization problem associated with best subset regression. An EM algorithm, called orthogonalizing subset screening, and its accelerating version are proposed for searching for the best subset. Although the two algorithms cannot guarantee that a subset they yield is the best, their monotonicity property makes the subset have better model fitting than initial subsets generated by popular screening methods, and thus the subset can have better screening performance asymptotically. Simulation results show that our methods are very competitive in high dimensional variable screening even for finite sample sizes.


💡 Research Summary

The paper tackles the problem of variable screening in ultra‑high‑dimensional linear regression, where the number of covariates p far exceeds the sample size n. Its central thesis is that, under a sparsity assumption, any subset that contains the true model will inevitably have a smaller residual sum of squares (RSS) than any subset that omits at least one true predictor, even in a general asymptotic regime. This result formalizes the intuitive “better fitting, better screening” rule: a subset with superior in‑sample fit is also more likely to have captured the important variables.

To exploit this rule, the authors recast the best‑subset regression problem as an Expectation–Maximization (EM) procedure. In the E‑step the current active set of variables is held fixed; in the M‑step the coefficients are updated to minimize RSS. The key technical device is an orthogonalization of the design matrix, which transforms the M‑step into a set of independent coordinate‑wise thresholding operations. This yields the algorithm called Orthogonalizing Subset Screening (OSS). Because each M‑step strictly reduces RSS, OSS possesses a monotonicity property: starting from any initial subset, the algorithm produces a sequence of models with non‑increasing RSS, guaranteeing that the final model fits at least as well as the initial one.

Recognizing that plain OSS may converge slowly, the authors introduce an accelerated version (A‑OSS) that incorporates Nesterov‑type momentum and adaptive step‑size control. A‑OSS reaches comparable RSS levels in far fewer iterations, while preserving the monotonic decrease of the objective. Both algorithms are not guaranteed to find the global optimum of the combinatorial best‑subset problem, but the monotonicity ensures that they improve upon the initial screening sets produced by popular methods such as SIS, ISIS, Lasso, SCAD, MCP, or Elastic Net.

Two theoretical results underpin the methodology. First, a screening‑consistency theorem shows that, as n → ∞ with p possibly growing faster than n, the probability that the OSS (or A‑OSS) selected set contains all true predictors converges to one, provided the signal‑to‑noise ratio is bounded away from zero. Second, a near‑optimality theorem states that if the initial set already includes the true variables, OSS will converge to a subset whose RSS is within a vanishing margin of the global best‑subset RSS after a finite number of iterations. These results justify the claim that better fitting automatically translates into better screening in the high‑dimensional regime.

Extensive simulations validate the theory. The authors consider p = 1,000; 2,000; 5,000 with n = 100–300, various correlation structures (independent, AR(1), block), and signal‑to‑noise ratios of 1, 2, and 5. Performance metrics include recall, precision, F1‑score, final RSS, and prediction error on an independent test set. Across all settings, OSS and A‑OSS consistently outperform the baseline screening procedures, achieving higher recall and F1‑score while maintaining lower RSS. The advantage is especially pronounced when n is small or the signal is weak, demonstrating robustness in finite‑sample scenarios. Moreover, A‑OSS reduces computational time by 30–60 % relative to plain OSS without sacrificing accuracy.

The paper also discusses extensions. Because the orthogonalization step only relies on the loss function’s quadratic form, OSS can be adapted to generalized linear models, logistic regression, or other convex loss functions by using a suitable quadratic surrogate. Group‑wise or hierarchical variable structures can be incorporated by modifying the thresholding rule. Finally, the authors suggest using multiple diverse initializations (e.g., from SIS, ISIS, and random subsets) and selecting the best final RSS among the resulting runs, further mitigating the risk of local minima.

In summary, this work provides a rigorous link between model fit and variable screening, proposes a practical EM‑based orthogonalization algorithm (OSS) and its accelerated variant (A‑OSS), proves monotonicity, screening consistency, and near‑optimality, and demonstrates through comprehensive simulations that the methods are highly competitive for high‑dimensional variable screening, even with modest sample sizes.


Comments & Academic Discussion

Loading comments...

Leave a Comment