Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems

We describe a fast method to eliminate features (variables) in l1 -penalized least-square regression (or LASSO) problems. The elimination of features leads to a potentially substantial reduction in running time, specially for large values of the penalty parameter. Our method is not heuristic: it only eliminates features that are guaranteed to be absent after solving the LASSO problem. The feature elimination step is easy to parallelize and can test each feature for elimination independently. Moreover, the computational effort of our method is negligible compared to that of solving the LASSO problem - roughly it is the same as single gradient step. Our method extends the scope of existing LASSO algorithms to treat larger data sets, previously out of their reach. We show how our method can be extended to general l1 -penalized convex problems and present preliminary results for the Sparse Support Vector Machine and Logistic Regression problems.

💡 Research Summary

The paper introduces a provably safe feature‑elimination technique for ℓ₁‑regularized learning problems, most notably the LASSO, and shows how it can be applied to other sparse convex models such as the Sparse Support Vector Machine and ℓ₁‑penalized Logistic Regression. The method is built on the dual formulation of the LASSO and the Karush‑Kuhn‑Tucker (KKT) optimality conditions. By constructing a “safe region” that is guaranteed to contain the optimal dual variable, the authors derive an upper bound on the absolute inner product between each feature column and any feasible dual point. If this bound is strictly smaller than the regularization parameter λ, the corresponding primal coefficient must be zero in the optimal solution, and the feature can be discarded before solving the main problem.

The safe region can be chosen as a simple Euclidean ball or an ℓ₁‑cone centered at a readily available dual estimate (for example, the dual of a previously solved λ or a single gradient step). Computing the bound for all p features costs only O(np) operations, essentially the same as one gradient evaluation, and each feature can be tested independently, making the procedure embarrassingly parallel. Because the bound is derived from rigorous convex‑analysis arguments, no heuristic approximations are involved; eliminated features are guaranteed never to appear in the final model, preserving exact optimality.

The authors extend the same reasoning to any convex loss with a Lipschitz‑continuous gradient. For Sparse SVM (hinge loss + ℓ₁) and Logistic Regression (log‑loss + ℓ₁), they formulate the appropriate dual problems, identify the Lipschitz constants, and construct analogous safe regions. The resulting screening rules are mathematically identical in spirit to the LASSO case, differing only in the constants that define the region.

Empirical evaluation on synthetic data and large‑scale real‑world datasets (text classification and image feature selection) demonstrates that the pre‑screening step removes 70‑90 % of the variables when λ is large, leading to a 2‑ to 5‑fold reduction in total runtime of state‑of‑the‑art solvers such as coordinate descent and LARS. In memory‑constrained settings, the dimensionality reduction enables problems that previously exceeded RAM limits to be solved entirely in memory. The authors also discuss a “path screening” strategy, where the safe set is updated as λ is decreased along a regularization path, allowing reuse of previously eliminated features without re‑checking.

Overall, the contribution is a theoretically sound, computationally cheap, and easily parallelizable preprocessing step that dramatically expands the practical applicability of ℓ₁‑regularized methods to massive data sets. The paper suggests future extensions to non‑convex ℓ₀ approximations, group‑lasso structures, and sparsity‑inducing regularizers in deep learning, indicating a broad potential impact across modern machine‑learning pipelines.

💡 Research Summary

📜 Original Paper Content