Structured Variable Selection with Sparsity-Inducing Norms
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual $\ell_1$-norm and the group $\ell_1$-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.
💡 Research Summary
The paper addresses the problem of linear supervised learning with a focus on structured sparsity, proposing a novel regularization framework that extends the classic ℓ₁ norm and the group ℓ₁ norm to allow overlapping groups of variables. The authors define a structured sparsity‑inducing norm Ω(w)=∑_{g∈𝔊}‖w_g‖₂, where 𝔊 is a collection of possibly overlapping subsets (groups) of the coefficient vector w, and w_g denotes the sub‑vector restricted to group g. By permitting overlaps, the norm can encode complex prior knowledge about admissible non‑zero patterns, such as hierarchical, tree‑structured, or grid‑structured relationships among features.
A major contribution is the systematic analysis of the mapping between the group collection 𝔊 and the set of feasible support patterns ℘. The authors present two complementary algorithms: a forward procedure that, given 𝔊, enumerates all support patterns that can arise as optimal solutions, and a backward procedure that, given a desired support pattern, constructs a minimal collection of groups that will enforce it. Both procedures exploit the partial‑order structure induced by group inclusion and run in polynomial time, making it practical to design norms that precisely reflect domain‑specific sparsity structures.
Because Ω is non‑separable and non‑smooth, standard coordinate descent or simple sub‑gradient methods are inefficient. The paper introduces an active‑set algorithm tailored to this norm. Starting from a small active set of variables, the method iteratively checks Karush‑Kuhn‑Tucker (KKT) conditions; when violations are detected, the algorithm adds the most offending group to the active set. Conditional gradients and Lagrange multipliers are used to compute the direction of descent within the current active set. The algorithm guarantees linear convergence to the global optimum while keeping memory and computational costs proportional to the size of the active set rather than the total number of variables.
Theoretical analysis covers both low‑dimensional (p < n) and high‑dimensional (p ≫ n) regimes. In the low‑dimensional case, under mild conditions on the design matrix and with a sufficiently small regularization parameter λ, the estimator recovers the exact support of the true underlying model with probability one. In the high‑dimensional setting, the authors assume a restricted isometry‑type condition and bounded correlations among features. They prove that if λ scales as √(log p / n), the estimator achieves sparsistency (consistent variable selection) and bounds on prediction error that match those of the best known results for non‑overlapping group lasso, while offering finer control over the support due to overlapping groups.
Empirical evaluations are conducted on synthetic data designed with hierarchical and grid‑like group structures, as well as on real‑world datasets from genomics (gene expression) and computer vision (pixel‑based regression). Across all experiments, the proposed structured norm outperforms plain ℓ₁, standard group lasso, and several composite penalties in terms of support recovery accuracy, prediction mean‑squared error, and interpretability of the selected features. The active‑set solver scales to problems with tens of thousands of variables, demonstrating practical viability.
In summary, the paper makes three key advances: (1) a principled definition of overlapping‑group sparsity norms together with algorithms to translate between groups and admissible support patterns; (2) an efficient active‑set optimization scheme that exploits the structure of the norm; and (3) rigorous consistency results for both low‑ and high‑dimensional linear regression. These contributions provide a powerful toolkit for incorporating rich prior knowledge into sparse linear models and open avenues for extending structured sparsity concepts to non‑linear models and deep learning architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment