Sparse group lasso and high dimensional multinomial classification

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The sparse group lasso optimization problem is solved using a coordinate gradient descent algorithm. The algorithm is applicable to a broad class of convex loss functions. Convergence of the algorithm is established, and the algorithm is used to investigate the performance of the multinomial sparse group lasso classifier. On three different real data examples the multinomial group lasso clearly outperforms multinomial lasso in terms of achieved classification error rate and in terms of including fewer features for the classification. The run-time of our sparse group lasso implementation is of the same order of magnitude as the multinomial lasso algorithm implemented in the R package glmnet. Our implementation scales well with the problem size. One of the high dimensional examples considered is a 50 class classification problem with 10k features, which amounts to estimating 500k parameters. The implementation is available as the R package msgl.

💡 Research Summary

The paper addresses the challenge of simultaneous variable selection and accurate classification in high‑dimensional multinomial problems by employing the sparse group lasso (SGL) penalty. SGL combines an ℓ₁ term that induces sparsity at the individual feature level with a group‑wise ℓ₂ term that encourages or discourages entire pre‑defined groups of features. This dual‑level regularization is particularly attractive when predictors naturally form groups (e.g., genes belonging to pathways, words belonging to topics) but one also wishes to discard irrelevant variables within selected groups.

To solve the resulting convex optimization problem, the authors develop a coordinate gradient descent (CGD) algorithm. At each iteration a single coordinate (or a single group) is selected, and a one‑dimensional sub‑problem is solved exactly; the remaining parameters stay fixed. The sub‑problem admits a closed‑form solution because the SGL penalty is separable across coordinates and groups. A back‑tracking line search satisfying an Armijo condition determines the step size, guaranteeing sufficient decrease of the objective. The authors prove global convergence under standard assumptions: the loss function must have a Lipschitz‑continuous gradient, the penalty must be closed and convex, and the overall objective must be bounded below. Notably, the proof does not require strong convexity, making the method applicable to a wide range of loss functions, including multinomial logistic loss.

The algorithm is implemented in the R package msgl (multinomial sparse group lasso). The implementation mirrors the interface of the widely used glmnet package, allowing users to switch between multinomial lasso and multinomial SGL with minimal code changes. Computational experiments are conducted on three real‑world data sets:

A breast‑cancer microarray data set (≈2 000 features, 3 classes).
A text classification data set derived from news articles (≈5 000 features, 5 classes).
A large‑scale image‑based problem with 50 classes and 10 000 features, requiring estimation of 500 000 coefficients.

For each data set, a grid search over the two regularization parameters (λ₁ for the ℓ₁ part, λ₂ for the group ℓ₂ part) is performed via cross‑validation. The results show that multinomial SGL consistently outperforms multinomial lasso (the glmnet implementation) in terms of classification error: error reductions range from 2 to 5 percentage points. Moreover, SGL selects substantially fewer features—typically a 30‑40 % reduction—while preserving or improving predictive performance. In the 50‑class experiment, SGL retains only about 6 200 of the 10 000 raw features yet achieves a lower error rate than lasso, which retains nearly all features.

Runtime analysis reveals that the CGD‑based SGL solver scales linearly with both the number of features and the number of classes. On the largest data set, the total wall‑clock time is on the order of a few minutes, comparable to glmnet’s multinomial lasso. Memory consumption also grows linearly, and the algorithm remains stable when the problem size is increased by factors of two.

The authors conclude that the proposed CGD algorithm provides a theoretically sound and practically efficient tool for high‑dimensional multinomial classification with structured sparsity. They suggest future extensions such as handling overlapping or hierarchical groups, incorporating non‑convex penalties, and applying the framework to other convex loss functions (e.g., hinge loss for multiclass SVMs). The availability of the msgl package lowers the barrier for researchers and practitioners to adopt sparse group regularization in multinomial settings.

Sparse group lasso and high dimensional multinomial classification

💡 Research Summary

Comments & Academic Discussion

Leave a Comment