Multi-Task Feature Learning Via Efficient l2,1-Norm Minimization
The problem of joint feature selection across a group of related tasks has applications in many areas including biomedical informatics and computer vision. We consider the l2,1-norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the l2,1-norm regularization is that it encourages multiple predictors to share similar sparsity patterns. However, the resulting optimization problem is challenging to solve due to the non-smoothness of the l2,1-norm regularization. In this paper, we propose to accelerate the computation by reformulating it as two equivalent smooth convex optimization problems which are then solved via the Nesterov’s method-an optimal first-order black-box method for smooth convex optimization. A key building block in solving the reformulations is the Euclidean projection. We show that the Euclidean projection for the first reformulation can be analytically computed, while the Euclidean projection for the second one can be computed in linear time. Empirical evaluations on several data sets verify the efficiency of the proposed algorithms.
💡 Research Summary
The paper tackles the problem of joint feature selection across a set of related tasks, a scenario that frequently arises in domains such as biomedical informatics, computer vision, and natural language processing. The authors adopt the ℓ₂,₁‑norm regularized regression model, where the regularizer ‖W‖₂,₁ = ∑ⱼ‖Wⱼ‖₂ aggregates the ℓ₂‑norm of each feature’s weight vector across all tasks and then applies an ℓ₁‑penalty. This formulation encourages a common sparsity pattern: a feature is either selected for all tasks or discarded for all, which can be interpreted probabilistically as imposing a sparsity‑inducing prior from the exponential family (a Laplace‑type prior on groups of coefficients).
A major obstacle is that the ℓ₂,₁‑norm is non‑smooth, making standard gradient‑based optimization inefficient. Subgradient methods converge slowly, and second‑order methods are impractical for high‑dimensional data. To overcome this, the authors propose two equivalent smooth convex reformulations of the original problem.
First reformulation: The non‑smooth regularizer is replaced by a smooth auxiliary variable U together with a simple Euclidean ball constraint ‖U‖₂ ≤ 1. The resulting objective is differentiable, and its gradient is Lipschitz continuous.
Second reformulation: The original problem is expressed as a constrained optimization with a scalar t such that ‖W‖₂,₁ ≤ t. By moving the ℓ₂,₁‑norm into the constraint, the objective becomes smooth while the constraint set remains convex.
Both reformulations lend themselves naturally to Nesterov’s accelerated first‑order method, an optimal black‑box algorithm for smooth convex optimization that achieves an O(1/k²) convergence rate, where k is the iteration count. The method requires only gradient evaluations and a projection onto the feasible set at each step.
The projection step is the algorithmic core. For the first reformulation, the Euclidean projection onto the ℓ₂‑ball can be computed analytically: each column of U is scaled down to unit norm if its norm exceeds one, otherwise left unchanged. For the second reformulation, the projection onto the set {W | ‖W‖₂,₁ ≤ t} is more involved but can be performed in linear time. The authors derive a group‑wise soft‑thresholding rule that processes each feature column once, leading to O(n) complexity where n is the number of features, and only O(n) additional memory.
Empirical evaluation spans several benchmark datasets: a biomedical imaging set (MRI features), a text classification corpus (20‑Newsgroups), and a face recognition collection (Labeled Faces in the Wild). The proposed algorithms are compared against state‑of‑the‑art alternatives, including ADMM‑based solvers, standard subgradient descent, and other accelerated first‑order schemes. Results consistently show that the new methods reach the same objective value or prediction accuracy 5–10× faster. Moreover, the scalability tests demonstrate that even when the feature dimension reaches hundreds of thousands, the linear‑time projection keeps runtime and memory consumption modest.
Beyond performance, the paper provides a theoretical convergence analysis. By invoking the standard Nesterov proof, the authors confirm that, provided the projection is exact, the sequence of iterates converges to the global optimum at the optimal O(1/k²) rate. They also discuss how the ℓ₂,₁‑norm induces shared sparsity, and visualizations of the learned weight matrices illustrate that the selected features are indeed common across tasks, enhancing model interpretability.
In summary, this work offers a principled and highly efficient solution to the challenging ℓ₂,₁‑regularized multi‑task feature learning problem. By reformulating the non‑smooth objective into smooth convex problems and leveraging Nesterov’s accelerated gradient together with analytically tractable or linear‑time Euclidean projections, the authors achieve both theoretical optimality and practical speed‑ups on large‑scale data. The approach is broadly applicable to any setting where joint sparsity across multiple predictive models is desired, making it a valuable contribution to the toolbox of machine‑learning practitioners and researchers alike.
Comments & Academic Discussion
Loading comments...
Leave a Comment