Multi-stage Multi-task feature learning via adaptive threshold
Multi-task feature learning aims to identity the shared features among tasks to improve generalization. It has been shown that by minimizing non-convex learning models, a better solution than the convex alternatives can be obtained. Therefore, a non-convex model based on the capped-$\ell_{1},\ell_{1}$ regularization was proposed in \cite{Gong2013}, and a corresponding efficient multi-stage multi-task feature learning algorithm (MSMTFL) was presented. However, this algorithm harnesses a prescribed fixed threshold in the definition of the capped-$\ell_{1},\ell_{1}$ regularization and the lack of adaptivity might result in suboptimal performance. In this paper we propose to employ an adaptive threshold in the capped-$\ell_{1},\ell_{1}$ regularized formulation, where the corresponding variant of MSMTFL will incorporate an additional component to adaptively determine the threshold value. This variant is expected to achieve a better feature selection performance over the original MSMTFL algorithm. In particular, the embedded adaptive threshold component comes from our previously proposed iterative support detection (ISD) method \cite{Wang2010}. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of this new variant over the original MSMTFL.
💡 Research Summary
The paper addresses a key limitation of the existing Multi‑Stage Multi‑Task Feature Learning (MSMTFL) algorithm, namely its reliance on a fixed threshold θ in the capped‑ℓ₁,ℓ₁ regularizer. While the capped‑ℓ₁,ℓ₁ penalty is a non‑convex surrogate that better approximates the ideal ℓ₀,ℓ₀ sparsity than convex alternatives, the performance of MSMTFL heavily depends on the choice of θ, which is data‑dependent and difficult to set a priori. To overcome this, the authors propose an adaptive‑threshold variant, MSMTFL‑AT, that integrates an iterative support detection (ISD) mechanism originally designed for single‑task sparse recovery. The core idea is to estimate θ at each stage by applying the “first significant jump” heuristic: after sorting the ℓ₁‑norms of the rows of the current weight matrix, the algorithm identifies the largest gap between consecutive values and uses the corresponding magnitude as the new threshold. This procedure automatically distinguishes near‑zero rows (irrelevant features) from truly active rows.
Algorithmically, MSMTFL‑AT proceeds as follows. Starting with an initial λ and a relatively large θ, each iteration solves a weighted ℓ₁‑regularized problem (identical to the first stage of the original MSMTFL) to obtain a provisional weight matrix ˆW⁽ℓ⁾. The row norms of ˆW⁽ℓ⁾ are then fed into the “first significant jump” rule to produce an updated θ⁽ℓ⁾. Based on this new threshold, the per‑row penalty parameters λ_j are re‑assigned (λ if the row norm exceeds θ⁽ℓ⁾, zero otherwise), and the process repeats. Consequently, θ gradually shrinks as the algorithm progresses, allowing a broad initial search space that becomes increasingly focused on the true support set.
The authors provide an intuitive theoretical justification. When θ → 0, the capped‑ℓ₁,ℓ₁ regularizer converges to the ℓ₀,ℓ₀ count, implying that a sufficiently small adaptive θ yields an almost exact sparsity model. Moreover, a large initial θ mitigates the risk of getting trapped in poor local minima, while subsequent reductions deepen the basin of attraction toward the true solution. Although a rigorous convergence proof for the adaptive scheme is left for future work, the method inherits the convergence and reproducibility guarantees of the original MSMTFL because each inner optimization is convex and the overall procedure follows a deterministic re‑weighting scheme.
Empirical evaluation includes both synthetic and real‑world experiments. In synthetic tests (d = 200 features, m = 10 tasks, varying noise levels), MSMTFL‑AT achieves support recovery rates of 85 % or higher, outperforming the original MSMTFL by 5–10 %. On real datasets—gene expression data for disease prediction and a news‑article classification task—MSMTFL‑AT reduces root‑mean‑square error (or improves classification accuracy) by 3–7 % relative to MSMTFL, while also yielding a more compact model (≈40 % feature reduction). Sensitivity analyses demonstrate that the adaptive threshold is robust to the choice of the initial θ and λ.
In conclusion, the paper presents a practical and effective enhancement to MSMTFL by making the critical threshold data‑driven. The adaptive‑threshold approach retains the advantages of the non‑convex capped‑ℓ₁,ℓ₁ formulation while automatically tailoring the sparsity level to each problem, leading to better feature selection and predictive performance. Future directions include a formal proof of convergence for the adaptive scheme, exploration of alternative adaptive‑threshold strategies (e.g., Bayesian sparsity priors), and scaling the method to ultra‑high‑dimensional multi‑task settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment