Structured Sparsity and Generalization
We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels.
💡 Research Summary
The paper develops a unified, data‑dependent generalization bound for a broad family of regularized linear learning algorithms that enforce structured sparsity. The authors model the regularizer as an infimum convolution over a (finite or countably infinite) set M of symmetric bounded linear operators on a real Hilbert space H. For any parameter vector β, the norm ‖β‖M is defined as the minimal ℓ₁‑norm of a decomposition β = Σ{M∈M} M v_M, where each v_M lies in H. The dual norm admits a remarkably simple expression: ‖z‖M* = sup{M∈M}‖M z‖, which makes the subsequent complexity analysis tractable.
The central technical object is the empirical Rademacher complexity
R_M(x) = (2/n) E sup_{‖β‖M≤1} (1/n) Σ{i=1}^n ε_i ⟨β, x_i⟩,
where ε_i are independent Rademacher variables. The authors prove two complementary upper bounds on R_M(x). The first bound (Theorem 2) depends on the supremum over M of the summed squared operator norms applied to the sample and includes a logarithmic factor ln|M| when M is finite. The second bound replaces the sample‑dependent term with a distribution‑dependent second‑moment quantity R₂ = 𝔼 sup_{M∈M}‖M X‖². When R₂ is finite, the bound becomes essentially dimension‑free, scaling only with √n and a logarithmic term in R₂. This result is novel because it remains valid even when M is infinite, provided the second‑moment condition holds.
The paper then instantiates the general theory for several well‑known regularizers:
- Euclidean (ridge) regularization (M = {I}) recovers the classic √(∑‖x_i‖² / n) rate up to a constant factor.
- Lasso (M = {P₁,…,P_d} where each P_k projects onto a coordinate axis) yields ‖β‖_M = ‖β‖₁ and ‖z‖M* = ‖z‖∞. The bound matches existing results, with a leading term (2/3)^{1/2}√n · √{2 + ln d}. In infinite‑dimensional ℓ₂ spaces the same bound holds under the mild condition 𝔼‖X‖² < ∞.
- Weighted Lasso introduces coordinate‑specific penalties α_k, leading to ‖β‖_M = Σ α_k^{-1}|β_k| and dual norm sup_k α_k|z_k|. If the weight vector α belongs to ℓ₂, the second‑moment term becomes Σ α_k², again delivering a dimension‑free bound.
- Group Lasso (non‑overlapping groups) uses orthogonal projection operators onto group subspaces, giving ‖β‖M = Σ_ℓ ‖P{J_ℓ}β‖₂ and dual norm max_ℓ ‖P_{J_ℓ}z‖₂. The bound depends logarithmically on the number of groups r.
- Overlapping Group Lasso is handled by the same framework despite non‑orthogonal ranges; the authors show that the overlapping norm is bounded by a simpler group norm, leading to comparable guarantees.
- Cone‑generated regularizers (e.g., Ω_Λ(β) = inf_{λ∈Λ}½ Σ (β_j²/λ_j + λ_j) with Λ a convex cone) fit the operator‑based description by constructing diagonal matrices from extreme points of the cone. When the set of extreme points is countable, the bound again reduces to a log‑|E(Λ)| term.
- Multiple Kernel Learning treats H as a direct sum of Hilbert spaces H_j and M as the set of projections onto each component. The resulting bound involves the trace of each kernel matrix K_j and a logarithmic factor in the number of kernels |J|. For countably many kernels, the condition 𝔼 K_j(X,X) < ∞ suffices.
Finally, the authors combine the Rademacher bound with a standard Lipschitz‑loss argument (Theorem 1) to obtain high‑probability generalization guarantees: for any β with ‖β‖_M ≤ 1, the expected loss is bounded by the empirical loss plus L·R_M plus an O(√{ln(1/δ)/n}) term. The proof relies on McDiarmid’s bounded‑difference inequality and careful manipulation of the operator‑norm structure.
Overall, the paper offers a powerful, unifying perspective on structured sparsity regularization, delivering data‑dependent, often dimension‑free generalization bounds that encompass many classical and modern learning settings, including infinite‑dimensional and countably infinite kernel families.
Comments & Academic Discussion
Loading comments...
Leave a Comment