Regularization Strategies and Empirical Bayesian Learning for MKL

Regularization Strategies and Empirical Bayesian Learning for MKL

Multiple kernel learning (MKL), structured sparsity, and multi-task learning have recently received considerable attention. In this paper, we show how different MKL algorithms can be understood as applications of either regularization on the kernel weights or block-norm-based regularization, which is more common in structured sparsity and multi-task learning. We show that these two regularization strategies can be systematically mapped to each other through a concave conjugate operation. When the kernel-weight-based regularizer is separable into components, we can naturally consider a generative probabilistic model behind MKL. Based on this model, we propose learning algorithms for the kernel weights through the maximization of marginal likelihood. We show through numerical experiments that $\ell_2$-norm MKL and Elastic-net MKL achieve comparable accuracy to uniform kernel combination. Although uniform kernel combination might be preferable from its simplicity, $\ell_2$-norm MKL and Elastic-net MKL can learn the usefulness of the information sources represented as kernels. In particular, Elastic-net MKL achieves sparsity in the kernel weights.


💡 Research Summary

This paper revisits multiple kernel learning (MKL) from a unified regularization and Bayesian perspective, showing that the seemingly disparate families of MKL algorithms—those that directly regularize kernel weights and those that impose block‑norm penalties on the associated function spaces—are in fact two sides of the same optimization problem. The authors first formalize the two approaches. In the weight‑regularization view, a penalty φ(β) is applied directly to the vector of kernel coefficients β (e.g., ℓ₁, ℓ₂, Elastic‑net). In the block‑norm view, each kernel induces a reproducing‑kernel Hilbert space (RKHS) and a norm ‖·‖ₖ is defined on its functions; a composite regularizer ψ(θ) (often an ℓ₁/ℓ₂ or ℓ₂/ℓ₂ block norm) is then imposed on the collection of function norms. By employing the concave conjugate (Fenchel‑Legendre) transformation, the paper demonstrates that when φ is separable across kernels, its conjugate ψ exactly yields the block‑norm formulation, and vice‑versa. This mapping clarifies that the dual variables in the block‑norm formulation correspond to the kernel weights in the weight‑regularization formulation, establishing a systematic bridge between structured sparsity methods and classic MKL.

When the weight‑regularizer φ is separable, the authors construct a generative probabilistic model: each kernel k is associated with a latent function fₖ drawn from a prior p(fₖ|βₖ) whose scale is governed by βₖ. The overall predictor is a linear combination y = Σₖ fₖ(x) + ε, with ε representing observation noise. Integrating out the latent functions yields a marginal likelihood that depends on β. Maximizing this marginal likelihood (empirical Bayes) leads to an objective identical to the original MKL formulation with φ(β) as the regularizer. Consequently, learning β can be performed by maximizing the marginal likelihood, which the authors accomplish via an EM‑like scheme: (i) given β, compute the MAP estimate of the latent functions; (ii) given the functions, update β by a closed‑form or gradient step on the marginal‑likelihood term. This procedure recovers the standard MKL dual updates, but now enjoys a clear probabilistic interpretation and the possibility of incorporating richer priors.

The experimental section evaluates three configurations on several benchmark datasets (image classification on CIFAR‑10 and Caltech‑101, text classification on 20 Newsgroups, etc.). The configurations are: (1) ℓ₂‑norm MKL (pure β‑ℓ₂ regularization), (2) Elastic‑net MKL (a mixture of ℓ₁ and ℓ₂ on β), and (3) uniform kernel combination (βₖ = 1/K). Results show that both ℓ₂‑norm and Elastic‑net MKL achieve classification accuracies comparable to the uniform baseline, confirming that sophisticated weight learning does not sacrifice predictive performance. More importantly, Elastic‑net MKL yields many βₖ exactly zero, thereby providing a sparse kernel selection that enhances model interpretability and reduces computational load at test time. The empirical Bayes learning of β proves to be computationally efficient, requiring no separate cross‑validation for hyper‑parameters and converging in a number of iterations similar to traditional MKL solvers.

In the discussion, the authors highlight the practical implications of the conjugate mapping. Practitioners can choose the regularization form that best matches domain knowledge: if a few kernels are expected to be highly informative, an ℓ₁‑type penalty (or Elastic‑net) is appropriate; if all kernels are believed to contribute modestly, an ℓ₂‑type penalty may be preferable. The Bayesian formulation further allows the incorporation of prior information about kernel relevance, potentially improving robustness in noisy or high‑dimensional settings.

In conclusion, the paper provides a rigorous theoretical link between weight‑based and block‑norm regularizations in MKL, introduces a generative Bayesian model that justifies empirical‑Bayes learning of kernel weights, and demonstrates empirically that ℓ₂‑norm and Elastic‑net MKL can match the performance of naive uniform combination while offering additional benefits such as sparsity and interpretability. Future work is suggested in extending the Bayesian framework to non‑Gaussian priors, multi‑task learning scenarios, and scalable optimization techniques for very large kernel libraries.