Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this 1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, like p-norms with p>1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the commonly used wrapper approaches. A theoretical analysis and an experiment on controlled artificial data experiment sheds light on the appropriateness of sparse, non-sparse and $\ell_\infty$-norm MKL in various scenarios. Empirical applications of p-norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that go beyond the state-of-the-art.
Deep Dive into Non-Sparse Regularization for Multiple Kernel Learning.
Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this 1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, like p-norms with p>1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the commonly used wrapper approaches. A theoretical analysis and an experiment on controlled artificial data experiment sheds light on the appropriateness of sparse, non-sparse and $\ell_\infty$-norm MKL in various scenarios. Empirical applications of p-norm MKL to three
Kernels allow to decouple machine learning from data representations. Finding an appropriate data representation via a kernel function immediately opens the door to a vast world of powerful machine learning models (e.g. Schölkopf and Smola, 2002) with many efficient and reliable off-the-shelf implementations. This has propelled the dissemination of machine learning techniques to a wide range of diverse application domains.
Finding an appropriate data abstraction-or even engineering the best kernel-for the problem at hand is not always trivial, though. Starting with cross-validation (Stone, 1974), which is probably the most prominent approach to general model selection, a great many approaches to selecting the right kernel(s) have been deployed in the literature.
Kernel target alignment (Cristianini et al., 2002;Cortes et al., 2010b) aims at learning the entries of a kernel matrix by using the outer product of the label vector as the groundtruth. Chapelle et al. (2002) and Bousquet and Herrmann (2002) minimize estimates of the generalization error of support vector machines (SVMs) using a gradient descent algorithm over the set of parameters. Ong et al. (2005) study hyperkernels on the space of kernels and alternative approaches include selecting kernels by DC programming (Argyriou et al., 2008) and semi-infinite programming ( Özögür-Akyüz and Weber, 2008;Gehler and Nowozin, 2008). Although finding non-linear kernel mixtures (Gönen and Alpaydin, 2008;Varma and Babu, 2009) generally results in non-convex optimization problems, Cortes et al. (2009b) show that convex relaxations may be obtained for special cases.
However, learning arbitrary kernel combinations is a problem too general to allow for a general optimal solution-by focusing on a restricted scenario, it is possible to achieve guaranteed optimality. In their seminal work, Lanckriet et al. (2004) consider training an SVM along with optimizing the linear combination of several positive semi-definite matrices, K = M m=1 θ m K m , subject to the trace constraint tr(K) ≤ c and requiring a valid combined kernel K 0. This spawned the new field of multiple kernel learning (MKL), the automatic combination of several kernel functions. Lanckriet et al. (2004) show that their specific version of the MKL task can be reduced to a convex optimization problem, namely a semidefinite programming (SDP) optimization problem. Though convex, however, the SDP approach is computationally too expensive for practical applications. Thus much of the subsequent research focuses on devising more efficient optimization procedures.
One conceptual milestone for developing MKL into a tool of practical utility is simply to constrain the mixing coefficients θ to be non-negative: by obviating the complex constraint K 0, this small restriction allows one to transform the optimization problem into a quadratically constrained program, hence drastically reducing the computational burden. While the original MKL objective is stated and optimized in dual space, alternative formulations have been studied. For instance, Bach et al. (2004) found a corresponding primal problem, and Rubinstein (2005) decomposed the MKL problem into a min-max problem that can be optimized by mirror-prox algorithms (Nemirovski, 2004). The min-max formulation has been independently proposed by Sonnenburg et al. (2005). They use it to recast MKL training as a semi-infinite linear program. Solving the latter with column generation (e.g., Nash and Sofer, 1996) amounts to repeatedly training an SVM on a mixture kernel while iteratively refining the mixture coefficients θ. This immediately lends itself to a convenient implementation by a wrapper approach. These wrapper algorithms directly benefit from efficient SVM optimization routines (cf., e.g., Fan et al., 2005;Joachims, 1999) and are now commonly deployed in recent MKL solvers (e.g., Rakotomamonjy et al., 2008;Xu et al., 2009), thereby allowing for large-scale training (Sonnenburg et al., 2005(Sonnenburg et al., , 2006a)). However, the complete training of several SVMs can still be prohibitive for large data sets. For this reason, Sonnenburg et al. (2005) also propose to interleave the SILP with the SVM training which reduces the training time drastically. Alternative optimization schemes include levelset methods (Xu et al., 2009) and second order approaches (Chapelle and Rakotomamonjy, 2008). Szafranski et al. (2010), Nath et al. (2009), and Bach (2009) study composite and hierarchical kernel learning approaches. Finally, Zien and Ong (2007) and Ji et al. (2009) provide extensions for multi-class and multi-label settings, respectively.
Today, there exist two major families of multiple kernel learning models. The first is characterized by Ivanov regularization (Ivanov et al., 2002) over the mixing coefficients (Rakotomamonjy et al., 2007;Zien and Ong, 2007). For the Tikhonov-regularized optimization problem (Tikhonov and Arsenin, 1977), there is an additional para
…(Full text truncated)…
This content is AI-processed based on ArXiv data.