Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Exploring Large Feature Spaces with Hierarchical Multiple Kernel   Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the l1-norm or the block l1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.


💡 Research Summary

The paper addresses a fundamental limitation of conventional kernel‑based learning: while positive‑definite kernels enable the use of very high‑dimensional (even infinite) feature spaces, the usual regularization with Euclidean or Hilbert‑space norms does not promote sparsity or interpretability. To overcome this, the authors propose a hierarchical multiple kernel learning (HMKL) framework that combines l1‑type sparsity with a directed‑acyclic‑graph (DAG) structure over a large collection of basis kernels.

First, the authors assume that a complex kernel can be decomposed as a sum of many elementary kernels (K = \sum_{k=1}^{m} K_k). Each elementary kernel corresponds to a node in a DAG, where edges encode a hierarchical relationship (e.g., low‑order polynomial kernels are ancestors of higher‑order ones, or short‑length string kernels are ancestors of longer‑length ones). The hierarchy enforces that a child kernel can be selected only if its parent is selected, thereby inducing a structured sparsity pattern.

The regularizer combines an ordinary l1‑norm on individual kernel weights with a block‑l1 (group‑l1) norm that acts on whole sub‑trees of the DAG. Mathematically, the learning problem is expressed as

\


Comments & Academic Discussion

Loading comments...

Leave a Comment