High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning
We consider the problem of high-dimensional non-linear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize non-linear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graph-adapted sparsity-inducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in high-dimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show state-of-the-art predictive performance for non-linear regression problems.
💡 Research Summary
The paper tackles the challenging problem of non‑linear variable selection in settings where the number of covariates far exceeds the number of observations. Traditional linear sparsity methods (e.g., Lasso) cannot capture interactions, while naïve non‑linear approaches that enumerate all possible feature interactions quickly become computationally infeasible because the number of candidate kernels grows exponentially with the number of variables.
The authors propose a hierarchical kernel learning (HKL) framework that systematically organizes an exponential family of positive‑definite kernels into a directed acyclic graph (DAG). Each node of the DAG corresponds to a kernel built on a specific subset of the original variables, and edges encode the inclusion relationship between subsets (a parent node contains all variables of its child). This hierarchical structure reflects the natural “ancestor‑descendant” constraint: a kernel can be active only if all its ancestors are active.
To enforce sparsity respecting this hierarchy, the authors introduce a graph‑adapted norm:
\
Comments & Academic Discussion
Loading comments...
Leave a Comment