Additive Gaussian Processes
We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generali…
Authors: David Duvenaud, Hannes Nickisch, Carl Edward Rasmussen
Additiv e Gaussian Pr ocesses David Duv enaud Department of Engineering Cambridge Univ ersity dkd23@cam.ac.uk Hannes Nickisch MPI for Intelligent Systems T ¨ ubingen, Germany hn@tue.mpg.de Carl Edward Rasmussen Department of Engineering Cambridge Univ ersity cer54@cam.ac.uk Abstract W e introduce a Gaussian process model of functions which are a dditiv e . An addi- tiv e function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additiv e GPs general- ize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). W e introduce an expressi ve but tractable parameterization of the kernel function, which allo ws ef ficient e valua- tion of all input interaction terms, whose number is exponential in the input di- mension. The additional structure discov erable by this model results in increased interpretability , as well as state-of-the-art predictive po wer in re gression tasks. 1 Introduction Most statistical regression models in use today are of the form: g ( y ) = f ( x 1 ) + f ( x 2 ) + · · · + f ( x D ) . Popular examples include logistic regression, linear regression, and Generalized Linear Models [1]. This family of functions, known as Generalized Additiv e Models (GAM) [2], are typically easy to fit and interpret. Some extensions of this family , such as smoothing-splines ANO V A [3], add terms depending on more than one v ariable. Howe v er , such models generally become intractable and difficult to fit as the number of terms increases. At the other end of the spectrum are kernel-based models, which typically allow the response to depend on all input variables simultaneously . These hav e the form: y = f ( x 1 , x 2 , . . . , x D ) . A popular example would be a Gaussian process model using a squared-exponential (or Gaussian) kernel. W e denote this model as SE-GP . This model is much more flexible than the GAM, but its flexibility makes it dif ficult to generalize to ne w combinations of input v ariables. In this paper , we introduce a Gaussian process model that generalizes both GAMs and the SE-GP . This is achiev ed through a k ernel which allow additiv e interactions of all orders, ranging from first order interactions (as in a GAM) all the way to D th-order interactions (as in a SE-GP). Although this kernel amounts to a sum ov er an exponential number of terms, we show ho w to compute this kernel efficiently , and introduce a parameterization which limits the number of hyperparameters to O ( D ) . A Gaussian process with this kernel function (an additive GP) constitutes a po werful model that allows one to automatically determine which orders of interaction are important. W e show that this model can significantly improve modeling efficac y , and has major advantages for model interpretability . This model is also extremely simple to implement, and we pro vide example code. W e note that a similar breakthrough has recently been made, called Hierarchical K ernel Learning (HKL) [4]. HKL explores a similar class of models, and sidesteps the possibly exponential num- ber of interaction terms by cleverly selecting only a tractable subset. Howe ver , this method suffers considerably from the fact that cross-validation must be used to set hyperparameters. In addition, the machinery necessary to train these models is immense. Finally , on real datasets, HKL is outper - formed by the standard SE-GP [4]. 1 −4 −2 0 2 4 −4 −2 0 2 4 0 0.2 0.4 0.6 0.8 1 + −4 −2 0 2 4 −4 −2 0 2 4 0 0.2 0.4 0.6 0.8 1 = −4 −2 0 2 4 −4 −2 0 2 4 0 0.2 0.4 0.6 0.8 1 −4 −2 0 2 4 −4 −2 0 2 4 0 0.2 0.4 0.6 0.8 1 k 1 ( x 1 , x 0 1 ) k 2 ( x 2 , x 0 2 ) k 1 ( x 1 , x 0 1 ) + k 2 ( x 2 , x 0 2 ) k 1 ( x 1 , x 0 1 ) k 2 ( x 2 , x 0 2 ) 1D kernel 1D kernel 1st order kernel 2nd order kernel ↓ ↓ ↓ ↓ −4 −2 0 2 4 −4 −2 0 2 4 −1 0 1 2 3 4 + −4 −2 0 2 4 −4 −2 0 2 4 −1 0 1 2 3 4 = −4 −2 0 2 4 −4 −2 0 2 4 −1 0 1 2 3 4 −4 −2 0 2 4 −4 −2 0 2 4 −3 −2 −1 0 1 2 f 1 ( x 1 ) f 2 ( x 2 ) f 1 ( x 1 ) + f 2 ( x 2 ) f ( x 1 , x 2 ) draw from draw from draw from draw from 1D GP prior 1D GP prior 1st order GP prior 2nd order GP prior Figure 1: A first-order additi ve k ernel, and a product kernel. Left: a draw from a first-order additiv e kernel corresponds to a sum of dra ws from one-dimensional kernels. Right: functions drawn from a product kernel prior ha ve weaker long-range dependencies, and less long-range structure. 2 Gaussian Process Models Gaussian processes are a flexible and tractable prior over functions, useful for solving regression and classification tasks [5]. The kind of structure which can be captured by a GP model is mainly determined by its k ernel : the co v ariance function. One of the main dif ficulties in specifying a Gaussian process model is in choosing a kernel which can represent the structure present in the data. For small to medium-sized datasets, the kernel has a lar ge impact on modeling ef ficacy . Figure 1 compares, for two-dimensional functions, a first-order additiv e kernel with a second-order kernel. W e can see that a GP with a first-order additive kernel is an example of a GAM: Each function drawn from this model is a sum of orthogonal one-dimensional functions. Compared to functions drawn from the higher-order GP , draws from the first-order GP hav e more long-range structure. W e can expect many natural functions to depend only on sums of low-order interactions. For ex- ample, the price of a house or car will presumably be well approximated by a sum of prices of individual features, such as a sun-roof. Other parts of the price may depend jointly on a small set of features, such as the size and building materials of a house. Capturing these re gularities will mean that a model can confidently extrapolate to unseen combinations of features. 3 Additive K er nels W e now give a precise definition of additiv e kernels. W e first assign each dimension i ∈ { 1 . . . D } a one-dimensional base kernel k i ( x i , x 0 i ) . W e then define the first order , second order and n th order additiv e kernel as: k add 1 ( x , x 0 ) = σ 2 1 D X i =1 k i ( x i , x 0 i ) (1) k add 2 ( x , x 0 ) = σ 2 2 D X i =1 D X j = i +1 k i ( x i , x 0 i ) k j ( x j , x 0 j ) (2) k add n ( x , x 0 ) = σ 2 n X 1 ≤ i 1
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment