Collaborative Filtering with Side Information: a Gaussian Process Perspective

We tackle the problem of collaborative filtering (CF) with side information, through the lens of Gaussian Process (GP) regression. Driven by the idea of using the kernel to explicitly model user-item similarities, we formulate the GP in a way that al…

Authors: Hyunjik Kim, Xiaoyu Lu, Seth Flaxman

Collaborative Filtering with Side Information: a Gaussian Process   Perspective
Collaborativ e Filtering with Side Inf ormation: a Gaussian Pr ocess P erspectiv e Hyunjik Kim Department of Statistics Univ ersity of Oxford hkim@stats.ox.ac.uk Xiaoyu Lu Department of Statistics Univ ersity of Oxford xiaoyu.lu@stats.ox.ac.uk Seth Flaxman Department of Statistics Univ ersity of Oxford flaxman@stats.ox.ac.uk Y ee Whye T eh Department of Statistics Univ ersity of Oxford y.w.teh@stats.ox.ac.uk Abstract W e tackle the problem of collaborati ve filtering (CF) with side information, through the lens of Gaussian Process (GP) re gression. Driv en by the idea of using the k ernel to explicitly model user-item similarities, we formulate the GP in a way that allo ws the incorporation of lo w-rank matrix factorisation, arri ving at our model, the T uck er Gaussian Pr ocess (TGP). Consequently , TGP generalises classical Bayesian matrix factorisation models, and goes be yond them to gi ve a natural and elegant method for incorporating side information, giving enhanced predictiv e performance for CF problems. Moreo ver we show that it is a no vel model for regression, especially well- suited to grid-structured data and problems where the dependence on covariates is close to being separable. 1 Introduction Collaborativ e filtering (CF) defines a branch of techniques for tackling the following supervised learning problem: making predictions (filtering) about the preferences of a user , based on information regarding the preferences of man y users (collaboration). W e are giv en data in the form of a partially observed rating matrix R , where R ij is the rating of user u i on movie v j for i = 1 , . . . , n 1 , j = 1 , . . . , n 2 . CF aims to predict missing entries of R by only using the observed entries. Hitherto, matrix factorisation approaches [ 3 , 11 , 18 , 32 ] hav e been the basis for man y successful CF models. These model R as a product of two lo w rank matrices R ≈ U V > , hence R ij ≈ P k U ik V j k . On the other hand, content-based filtering predicts user ratings based on attributes of users (e.g. age, sex) and items (e.g. genre). CF with side information is a combination of the two, aiming to predict user ratings using both ratings data and user/item attributes. There has been a wide range of work on CF with side information, mostly building on the framew ork of matrix factorisation. Suppose user/item side information is giv en in the form of feature matrices F = [ ω ( u 1 ) , ..., ω ( u n 1 )] > ∈ R n 1 × r and G = [ ω 0 ( v 1 ) , ..., ω 0 ( v n 2 )] > ∈ R n 2 × r . Matrix co-factorization [ 28 ] attempt to factorise F , G and R simultaneously , whereas the Re gr ession-based Latent F actor Model [ 1 ] assumes instead that U and V are linear in F and G . Bayesian Matrix F actorization with Side Information (BMFSI) [ 19 ] gives an additiv e model in the sense that R is assumed to be the sum of the standard matrix factorisation prediction U V > and linear contributions of F and G . Hierar c hical Bayesian Matrix F actorization with Side Information [ 17 ] is an extension of BFMSI with Gaussian-W ishart hyperpriors on the prior mean and v ariance of U and V . Gaussian Processes (GPs) are a popular class of Bayesian nonparametric priors ov er functions [ 23 ], and have serv ed as flexible models across a range of machine learning tasks, e.g. regression, classification, dimensionality reduction [ 12 ] and CF [ 13 , 39 ]. In a regression setting with input and output pairs, a key adv antage of GPs is that we can use the kernel to explicitly model similarity in the outputs between a pair of input v alues. W e use this to model similarity between users/items given side information, forming the outset of the paper . W e model the ratings as R ij ∼ N ( f ( u i , v j ) , σ 2 ) and f ∼ G P ( k 1 × k 2 ) where kernels k 1 and k 2 model user and item similarities respecti vely . Ho wev er , a direct GP regression application is infeasible due to its O ( N 3 ) computational cost, because for most CF problems, the number of ratings N ranges from a hundred thousand to hundreds of millions. Lo w-rank matrix factorisation remedies this problem, and also underlies the state of the art approaches for CF . Hence it is natural to look for a connection between GPs and lo w-rank matrix f actorisation, which is the motiv ation and contrib ution of our work. T o dev elop a framew ork for this connection, we first propose a nov el approximation scheme for GPs. Our starting point is the Kronecker structure that arises naturally when working with kernels that are products of simpler constituent kernels (say each dependent on one cov ariate dimension). Coupled with the weight space view of GPs, we can represent a draw from the GP as a product between a random weight tensor and a collection of feature v ectors (one for each constituent k ernel). The weight tensor can be very large for high dimensional problems, and our proposal is to approximate it using a low-rank T ucker decomposition [ 34 ] instead. This reduces the effecti ve number of parameters that need to be learnt, and forms the link between GPs and matrix factorisation methods in CF . Thus we arriv e at our model, the T uck er Gaussian Process (TGP). W e make the follo wing contributions: • TGP is an eleg ant and ef fectiv e method for modelling user/item similarities via k ernels to exploit side-information. As far as we know , ours is the first work to use GPs for modelling similarities via side information, with explicit correspondence between similarities and the kernel. • TGP generalises classical Bayesian Matrix factorisation models [ 26 , 27 ], bridging the gap between matrix factorisation methods and GP methods in CF . • Sub-linear scaling of TGP , achiev ed by stochastic gradient descent, makes it suitable for CF problems that typically hav e large data sets infeasible for GPs. • TGP is applicable to certain regression problems where the regression function is separable in the cov ariates. W e reason that the T ucker decomposition acts as a regulariser for the GP that helps control overfitting, and verify that TGP outperforms GPs in generalisation performance for these problems. Outline of paper W e formulate TGP in the general regression setting in Section 2, and describe its central application to CF with side information in Section 3, followed by related work and discussion in Section 4. W e present experimental results in Section 5 and conclude in Section 6. 2 T ucker Gaussian Pr ocess Regression 2.1 T ucker GP Regression Consider a re gression problem with inputs x 1 , . . . , x N ∈ X and corresponding observations y 1 , . . . y N ∈ R . W e assume y i | x i ∼ N ( f ( x i ) , σ 2 ) for some f : X → R and that the observa- tions are independent. The aim is to learn f . One approach is to put a Gaussian Process (GP) prior on f , with zero mean and covariance k . The training then consists of computing the posterior GP . The problem of using this in a CF setting is that training costs O ( N 3 ) operations. N , the number of ratings, usually ranges from a hundred thousand to hundreds of millions in CF , making inference infeasible. The weight space vie w of GPs of fers a natural way of dealing with the problem: suppose there exists a feature map φ : X → R n (where n is the number of features) such that k ( x, x 0 ) = φ ( x ) > φ ( x 0 ) ∀ x, x 0 ∈ X . Then the GP is equiv alent to Bayesian Linear Regression with feature 2 x x x ✓  1 ( x )  3 ( x )  2 ( x ) n n n (a) T ensor  1 ( x )  3 ( x )  2 ( x ) U (1) U (2) U (3) n n n r r r W (b) T ucker Figure 1: T ensor & T ucker Decomposition representation of regression function for D = 3 . vectors used for each ro w of the design matrix [23]: y | x, θ ind ∼ N ( f ( x ) , σ 2 ) , f ( x ) = θ > φ ( x ) θ ∼ N (0 , I ) , θ ∈ R n (1) Now training tak es O ( N n 2 ) time, and is scalable for n  N . Consider the case of product kernels, where the kernel can be written as follo ws: k ( x i , x j ) = D Y d =1 k d ( x i , x j ) (2) and suppose there are feature maps φ d : X → R n such that k d ( x i , x j ) = φ d ( x i ) > φ d ( x j ) . Then we can write k ( x i , x j ) = φ ( x i ) > φ ( x j ) where φ ( x ) = ⊗ D d =1 φ d ( x ) is the Kronecker product of the φ d . Returning to (1): f ( x ) = θ > φ ( x ) = θ >  ⊗ D d =1 φ d ( x )  = θ × D d =1 φ d ( x ) (3) where θ has been reshaped as a D -dimensional tensor in R n × ... × n in the rightmost expression, as in Figure 1a. W e define the tensor product notation as follows: θ × D d =1 φ d := v ec ( θ ) > ⊗ D d =1 φ d = n X i 1 ,...,i D =1 θ i 1 ,...,i D D Y d =1 ( φ d ) i d W e refer to (3) as the full-rank model, and use θ as a tensor for the rest of the paper . This full-rank model is problematic in high dimensions: the size of θ grows as n D , so the function computation become infeasible. Thus we introduce the nov el T uc ker Gaussian Pr ocess (TGP) model, where we circumvent this problem by approximating θ using a lo w-rank Tuck er decomposition [ 34 ]. This is a tensor-matrix product between a low rank core tensor W ∈ R r × ... × r of dimension D and matrices U (1) , . . . U ( D ) ∈ R n × r , as in Figure 1b. W e denote θ ≈ W × D d =1 U ( d ) > where the ( i 1 , . . . , i D ) th entry is W × D d =1 U ( d ) i d with U ( d ) i d a column vector representing the i th d row of U ( d ) . n is the number of features in each dimension and r is the rank. Note that we are free to use a different n and r for each dimension, b ut assume these are the same for con venience of notation. W e must also place suitable priors on W and U ( d ) to match the iid N (0 , 1) prior on each entry of θ . W e place iid priors N (0 , 1) on each entry of W , and N (0 , σ 2 u ) on each entry of U ( d ) . Setting σ 2 u = 1 r , we match the first two moments of W × D d =1 U ( d ) > and θ . W e then prove in Appendix B that each entry of W × D d =1 U ( d ) > con ver ges in distribution to N (0 , 1) as r → ∞ . In summary our TGP re gression model approximating data from a GP with product kernel (2) and homoscedastic noise is: y | x ind ∼ N ( f ( x ) , σ 2 ) , f ( x ) = W × D d =1  U ( d ) > φ d ( x )  (4) where φ d : X → R n are feature maps such that k d ( x i , x j ) = φ d ( x i ) > φ d ( x j ) , and we have iid N (0 , 1) and N (0 , 1 r ) priors on the entries of W and U ( d ) respectiv ely . 3 2.2 Choice of Featur e Map So far , we hav e assumed that the kernels k d can be written as the inner product of feature vectors: k d ( x i , x j ) = φ d ( x i ) > φ d ( x j ) . W e in vestigate the situations where this assumption holds. When this doesn’t hold, we e xplore other choices of φ that approximate k d . Identity features One case where we can write kernels as inner products of features is with identity kernels k d ( x i , x j ) = δ ij . The features are unit vectors: φ d ( x i ) = e i := (0 , · · · , 0 , 1 , 0 , · · · ) > with the non-zero at the i th entry , hence U ( d ) > φ d ( x i ) = U ( d ) i . Howe v er this implies U ( d ) ∈ R N × r (or R n d × r for inputs on a grid), so for N (or n d ) too big, computations can become too costly both in time and memory . A workaround is to use feature hashing [ 35 ] to obtain shorter features whose inner products are unbiased estimates of inner products of the original features. This technique can be applied to arbitrary features where the number of features is too large. See Appendix C for details. W e can also deal with cases where the data lies on a grid using Cholesky features, or in the most general case where the data doesn’t lie on a grid and k d cannot be expressed as the inner product of finite feature vectors using Random F ourier features. See Appendix E for details. 2.3 Learning In TGP we would like to learn the posterior distribution of U and W . The simplest and fastest method of learning is Maximum a Posteriori (MAP), whereby we approximate the posterior with point estimates ˆ U , ˆ W = arg max U,W p ( U, W | y ) . For the optimisation we may use stochastic gradient descent (SGD) to approximate the full gradient, for which we get a time complexity of O ( m ( nrD + r D D )) operations for computing the stochastic gradient on a mini-batch of size m , which is sublinear in N . See Appendix A for a details. The problem with a MAP estimate for U, W is that only the posterior mode is used, and the uncertainty encoded in the shape of the posterior distribution is ignored. In a Bayesian setting, we wish to use samples from the posterior and a verage predictions ov er samples. For data where we can af ford an O ( N ) runtime, we may use sampling algorithms such as Hamiltonian Monte Carlo (HMC) [ 7 , 16 ]. The runtime for each HMC leapfrog step is O ( N ( nr D + r D D )) , the same time complexity as a step of full-batch gradient descent. 3 TGP f or Collaborative Filtering with Side Inf ormation In order to apply TGP to CF , let us first formulate the problem using GPs. It is natural to model this as a supervised regression problem with R ij ∼ N ( f ( u i , v j ) , σ 2 ) and prior f ∼ G P (0 , k ) [ 39 ]. Note that this is particularly suitable with side information, since kernels can be interpreted as measures of similarity; we can design k to encode similarities between users/mo vies gi ven by the side information. Hence we may further exploit the use of GPs for addressing this problem. In particular we use a product kernel k (( u i , v j ) , ( u i 0 , v j 0 )) = k 1 ( u i , u i 0 ) k 2 ( v j , v j 0 ) since we expect similar ratings for two user/movie pairs if the users are similar and the movies are similar . When there is no side information, it is sensible to use identity kernels k 1 ( u i , u i 0 ) = δ u i u i 0 , k 2 ( v j , v j 0 ) = δ v j v j 0 . i.e. that distinct users and movies are not similar a priori. With side information, we may add on further kernels κ 1 , κ 2 modelling similarity between users/movies: k 1 ( u i , u i 0 ) = a 2 1 δ u i u i 0 + b 2 1 κ 1 ( u i , u 0 i ) , k 2 ( v j , v j 0 ) = a 2 2 δ v j v j 0 + b 2 2 κ 2 ( v j , v 0 j ) , where a and b are parameters controlling the extent to which similarity in side information leads to similarity in preference. Howe v er , it is not immediately clear how this single GP framework relates to the matrix factorisation approach. W e sho w that our proposed TGP forms a natural connection between these two approaches, and that we recover classic matrix factorisation models as a special case. T o apply TGP , first note that we hav e D = 2 , and the T ucker Decomposition is simply a low-rank matrix factorisation. Using the notation U, V instead of U (1) , U (2) , we hav e that θ ≈ U W V > , hence f ( u i , v j ) = φ 1 ( u i ) > U W ( φ 2 ( v j ) > V ) > . W ith the identity kernel, we hav e unit vector features φ 1 ( u i ) = e i ∈ R n 1 and φ 2 ( v j ) = e j ∈ R n 2 . TGP therefore simplifies to: R ij ind ∼ N ( f ( u i , v j ) , σ 2 ) , f ( u i , v j ) = U > i W V j (5) with iid N (0 , σ 2 u ) priors on each entry of U, V where U i , V j are column vectors representing the i th and j th row of U and V respectiv ely . Note that with W = I fixed, we recover Probabilistic 4 Matrix Factorization (PMF) [ 27 ], a particularly effecti ve Bayesian model in the matrix factorization framew ork. An extension is Bayesian PMF (BPMF) [ 26 ] where priors are placed on the prior mean and cov ariance of U i , V j . Should we decide to learn W in TGP , interesting parallels arise between our model and BPMF . Observe from the follo wing that learning W can be a proxy for learning the prior mean and cov ariance of U and V , as is done in the BPMF model: Figure 2: Bayesian PMF reparametrised. U i ∼ N ( µ u , Λ u ) , V j ∼ N ( µ v , Λ v ) ⇒ U i = µ u + L u u i , V j = µ v + L v v j where u i , v j ∼ N (0 , I ) , Λ u = L u L > u , Λ v = L v L > v ⇒ U > i V j = µ > u µ v + µ > u L v v j + u > i L > u µ v + u > i L > u L v v j = U 0> i W V 0 j where U 0> i = [ u > i , 1] , W = [ L > u L v , L > u µ v ; µ > u L v , µ > u µ v ] , V 0 j = [ v j ; 1] , as displayed in Figure 2. So a full W with standard iid Gaussian priors on U, V can capture the ef fects of modelling U, V with non-zero means and full cov ariances for each ro w of U, V , as in BPMF . Returning to the case with side information, suppose it is giv en in the form of vectors ω 1 ( u i ) , ω 2 ( v j ) , and that we expect users/mo vies with similar ω to show similar preferences/be preferred by similar users. For example we can encode the user age into ω 1 and the movie genre into ω 2 and define κ 1 ( u i , u 0 i ) = ω 1 ( u i ) > ω 1 ( u i 0 ) , κ 2 ( v j , v 0 j ) = ω 2 ( v j ) > ω 2 ( v j 0 ) . The feature vector is now φ d ( u i ) = [ a d e > i , b d ω d ( u i ) > ] > for d = 1 , 2 , and we have f ( u i , v j ) = φ 1 ( u i ) > U W V > φ 2 ( v j ) . 4 Related W ork and Discussion Modelling data in the form of matrices and tensors has been studied in the field of multi-way data analysis and relational learning. The key idea here is to factorise the data tensor , with two notable forms of factorisation: P ARAF AC [ 4 ] and Tuck er [ 34 ]. There are a few works in these domains that relate to GPs. InfT ucker [ 38 ] uses the T ucker decomposition directly on the data tensor , and use a non-linear transformation of the parameters U ( d ) for the regression function, contrary to TGP which is linear in the parameters. DinTuck er [ 31 ] tries to scale up InfT ucker by splitting up the observed tensor into subarrays. [ 15 ] moti vate their model using the P arafac decomposition instead of T ucker , expressing the regression function as a sum of products of local functions. These local functions are each modelled by GPs. Howe ver the TGP is moti v ated from a single GP on the input space. [ 33 ] again model the regression function as a sum of product of local functions, which liv e in the RKHS of some kernel, analogous to the feature maps in TGP . Howe ver there is no mention of GPs or ho w their model relates to lo w-rank tensor decomposition. [ 20 ] deals with the classification problem where each input is a tensor, so there is one label per tensor . They define a GP ov er the space of tensors. It is unclear whether they actually use low rank tensor decomposition. For TGP we deal with regression, w ork in the setting where each element of the data tensor corresponds to a response, and apply T ucker decomposition to the parameters. [ 30 ] define a GP o ver the parameter space whereas TGP is a multilinear expression in the parameters and feature maps, approximating a GP ov er the input space. In short, these models make completely dif ferent assumptions to TGP , and thus are useful for different CF applications - none use side information (it is unclear ho w this would be possible giv en their model assumptions) and do not relate to the Bayesian matrix factorisation literature. There are closer connections between our model and the Stochastic Relational Model [ 39 ] in relational learning. It is a special case of our model with W = I and D = 2 . The key dif ferences lie in the 5 inference: we use features to build on the weight-space view of GPs, whereas [ 39 ] work with GPs in the function-space view . This complicates learning for kernels which cannot be expressed as an inner product of features; the authors resort to Laplace approximation for finding maximum likelihood estimates of parameters. For such kernels we use random feature maps (see Appendix E), making learning simple and more computationally efficient. In the domain of matrix factorisation, [ 13 ] use a GP Latent V ariable Model (GP-L VM) [ 12 ]. They learn a latent v ector for each movie, and pass it through a zero-mean GP with squared exponential (SE) kernel, with one GP per user . In TGP we use one GP for all users and items. For a CF application, they incorporate side information about movies by taking the product of these kernels with a SE kernel in the mo vie features. Our model is more flexible in that we can tak e into account both user and item similarities simultaneously . From the perspectiv e of GP regression, we analyse the regression function of TGP to understand the regression problems for which it will be effecti v e. Recall that the regression function f ( x ) in (4) can be seen as W × D d =1 ψ d ( x ) , where ψ d ( x ) = U ( d ) > φ d ( x ) are lower -dimensional features in R r (i.e. the U ( d ) multiplied by φ d ( x ) in Figure 1b). W ith this ne w formulation, we hav e: f ( x ) = W × D d =1 φ d ( x ) = r X i 1 ,...,i D =1 W i 1 ...i D D Y d =1 ( ψ d ( x )) i d (6) Hence learning W and  U ( d )  D d =1 can be interpreted as learning features ψ d as well as their weights for the regression function, i.e. learning a linear combination of products of these features. In the case where ψ d ( x ) is only a function of the d th dimension of x , each Q D d =1 ( ψ d ( x )) i d is separable in the dimensions. Modelling data with sums of separable functions has been studied in [ 2 ], and its effecti v eness for regression is sho wn by promising results on v arious synthetic and real data. Such additiv e models arise frequently in the context of ensemble learning, such as boosting and B AR T [ 5 ], where a linear combination of many weak learners is used to b uild a single strong learner . W e may interpret our model in this frame work where Q D d =1 ( ψ d ( x )) i d are the weak learners that share parameters, and W i 1 ...i D are the corresponding weights. W ith this alternati ve interpretation in mind, we may e xpect TGP to perform well in cases where the data displays an additi ve structure, with the additi ve components arising from a product of features on each dimension. Hence we interpret TGP as a modified GP where the approximation acts as a regulariser to wards such simpler functions, which can actually lead to enhanced generalisation performance by controlling overfitting. W e thus compare its performance to GPs on spatio-temporal data sets where it is reasonable to expect separability in longitude and latitude, or in time and space. Based on Section 2.2, we also see that our model is particularly well-suited to modelling grid- structured data. The difference between our model and that in [ 25 ] is that we have Kronecker structure in the features φ , whereas they exploit Kronecker structure on the data. Moreover , our model can deal with data not on a grid, as well as data on a grid with many missing observations, since observations are not needed for constructing the features. Going back to CF , recall from Section 3 that the low-rank matrix factorisation model has R ij ≈ P k U ik V j k , a sum of a product of parameters(features) in each dimension. TGP generalises this to modelling a linear combination of products of features, hence we may e xpect it to perform well for this task. Also note the grid structure, since users and items are categorical v ariables. 5 Experimental Results Collaborative Filtering W e use the MovieLens 100K data 1 , which consists of 100,000 ratings in { 1 , . . . , 5 } from 943 users on 1682 movies. User age, gender and occupation are giv en, as well as the genre of the mo vies. W e represent this side information with binary v ectors for ω 1 ( u i ) , ω 2 ( v j ) and use the formulation in (7) in Appendix F. W e bin the age into fi ve cate gories, and there are 20 occupations and 18 genres. Thus ω 1 ( u i ) ∈ R 5+2+20 has 3 non-zero entries, one for each feature, and ω 2 ( v j ) ∈ R 18 can hav e multiple non-zero entries since each mo vie can belong to many genres. W e report the mean and standard deviation of the test RMSE on the fi ve 80:20 train test splits that come 1 Obtained from http://grouplens.org/datasets/movielens/100k/ 6 with the data, as it will offer a sensible means of comparison with other algorithms. N is too lar ge for HMC, hence we use SGD to obtain MAP estimates for the parameters, and compare different configurations: learning W /fixing it to be the identity and using/not using side information, along with BPMF initialised by PMF 2 . W e use mini-batches of size 100, and set r = 15 for all models as it giv es best results for PMF and BPMF . W e used a grid search and cross-validation for tuning hyperparameters, the recommended method in big N settings where the number of hyperparameters is not too large. SGD was not so sensitiv e to mini-batch size, and finding the range of hyperparameters was straightforward. See Appendix F for details. T able 1: T est RMSE on MovieLens100K. Model T est RMSE BPMF 0 . 9024 ± 0 . 0050 TGP , W = I (PMF) 0 . 9395 ± 0 . 0115 TGP , learn W 0 . 9270 ± 0 . 0097 TGP , W = I , side-info 0 . 9014 ± 0 . 0061 TGP , learn W , side-info 0.8995 ± 0.0062 From T able 1 it is evident that TGP makes good use of side information, since the RMSE decreases significantly with side information. Learning W instead of fixing it helps predictiv e performance, but does not perform as well as BPMF . One reason is that our Gaussian prior on W is not equiv alent to the Gaussian-W ishart priors on the mean and variance of U i , V j in BPMF . Another reason is that we are resorting to a MAP estimate. If we can instead sample from the posterior and average predictions ov er these samples, we expect enhanced predictions. Howe v er , note that using TGP with side information and learning W , we are able to get comparable/superior results to BPMF , ev en with a MAP estimate. W e expect further improvements not only with sampling but also by using more sophisticated kernels that mak e better use of the side information; for e xample, use dif ferent hyperparameter coef ficients for the different types of features. In so far as comparison was possible, these numbers are comparable to state-of-the-art algorithms in Section 4. A direct comparison was not possible as each use different methods for e v aluation. Regression on spatial data W e use the California house prices data from the 1990 census 3 , which consists of average house prices for 20,640 different locations in California. W e only use the cov ariates longitude and latitude, and whiten them along with log-transformed house prices to each hav e zero mean and unit variance. W e chose this data set as spatial data sometimes exhibit separability in the dif ferent dimensions. Moreov er the data is clustered in urban areas, hence an additi v e model with each component describing dif ferent sections of California may be desirable. Using a random 50:50 train test split, we report the RMSE of the model on the training set and test set after training. W e first fit a GP to the data with a squared expeonential (SE) kernel on each dimension using the GPML toolbox [ 22 ], optimising the hyperparameters by type-II maximum lik elihood. Then using these hyperparameters we generate RFF for φ . See Appendix D and E for details. W e implemented both the full-rank model and TGP with n = 25 , 50 , 100 , 200 on Stan [ 29 ], which uses HMC with the No-U-T urn Sampler (NUTS) [ 10 ] for inference. Note that for both models n refers to the length of features φ d ( x ) . For TGP , we use 300 warmup iterations and a further 300 samples on 4 different chains, and use the mean prediction across the samples. F or full-rank , we take the same number of samples and chains, but only use 50 warmup dra ws as we diagnosed that con ver gence was reached by this point (looking at the Gelman-Rubin statistic [8] and ef fectiv e sample size). The con ver gence statistics for TGP are in Appendix G. W e can see from Figure 3 that some TGP models give lo wer test RMSE and higher train RMSE than the GP and the full-rank model. In fact TGP with r = 5 consistently shows higher predicti ve performance than full-r ank for all values of n , and for n ≥ 100 TGP with r = 10 outperforms GP . This indicates that TGP is an effecti ve re gulariser to wards simpler regression functions, namely a linear combination of separable functions. W e expect bigger gains for TGP with more warmup iterations, since the conv ergence diagnostics suggest that TGP hasn’t quite fully mixed by 300 iterations. W e further in vestigate the predictions of TGP by analysing the additiv e components in the prediction for r = 2 . W e see in Figure 4b that the components are quite dif ferent. The upper two components 2 Code obtained from http://www.cs.toronto.edu/~rsalakhu/BPMF.html 3 Obtained from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ cadata 7 (a) T rain RMSE (b) T est RMSE Figure 3: RMSE for GP , full-rank , and TGP for r = 2 , 5 , 10 for n = 25 , 50 , 100 , 200 on the California House Price data. (a) T rue values/TGP Predictions (b) Additiv e components for TGP Figure 4: (a) T op: Heatmap of true log house price values. Bottom: TGP predictions for r = 2 , n = 200 . (b) Heatmap showing the four additi ve components (summands in (6)) of predictions for TGP with r = 2 , n = 200 . W e only use the last sample in the Marko v Chain to get a better indication of the structure. Red indicates high log price and blue indicates low , and the same colour scheme is applied to all four subplots. T o accentuate the differences in the predicti ve v alues, we colour v alues by the percentile they belong to instead of a uniform colouring. See Appendix G for the uniform colouring. show complementary predictions in the Bay area (North-W est) and the central area, whereas the bottom two sho w complementary predictions in the Los Angeles area (South-East). This confirms the hypothesis that the dif ferent additive components will learn different sections of the data. See Appendix G for zoomed in plots, and Appendix H for experimental results on spatio-temporal data with grid structure. 6 Conclusion W e hav e introduced the T ucker Gaussian Process (TGP), a regression model that regularises a GP tow ards simpler regression functions, in particular a linear combination of separable functions. W e moti vate it as a solution to Collaborati ve Filtering (CF), by using feature maps and a lo w-rank T uck er decomposition on the parameters in the weight-space vie w of GPs. In particular , we ha ve highlighted the effecti veness of TGP in CF with side-information, a domain where outputs can be ef fectiv ely modelled as a linear combination of functions separable in the covariates. W e believ e that this is the largest contrib ution of our paper: after sho wing that PMF and BPMF are special cases of the TGP , we extended it to e xploit the user/item side information for better predictions; the k ernel of the GP we approximate can be designed to encode similarities between dif ferent users and items, a particularly 8 neat and natural method for modelling similarity . In doing so, we bring together matrix factorisation methods and GP methods in CF , as well as scaling up GP methods in CF . W e confirm experimentally that side information enhances the predictiv e performance of TGP in collaborati ve filtering. W e hav e also shown that for problems where one might e xpect separability in the cov ariates such as prediction for spatio-temporal data sets, the TGP ef fecti vely controls o verfitting and outperforms GPs in prediction. W e also point out that exact Cholesky features can be used with TGP in the case of grid-structured data, and random feature maps can be used for arbitrary kernels. Acknowledgments HK, XL, SF and YWT’ s research leading to these results has recei v ed funding from the European Research Council under the European Union’ s Sev enth Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. References [1] D. Agarwal and B. Chen. Regression-based latent factor models. In ACM SIGKDD , 2009. [2] G. Beylkin, J. Garck e, and M. Mohlenkamp. Multiv ariate regression and machine learning with sums of separable functions. SIAM Journal on Scientific Computing , 2009. [3] D. Billsus and M. Pazzani. Learning collaborative information filters. In ICML , 1998. [4] R. Bro. P ARAF A C. T utorial and applications. Chemometrics and Intelligent Laboratory Systems , 1997. [5] H. Chipman, E. George, and R. McCulloch. BAR T: Bayesian additiv e regression trees. The Annals of Applied Statistics , 2010. [6] P . Drineas and M. Mahoney . On the Nyström method for approximating a Gram matrix for improved kernel-based learning. JMLR , 2005. [7] S. Duane, A. Kennedy , B.Pendleton, and D.Roweth. Hybrid Monte Carlo. Physics letters B , 1987. [8] A. Gelman and D. Rubin. Inference from iterativ e simulation using multiple sequences. Statistical Science , 1992. [9] P . Hall and C. Heyde. Martingale Limit Theory and its Application . Academic press, 2014. [10] M. Hoffman and A. Gelman. The No-U-T urn sampler: Adaptiv ely setting path lengths in Hamiltonian Monte Carlo. The Journal of Machine Learning Resear ch , 2014. [11] Y ehuda Koren. The bellkor solution to the netflix grand prize. Netflix prize documentation , 81:1–10, 2009. [12] N. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. NIPS , 2004. [13] N. Lawrence and R. Urtasun. Non-linear matrix factorization with Gaussian processes. In ICML , 2009. [14] Y . Ma, T . Chen, and E. F ox. A complete recipe for stochastic gradient MCMC. In NIPS , 2015. [15] M.Imaizumi and K.Hayashi. Doubly decomposing nonparametric tensor regression. In ICML , 2016. [16] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo , 2011. [17] S. Park, Y . Kim, and S. Choi. Hierarchical bayesian matrix factorization with side information. In IJCAI , 2013. [18] Martin Piotte and Martin Chabbert. The pragmatic theory solution to the netflix grand prize. Netflix prize documentation , 2009. [19] I. Porteous and M. W elling. Bayesian matrix factorization with side information and Dirichlet process mixtures. In AAAI , 2010. [20] Q.Zhao, L.Zhang, and A.Cichocki. A tensor -variate gaussian process for classification of multidimensional structured data. In AAAI , 2013. [21] A. Rahimi and B. Recht. Random features for large-scale k ernel machines. In NIPS , 2007. [22] C. Rasmussen and H. Nickisch. Gaussian Processes for Machine Learning (GPML) toolbox, 2010. [23] C. Rasmussen and C. Williams. Gaussian Processes for Mac hine Learning . The MIT Press, 2005. [24] W . Rudin. Fourier analysis on groups. AMS , 1964. [25] Y . Saatçi. Scalable Infer ence for Structur ed Gaussian Pr ocess Models . PhD thesis, Uni versity of Cambridge, 2011. [26] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Marko v Chain Monte Carlo. In ICML , 2008. 9 [27] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS , 2008. [28] A. Singh and G. Gordon. Relational learning via collective matrix f actorization. In ACM SIGKDD , 2008. [29] Stan Development T eam. Stan: A C++ library for probability and sampling, version 1.0, 2012. [30] S.Zhe, K.Zhang, P .W ang, K.Lee, Z.Xu, Y .Qi, and Z.Ghahramani. Distributed flexible nonlinear tensor factorization. In NIPS , 2016. [31] S.Zhe, Y .Qi, Y .Park, Z.Xu, I.Molloy , and S.Chari. Dintucker: Scaling up gaussian process models on large multidimensional arrays. In AAAI , 2016. [32] Andreas Töscher, Michael Jahrer, and Robert M Bell. The bigchaos solution to the netflix grand prize. Netflix prize documentation , pages 1–52, 2009. [33] T .Suzuki, H.Kanagawa, H.K obayashi, N.Shimizu, and Y .T agami. Minimax optimal alternating minimiza- tion for kernel nonparametric tensor learning. In NIPS , 2016. [34] L. Tuck er . Some mathematical notes on three-mode factor analysis. Psychometrika , 1966. [35] K. W einber ger , A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In ICML , 2009. [36] M. W elling and Y . W . T eh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML , 2011. [37] C. Williams and M. See ger . Using the Nyström method to speed up kernel machines. In NIPS , 2001. [38] Z. Xu, F . Y an, and Y . Qi. Infinite T ucker decomposition: Nonparametric bayesian models for multiw ay data analysis. In ICML , 2012. [39] K. Y u, W . Chu, S. Y u, V . T resp, and Z. Xu. Stochastic relational models for discriminative link prediction. In NIPS , 2006. 10 A ppendix A Learning Algorithms f or TGP W e give detailed deriv ations of v arious inference algorithms for the TGP . W e ha ve a set of N observations y i ∈ R corresponding to a set of inputs x i ∈ X , and we wish to regress y = ( y i ) N i =1 on X = ( x i ) N i =1 . W e assume that the data generating mechanism takes the form y = f ( X ) +   ∼ N (0 , σ 2 I N ) where f ( X ) = ( f ( x i )) N i =1 ∈ R N and also that the regression function takes the follo wing form f ( x ) = w > ⊗ D d =1  U ( d ) > φ d ( x )  where • W ∈ R r × ... × r is a D-dimensional tensor whose entries are iid N (0 , σ 2 w ) • w = v ec ( W ) is the vector obtained when flattening tensor W , such that W × D d =1 v d = w > ⊗ D d =1 v d ∀ v d ∈ R r • ( φ d ( x )) D d =1 are the features in R n extracted from x • ( U ( d ) ) D d =1 are a set of real n × r matrices with U ( d ) j l iid ∼ N (0 , σ 2 u ) W e assume n > r , and wish to learn w and the U ( d ) from the data. Note from the second point that ∇ w  W × D d =1 v d  = ⊗ D d =1 v d . For D=2 for e xample, if g ( U ) = s > U t for some matrix U and vectors s, t , then ∇ u g ( U ) = s ⊗ t where u = v ec ( U ) . First we giv e the complexity for calculating f ( x ) . Computing ψ d ( x i ) = U ( d ) > φ d ( x i ) ∀ d requires O ( nrD ) time, then w > ⊗ D d =1 ψ d ( x i ) takes O ( r D ) time. So time for a prediction given φ, U, w takes O ( nrD + r D ) . The quantity of interest for MAP and HMC is the log joint distribution p ( y , U, w ) = p ( y | U, w ) p ( U ) p ( w ) . In full this is: log p ( y | U, w )+log p ( U )+log p ( w ) = − 1 2 σ 2 N X i =1 ( y i − f ( x i )) 2 − 1 2 σ 2 u D X k =1 tr ( U ( k ) T U ( k ) ) − 1 2 σ 2 w w > w This has the following deri v ati ves: ∇ w log p ( w ) = − w ∇ U ( k ) log p ( U ) = − r U ( k ) ∇ w log p ( y i | U, w ) = 1 σ 2 ( y i − f ( x i )) ⊗ D d =1 ψ d ( x i ) ∇ u ( k ) log p ( y i | U, w ) = 1 σ 2 ( y i − f ( x i )) φ k ( x i ) ⊗  W × d 6 = k ψ d ( x i )  with the following definitions: • u ( k ) = v ec ( U ( k ) ) ∈ R nr • ( W × d 6 = k v d ) l := W × D d =1 v 0 d where v 0 d = v d for d 6 = l and v 0 l = e l ∈ R r , the unit vector with non-zero at the l th entry . The last deriv ati ve holds since f ( x ) = φ k ( x ) > U ( k ) ( W × d 6 = k ψ d ( x i )) . Computing W × d 6 = k ψ d ( x i ) takes O ( r D ) for each k , hence O ( r D D ) ∀ k . So we have that using mini-batches { x t 1 , . . . , x tm } for 11 SGD, we hav e the follo wing updates for MAP: w ← w +  w 2  ∇ w log p ( w t ) + N m m X i =1 ∇ w log p ( y ti | x ti , w , U )  u ( k ) ← u ( k ) +  k 2  ∇ u ( k ) log p ( U ) + N m m X i =1 ∇ u ( k ) log p ( y ti | x ti , w , U )  with time complexity O ( m ( nr D + r D D )) . Gathering the parameters into a vector θ = ( w , U (1) , . . . , U ( k ) ) , the HMC algorithm runs as follows: 1. Initialise MC by drawing θ 0 = ( w 0 , U (1) 0 , ..., U ( D ) 0 ) from its prior distribution. 2. For t = 0 , ..., T : (a) Initialise p ∼ N (0 , I Q ) , V ( k ) ∼ N (0 , I n × r ) ∀ k , H t ← − log p ( w t ) − P N i =1 log p ( y i | x i , θ t ) + 1 2 P D k =1 tr ( V ( k ) T V ( k ) ) + 1 2 p T p θ = ( w , U (1) , ...U ( D ) ) ← θ t = ( w t , U (1) t , ...U ( D ) t ) (b) For l = 1 , ..., L : i. p ← p +  w t 2  ∇ w log p ( w t ) + P N i =1 ∇ w log p ( y i | x i , θ t )  For k = 1 , ..., D : V ( k ) ← V ( k ) +  k t 2  P N i =1 ∇ U ( k ) log p ( y i | x i , θ )  ii. w ← w +  w t p For k = 1 , ..., D : u ( k ) ← u ( k ) +  w t V ( k ) iii. same as i. (c) H ∗ ← − log p ( w ) − P N i =1 log p ( y i | x i , θ ) + 1 2 P D k =1 tr ( V ( k ) T V ( k ) ) + 1 2 p T p u ∼ U nif [0 , 1] If u ≤ exp( H t − H ∗ ) θ t +1 = ( w t +1 , U (1) t +1 , ..., U ( D ) t +1 ) ← θ = ( w , U (1) , ..., U ( D ) ) else θ t +1 ← θ t From previous computations, it is easy to see that each update requires O ( LN ( nr D + r D D )) operations. B Elementwise con v ergence of TGP to N (0 , 1) Definition B.1. Martingale Difference Sequence A martingale differ ence sequence with r espect to a filtration ( F p ) p ∈{ 0 , 1 ,...,r } is a r eal-valued sequence of random variables X 1 , . . . , X r that satisfies: 1. X p is F p measurable 2. E ( | X p | ) < ∞ 3. E ( X p |F p − 1 ) = 0 a.s. for all p ∈ { 1 , . . . , r } . Theorem 1 ( Martingale Central Limit Theorem [ 9 ]) . Let X = { X 1 , . . . , X r } be a sequence of random variables satisfying the following conditions: 1. X is a martingale differ ence sequence with r espect to filtration ( F p ) p ∈{ 0 , 1 ,...,r } 2. P r p =1 E ( X 2 p |F p − 1 ) p → 1 as r → ∞ . 3. P r p =1 E ( X 2 p I ( | X p | >  ) |F p − 1 ) p → 0 as r → ∞ ∀  > 0 . 12 Then the sums S r = P r p =1 X p d → N (0 , 1) as r → ∞ . Proposition 1. Let n by r matrices U ( d ) iid ∼ N (0 , 1 r I ) for d = 1 , . . . , D , and let W ∼ N (0 , I ) wher e W ∈ R r × ... × r is a D-dimensional tensor . Then each element of W × D d =1 U ( d ) > con ver ges in distribution to N (0 , 1) as r → ∞ . Pr oof. Suppose first that D = 2 . It suffices to sho w that u, v ∈ R r , W ∈ R r × r , u, v iid ∼ N (0 , I ) , W ∼ N (0 , I ) ⇒ u > W v d → N (0 , 1) as r → ∞ W e define for each r ∈ N : S 0 = 0 S p := p X i,j =1 u i W ij v j X p := S p − S p − 1 = u p W pp v p + p − 1 X i =1 u p W pi v i + u i W ip v p F 0 := {∅ , Ω } where Ω is the sample space for the R Vs u, v , W F p := σ ( u i , v j , W ij ) p i,j =1 , the sigma algebra generated by these random variables for p ∈ { 1 , . . . , r } So we hav e that S r = u > W v , hence it suf fices to check conditions 1,2,3 in the Martingale CL T . W e first sho w 1, that X is a martingale dif ference sequence. It is clear that X p is F p measurable by definition of F p . T o show that X is integrable, we hav e: E ( | X p | ) ≤ E ( | u p W pp v p | ) + p − 1 X i =1 E ( | u p W pi v i | ) + E ( | u i W ip v p | ) ≤ q E ( u 2 p W 2 pp v 2 p ) + p − 1 X i =1 q E ( u 2 p W 2 pi v 2 i ) + q E ( u 2 i W 2 ip v 2 p ) = 1 r + ( p − 1)  1 r + 1 r  < ∞ by the inequality E ( | X | ) 2 ≤ E ( X 2 ) (shown using con vexity of g : x → x 2 and Jensen’ s inequality) and independence of u, v , W . Also we have: E ( X p |F p − 1 ) = E ( u p W pp v p ) + p − 1 X i =1 E ( u p W pi ) v i + u i E ( W ip v p ) = 0 since u p , v p , W pi , W ip are independent of F p − 1 and hav e zero mean. Hence X forms a martingale difference sequence. T o verify the next two conditions, we first prove a lemma that will help us do so. This is the generalisation of Chebyshev’ s inequality to higher moments: Lemma 2. Suppose X is a random variable with bounded n th moment for some n ∈ N . Then P ( | X − E ( X ) | >  ) ≤ E ( | X − E ( X ) | n )  n ∀  > 0 . Pr oof. W ithout loss of generality , assume E ( X ) = 0 . Then P ( | X | >  ) = E [ I ( | X | >  )] = 1  n E [  n I ( | X | >  )] ≤ 1  n E [ | X | n I ( | X | >  )] ≤ E [ | X | n ]  n 13 Note Lemma 2 shows that con ver gence in L n implies con vergence in probability . So to show conditions 2 and 3 of the martingale CL T , it suffices to sho w that the expectations of the quantities on the left hand sides con ver ge to the right hand side as scalars: 2’. P r p =1 E ( X 2 p ) → 1 as r → ∞ . 3’. P r p =1 E ( X 2 p I ( | X p | >  )) → 0 as r → ∞ ∀  > 0 . Let us show 2’. In E ( X 2 p ) , note that all cross terms in E ( X 2 p ) cancel since all terms have mean 0. So we hav e: E ( X 2 p ) = E ( u 2 p W 2 pp v 2 p ) + p − 1 X i =1 E ( u 2 p W 2 pi v 2 i ) + E ( u 2 i W 2 ip v 2 p ) = 1 r 2 + ( p − 1)  1 r 2 + 1 r 2  = 2 p − 1 r 2 ⇒ r X p =1 E ( X 2 p ) = 2 r 2 r X p =1 p − r · 1 r 2 = 2 r 2 ( r + 1) r 2 − 1 r = 1 T o show 3’, we first note that for a random v ariable X , Z ∞ δ I ( X > t ) dt = ( X − δ ) I ( X > δ ) for δ ∈ R Setting X = X 2 p , δ =  2 and rearranging we hav e: X 2 p I ( | X p | >  ) = X 2 p I ( X 2 p >  2 ) =  2 I ( X 2 p >  2 ) + Z ∞  2 I ( X 2 p > t ) dt =  2 I ( | X p | >  ) + Z ∞  2 s I ( | X p | > s ) ds by change of variables t = s 2 ⇒ E [ X 2 p I ( | X p | >  )] =  2 P ( | X p | >  ) + Z ∞  2 s P ( | X p | > s ) ds Now we w ould like to use Lemma 2 to upper bound the right hand side. Note we want to use ev en n such that E [ | X | n ] = E ( X n ) , since we kno w how to compute E ( X n p ) but not E [ | X p | n ] . Also note that P ( | X p | > s ) can be bounded by E ( X n p ) s n . So we want n > 2 for the bound on the integral to become finite. Hence we use n = 4 , and show that E ( X 4 p ) is sufficiently small so that e ven when we sum ov er p = 1 , . . . , r , we hav e that the upper bound tends to 0 as r → ∞ . First we compute E ( X 4 p ) . Note from the multinomial theorem: ( x 1 + x 2 + · · · + x m ) n = X k 1 + k 2 + ··· + k m = n  n k 1 , k 2 , . . . , k m  Y 1 ≤ t ≤ m x k t t where  n k 1 , k 2 , . . . , k m  = n ! k 1 ! k 2 ! · · · k m ! Applying this to X 4 p and taking the expectation, we see that the only cross terms that surviv e are products of ev en po wers of the terms, namely where two of the k i are 2 and the rest are 0. 14 Noting  n 2 , 2  = 6 , and that E ( X 4 ) = 3 σ 4 for X ∼ N (0 , σ 2 ) we hav e: E [ X 4 p ] = E [ u 4 p W 4 pp v 4 p + p − 1 X i =1 u 4 p W 4 pi v 4 i + u 4 i W 4 ip v 4 p ] + 6 E  ( u 2 p W 2 pp v 2 p )  p − 1 X i =1 u 2 p W 2 pi v 2 i + u 2 i W 2 ip v 2 p  + 6 E  p − 1 X i 6 = j u 2 p W 2 pi v 2 i u 2 p W 2 pj v 2 j + u 2 i W 2 ip v 2 p u 2 j W 2 j p v 2 p  + 6 E  p − 1 X i,j =1 u 2 p W 2 pi v 2 i u 2 j W 2 j p v 2 p  = (2 p − 1) · 3 r 2 · 3 · 3 r 2 + 6 · 2( p − 1) · 3 r 2 · 1 r · 1 r + 6 · 2  p − 1 2  3 r 2 · 1 r 2 + 6( p − 1) 2 1 r 2 · 1 r 2 = 3 r 4 (8 p 2 + 8 p − 7) So P ( | X p | >  ) ≤ 3  4 r 4 (8 p 2 + 8 p − 7) . Hence E [ X 2 p I ( | X p | >  )] ≤ 3  2 r 4 (8 p 2 + 8 p − 7) + Z ∞  2 s 3 s 4 r 4 (8 p 2 + 8 p − 7) ds = 3 r 4 (8 p 2 + 8 p − 7)  1  2 + Z ∞  2 s 3 ds  = 3 C r 4 (8 p 2 + 8 p − 7) where R ∞  2 s 3 ds = C . So r X p =1 E [ X 2 p I ( | X p | >  )] ≤ C r 4 r X p =1 8 p 2 + 8 p − 7 = O  1 r  → 0 as r → ∞ since P r p =1 8 p 2 + 8 p − 7 = O ( r 3 ) . So we hav e sho wn conditions 1,2’,3’, hence by martingale CL T we have that S r = u > W v d → N (0 , 1) as r → ∞ W e can prov e the claim for D > 2 in a similar fashion. C F eature Hashing Suppose we have features φ ( x ) ∈ R n . When n is too large, we may use feature hashing [ 35 ] to reduce the dimensionality of φ : Lemma 3. Let h : { 1 , . . . , n } → { 1 , . . . , m } be a hash function for m  n . i.e. P ( h ( i ) = j ) = 1 m ∀ j ∈ { 1 , . . . , m } . Also let ξ : { 1 , . . . , n } → {± 1 } be a hash function. Define ¯ φ ( x ) ∈ R m as follows: ¯ φ j ( x ) = P i : h ( i )= j ξ ( i ) φ i ( x ) Then E [ ¯ φ ( x ) > ¯ φ ( x 0 )] = φ ( x ) > φ ( x 0 ) , V ar [ ¯ φ ( x ) > ¯ φ ( x 0 )] = O ( 1 m ) . D Random F ourier F eatures Theorem 4 (Bochner’ s Theorem[ 24 ] ) . A stationary kernel k(d) is positive definite if and only if k(d) is the F ourier transform of a non-ne gative measur e . 15 For RFF the k ernel can be approximated by the inner product of random features gi ven by samples from its spectral density , in a Monte Carlo approximation, as follows: k ( x − y ) = Z R D e iv T ( x − y ) d P ( v ) ∝ Z R D p ( v ) e iv T ( x − y ) dv = E p ( v ) [ e iv T x ( e iv T y ) ∗ ] = E p ( v ) [ Re ( e iv T x ( e iv T y ) ∗ )] ≈ 1 n n X k =1 Re ( e iv k T x ( e iv k T y ) ∗ ) = E b [ φ ( x ) T φ ( y )] where φ ( x ) = q 2 n ( cos ( v 1 T x + b 1 ) , . . . , cos ( v m T x + b n )) with spectral frequencies v k iid samples from p ( v ) and b k iid samples from U [0 , 2 π ] . For a one dimensional squared exponential kernel k ( x, y ) = σ 2 f exp  − ( x − y ) 2 2 l 2  , the spectral density is N (0 , l − 2 ) . So we use features φ ( x ) = σ f q 2 n ( cos ( v 1 T x + b 1 ) , . . . , cos ( v m T x + b n )) where v k iid samples from N (0 , l − 2 ) and b k iid samples from U [0 , 2 π ] . E Choice of F eature Map Cholesky featur es Consider data with inputs lying on a D -dimensional grid: x i ∈ X = × D d =1 X ( d ) , | X ( d ) | = n d finite, where k d ( x i , x j ) only depends on the v alues that x i , x j take in X ( d ) . The X ( d ) can be, for example, a finite set of points in Euclidean space, or the set of values a categorical variable can take. Then the Gram matrix K , containing the values of the kernel ev aluated at each pair of points on the full grid, can be written as K = ⊗ D d =1 K ( d ) , a Kronecker product of the Gram matrices K ( d ) ∈ R n d × n d on each dimension [ 25 ]. The same holds for the Cholesky factor L where K = LL > : we hav e L = ⊗ D d =1 L ( d ) where K ( d ) = L ( d ) L ( d ) > ∈ R n d × n d . Then we define φ d ( x i ) to be the i th row of L ( d ) , so that k d ( x i , x j ) = K ( d ) ij = φ d ( x i ) > φ d ( x j ) . In general a Cholesky decomposition for an m by m matrix takes O ( m 3 ) to compute. Thus φ d ( x i ) for i = 1 , . . . , N require O ( n 3 d ) to compute in total. Hence the computation of features become feasible e ven for large N as long as the n d are reasonably small. Random feature maps In most cases the data does not lie on a grid, nor can k d be expressed as the inner product of finite feature vectors. In this case we can use random feature maps φ d : X → R n where E [ φ d ( x ) > φ d ( x 0 )] = k d ( x, x 0 ) . An example is random Fourier features (RFF) [ 21 ] for stationary kernels, where V [ φ d ( x ) > φ d ( x 0 )] = O ( 1 n ) . So we are introducing a further approximation k d ( x, x 0 ) ≈ φ d ( x ) > φ d ( x 0 ) , with more accurate approximations for larger n . This is feasible even for large N as φ d ( x ) only takes O ( n ) computation. See Appendix D for details. For non-stationary kernels, we can obtain features by Nyström methods [ 37 , 6 ], which use a set of n inducing points to approximate K . The kernel is ev aluated for each pair of inducing points and also between the inducing points and the data, gi ving matrices K nn and K N n . Then ˆ K ≈ K N n K − 1 nn K > N n = Φ > Φ where Φ = L − 1 nn K > N n . Hence the columns of Φ can be defined to be the Nyström features. F Collaborative Filtering F .1 Using Binary V ectors for Side Inf ormation Note if the side information ω 1 ( u i ) , ω 2 ( v j ) are binary vectors with non-zeros at indices I i , J j respectiv ely , we have: f ( u i , v j ) = ( a 1 U i + b 1 X k ∈I i U n 1 + k ) T W ( a 2 V j + b 2 X k ∈J j V n 2 + k ) 16 which can be reparametrised to: f ( u i , v j ) = a ( U i + b X k ∈I i U n 1 + k ) T W ( V j + c X k ∈J j V n 2 + k ) (7) F .2 Hyperparameter tuning f or MovieLens 100K Hyperparameters were tuned on the follo wing v alues. F or PMF and fixed W TGP: σ u = [0 . 3 , 0 . 1 , 0 . 03] , σ 2 = [1 . 0 , 0 . 1 , 0 . 01 , 0 . 001] ,  u = [10 − 5 , 10 − 6 , 10 − 7 ] where  u ,  w are the step sizes for SGD on U /V and W respectiv ely . W e noticed that for a fixed W the model overfits quickly in less than 30 epochs, whereas when learning W the test RMSE decreases steadily . So we used a different grid of parameters for tuning the models where W is learned: σ u = [0 . 3 , 0 . 1] , σ 2 = [1 . 0 , 0 . 75] ,  u ,  w = [10 − 5 , 10 − 6 ] . For models with side information, we tuned on a = [0 . 25 , 0 . 5 , 0 . 75] , b, c = [0 . 15 , 0 . 3 , 0 . 45] . G Calif ornia House Prices Data Figure 5: Heatmap showing the four additi ve components of predictions of the last sample of TGP for r = 2 , n = 200 , using uniform colouring scheme. 17 Figure 6: Zoom in on LA area of Figure 4b. Figure 7: Zoom in on Bay area of Figure 4b. 18 T able 2: Mean and standard deviation of Gelman Rubin statistic for HMC on TGP . Model n = 25 n = 50 n = 100 n = 200 TGP , r = 2 2 . 67 ± 1 . 34 2 . 55 ± 1 . 37 2 . 10 ± 0 . 70 1 . 92 ± 0 . 67 TGP , r = 5 1 . 06 ± 0 . 27 1 . 06 ± 0 . 19 1 . 15 ± 0 . 11 1 . 11 ± 0 . 11 TGP , r = 10 1 . 00 ± 0 . 03 1 . 02 ± 0 . 04 1 . 01 ± 0 . 02 1 . 06 ± 0 . 03 T able 3: Mean and standard deviation of Ef fecti ve Sample Size (out of 1200) for HMC on TGP . Model n = 25 n = 50 n = 100 n = 200 TGP , r = 2 230 ± 459 5 ± 17 11 ± 81 12 ± 74 TGP , r = 5 244 ± 118 121 ± 70 34 ± 64 42 ± 79 TGP , r = 10 692 ± 152 196 ± 93 310 ± 111 96 ± 165 H Irish Wind Data Regression on spatio-temporal data with grid structur e W e use the Irish wind data 4 giving daily av erage wind speeds for 12 locations in Ireland between 1961 and 1978 (78,888 observ ations). W e only use the cov ariates longitude, latitude and time. Note a 2D grid structure arises for the data when we treat the spatial covariates as one dimension and time as another . Again we whiten each covariate and observ ations, and use 20,000 randomly chosen data points for training and the rest for test. Using an isotropic SE kernel for space, and the sum of a periodic kernel and a SE kernel for time (to model annual periodicity and global trend), we first fit a GP efficiently exploiting the grid structure [ 25 ]. The optimised hyperparameters are then used to construct Cholesk y features. Again we use NUTS for inference on both the full-rank model and TGP , using 4 chains with 100 w armup draws and 100 samples. T able 4: T rain/T est RMSE on Irish wind data. Model T rain RMSE T est RMSE GP 4.8822 4.9915 Full-rank 4.8816 4.9898 TGP , r = 2 4.9120 4.9753 TGP , r = 5 4.8996 4.9735 TGP , r = 10 4.8913 4.9754 All models sho w good con ver gence after 100 warmup draws, indicated by the aforementioned con ver gence diagnostics. Looking at T able 4, we see similar patterns in the results for the wind data as for the house prices data: the GP , which is equiv alent to the full-rank model with Cholesky features (confirmed by similar train/test RMSE), sho ws lo wer training error than TGP , whereas TGP sho ws superior predictiv e performance. These results again suggest that TGP is an effecti ve regulariser tow ards simpler regression functions compared to GPs. 4 Obtained from http://www.inside- r.org/packages/cran/gstat/docs/wind 19 Figure 8: The predictions for TGP with r = 5 on the 12 locations. The light blue lines are the true observations, the yello w are the mean predictions, and the blue show 2.5% and 97.5% percentiles of predictions for samples. I Future W ork Note that TGP can easily be e xtended to non-Gaussian likelihoods, since all we need for SGD and HMC is the likelihood and priors to be analytic and dif ferentiable in the parameters. For v ery high dimensions where e ven the r D entries in W are undesirable, we can use a sparse representation of W with say Q non-zeros. All deriv ations carry forward, and we obtain time complexity O ( m ( nrD + QD )) for gradient computations in SGD. It would be interesting to compare TGP against other algorithms suitable for high-dimensional data. Furthermore, it would be desirable to hav e a sampling algorithm that scales sub-linearly , to benefit from the Bayesian approach to learning when N is large and HMC is infeasible. One example is Stochastic Gradient Langevin Dynamics (SGLD) [36] among many other Stochastic Gradient MCMC [14] algorithms. W e hav e also tried mean-field variational inference, but results were poor compared to HMC. Moreov er a more ef ficient method of tuning hyperparameters than by cross-validation would be ideal, especially for big N settings with many k ernel hyperparameters. One potential solution is the fully Bayesian approach, learning hyperparameters directly by imposing priors and sampling or MAP . W e leave these extensions for future work. 20

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment