Variational Inducing Kernels for Sparse Convolved Multiple Output Gaussian Processes
Interest in multioutput kernel methods is increasing, whether under the guise of multitask learning, multisensor networks or structured output data. From the Gaussian process perspective a multioutput Mercer kernel is a covariance function over corre…
Authors: Mauricio A. Alvarez, David Luengo, Michalis K. Titsias
V ariational Inducing Kernels for Sparse Con v olv ed Multiple Output Gaussian Pro cesses Mauricio A. ´ Alv arez al v arezm@cs.man.ac.uk Scho ol of Computer Scienc e University of Manchester Manchester, UK, M13 9PL Da vid Luengo luengod@ieee.org Dep. T e or ´ ıa de Se ˜ nal y Comunic aciones Universidad Carlos III de Madrid 28911 L e gan´ es, Sp ain Mic halis K. Titsias mtitsias@cs.man.ac.uk Neil D. La wrence neill@cs.man.ac.uk Scho ol of Computer Scienc e University of Manchester Manchester, UK, M13 9PL Abstract In terest in multioutput kernel methods is increasing, whether under the guise of m ultitask learning, m ultisensor net works or structured output data. F rom the Gaussian pro cess p erspective a m ultioutput Mercer k ernel is a co v ariance function o ver correlated output functions. One wa y of constructing such kernels is based on conv olution pro cesses (CP). A k ey problem for this approach is efficient inference. ´ Alv arez and Lawrence (2009) recently presen ted a sparse approximation for CPs that enabled efficient inference. In this pap er, w e extend this w ork in t wo directions: we introduce the concept of v ariational inducing functions to handle p oten tial non-smo oth functions inv olved in the k ernel CP construction and w e consider an alternativ e approach to approximate inference based on v ariational metho ds, extending the w ork by Titsias (2009) to the m ultiple output case. W e demonstrate our approaches on prediction of school marks, compiler performance and financial time series. 1. In tro duction Gaussian processes (GPs) are flexible non-parametric mo dels whic h allo w us to sp ecify prior distributions and p erform inference of functions. A limiting c haracteristic of GPs is the fact that the computational cost of inference is in general O ( N 3 ), N b eing the n um b er of data p oin ts, with an asso ciated storage requirement of O ( N 2 ). In recent years a lot of progress (Csat´ o and Opper, 2001; Lawrence et al., 2003; Seeger et al., 2003; Snelson and Ghahramani, 2006; Qui ˜ nonero Candela and Rasmussen, 2005) has been made with appro ximations that allo w inference in O ( K 2 N ) (and asso ciated storage of O ( K N ), where K is a user sp ecified 1 n umber. This has made GPs practical for a range of larger scale inference problems. In this pap er we are sp ecifically in terested in dev eloping priors ov er multiple functions. While suc h priors can b e trivially sp ecified b y considering the functions to b e indep enden t, our focus is on priors whic h specify correlations b etw een the functions. Most attempts to apply suc h priors so far (T eh et al., 2005; Osb orne et al., 2008; Bonilla et al., 2008) hav e fo cussed on what is kno wn in the geostatistics communit y as “linear mo del of coregionaliza- tion” (LMC) (Journel and Huijbregts, 1978; Go o v aerts, 1997). In these mo dels the differen t outputs are assumed to b e linear combinations of a set of one or more “latent functions” so that the d th output of the function, f d ( x ) is given by f d ( x ) = Q X q =1 a d,q u q ( x ) , (1) where u q ( x ) is one of Q latent functions that, w eighted by { a d,q } Q q =1 , sum to form each of the D outputs. GP priors are placed, independently , ov er eac h of the laten t functions inducing a correlated cov ariance function ov er { f d ( x ) } D d =1 . Approaches to multi-task learning arising in the kernel communit y (see for example Evgeniou et al., 2005) can also b e seen to b e instances of the LMC framew ork. W e wish to go b ey ond the LMC framework, in particular, our fo cus is c onvolution pr o- c esses (Higdon, 2002; Boyle and F rean, 2005). Using CPs for m ulti-output GPs w as pro- p osed b y Higdon (2002) and in tro duced to the machine learning audience by Bo yle and F rean (2005). Con volution processes allo w the integration of prior information from ph ys- ical mo dels, suc h as ordinary differential equations, in to the cov ariance function. ´ Alv arez et al. (2009), inspired by Gao et al. (2008), hav e demonstrated ho w first and second order differen tial equations, as w ell as partial differential equations, can b e accommo dated in a co v ariance function. Their in terpretation is that the set of laten t functions are a set of latent for c es , and they term the resulting mo dels “latent force models”. The cov ariance functions for these models are derived through conv olution pro cesses (CPs). In the CP framework, output functions are generated b y con volving Q latent pro cesses { u q ( x ) } Q q =1 with kernel functions, 1 G d,q ( x ), associated to eac h output d and laten t force q , so that w e ha v e f d ( x ) = Q X q =1 Z Z G d,q ( x − z ) u q ( z ) d z . (2) The LMC can b e seen as a particular case of the CP , in which the k ernel functions G d,q ( x ) corresp ond to scaled Dirac δ -function G d,q ( x − z ) = a d,q δ ( x − z ). In latent force mo dels the conv olving k ernel, G d,r ( · ), is the Gr e en ’s function asso ciated to a particular differential equation. A practical problem asso ciated with the CP framework is that in these models inference has computational complexit y O ( N 3 D 3 ) and storage requiremen ts O ( N 2 D 2 ). Recen tly , ´ Alv arez and La wrence (2009) in tro duced an efficient approximation for inference in this m ulti-output GP model. The idea w as to exploit a conditional indep endence assumption o ver the output functions { f d ( x ) } D d =1 giv en a finite n umber of observ ations of the laten t 1. Not kernels in the Mercer sense, but kernels in the normal sense. 2 functions n { u q ( x k ) } K k =1 o Q q =1 . This led to approximations that were very similar in spirit to the PITC and FITC appro ximations of Snelson and Ghahramani (2006); Qui ˜ nonero Candela and Rasmussen (2005). In this pap er we build on the work of ´ Alv arez and Lawrence. Their appro ximation was inspired by the fact that if the laten t functions are observ ed in their en tirety , the output functions ar e conditionally indep enden t of one another (as can b e seen in (2)). W e extend the previous w ork presented in ´ Alv arez and Lawrence (2009) in t wo w ays. First, a problem with the FITC and PITC approximations can b e their propensity to o verfit when inducing inputs are optimized. A solution to this problem was given in recen t w ork by Titsias (2009) who provides a sparse GP approximation that has an asso ciated v ariational b ound. In this paper we sho w how the ideas of Titsias can b e extended to the m ultiple output case. Second, we notice that if the lo cations of the inducing p oin ts, { x k } K k =1 , are close relativ e to the length scale of the laten t function, the PITC appro ximation will be accurate. How ev er, if the length scale b ecomes small the approximation requires v ery man y inducing p oin ts. In the worst case, the latent pro cess could b e white noise (as suggested by Higdon (2002) and implemen ted by Boyle and F rean (2005)). In this case the appro ximation will fail completely . W e further dev elop the v ariational approximation to allo w us to w ork with rapidly fluctuating latent functions (including white noise). This is ac hieved b y augmenting the output functions with one or more additional functions. W e refer to these additional outputs as the inducing functions . Our v ariational approximation is dev elop ed through the inducing functions. There are also smo othing kernels asso ciated with the inducing functions. The quality of the v ariational approximation can b e con trolled b oth through these inducing kernels and through the num b er and lo cation of the inducing inputs. Our approximation allo ws us to consider latent force mo dels with a larger num b er of states, D , and data p oin ts N . The use of inducing kernels also allo ws us to extend the induc- ing v ariable approximation of the latent force mo del framework to systems of sto chastic differ ential e quations (SDEs). In this pap er we apply the v ariational inducing k ernel ap- pro ximation to different real world datasets, including a m ultiv ariate financial time series example. A similar idea to the inducing function one in tro duced in this paper, was sim ultaneously prop osed b y L´ azaro-Gredilla and Figueiras-Vidal (2010). L´ azaro-Gredilla and Figueiras- Vidal (2010) in tro duced the concept of inducing feature to improv e p erformance o ver the pseudo-inputs approach of Snelson and Ghahramani (2006) in sparse GP models. Our use of inducing functions and inducing k ernels is motiv ated b y the necessity to deal with non-smo oth latent functions in the conv olution pro cesses mo del of multiple outputs. 2. Multiple Outputs Gaussian Pro cesses Let y d ∈ R N , where d = 1 , . . . , D , b e the observ ed data asso ciated with the output function y d ( x ). F or simplicit y , w e assume that all the observ ations asso ciated with different outputs are ev aluated at the same inputs X (although this assumption is easily relaxed). W e will often use the stac ked v ector y = ( y 1 , . . . , y D ) to collectiv ely denote the data of all the out- puts. Eac h observ ed v ector y d is assumed to b e obtained b y adding indep endent Gaussian noise to a vector of function v alues f d so that the likelihoo d is p ( y d | f d ) = N ( y d | f d , σ 2 d I ), 3 where f d is defined via (2). More precisely , the assumption in (2) is that a function v alue f d ( x ) (the noise-free version of y d ( x )) is generated from a common p ool of Q indep enden t laten t functions { u q ( x ) } Q q =1 , each having a co v ariance function (Mercer kernel) given b y k q ( x , x 0 ). Notice that the outputs share the same latent functions, but they also hav e their o wn set of parameters ( { α d,q } Q q =1 , σ 2 d ) where α d,q are the parameters of the smo othing ker- nel G d,q ( · ). Because con volution is a linear op eration, the co v ariance b et ween any pair of function v alues f d ( x ) and f d 0 ( x 0 ) is given by k f d ,f d 0 ( x , x 0 ) = Co v[ f d ( x ) , f d 0 ( x 0 )] = Q X q =1 Z Z G d,q ( x − z ) Z Z G d 0 ,q ( x 0 − z 0 ) k q ( z , z 0 )d z d z 0 . This cov ariance function is used to define a fully-coupled GP prior p ( f 1 , . . . , f D ) ov er all the function v alues asso ciated with the different outputs. The join t probabilit y distribution of the m ultioutput GP mo del can b e written as p ( { y d , f d } D d =1 ) = D Y d =1 p ( y d | f d ) p ( f 1 , . . . , f D ) . The GP prior p ( f 1 , . . . , f D ) has a zero mean v ector and a ( N D ) × ( N D ) co v ariance matrix K f , f , where f = ( f 1 , . . . , f D ), whic h consists of N × N blo cks of the form K f d , f d 0 . Elements of each blo c k are given by k f d ,f d 0 ( x , x 0 ) for all p ossible v alues of x . Each of suc h blocks is either a cross-cov ariance or cov ariance matrix of pairs of outputs. Prediction using the abov e GP model, as w ell as the maximization of the marginal lik eliho od p ( y ) = N ( y | 0 , K f , f + Σ ), where Σ = diag( σ 2 1 I , . . . , σ 2 D I ), requires O ( N 3 D 3 ) time and O ( N 2 D 2 ) storage whic h rapidly b ecomes infeasible ev en when only few hundreds of outputs and data are considered. Therefore approximate or sparse methods are needed in order to mak e the abov e m ultioutput GP mo del practical. 3. PITC-like appro ximation for Multiple Outputs Gaussian Pro cesses Before w e prop ose our v ariational sparse inference method for m ultioutput GP regression in Section 4, we review the sparse method prop osed by ´ Alv arez and Lawrence (2009). This metho d is based on a likelihoo d approximation. More precisely , eac h output function y d ( x ) is indep enden t from the other output functions given the full-length of each latent function u q ( x ). This means, that the likelihoo d of the data factorizes according to p ( y | u ) = D Y d =1 p ( y d | u ) = D Y d =1 p ( y d | f d ) , with u = { u q } Q q =1 the set of laten t functions. The sparse metho d in ´ Alv arez and La wrence (2009) makes use of this factorization by assuming that it remains v alid even when w e are only allow ed to exploit the information pro vided b y a finite set of function v alues, u q , instead of the full-length function u q ( x ) (which inv olv es uncountably man y p oin ts). Let u q , for q = 1 , . . . , Q , b e a K -dimensional v ector of v alues from the function u q ( x ) whic h are ev aluated at the inputs Z = { z k } K k =1 . These p oin ts are commonly referred to 4 as inducing inputs . The v ector u = ( u 1 , . . . , u Q ) denotes all these v ariables. The sparse metho d approximates the exact likelihoo d function p ( y | u ) with the lik eliho o d p ( y | u ) = D Y d =1 p ( y d | u ) = D Y d =1 N ( y d | µ f d | u , Σ f d | u + σ 2 d I ) , where µ f d | u = K f d , u K − 1 u , u u and Σ f d | u = K f d , f d − K f d , u K − 1 u , u K u , f d are the mean and co v ari- ance matrices of the conditional GP priors p ( f d | u ). The matrix K u , u is a block diagonal co v ariance matrix where the q th blo c k K u q , u q is obtained b y ev aluating k q ( z , z 0 ) at the inducing inputs Z . F urther, the matrix K f d , u has entries defined b y the cross-cov ariance function Co v[ f d ( x ) , u q ( z )] = Z Z G d,q ( x − z 0 ) k q ( z 0 , z ) d z 0 . The v ariables u follow the GP prior p ( u ) = N ( u | 0 , K u , u ) and can b e integrated out to give the follo wing appro ximation to the exact marginal likelihoo d: p ( y | θ ) = N ( y | 0 , D + K f , u K − 1 u , u K u , f + Σ ) . (3) Here, D is a block-diagonal matrix, where eac h block in the diagonal is giv en by K f d , f d − K f d , u K − 1 u , u K u , f d for all d . This appro ximate marginal likelihoo d represen ts exactly each diagonal (output-specific) blo c k K f d , f d while eac h off diagonal (cross-output) blo c k K f d , f d 0 is appro ximated b y the Nystr¨ om matrix K f d , u K − 1 u , u K u , f d 0 . The ab o ve sparse metho d has a similar structure to the PITC appro ximation in tro duced for single-output regression (Qui ˜ nonero Candela and Rasm ussen, 2005). Because of this similarit y , ´ Alv arez and La wrence (2009) call their multioutput sparse approximation PITC as w ell. Two of the properties of this PITC ap proximation, whic h can b e also its limitations, are: 1. It assumes that all laten t functions in u are smooth. 2. It is based on a modification of the initial full GP mo del. This implies that the inducing inputs Z are extra kernel hyparameters in the modified GP model. Because of p oin t 1, the metho d is not applicable when the laten t functions are white noise pro cesses. An important class of problems where w e ha ve to deal with white noise processes arise in linear SDEs where the abov e sparse metho d is currently not applicable there. Be- cause of 2, the maximization of the marginal likelihoo d in eq. (3) with resp ect to ( Z , θ ), where θ are mo del hyperparameters, ma y b e prone to ov erfitting esp ecially when the num- b er of v ariables in Z is large. Moreo v er, fitting a mo dified sparse GP mo del implies that the full GP mo del is not appro ximated in a systematic and rigorous w ay since there is no distance or divergence b et ween the t wo mo dels that is minimized In the next section, we address p oin t 1 ab o ve by in tro ducing the concept of v ariational inducing kernels that allo w us to efficiently sparsify m ultioutput GP models ha ving white noise laten t functions. F urther, these inducing kernels are incorp orated into the v ariational inference metho d of Titsias (2009) (thus addressing p oin t 2) that treats the inducing inputs Z as well as other quantities asso ciated with the inducing k ernels as v ariational parame- ters. The whole v ariational approac h provides us with a very flexible, robust to o verfitting, appro ximation framew ork that o vercomes the limitations of the PITC approximation. 5 4. Sparse v ariational appro ximation In this s ection, we in tro duce the concept of v ariational inducing k ernels (VIKs). VIKs give us a wa y to define more general inducing v ariables that ha ve larger approximation capacity than the u inducing v ariables used earlier and imp ortan tly allow us to deal with white noise laten t functions. T o motiv ate the idea, w e first explain why the u v ariables can work when the laten t functions are smo oth and fail when these functions b ecome white noises. In PITC, we assume eac h laten t function u q ( x ) is smo oth and w e sparsify the GP model through in tro ducing, u q , inducing v ariables which are direct observ ations of the latent function, u q ( x ), at particular input p oin ts. Because of the latent function’s smoothness, the u q v ariables also carry information ab out other p oin ts in the function through the imp osed prior o ver the laten t function. So, ha ving observed u q w e can reduce the uncertain t y of the whole function. With the vector of inducing v ariables u , if chosen to b e sufficiently large relativ e to the length scales of the laten t functions, we can efficien tly represent the functions { u q ( x ) } Q q =1 and subsequently v ariables f whic h are just conv olved versions of the latent functions. 2 When the reconstruction of f from u is p erfect, the conditional prior p ( f | u ) b ecomes a delta function and the sparse PITC appro ximation b ecomes exact. Figure 1(a) shows a cartoon description of a summarization of u q ( x ) b y u q . In con trast, when some of the latent functions are white noise pro cesses the sparse ap- pro ximation will fail. If u q ( z ) is white noise 3 it has a cov ariance function δ ( z − z 0 ). Such pro cesses naturally arise in the application of sto chastic differ ential e quations (see section 7) and are the ultimate non-smo oth pro cesses where t wo v alues u q ( z ) and u q ( z 0 ) are uncor- related when z 6 = z 0 . When we apply the sparse appro ximation a vector of “white-noise” inducing v ariables u q do es not carry information ab out u q ( z ) at any input z that differs from all inducing inputs Z . In other w ords there is no additional information in the condi- tional prior p ( u q ( z ) | u q ) ov er the unconditional prior p ( u q ( z )). Figure 1(b) shows a pictorial represen tation. The lack of structure mak es it impossible to exploit the correlations in the standard sparse metho ds lik e PITC. 4 Our solution to this problem is the follo wing. W e will define a more pow erful form of inducing v ariable, one based not around the latent function at a p oint, but one given b y the con volution of the laten t function with a smo othing k ernel. More precis ely , let us replace eac h inducing v ector u q with the v ariables λ q whic h are ev aluated at the inputs Z and are defined according to λ q ( z ) = Z T q ( z − v ) u q ( v )d v , (4) where T q ( x ) is a smoothing kernel (e.g. Gaussian) which w e call the inducing kernel (IK). This k ernel is not necessarily related to the mo del’s smo othing k ernels. These newly defined inducing v ariables can carry information ab out u q ( z ) not only at a single input lo cation but 2. This idea is like a “soft version” of the Nyquist-Shannon sampling theorem. If the latent functions w ere bandlimited, we could compute exact results given a high enough num b er of inducing p oin ts. In general it w on’t b e bandlimited, but for smo oth functions low frecuency components will dominate o ver high frecuencies, whic h will quickly fade aw ay . 3. Suc h a pro cess can b e though t as the “time deriv ative” of the Wiener pro cess. 4. Returning to our sampling theorem analogy , the white noise pro cess has infinite bandwidth. It is therefore imp ossible to represent it by observ ations at a few fixed inducing points. 6 (a) Latent function is smo oth (b) Latent function is noise u q ( x ) ∗ T q ( x ) = λ q ( x ) (c) Generation of an inducing function Figure 1: With a smo oth laten t function as in (a), we can use some inducing v ariables u q (red dots) from the complete latent pro cess u q ( x ) (in black) to generate smo othed versions (for example the one in blue), with uncertain ty describ ed b y p ( u q | u q ). How ever, with a white noise latent function as in (b), c ho osing inducing v ariables u q (red dots) from the latent pro cess (in black) do es not give us a clue ab out other p oin ts (for example the blue dots). In (c) the inducing function λ q ( x ) acts as a surrogate for a smooth function. Indirectly , it con tains information ab out the inducing p oin ts and it can be used in the computation of the low er b ound. In this con text, the symbol ∗ refers to the con volution integral. from the en tire input space. Figure 1(c) sho ws ho w the inducing k ernel generates the artificial construction λ q ( x ), that shares some ligth o ver the, otherwise, obscure inducing p oin ts. W e can ev en allo w a separate IK for eac h inducing p oin t, this is, if the set of inducing points is Z = { z k } K k =1 , then λ q ( z k ) = Z T q ,k ( z k − v ) u q ( v )d v , with the adv antage of asso ciating to each inducing p oint z k its own set of adaptive pa- rameters in T q ,k . F or the PITC approximation, this adds more hyperparameters to the lik eliho o d, p erhaps leading to ov erfitting. How ever, in the v ariational approximation w e define all these new parameters as v ariational parameters and therefore they do not cause the model to o v erfit. W e use the notation λ to refer to the set of inducing functions { λ q } Q q =1 . 7 If u q ( z ) has a white noise 5 GP prior the co v ariance function for λ q ( x ) is Co v[ λ q ( x ) , λ q ( x 0 )] = Z T q ( x − z ) T q ( x 0 − z )d z (5) and the cross-cov ariance function b et ween f d ( x ) and λ q ( x 0 ) is Co v[ f d ( x ) , λ q ( x 0 )] = Z G d,q ( x − z ) T q ( x 0 − z ) d z . (6) Notice that this cross-co v ariance function, unlik e the case of u inducing v ariables, maintains a weigh ted in tegration o v er the whole input space. This implies that a single inducing v ariable λ q ( x ) can prop erly propagate information from the full-length pro cess u q ( x ) in to the set of outputs f . It is p ossible to com bine the IKs defined ab o ve with the PITC approximation of ´ Alv arez and Lawrence (2009), but in this pap er our focus will be on applying them within the v ariational framework of Titsias (2009). W e therefore refer to the k ernels as v ariational inducing k ernels (VIKs). 5. V ariational inference for sparse m ultiple output Gaussian Processes. W e no w extend the v ariational inference metho d of Titsias (2009) to deal with multiple outputs and incorp orate them into the VIK framew ork. W e compactly write the join t probability model p ( { y d , f d } D d =1 ) as p ( y , f ) = p ( y | f ) p ( f ). The first step of the v ariational metho d is to augment this model with inducing v ariables. F or our purp ose, suitable inducing v ariables are defined through VIKs. More precisely , let λ = ( λ 1 , . . . , λ Q ) b e the whole v ector of inducing v ariables where eac h λ q is a K -dimensional v ector of v alues obtained according to eq. (4). The role of λ q is to carry information about the latent function u q ( z ). Each λ q is ev aluated at the inputs Z and has its own VIK, T q ( x ), that depends on parameters θ T q . W e denote these parameters as Θ = { θ T q } Q q =1 . The λ v ariables augment the GP mo del according to p ( y , f , λ ) = p ( y | f ) p ( f | λ ) p ( λ ) . Here, p ( λ ) = N ( λ | 0 , K λ , λ ) and K λ , λ is a blo c k diagonal matrix where eac h blo c k K λ q , λ q in the diagonal is obtained b y ev aluating the co v ariance function in eq. (5) at the inputs Z . Ad- ditionally , p ( f | λ ) = N ( f | K f , λ K − 1 λ , λ λ , K f , f − K f , λ K − 1 λ , λ K λ , f ) where the cross-co v ariance K f , λ is computed through eq. (6). Because of the consistency condition R p ( f | λ ) p ( λ )d λ = p ( f ), p erforming exact inference in the ab o ve augmen ted mo del is equiv alent to p erforming exact inference in the initial GP mo del. Crucially , this holds for an y v alues of the augmentation parameters ( Z , Θ ). This is the k ey prop ert y that allo ws us to turn these augmen tation parameters in to v ariational parameters by applying appro ximate sparse inference. Our method no w follo ws exactly the lines of Titsias (2009) (in app endix A w e presen t a detailed deriv ation of the b ound based on the set of latent functions u q ( x )). W e introduce the v ariational distribution q ( f , λ ) = p ( f | λ ) φ ( λ ), where p ( f | λ ) is the conditional GP prior 5. It is straightforw ard to generalize the metho d for rough latent functions that are not white noise or to com bine smo oth laten t functions with white noise. 8 defined earlier and φ ( λ ) is an arbitrary v ariational distribution. By minimizing the KL div ergence betw een q ( f , λ ) and the true p osterior p ( f , λ | y ), we can compute the follo wing Jensen’s lo w er b ound on the true log marginal lik eliho o d: F V ( Z , Θ ) = log N y | 0 , K f , λ K − 1 λ , λ K λ , f + Σ − 1 2 tr Σ − 1 e K , where Σ is the cov ariance function asso ciated with the additiv e noise pro cess and e K = K f , f − K f , λ K − 1 λ , λ K λ , f . Note that this bound consists of tw o parts. The first part is the log of a GP prior with the only difference that now the cov ariance matrix has a particular lo w rank form. This form allo ws the in version of the cov ariance matrix to tak e place in O ( N D K 2 ) time rather than O ( N 3 D 3 ). The second part can b e seen as a p enalization term that regulates the estimation of the parameters. Notice also that only the diagonal of the exact cov ariance matrix K f , f needs to b e computed. Overall, the computation of the b ound can be done efficien tly in O ( N D K 2 ) time. The b ound can b e maximized with resp ect to all parameters of the cov ariance function; b oth mo del parameters and v ariational parameters. The v ariational parameters are the inducing inputs Z and the parameters θ T q of each VIK whic h are rigorously selected so that the KL div ergence is minimized. In fact each VIK is also a v ariational quantit y and one could try differen t forms of VIKs in order to c ho ose the one that giv es the b est lo wer bound. The form of the b ound is v ery similar to the pro jected process appro ximation, also kno wn as Deterministic T raining Conditional approximation (DTC) (Csat´ o and Opp er, 2001; Seeger et al., 2003; Rasm ussen and Williams, 2006). How ev er, the b ound has an additional trace term that p enalizes the mo vemen t of inducing inputs a wa y from the data. This term con verts the DTC approximation to a low er bound and prev ents ov erfitting. In what follo ws, w e refer to this approximation as DTCV AR, where the V AR suffix refers to the v ariational framew ork. The predictive distribution of a vector of test p oints, y ∗ giv en the training data can also b e found to b e p ( y ∗ | y , X , Z ) = N ( y ∗ | µ y ∗ , Σ y ∗ ) , with µ y ∗ = K f ∗ λ A − 1 K λ f Σ − 1 y and Σ y ∗ = K f ∗ f ∗ − K f ∗ λ K − 1 λλ − A − 1 K λ f ∗ + Σ ∗ and A = K λ , λ + K λ , f Σ − 1 K f , λ . Predictive means can b e computed in O ( N D K ) whereas predictiv e v ariances require O ( N D K 2 ) computation. 6. Exp erimen ts W e present results of applying the metho d prop osed for t w o real-world datasets that will b e describ ed in short. W e compare the results obtained using PITC, the intrinsic coregion- alization mo del (ICM) 6 emplo yed in (Bonilla et al., 2008) and the metho d using v ariational inducing k ernels. F or PITC we estimate the parameters through the maximization of the appro ximated marginal lik eliho o d of equation (3) using a scaled-conjugate gradien t metho d. 6. The intrinsic coregionalization model is a particular case of the linear mo del of coregionalization with one laten t function (Go o v aerts, 1997). See equation (1) with Q = 1. 9 W e use one latent function and b oth the cov ariance function of the latent process, k q ( x , x 0 ), and the kernel smo othing function, G d,q ( x ), follo w a Gaussian form, this is k ( x , x 0 ) = N ( x − x 0 | 0 , C ) , where C is a diagonal matrix. F or the DTCV AR approximations, we maximize the v ari- ational b ound F V . Optimization is also p erformed using scaled conjugate gradient. W e use one white noise laten t function and a corresp onding inducing function. The inducing k ernels and the mo del kernels follow the same Gaussian form as in the PITC case. Using this form for the co v ariance or k ernel, all conv olution in tegrals are solved analytically . 6.1 Exam score prediction In this experiment the goal is to predict the exam score obtained by a particular student b elonging to a particular sc ho ol. The data comes from the Inner London Education Author- it y (ILEA). 7 It consists of examination records from 139 secondary schools in years 1985, 1986 and 1987. It is a random 50% sample with 15362 studen ts. The input space consists of features related to eac h studen t and features related to each sc ho ol. F rom the m ultiple output p oint of view, each sc ho ol represents one output and the exam score of each student a particular instantiation of that output. W e follo w the same preprocessing steps emplo yed in (Bonilla et al., 2008). The only features used are the studen t-dep endent ones (y ear in whic h each studen t to ok the exam, gender, VR band and ethnic group), which are categorial v ariables. Eac h of them is transformed to a binary represen tation. F or example, the p ossible v alues that the v ariable year of the exam can take are 1985, 1986 or 1987 and are represen ted as 100, 010 or 001. The transformation is also applied to the v ariables gender (tw o binary v ariables), VR band (four binary v ariables) and ethnic group (elev en binary v ariables), ending up with an input space with dimension 20. The categorial nature of data restricts the input space to 202 unique input feature v ectors. Ho wev er, tw o students represented b y the same input v ector x and b elonging b oth to the same school d , can obtain different exam scores. T o reduce this noise in the data, w e follow Bonilla et al. (2008) in taking the mean of the observ ations that, within a school, share the same input vector and use a simple heterosk edastic noise mo del in which the v ariance for eac h of these means is divided b y the n um b er of observ ations used to compute it. The p erformance measure emplo yed is the p ercentage of explained v ariance defined as the total v ariance of the data min us the sum-squared error on the test set as a p ercen tage of the total data v ariance. It can b e seen as the p ercen tage version of the coefficient of determination b et ween the test targets and the predictions. The performance measure is computed for ten rep etitions with 75% of the data in the training set and 25% of the data in the test set. Figure 6.1 shows results using PITC, DTCV AR with one smoothing k ernel and DTCV AR with as many inducing k ernels as inducing p oin ts (DTCV ARS in the figure). F or 50 inducing p oin ts all three alternativ es lead to appro ximately the same results. PITC keeps a relativ ely constan t p erformance for all v alues of inducing p oin ts, while the DTCV AR appro ximations increase their p erformance as the n umber of inducing p oin ts increase. This is consisten t with 7. Data is a v ailable at http://www.cmm.bristol.ac.uk/learning- training/multilevel- m- support/ datasets.shtml 10 the exp ected b ehaviour of the DTCV AR metho ds, since the trace term p enalizes the mo del for a reduced num b er of inducing p oin ts. Notice that all the approximations outperform indep enden t GPs and the best result of the intrinsic coregionalization mo del presented in (Bonilla et al., 2008). SM 5 SM 20 SM 50 ICM IND 25 30 35 40 45 50 55 60 Pecentage of explained variance Method PITC DTCVAR DTCVARS Figure 2: Exam score prediction results for the sc ho ol dataset. Results include the mean of the p ercen tage of explained v ariance of ten rep etitions of the experiment, together with one standard deviation. In the b ottom, SM X stands for sparse metho d with X inducing p oin ts, DTCV AR refers to the DTC v ariational appro ximation with one smo othing kernel and DTCV ARS to the same appro ximation using as man y inducing kernels as inducing p oin ts. Results using the ICM model and indep endent GPs (app earing as IND in the figure) ha ve also b een included. 6.2 Compiler prediction p erformance. In this dataset the outputs corresp ond to the speed-up of 11 C programs after some trans- formation sequence has been applied to them. The speed-up is defined as the execution time of the original program divided b y the execution time of the transformed program. The input space consists of 13-dimensional binary feature vectors, where the presence of a one in these vectors indicates that the program has received that particular transformation. The dataset contains 88214 observ ations for eac h output and the same num b er of input v ectors. All the outputs share the same input space. Due to technical requirements, it is imp ortan t that the prediction of the sp eed-up for the particular program is made using few observ ations in the training set. W e compare our results to the ones presented in (Bonilla et al., 2008) and use N = 16, 32, 64 and 128 for the training set. The remaining 88214 − N observ ations are used for testing, employing as p erformance measure the mean absolute 11 error. The exp erimen t is rep eated ten times and standard deviations are also rep orted. W e only include results for the av erage performance ov er the 11 outputs. Figure 3 shows the results of applying indep enden t GPs (IND in the figure), the in trinsic coregionalization mo del (ICM in the figure), PITC, DTCV AR with one inducing kernel (DTCV AR in the figure) and DTCV AR with as man y inducing k ernels as inducing p oin ts (DTCV ARS in the figure). Since the training sets are small enough, w e also include results of applying the GP generated using the full cov ariance matrix of the con volution construction (see FULL GP in the figure). W e rep eated the experiment for different v alues of K , but sho w results only for K = N/ 2. Results for ICM and IND w ere obtained from (Bonilla et al., 2008). In general, the DTCV AR v arian ts outperform the ICM method, and the independent 16 32 64 128 0.02 0.04 0.06 0.08 0.1 0.12 Mean Absolute Error Number of training points IND ICM PITC DTCVAR DTCVARS FULL GP Figure 3: Mean absolute error and standard deviation ov er ten rep etitions of the compiler experiment as a function of the training points. IND stands for independent GPs, ICM stands for in trinsic coregionalization mo del, DTCV AR refers to the DTCV AR approximation using one inducing k ernel, DTCV ARS refers to the DTCV AR approximation using as many inducing k ernels as inducing p oin ts and FULL GP stands for the GP for the multiple outputs without any approximation. GPs for N = 16 , 32 and 64. In this case, using as many inducing kernels as inducing p oin ts improv es in a verage the p erformance. All metho ds, including the independent GPs are b etter than PITC. The size of the test set encourages the application of the sparse metho ds: for N = 128, making the prediction of the whole dataset using the full GP takes in av erage 22 minutes while the prediction with DTCV AR tak es 0 . 65 minutes. Using more inducing kernels impro ves the p erformance, but also mak es the ev aluation of the test set more exp ensiv e. F or DTCV ARS, the ev aluation takes in a verage 6 . 8 min utes. Time results are a v erage results o ver the ten rep etitions. 12 7. Sto c hastic Laten t F orce Mo dels for Financial Data The starting p oin t of sto c hastic differential equations is a sto chastic v ersion of the equation of motion, which is called Langevin’s equation: d f ( t ) d t = − C f ( t ) + S u ( t ) , (7) where f ( t ) is the velocity of the particle, − C f ( t ) is a systematic friction term, u ( t ) is a random fluctuation external force, i.e. white noise, and S indicates the sensitivit y of the ouput to the random fluctuations. In the mathematical probability literature, the abov e is written more rigorously as d f ( t ) = − C f ( t )d t + S d W ( t ) where W ( t ) is the Wiener pro cess (standard Bro wnian motion). Since u ( t ) is a Gaussian pro cess and the equation is linear, f ( t ) m ust b e also a Gaussian process which turns out to b e the Ornstein-Uhlenbeck (OU) pro cess. Here, w e are in terested in extending the Langevin equation to mo del multiv ariate time series. The wa y that the mo del in (7) is extended is b y adding more output signals and more external forces. The forces can b e either smo oth (systematic or drift-t yp e) forces or white noise forces. Thus, w e obtain d f d ( t ) d t = − D d f d ( t ) + Q X q =1 S d,q u q ( t ) , (8) where f d ( t ) is the d th output signal. Each u q ( t ) can be either a smo oth latent force that is assigned a GP prior with cov ariance function k q ( t, t 0 ) or a white noise force that has a GP prior with cov ariance function δ ( t − t 0 ). That is, w e ha ve a comp osition of Q latent forces, where Q s of them corresp ond to smo oth laten t forces and Q o corresp ond to white noise pro cesses. The in tuition b ehind this com bination of input forces is that the smo oth part can b e used to represen t medium/long term trends that cause a departure from the mean of the output pro cesses, whereas the sto c hastic part is related to short term fluctuations around the mean. A model that employs Q s = 1 and Q o = 0 was prop osed b y Lawrence et al. (2007) to describe protein transcription regulation in a single input motif (SIM) gene net work. Solving the differential equation (8), w e obtain f d ( t ) = e − D d t f d, 0 + Q X q =1 S d,q Z t 0 e − D d ( t − z ) u q ( z ) dz , where f d, 0 arises from the initial condition. This model no w is a sp ecial case of the m ultiout- put regression model discussed in sections 1 and 2 where eac h output signal y d ( t ) = f d ( t ) + has a mean function e − D d t f d, 0 and each mo del k ernel G d,q ( x ) is equal to S d,q e − D d ( t − z ) . The ab o v e model can b e also view ed as a sto c hastic laten t force mo del (SLFM) following the w ork of ´ Alv arez et al. (2009). Laten t market forces The application considered is the inference of missing data in a m ultiv ariate financial data set: the foreign exchange rate w.r.t. the dollar of 10 of the top in ternational currencies 13 50 100 150 200 250 0 .75 0.8 0 .85 0.9 0 .95 1 1 .05 1.1 1 .15 (a) CAD: Real data and prediction 50 100 150 200 250 7 .8 8 8 .2 8 .4 8 .6 8 .8 9 9 .2 9 .4 x 10 −3 (b) JPY: Real data and prediction 50 100 150 200 250 0.7 0 .75 0.8 0 .85 0.9 0 .95 (c) AUD: Real data and prediction Figure 4: Predictions from the mo del with Q s = 1 and Q o = 3 are sho wn as solid lines for the mean and grey bars for error bars at 2 standard deviations. F or CAD, JPY and AUD the data was artificially held out. The true v alues are shown as a dotted line. Crosses on the x -axes of all plots sho w the lo cations of the inducing inputs. (Canadian Dollar [CAD], Euro [EUR], Japanese Y en [JPY], Great British Pound [GBP], Swiss F ranc [CHF], Australian Dollar [A UD], Hong Kong Dollar [HKD], New Zealand Dollar [NZD], South Korean W on [KR W] and Mexican P eso [MXN]) and 3 precious metals (gold [XA U], silver [XAG] and platinum [XPT]). 8 W e considered all the data a v ailable for the calendar y ear of 2007 (251 working da ys). In this data there are sev eral missing v alues: XA U, XA G and XPT hav e 9, 8 and 42 da ys of missing v alues resp ectiv ely . On top of this, w e also introduced artificially long sequences of missing data. Our ob jectiv e is to mo del the data and test the effectiv eness of the mo del b y imputing these missing p oin ts. W e remov ed a test set from the data by extracting contiguous sections from 3 currencies asso ciated with v ery different geographic locations: we to ok days 50–100 from CAD, days 100–150 from 8. Data is av ailable at http://fx.sauder.ubc.ca/data.html ). 14 JPY and da ys 150–200 from AUD. The remainder of the data comprised the training set, whic h consisted of 3051 p oin ts, with the test data containing 153 p oints. F or prepro cessing w e remov ed the mean from eac h output and scaled them so that they all had unit v ariance. It seems reasonable to suggest that the fluctuations of the 13 correlated financial time-series are driven by a smaller num b er of latent mark et forces. W e therefore mo delled the data with up to six laten t forces whic h could b e noise or smo oth GPs. The GP priors for the smo oth latent forces are assume d to ha ve a squared exp onential co v ariance function, k q ( t, t 0 ) = 1 q 2 π ` 2 q exp − ( t − t 0 ) 2 2 ` 2 q , where the hyperparameter ` q is kno wn as the lengthscale. W e presen t an example with Q = 4. F or this v alue of Q , we consider all the p ossible combi- nations of Q o and Q s . The training w as performed in all cases b y maximizing the v ariational b ound using the scale conjugate gradient algorithm until con vergence w as ac hieved. The b est performance in terms of achiving the highest v alue for F V w as obtained for Q s = 1 and Q o = 3. W e compared against the LMC mo del for different v alues of the latent functions in that framew ork. While our b est mo del resulted in an standardized mean square error of 0 . 2795, the b est LMC (with Q =2) resulted in 0 . 3927. W e plotted predictions from the laten t market force mo del to c haracterize the p erformance when filling in missing data. In figure 4 w e sho w the output signals obtained using the mo del with the highest b ound ( Q s = 1 and Q o = 3) for CAD, JPY and AUD. Note that the mo del p erforms better at capturing the deep drop in AUD than it does at capturing fluctuations in CAD and JPY. 8. Conclusions W e hav e presented a v ariational approac h to sparse approximations in conv olution pro- cesses. Our main fo cus w as to pro vide efficien t mec hanisms for learning in m ultiple output Gaussian pro cesses when the laten t function is fluctuating rapidly . In order to do so, w e ha ve introduced the concept of inducing function, which generalizes the idea of inducing p oin t, traditionally employ ed in sparse GP methods. The approac h extends the v ariational appro ximation of Titsias (2009) to the m ultiple output case. Using our approac h w e can per- form efficien t inference on latent force mo dels whic h are based around sto c hastic differential equations, but also con tain a smooth driving force. Our appro ximation is flexible enough and has been sho wn to b e applicable to a wide range of data sets, including high-dimensional ones. Ac kno wledgements The authors w ould like to thank Edwin Bonilla for his v aluable feedbac k with resp ect to the exam score prediction example and the compiler dataset example. W e also thank the authors of Bonilla et al. (2008) who kindly made the compiler dataset av ailable. DL has b een partly financed by Comunidad de Madrid (pro ject PR O-MUL TIDIS-CM, S-0505/TIC/0233), and b y the Spanish go vernmen t (CICYT pro ject TEC2006-13514-C02-01 and researh grant JC2008-00219). MA and NL hav e been financed b y a Go ogle Research Aw ard “Mec ha- nistically Inspired Con v olution Pro cesses for Learning” and MA, NL and MT ha ve b een 15 financed by EPSR C Grant No EP/F005687/1 “Gaussian Pro cesses for Systems Iden tifica- tion with Applications in Systems Biology”. App endix A. V ariational Inducing Kernels Recen tly , a metho d for v ariational sparse appro ximation for Gaussian pro cesses learning w as introduced in Titsias (2009). In this app endix, w e apply this metho dology to a mul- tiple output Gaussian pro cess where the outputs hav e b een generated through a so called con volution pro cess. F or learning the parameters of the k ernels inv olv ed, a lo w er b ound for the true marginal can b e maximized. This lo wer b ound has similar form to the marginal lik eliho o d of the Deterministic T raining Conditional (DTC) appro ximation plus an extra term which inv olves a trace op eration. The computational complexity grows as O ( N D K 2 ) where N is the n umber of data p oin ts per output, D is the n umber of outputs and K the n umber of inducing v ariables. A.1 Computation of the lo w er b ound Giv en target data y and inputs X , the marginal likelihoo d of the original mo del is given b y in tegrating o ver the laten t function 9 p ( y | X ) = Z u p ( y | u, X ) p ( u )d u. The prior ov er u is expressed as p ( u ) = Z λ p ( u | λ ) p ( λ )d λ . The augmen ted join t model can then be expressed as p ( y , u, λ ) = p ( y | u ) p ( u | λ ) p ( λ ) . With the inducing function λ , the marginal lik eliho od takes the form p ( y | X ) = Z u, λ p ( y | u, X ) p ( u | λ ) p ( λ )d λ d u. Using Jensen’s inequality , w e use the following v ariational b ound on the log likelihoo d, F V ( Z , Θ , φ ( λ )) = Z u, λ q ( u, λ ) log p ( y | u, X ) p ( u | λ ) p ( λ ) q ( u, λ ) d λ d u, where we ha v e in tro duced q ( u, λ ) as the v ariational approximation to the p osterior. F ol- lo wing Titsias (2009) w e now sp ecify that the v ariational appro ximation should b e of the form q ( u, λ ) = p ( u | λ ) φ ( λ ) . 9. Strictly sp eaking, the distributions asso ciated to u corresp ond to random signed measures, in particular, Gaussian measures. 16 W e can write our b ound as F V ( Z , Θ , φ ( λ )) = Z λ φ ( λ ) Z u p ( u | λ ) log p ( y | u ) + log p ( λ ) φ ( λ ) d u d λ . T o compute this b ound w e first consider the in tegral log T( λ , y ) = Z u p ( u | λ ) log p ( y | u )d u. Since this is simply the exp ectation of a Gaussian under a Gaussian we can compute the result analytically as follo ws log T( λ , y ) = D X d =1 Z u p ( u | λ ) − N 2 log 2 π − 1 2 log | Σ | − 1 2 tr h Σ − 1 y d y > d − 2 y d f > d + f d f > d i d u. W e need to compute E u | λ [ f d ] and E u | λ f d f > d . E u | λ [ f d ] is a v ector with elemen ts E u | λ [ f d ( x n )] = Q X q =1 Z Z G d,q ( x n − z 0 ) E u | λ [ u q ( z 0 )]d z 0 . Assuming that the latent functions are indep endent GPs, E u | λ [ u q ( z 0 )] = E u q | λ q [ u q ( z 0 )] = k u q λ q ( z 0 , Z ) K − 1 λ q , λ q ( Z , Z ) λ q . Then E u | λ [ f d ( x n )] = Q X q =1 k f d λ q ( x n , Z ) K − 1 λ q λ q ( Z , Z ) λ q . E u | λ [ f d ] can b e expressed as E u | λ [ f d ] = K f d λ K − 1 λλ λ = α d ( X , λ ) = α d . On the other hand, E u | λ f d f > d is a matrix with elemen ts E u | λ [ f d ( x n ) f d ( x m )] = Q X q =1 Z Z G d,q ( x n − z ) Z Z G d,q ( x m − z 0 ) E u | λ [ u q ( z ) u q ( z 0 )]d z d z 0 + α d ( x n ) α d ( x m ) . With independent GPs the term E u | λ [ u q ( z ) u q ( z 0 )] can b e expressed as E u | λ [ u q ( z ) u q ( z 0 )] = k u q u q ( z , z 0 ) − k u q λ q ( z , Z ) K − 1 λ q λ q ( Z , Z ) k > u q λ q ( z 0 , Z ) . In this wa y E u | λ h f d f > d i = α d α > d + K f d f d − K f d λ K − 1 λλ K λ f d = α d α > d + e K dd , with e K dd = K f d f d − K f d λ K − 1 λλ K λ f d . 17 The expression for log T( λ , y ) is given as log T( λ , y ) = log N ( y | α , Σ ) − 1 2 D X d =1 tr Σ − 1 e K dd . The v ariational low er b ound is no w giv en as F V ( Z , Θ , φ ) = Z λ φ ( λ ) log N ( y | α , Σ ) p ( λ ) φ ( λ ) d λ − 1 2 D X d =1 tr Σ − 1 e K dd . (9) A free form optimization ov er φ ( λ ) could no w b e p erformed, but it is far simpler to reverse Jensen’s inequality on the first term, we then recov er the v alue of the low er b ound for opti- mized φ ( λ ) without ev er ha ving to explicitly optimise φ ( λ ). Rev ersing Jensen’s inequality , w e ha ve F V ( Z , Θ ) = log N y | 0 , K f λ K − 1 λλ K λ f + Σ − 1 2 D X d =1 tr Σ − 1 e K dd . The form of φ ( λ ) whic h leads to this b ound can b e found as φ ( λ ) ∝ N ( y | α , Σ ) p ( λ ) = N λ | Σ λ | y K − 1 λλ K λ f Σ − 1 y , Σ λ | y = N K λλ A − 1 K λ f Σ − 1 y , K λλ A − 1 K λλ , with Σ λ | y = K − 1 λλ + K − 1 λλ K λ f Σ − 1 K f λ K − 1 λλ − 1 = K λλ A − 1 K λλ and A = K λλ + K λ f Σ − 1 K f λ . A.2 Predictiv e distribution The predictiv e distribution for a new test p oint given the training data is also required. This can b e expressed as p ( y ∗ | y , X , Z ) = Z u, λ p ( y ∗ | u ) q ( u, λ )d λ d u = Z u, λ p ( y ∗ | u ) p ( u | λ ) φ ( λ )d λ d u = Z u p ( y ∗ | u ) Z λ p ( u | λ ) φ ( λ )d λ d u. Using the Gaussian form for the φ ( λ ) w e can compute Z λ p ( u | λ ) φ ( λ )d λ = Z λ N ( u | k u λ K − 1 λλ λ , k uu − k u λ K − 1 λλ k λ u ) × N K λλ A − 1 K λ f Σ − 1 y , K λλ A − 1 K λλ d λ = N u | k u λ A − 1 K λ f Σ − 1 y , k uu − k u λ K − 1 λλ − A − 1 k λ u . Whic h allo ws us to write the predictiv e distribution as p ( y ∗ | y , X , Z ) = Z u N ( y ∗ | f ∗ , Σ ∗ ) N u | µ u | λ , Σ u | λ d u = N ( y ∗ | µ y ∗ , Σ y ∗ ) with µ y ∗ = K f ∗ λ A − 1 K λ f Σ − 1 y and Σ y ∗ = K f ∗ f ∗ − K f ∗ λ K − 1 λλ − A − 1 K λ f ∗ + Σ ∗ . 18 A.3 Optimisation of the Bound Optimisation of the b ound (with respect to the v ariational parameters and the parameters of the co v ariance functions) can b e carried out through gradient based metho ds. W e follo w the notation of Brookes (2005) obtaining similar results to La wrence (2007). This notation allo ws us to apply the chain rule for matrix deriv ation in a straight-forw ard manner. The resulting gradients can then b e com bined with gradients of the cov ariance functions with resp ect to their parameters to optimize the mo del. Let’s define G: = v ec G , where vec is the vectorization operator o ver the matrix G . F or a function F V ( Z ) the equiv alence b etw een ∂ F V ( Z ) ∂ G and ∂ F V ( Z ) ∂ G: is given through ∂ F V ( Z ) ∂ G: = ∂ F V ( Z ) ∂ G : > . The log-likelihoo d function is given as F V ( Z , Θ ) ∝ − 1 2 log | Σ + K f λ K − 1 λλ K λ f | − 1 2 tr h Σ + K f λ K − 1 λλ K λ f − 1 yy > i − 1 2 tr Σ − 1 e K , where e K = K ff − K f λ K − 1 λλ K λ f . Using the matrix in version lemma and its equiv alent form for determinan ts, the ab o ve expression can be written as F V ( Z , Θ ) ∝ 1 2 log | K λλ | − 1 2 log | A | − 1 2 log | Σ | − 1 2 tr h Σ − 1 yy > i + 1 2 tr h Σ − 1 K f λ A − 1 K λ f Σ − 1 yy > i − 1 2 tr Σ − 1 e K , up to a constant. W e can find ∂ F V ( Z ) ∂ θ and ∂ F V ( Z ) ∂ Z applying the c hain rule to F V ( Z , Θ ) obtaining expressions for ∂ F V ( Z ) ∂ K ff , ∂ F V ( Z ) ∂ K f λ and ∂ F V ( Z ) ∂ K λλ and combining those with the relev ant deriv atives of the co v ariances wrt Θ , Z and the parameters asso ciated to the mo del kernels, ∂ F ∂ G: = ∂ F A ∂ A: ∂ A: ∂ G: δ GK + ∂ F G ∂ G: , (10) where the subindex in F E stands for those terms of F whic h dep end on E , G is either K ff , K λ f or K λλ and δ GK is zero if G is equal to K ff and one in other case. F or con venience w e ha ve used F ≡ F V ( Z , Θ ). Next we present expressions for eac h partial deriv ativ e ∂ A: ∂ Σ: = − K λ , f Σ − 1 ⊗ K λ , f Σ − 1 , ∂ F Σ ∂ Σ: = − 1 2 Σ − 1 HΣ − 1 : > + 1 2 Σ − 1 e K > Σ − 1 : > ∂ A: ∂ K λ , λ : = I , ∂ A: ∂ K λ , f : = K λ , f Σ − 1 ⊗ I + I ⊗ K λ , f Σ − 1 T A , ∂ F K f , f ∂ K f , f : = − 1 2 Σ − 1 : ∂ F A ∂ A: = − 1 2 ( C: ) > , ∂ F K λ , f ∂ K λ , f : = A − 1 K λ , f Σ − 1 yy > Σ − 1 : > + K − 1 λ , λ K λ , f Σ − 1 : > , ∂ F K λ , λ ∂ K λ , λ : = 1 2 K − 1 λ , λ : > − 1 2 K − 1 λ , λ K λ , f Σ − 1 K f , λ K − 1 λ , λ : > , where C = A − 1 + A − 1 K λ , f Σ − 1 yy > Σ − 1 K f , λ A − 1 , H = Σ − yy > + K f , λ A − 1 K λ , f Σ − 1 yy > + K f , λ A − 1 K λ , f Σ − 1 yy > > and T A is a ve ctorize d tr ansp ose matrix (Bro ok es, 2005) and w e ha ve not included its dimensions to keep the notation clearer. W e can replace the abov e 19 expressions in (10) to find the corresp onding deriv ativ es, so ∂ F ∂ K f , f : = − 1 2 Σ − 1 : W e also ha ve ∂ F ∂ K λ , f : = − 1 2 ( C: ) > K λ , f Σ − 1 ⊗ I + I ⊗ K λ , f Σ − 1 T A + A − 1 K λ , f Σ − 1 yy > Σ − 1 : > + K − 1 λ , λ K λ , f Σ − 1 : > = − CK λ , f Σ − 1 + A − 1 K λ , f Σ − 1 yy > Σ − 1 + K − 1 λ , λ K λ , f Σ − 1 : > . Finally , results for ∂ F ∂ K λ , f : and ∂ F ∂ Σ: are obtained as ∂ F ∂ K λ , λ : = − 1 2 ( C: ) > + 1 2 K − 1 λ , λ : > − 1 2 K − 1 λ , λ K λ , f Σ − 1 K f , λ K − 1 λ , λ : > ∂ F ∂ Σ: = 1 2 Σ − 1 e K > − H Σ − 1 : > + 1 2 ( C: ) > K λ , f Σ − 1 ⊗ K λ , f Σ − 1 . References Mauricio ´ Alv arez and Neil D. Lawrence. Sparse con v olved Gaussian pro cesses for multi- output regression. In NIPS , v olume 21, pages 57–64. MIT Press, Cambridge, MA, 2009. Mauricio ´ Alv arez, David Luengo, and Neil D. Lawrence. Laten t Force Mo dels. In v an Dyk and W elling (2009), pages 9–16. Christopher M. Bishop. Pattern R e c o gnition and Machine L e arning . Information Science and Statistics. Springer, 2006. Edwin V. Bonilla, Kian Ming Chai, and Christopher K. I. Williams. Multi-task Gaussian pro cess prediction. In John C. Platt, Daphne Koller, Y oram Singer, and Sam Row eis, editors, NIPS , volume 20, Cam bridge, MA, 2008. MIT Press. Phillip Boyle and Marcus F rean. Dep enden t Gaussian pro cesses. In La wrence Saul, Y air W eiss, and L´ eon Bouttou, editors, NIPS , v olume 17, pages 217–224, Cambridge, MA, 2005. MIT Press. Mic hael Bro ok es. The matrix reference manual. Av ailable on-line., 2005. http://www.ee. ic.ac.uk/hp/staff/dmb/matrix/intro.html . Lehel Csat´ o and Manfred Opper. Sparse representation for Gaussian process models. In T o dd K. Leen, Thomas G. Dietteric h, and V olk er T resp, editors, NIPS , volume 13, pages 444–450, Cam bridge, MA, 2001. MIT Press. Theo doros Evgeniou, Charles A. Micchelli, and Massimiliano P ontil. Learning m ultiple tasks with kernel metho ds. Journal of Machine L e arning R ese ar ch , 6:615–637, 2005. 20 P ei Gao, An tti Honkela, Magn us Rattray , and Neil D. La wrence. Gaussian pro cess mo d- elling of laten t c hemical sp ecies: Applications to inferring transcription factor activities. Bioinformatics , 24:i70–i75, 2008. doi: 10.1093/bioinformatics/btn278. Pierre Goov aerts. Ge ostatistics For Natur al Resour c es Evaluation . Oxford Univ ersit y Press, USA, 1997. Da vid M. Higdon. Space and space-time modelling using pro cess con v olutions. In C. An- derson, V. Barnett, P . Chat win, and A. El-Shaara wi, editors, Quantitative metho ds for curr ent envir onmental issues , pages 37–56. Springer-V erlag, 2002. Andre G. Journel and Charles J. Huijbregts. Mining Ge ostatistics . Academic Press, London, 1978. ISBN 0-12391-050-1. Neil D. Lawrence. Learning for larger datasets with the Gaussian pro cess laten t v ariable mo del. In Marina Meila and Xiaotong Shen, editors, AIST A TS 11 , San Juan, Puerto Rico, 21-24 March 2007. Omnipress. Neil D. Lawrence, Matthias Seeger, and Ralf Herbrich. F ast sparse Gaussian pro cess meth- o ds: The informative v ector mac hine. In Sue Bec ker, Sebastian Thrun, and Klaus Ob er- ma yer, editors, NIPS , v olume 15, pages 625–632, Cam bridge, MA, 2003. MIT Press. Neil D. Lawrence, Guido Sanguinetti, and Magn us Rattra y . Mo delling transcriptional reg- ulation using Gaussian pro cesses. In Bernhard Sch¨ olk opf, John C. Platt, and Thomas Hofmann, editors, NIPS , v olume 19, pages 785–792. MIT Press, Cambridge, MA, 2007. Miguel L´ azaro-Gredilla and An ´ ıbal Figueiras-Vidal. Inter-domain Gaussian processes for sparse inference using inducing features. In NIPS , volume 22, pages 1087–1095. MIT Press, Cam bridge, MA, 2010. Mic hael A. Osborne, Alex Rogers, Sarv apali D. Ramch urn, Stephen J. Roberts, and Nic holas R. Jennings. T ow ards real-time information processing of sensor net work data using computationally efficien t m ulti-output Gaussian pro cesses. In Pr o c e e dings of the International Confer enc e on Information Pr o c essing in Sensor Networks (IPSN 2008) , 2008. Joaquin Qui˜ nonero Candela and Carl Edward Rasm ussen. A unifying view of sparse approx- imate Gaussian pro cess regression. Journal of Machine L e arning R ese ar ch , 6:1939–1959, 2005. Carl Edw ard Rasm ussen and Christopher K. I. Williams. Gaussian Pr o c esses for Machine L e arning . MIT Press, Cambridge, MA, 2006. ISBN 0-262-18253-X. Matthias Seeger, Christopher K. I. Williams, and Neil D. La wrence. F ast forw ard selection to speed up sparse Gaussian process regression. In Christopher M. Bishop and Brendan J. F rey , editors, Pr o c e e dings of the Ninth International Workshop on A rtificial Intel ligenc e and Statistics , Key W est, FL, 3–6 Jan 2003. 21 Edw ard Snelson and Zoubin Ghahramani. Sparse Gaussian pro cesses using pseudo-inputs. In Y air W eiss, Bernhard Sch¨ olkopf, and John C. Platt, editors, NIPS , volume 18, Cam- bridge, MA, 2006. MIT Press. Y ee Why e T eh, Matthias Seeger, and Michael I. Jordan. Semiparametric laten t factor mo dels. In Robert G. Co well and Zoubin Ghahramani, editors, AIST A TS 10 , pages 333–340, Barbados, 6-8 Jan uary 2005. Society for Artificial In telligence and Statistics. Mic halis K. Titsias. V ariational learning of inducing v ariables in sparse Gaussian pro cesses. In v an Dyk and W elling (2009), pages 567–574. Da vid v an Dyk and Max W elling, editors. AIST A TS , Clearw ater Beac h, Florida, 16-18 April 2009. JMLR W&CP 5. 22
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment