On Approximate Inference for Generalized Gaussian Process Models

T echnical Report - City Univ ersity of Hong Kong (July 12, 2013) Submitted 7/2013 On A pproximate Infer ence for Generalized Gaussian Pr ocess Models Lifeng Shang L S H A N G @ C I T Y U . E D U . H K Department of Computer Science City University of Hong K ong Antoni B. Chan A B C H A N @ C I T Y U . E D U . H K Department of Computer Science City University of Hong K ong (This manuscript was submitted for re view on July 12, 2013) Editor: Abstract A generalized Gaussian process model (GGPM) is a unifying framew ork that encompasses many existing Gaussian process (GP) models, such as GP regression, classiﬁcation, and counting. In the GGPM framew ork, the observ ation lik elihood of the GP model is itself parameterized using the e x- ponential family distrib ution (EFD). In this paper , we consider ef ﬁcient algorithms for approximate inference on GGPMs using the general form of the EFD. A particular GP model and its associ- ated inference algorithms can then be formed by changing the parameters of the EFD, thus greatly simplifying its creation for task-speciﬁc output domains. W e demonstrate the efﬁcac y of this frame- work by creating sev eral ne w GP models for regressing to non-negativ e reals and to real intervals. W e also consider a closed-form T aylor approximation for efﬁcient inference on GGPMs, and elab- orate on its connections with other model-speciﬁc heuristic closed-form approximations. Finally , we present a comprehensi ve set of e xperiments to compare approximate inference algorithms on a wide variety of GGPMs. Keyw ords: Gaussian processes, Bayesian generalized linear models, non-parametric regression, exponential family , approximate inference 1. Introduction In recent years, Gaussian processes (GPs) (Rasmussen and W illiams, 2006), a non-parametric Bayesian approach to regression and classiﬁcation, have been gaining popularity in machine learn- ing and computer vision. For example, recent work Kapoor et al. (2010) has demonstrated promis- ing results on object classiﬁcation using GP classiﬁcation and active learning. GPs have se veral properties that are desirable for solving complex regression tasks, such as those found in computer vision. First, due to the Bayesian formulation, GPs can be learned robustly from small training sets, which is important in tasks where the amount of training data is sparse compared to the dimension of the model (e.g., large-scale object recognition, tracking, 3d human pose modeling). Second, GP regression produces a predictiv e distribution, not just a single predicted v alue, thus providing a prob- abilistic approach to judging conﬁdence in the predictions, e.g., for activ e learning. Third, GPs are based on k ernel functions between the input examples, which allo ws for both a div erse set of image representations (e.g., bag-of-words, local-feature descriptors), and incorporation of prior kno wledge about the computer vision task (e.g., modeling object structure). Finally , in the GP framew ork, the c  July 12, 2013 Lifeng Shang and Antoni B. Chan. S H A N G A N D C H A N kernel hyperparameters can be learned by maximizing the marginal likelihood, or evidence, of the training data. This is typically more efﬁcient than standard cross-validation (which requires a grid search), and allo ws for more expressi ve kernels, e.g., compound kernels that model dif ferent trends in the data, or multiple kernel learning, where features are optimally combined by weighting the kernel function of each feature. Because of these advantages, GP regression and classiﬁcation have been applied to many com- puter vision problems, such as object classiﬁcation (Kapoor et al., 2010), human action recognition (Han et al., 2009), age estimation (Zhang and Y eung, 2010), eye-gaze recognition (Noris et al., 2008), tracking (Raskin et al., 2007), counting people (Chan et al., 2008; Chan and V asconcelos, 2009), crowd ﬂo w modeling (Ellis et al., 2009), anomaly detection (Loy et al., 2009), stereo vision (W illiams, 2006; Sinz et al., 2004), interpolation of range data (Plagemann et al., 2007), non-rigid shape recov ery (Zhu et al., 2009), 3d human pose recovery (Bo and Sminchisescu, 2010; Urtasun and Darrell, 2008; Fer gie and Galata, 2010; Zhao et al., 2008), and latent-space models of 3d human pose (Urtasun et al., 2005; W ang et al., 2008; Chen et al., 2009). Howe ver , despite their successes, many of these methods attempt to “shoe-horn” their computer vision task into the standard GP regression framew ork. In particular , while the standard GP regresses a continuous r eal-valued func- tion, it is often used to predict discr ete non-ne gativ e inte gers (cro wd counts Chan et al., 2008 or age Zhang and Y eung, 2010), non-negati ve real numbers (disparity W illiams, 2006; Sinz et al., 2004 or depth Plagemann et al., 2007), and real numbers on a ﬁxed interval (pose angles Bo and Smin- chisescu, 2010; Urtasun and Darrell, 2008; Fergie and Galata, 2010; Zhao et al., 2008 or squashed optical ﬂo w Loy et al., 2009). Hence, heuristics are often required to con vert the real-valued GP prediction to a valid task-speciﬁc output , which is not optimal in the Bayesian setting. For e xample in Chan et al. (2008), the real-valued GP prediction must be truncated and rounded to generate a proper count prediction, and it is not obvious how the predicti ve distrib ution ov er real-v alues can be con verted to one ov er counts. De veloping a new GP model for each of the above regression tasks requires ﬁrst ﬁnding a suit- able distrib ution for the output variable (e.g., Poisson distribution for counting numbers, Gamma distribution for positi ve reals, Beta distribution for a real interv al), and then deriving an approximate inference algorithm. This task can be simpliﬁed considerably with recourse to a unifying frame- work, which we call a generalized Gaussian pr ocess model (GGPM) (Chan and Dong, 2011). The GGPM is inspired by the gener alized linear model (GLM) (McCullagh and Nelder, 1989), which aims to consolidate parametric regression methods (e.g., least-squares regression, Poisson regres- sion, logistic regression) into a unifying framew ork. Similarly , the GGPM uniﬁes many Bayesian non-parametric regression methods using GP priors (e.g., GP regression, GP classiﬁcation, and GP counting) through the exponential family . W ith the GGPM, the observation likelihood of the output is itself parameterized using the generic form of the exponential family distribution. Approximate inference algorithms can then be deriv ed that depend only on these EFD parameter functions, elucidating the terms (e.g., deriv a- ti ves, moments) needed for each algorithm. Different GP models are then created by simply chang- ing the parameters of the likelihood function, thus easing the development of new GP models for task-speciﬁc output domains . Note that this is analogous to GLMs (McCullagh and Nelder, 1989), where a common iterativ ely reweighted least-squares (IRLS) algorithm was deriv ed to estimate all associated regression models. This paper is intended to both survey existing GP regression models, as well as de velop a uni- fying regression framework for GP models and its associated approximate inference algorithms. 2 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S Besides further formalizing the GGPM framew ork, the contrib utions of this paper are 4-fold: 1) we deri ve a closed-form approximate inference method for GGPMs, based on a T aylor approximation, and show that model-speciﬁc closed-form approximations from Kapoor et al. (2010); Chan and V asconcelos (2009); Heikkinen et al. (2008) are special cases; 2) we analyze existing approximate inference algorithms (Laplace approximation, expectation propagation, v ariational approximations) using the generic form of the exponential family distribution; 3) using the GGPM framew ork, we propose se veral ne w GP models for regressing to non-neg ati ve and interval real outputs; 4) we con- duct comprehensi ve experiments comparing the efﬁcac y of the approximate inference algorithms on both synthetic and real data sets. The remainder of the paper is organized as follows. In Sec- tion 2, we ﬁrst revie w Gaussian process re gression and related work. In Section 3, we introduce the GGPM frame work, in Section 4, we discuss existing novel GP models within the GGPM framew ork. In Section 5, we deriv e efﬁcient approximate inference algorithms. Next, we compare approximate posteriors in Section 6, and discuss initialization strategies for hyperparameter estimation in Section 7. Finally , in Section 8, we present experiments to compare the approximate inference algorithms, as well as demonstrate the ef ﬁcacy of the ne w proposed models. 2. Gaussian processes and r elated work In this section we re vie w Gaussian process regression and other related work. 2.1 Gaussian process r egression Gaussian process regression (GPR) (Rasmussen and W illiams, 2006) is a Bayesian approach to predicting a real-valued function f ( x ) of an input vector x ∈ R d (also known as the regressor or explanatory variable). The function v alue is observed through a noisy observ ation (or measurement or output) y ∈ R , y = f ( x ) + , (1) where  is zero-mean Gaussian noise with variance σ 2 n , i.e.,  ∼ N  0 , σ 2 n  . A zero-mean Gaussian pr ocess (GP) prior is placed on the function, yielding the GPR model f ∼ G P ( 0 , k ( x , x 0 )) , y | f ( x ) ∼ N  f ( x ) , σ 2 n  . (2) A GP is a random process that represents a distribution ov er functions, and is completely speciﬁed by its mean and cov ariance functions, m ( x ) and k ( x , x 0 ) . For simplicity , we assume the mean function is zero. The cov ariance (or kernel) function k ( x , x 0 ) determines the class of functions that f can represent (e.g., linear, polynomial, etc). Giv en an y set of input v ectors X = [ x 1 , · · · , x n ] , the GP speciﬁes that the corresponding function values f = [ f ( x 1 ) , · · · , f ( x n )] T are jointly Gaussian, f | X ∼ N (0 , K ) , where K is the cov ariance (kernel) matrix with entries k ( x i , x j ) . The function f is estimated from a training set of input vectors X = [ x 1 , · · · , x n ] and corre- sponding noisy observations y = [ y 1 , · · · , y n ] T . First, giv en the inputs and noisy outputs { X , y } , the posterior distribution of the corresponding function v alues f is obtained with Bayes’ rule, p ( f | X , y ) = p ( y | f ) p ( f | X ) R p ( y | f ) p ( f | X ) d f (3) 3 S H A N G A N D C H A N where p ( y | f ) = Q n i =1 p ( y i | f i ) is the observation likelihood, and the denominator is the marginal likelihood, p ( y | X ) = R p ( y | f ) p ( f | X ) d f . T o predict a function value f ∗ = f ( x ∗ ) from a novel input x ∗ , the posterior distribution in (3) is marginalized to obtain the predictiv e distrib ution (i.e., an av erage over all possible latent function v alues), p ( f ∗ | X , x ∗ , y ) = Z p ( f ∗ | f , X , x ∗ ) p ( f | X , y ) d f (4) Finally , the distrib ution of the predicted noisy observation y ∗ is obtained by marginalizing o ver f ∗ , p ( y ∗ | X , x ∗ , y ) = Z p ( y ∗ | f ∗ ) p ( f ∗ | X , x ∗ , y ) d f ∗ . (5) Since the observ ation likelihood and posterior are both Gaussian, the predictiv e distributions in (4, 5) are both Gaussian, with parameters that can be computed in closed-form (Rasmussen and W illiams, 2006), µ ∗ = k T ∗ ( K + σ 2 I ) − 1 y , σ 2 ∗ = k ∗∗ − k T ∗ ( K − 1 + σ 2 I ) − 1 k ∗ , (6) p ( f ∗ | X , x ∗ , y ) = N  f ∗   µ ∗ , σ 2 ∗  , p ( y ∗ | X , x ∗ , y ) = N  y ∗   µ ∗ , σ 2 ∗ + σ 2 n  . (7) where k ∗ = [ k ( x ∗ , x 1 ) · · · k ( x ∗ , x n )] T and k ∗∗ = k ( x ∗ , x ∗ ) . The hyperparameters α of the kernel function and the observation noise σ 2 n are typically esti- mated by maximizing the marginal likelihood, or evidence, of the training data (also called T ype-II maximum likelihood), { ˆ α , ˆ σ 2 n } = argmax α ,σ 2 n log p ( y | X ) , (8) log p ( y | X ) = − 1 2 y T ( K + σ 2 n I ) − 1 y − 1 2 log   K + σ 2 n I   − n 2 log 2 π . (9) The marginal likelihood measures the data ﬁt, av eraged over all probable functions. Hence, the kernel hyperparameters are selected so that each probable latent function will model the data well. 2.2 GP classiﬁcation and other GP models For Gaussian process classiﬁcation (GPC) (Nickisch and Rasmussen, 2008; Kuss and Rasmussen, 2005; Rasmussen and Williams, 2006), a GP prior is again placed on the function f , which is then “squashed” through a sigmoid function to obtain the probability of the class y ∈ { 0 , 1 } , f ∼ G P (0 , k ( x , x 0 )) , p ( y = 1 | f ( x )) = σ ( f ( x )) , (10) where σ ( f ) is a sigmoid function, e.g. the logistic or probit functions, which maps a real number to the range [0 , 1] . Howe ver , since the observation likelihood is no longer Gaussian, computing the posterior and predictiv e distributions in (3, 4, 5) is no longer analytically tractable. This has led to the de velopment of several approximate inference algorithms for GPC, such as Marko v-chain Monte Carlo (MCMC) (Nickisch and Rasmussen, 2008), variational bounds (Gibbs and Mackay, 2000), Laplace approximation (W illiams and Barber , 1998), and expectation propagation (EP) (Minka, 2001; Rasmussen and W illiams, 2006). As an alternative to approximate inference, the classiﬁca- tion task itself can be approximated as a GP r e gression problem, where the observations are set to 4 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S inference methods method likelihood Y MCMC LA EP KLD VB T A # regression Gaussian R - - - - - R W 06 (exact)                                              GGPMs classiﬁcation logit/probit { 0 , 1 } N 97 WB 98 M 01 NR 08 GM 97 NR 08 (label regression) KG 06 robust re gression Laplace R - - K 06 MA 09 RN 10 † - counting Poisson Z ∞ + D 98 V 10 V 10 # - CV 09 (BPR) VV 07 S 11 counting COM-Poisson Z ∞ + - # # # - # robust counting neg. binomial Z ∞ + S 11 V 11 † V 11 † # - # occurrence binomial Z N + PS 04 V 11 † V 11 † # - # range regression beta [0 , 1] # # # - # non-negati ve Gamma R + # # # - # non-negati ve In v . Gaussian R + # # # - # robust re gression student-t R N 97 V 09 J 11 K 06 RN 10 † - K 06 RN 10 MA 09 ∗ T able 1: Bayesian regression methods using GP priors. Abbreviations – Inference: MCMC (Markov-chain Monte Carlo), LA (Laplace approximation), EP (expectation prop- agation); KLD (KL diver gence minimization), VB (variational bounds), T A (T aylor approximation). Citations: CV 09 (Chan and V asconcelos, 2009), D 98 (Diggle et al., 1998), GM 97 (Gibbs and Mackay, 2000), J 11 (Jyl ¨ anki et al., 2011), K 06 (Kuss, 2006), KG 06 (Kim and Ghahramani, 2006), M 01 (Minka, 2001), MA 09 (Manfred and Archambeau, 2009), N 97 (Neal, 1997), NR 08 (Nickisch and Rasmussen, 2008), PS 04 (Paciorek and Schervish, 2004), RN 10 (Rasmussen and Nickisch, 2010), R W 06 (Rasmussen and Williams, 2006), S 11 (Sa vitsky et al., 2011), V 09 (V anhatalo et al., 2009), V 10 (V anhatalo et al., 2010), V 11 (V anhatalo et al., 2011), VV 07 (V anhatalo and V ehtari, 2007), WB 98 (W illiams and Barber , 1998) Other: Z N + = { 0 , 1 , 2 , . . . , N } , BPR (Bayesian Poisson regression); † part of a software toolbox. ∗ considered the Cauchy distribution (Student-t with 1 d.o.f.). # introduced in this paper by our model. y ∈ {− 1 , +1 } and standard GPR is applied. This is a computationally efﬁcient alternati ve called label r e gr ession (Nickisch and Rasmussen, 2008) (or least-squares classiﬁcation in Rasmussen and W illiams, 2006, and also discussed in T resp, 2000 as a “fast version” for two-class classiﬁcation), and has sho wn promising results in object recognition (Kapoor et al., 2010). GPR has been extended in sev eral ways for other regression tasks with univ ariate outputs. Robust GP regression can be obtained by replacing the Gaussian observation likelihood with the Laplace or Cauchy likelihood (Manfred and Archambeau, 2009), or with a student-t likelihood (Jyl ¨ anki et al., 2011). P aciorek and Schervish (2004) uses a binomial likelihood to model e vent occurrence data. In spatial statistics, the kriging method was de veloped for interpolating spatial data, and essentially uses the same model as GP regression. Diggle et al. (1998); V anhatalo and V ehtari (2007); V anhatalo et al. (2010) extend this framework for modeling counting observations, by assuming a Poisson observ ation likelihood and a GP spatial prior . Similarly , Chan and V ascon- celos (2009) dev elops a method for Bayesian Poisson regression, by applying a Gaussian prior on the linear weights of Poisson regression. Using a log-gamma approximation and kernelizing yields a closed-form solution that resembles GPR with speciﬁc output-dependent noise. Finally , Sa vit- sky et al. (2011); Savitsk y and V annucci (2010) presents a Bayesian formulation of the Cox hazard model, by replacing the linear cov ariate with a GP prior , and also studies GP counting models using the Poisson and negati ve binomial likelihoods, in the conte xt of Bayesian variable selection. T able 1 summarizes the previous work on Bayesian regression models using GP priors, along with the approximate inference algorithms proposed for them. The goal of this paper is to generalize many of these models into a uniﬁed frame work. Finally , GP models ha ve also be extended to model multivariate observations , i.e, vector out- puts. Chu and Ghahramani (2005) proposes GP ordinal regression (i.e., ranking) using a multi- 5 S H A N G A N D C H A N probit likelihood, while multiclass classiﬁcation is obtained using a probit (Girolami and Rogers, 2006; Kim and Ghahramani, 2006) or softmax (W illiams and Barber , 1998) sigmoid function. T eh et al. (2005) linearly mixes independent GP priors to obtain a semiparametric latent factor model . In this paper , we only consider a univ ariate outputs with a single GP prior; extending the GGPM to multi v ariate outputs is a topic of future work. 2.3 Related work on GGPMs Pre vious works on GGPMs include Diggle et al. (1998), which focuses on geostatistics (extend- ing kriging) using Poisson- and binomial-GGPMs, Paciorek and Schervish (2004), which uses binomial-GGPM in the context of testing non-stationary cov ariance functions, and Savitsky et al. (2011), which is mainly interested in variable selection by adding priors to the kernel hyperparam- eters of a GGPM. All these works (Diggle et al., 1998; Paciorek and Schervish, 2004; Savitsk y et al., 2011) perform inference using MCMC, by plugging in different likelihood functions without exploiting the exponential family form. MCMC tends to be slow , and conv ergence problems were observed in Diggle et al. (1998). In contrast, this paper focuses on efﬁcient algorithms for approxi- mate inference, and deri ves their general forms by exploiting the exponential family form. By doing so, we can create a “plug-and-play” aspect to GP models, which we exploit later to create sev eral nov el GP models with very little extra work. Seeger (2004); T resp (2000); Shi and Choi (2011) also brieﬂy mention the connection between the assumed output likelihoods of GPR/GPC and the exponential f amily , b ut do not study approximate inference or new models in depth. The GGPM can be interpreted as a Bayesian approach to generalized linear models (GLMs) (McCullagh and Nelder, 1989), where a Gaussian prior is placed on the linear weights (or equiv a- lently a GP prior with linear kernel is placed on the systemic component, i.e., latent function). Pre- vious works on Bayesian GLMs consider dif ferent priors. A typical approach is to form a Bayesian hierarchical GLM by applying a conjugate prior on the parameters (e.g., Albert, 1988; Das and Dey, 2007; Bedrick et al., 1996). Recent work focuses on inducing sparsity in the latent function, e.g., Seeger et al. (2007) and Nickisch and Seeger (2009) assume a factorial heavy-tailed prior dis- tribution, but are not kernelizable due to the factorial assumption. Hannah et al. (2011) proposes a mixture of GLMs, based on a Dirichlet process, to allo w dif ferent regression parameters in different areas of the input space, and performs inference using MCMC. When used with a non-linear kernel, the GGPM is a Bayesian kernelized GLM for non-linear regression. Zhang et al. (2010) also proposes a Bayesian kernelized GLM using a hierarchical model with a sparse prior (a mixture of point mass and Silverman’ s g-prior) and e valuates their models on classiﬁcation problems. Finally , Cawle y et al. (2007) proposes a non-Bayesian version of a GLM, called a generalised kernel mac hines (GKM), which is based on kernelizing the iterated-re weighted least-squares algorithm (IR WLS). The GGPM is a Bayesian formulation of the GKM. W ith respect to our pre vious work (Chan and Dong, 2011), this paper presents more algorithms for approximate inference (e.g., variational methods), and in more depth. W e also propose nov el GP models for regression to non-negati ve reals and real intervals, and sho w more connections with heuristic methods using the T aylor approximation. Furthermore, comprehensiv e experiments are presented to compare the performance of the inference algorithms on a wide variety of likelihood functions. 6 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S 3. Generalized Gaussian process models In this section, we introduce the generalized Gaussian process model, a non-parametric Bayesian regression model that encompasses man y existing GP models. 3.1 Exponential family distributions W e ﬁrst note that different GP models are obtained by changing the form of the observation like- lihood p ( y | f ) . The standard GPR assumes a Gaussian observ ation likelihood, while GPC essen- tially uses a Bernoulli distribution, and Chan and V asconcelos (2009) uses a Poisson likelihood for counting. These likelihood functions are all instances of the single-parameter exponential family distribution (Duda et al., 2001), with lik elihood given by p ( y | θ, φ ) = exp  1 a ( φ ) [ T ( y ) θ − b ( θ )] + c ( φ, y )  , (11) where y ∈ Y is the observation from a set of possible v alues Y (e.g., real numbers, counting numbers, binary class labels), θ is the natural parameter of the exponential family distribution, and φ is the dispersion parameter . T ( y ) is the suf ﬁcient statistic (e.g. the logit function for beta distribution), a ( φ ) and c ( φ, y ) are known functions, and b ( θ ) = log R exp( 1 a ( φ ) T ( y ) θ + c ( φ, y )) dy is the log-partition function, which normalizes the distribution. The mean and v ariance of the sufﬁcient statistic T ( y ) are functions of b ( θ ) and a ( φ ) , µ = E [ T ( y ) | θ ] = ˙ b ( θ ) , v ar[ T ( y ) | θ ] = ¨ b ( θ ) a ( φ ) , (12) where ˙ b ( θ ) and ¨ b ( θ ) are the ﬁrst and second deriv ativ es of b w .r .t. θ . The exponential family gener- alizes a wide variety of distributions for dif ferent output domains, and hence a unifying framew ork can be created by analyzing a GP model where the likelihood takes the g eneric form of (11). 3.2 Generalized Gaussian process models W e now consider a framew ork for a generic Bayesian model that regresses from inputs x ∈ R d to outputs y ∈ Y , which encompasses many popular GP models. Following the formulation of GLMs (McCullagh and Nelder, 1989), the model is composed of three components: 1. a latent function, η ( x ) ∼ G P (0 , k ( x , x 0 )) , which is a function of the inputs, modeled with a GP prior . 2. a random component, p ( y | θ, φ ) , that models the output as an exponential family distribution with parameter θ and dispersion φ . 3. a link function, η = g ( µ ) , that relates the mean of the sufﬁcient statistic with the latent function. Formally , the GGPM is speciﬁed by η ( x ) ∼ G P (0 , k ( x , x 0 )) , y ∼ p ( y | θ , φ ) , g ( E [ T ( y ) | θ ]) = η ( x ) . (13) The mean of the sufﬁcient statistic is related to the latent function η ( x ) through the inv erse-link function, i.e. µ = g − 1 ( η ( x )) . The adv antage of the link function is that it allo ws dir ect speciﬁcation 7 S H A N G A N D C H A N of prior knowledge about the functional relationship between the output mean and the latent function η ( x ) . On the other hand, the effect of the GP kernel function is to adapti vely warp (or completely ov erride) the link function to ﬁt the data. While many trends can be represented by the GP kernel function (e.g., polynomial functions), it is important to note that some functions (e.g., logarithms) cannot be naturally r epr esented by a kernel function , due to its positiv e-deﬁnite constraint. Hence, directly specifying the link function is necessary for these cases. Substituting (12) for the mean, we obtain the parameter θ as a function of the latent function, η ( x ) = g ( E [ T ( y ) | θ ]) = g ( ˙ b ( θ )) ⇒ g − 1 ( η ( x )) = ˙ b ( θ ) ⇒ θ ( η ( x )) = ˙ b − 1 ( g − 1 ( η ( x ))) , (14) where ˙ b − 1 is the in verse of the ﬁrst deri vati ve of b ( · ) . Using (14), another form of GGPM that directly relates the latent function with the parameter is η ( x ) ∼ G P (0 , k ( x , x 0 )) , y ∼ p ( y | θ ( η ( x )) , φ ) , θ ( η ( x )) = ˙ b − 1 ( g − 1 ( η ( x ))) . (15) The GGPM uniﬁes many Bayesian regression methods using GP priors (e.g., all of T able 1 except the student-t distribution), with each model arising from a speciﬁc instantiation of the parameter functions, E = { a ( φ ) , b ( θ ) , c ( φ, y ) , θ ( η ) , T ( y ) } . Giv en a set of training examples and a novel input, the predicti ve distrib ution is obtained by marginalizing over the posterior of the latent function η ( x ) , similar to standard GPR/GPC (Rasmussen and Williams, 2006). The dispersion φ is treated as a hyperparameter , which can be estimated along with the kernel hyperparameters by maximizing the marginal lik elihood. 3.3 Canonical link function One common link function is to select g ( · ) such that θ ( η ( x )) = η ( x ) . This is called the canonical link function , and is obtained with g ( · ) = ˙ b − 1 ( · ) and g − 1 ( · ) = ˙ b ( · ) . For the canonical link function, the GGPM simpliﬁes to η ( x ) ∼ G P (0 , k ( x , x 0 )) , y ∼ p ( y | θ = η ( x ) , φ ) , (16) and the output mean is related to the latent function by µ = ˙ b ( η ( x )) . 3.4 Inference and pr ediction Inference on GGPMs follows closely to that of standard GPR/GPC (Rasmussen and W illiams, 2006). Gi ven a set of training examples, input vectors X = [ x 1 , · · · , x n ] and corresponding ob- serv ations y = [ y 1 , · · · , y n ] T , the goal is to generate a pr edictive distribution of the output y ∗ corresponding to a nov el input x ∗ . The predicti ve distrib ution is obtained using two steps. The ﬁrst step is to calculate a posterior distribution of the latent function values η , which best explains the training examples { X , y } . This corresponds to the training or parameter estimation phase of a regression model. In the second step, the distrib ution of the latent function v alue η ( x ∗ ) for the nov el input x ∗ is calculated, follo wed by the predicti ve distrib ution for y ∗ . Formally , given the training inputs X = [ x 1 , · · · , x n ] , the latent values η i = η ( x i ) are jointly Gaussian, according to the GP prior , p ( η | X ) = N ( η | 0 , K ) , where η = [ η 1 , · · · , η n ] T and [ K ] i,j = k ( x i , x j ) is the kernel matrix. The posterior distribution of η is obtained by further conditioning on the training outputs y , and applying Bayes’ rule, p ( η | X , y ) = p ( y | θ ( η )) p ( η | X ) p ( y | X ) , (17) 8 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S where p ( y | θ ( η )) is the observ ation lik elihood 1 , and p ( y | X ) is the mar ginal likelihood, or evidence, p ( y | X ) = Z p ( y | θ ( η )) p ( η | X ) d η . (18) The posterior p ( η | X , y ) in (17) is the distribution of the latent values for all inputs X that can best describe the provided input/output pairs. Next, the predictiv e distribution of y ∗ for a novel input x ∗ is obtained by ﬁrst predicting the latent function v alue η ∗ = η ( x ∗ ) . Conditioned on all the inputs { X , x ∗ } , the joint distribution of the latent function v alues { η , η ∗ } is also Gaussian, via the GP prior , and the the conditional distribution is obtained using the conditional Gaussian theorem, p ( η ∗ | η , X , x ∗ ) = N  η ∗   k T ∗ K − 1 η , k ∗∗ − k T ∗ K − 1 k ∗  , (19) where k ∗ = [ k ( x 1 , x ∗ ) , · · · k ( x n , x ∗ )] T and k ∗∗ = k ( x ∗ , x ∗ ) . The predictiv e distribution of η ∗ is then obtained by marginalizing ov er the posterior distribution in (17) (i.e., av eraging ov er all possible latent functions), p ( η ∗ | X , x ∗ , y ) = Z p ( η ∗ | η , X , x ∗ ) p ( η | X , y ) d η . (20) Finally , the predicti ve distribution of y ∗ is obtained by marginalizing o ver η ∗ , p ( y ∗ | X , x ∗ , y ) = Z p ( y ∗ | θ ( η ∗ )) p ( η ∗ | X , x ∗ , y ) dη ∗ . (21) Note that for most non-Gaussian likelihoods, the posterior and predictiv e distributions in (17, 18, 20, 21) cannot be computed analytically in closed-form. W e will discuss ef ﬁcient approximate inference algorithms in Section 5. 3.5 Learning the h yperparameters As in GPR (Rasmussen and W illiams, 2006), the kernel hyperparameters α and the dispersion φ , can be estimated from the data by maximizing the marginal lik elihood in (18), { α ∗ , φ ∗ } = argmax α ,φ Z p ( y | η , φ ) p ( η | X , α ) d η , (22) where we now note the dependence on the hyperparameters. The marginal likelihood measures the data ﬁt, av eraged ov er all probable latent functions. Hence, the criteria selects the kernel hyperpa- rameters such that each probable latent function will model the data well. 4. Example GGPMs In this section, using the GGPM framew ork, we propose sev eral novel models for GP regression on dif ferent output domains. W e obtain Bayesian regression for non-negati ve real number outputs by adopting the Gamma and in verse-Gaussian distributions, and propose a Beta-GGPM that regresses to real numbers on the [0 , 1] interv al. W e also consider existing GP models within the GGPM frame work (e.g., Poisson- and binomial-GGPMs from T able 1). W e also discuss the role of the link function and its selection criteria. 1. T o reduce clutter , we will not write the dependency on the dispersion parameter φ or the kernel hyperparameters, unless we are explicitly optimizing them. 9 S H A N G A N D C H A N 4.1 Summary T able 2 (top) presents some examples of EFDs (distrib utions and parameters), while T able 2 (bot- tom) shows their expectations and link functions of their corresponding GGPM. The table encom- passes both e xisting and novel GP models. By changing the parameters of the EFD to form a speciﬁc observ ation likelihood (i.e., selecting the functions { a ( φ ) , b ( θ ) , c ( φ, y ) , θ ( η ) , T ( y ) } , we can easily obtain a wide range of GP models with different types of outputs, e.g., Gamma and In verse Gaussian for non-negati ve reals, Beta for [0 , 1] interv al reals, etc. Model y p ( y ) { θ, φ } T ( y ) a ( φ ) b ( θ ) c ( φ, y ) Gaussian R 1 √ 2 πσ 2 e − 1 2 σ 2 ( y − µ ) 2 { µ, σ 2 } y φ 1 2 θ 2 − log(2 π φ ) 2 − y 2 2 φ Gamma sh R + ( ν /µ ) ν Γ( ν ) y ν − 1 e − yν /µ { − 1 µ 1 ν } y φ − log( − θ ) log y φ − 1 − 1 φ − 1 φ − 1 Γ( φ − 1 ) Gamma sc R + s − ν Γ( ν ) y ν − 1 e − y/s { ν, s } log y φ θ log φ + φ log Γ( θ φ ) − y/φ − log y In v . Gauss. R + h λ 2 πy 3 i 1 2 e − λ ( y − µ ) 2 2 µ 2 y { − 1 2 µ 2 1 λ } y φ − √ − 2 θ log  1 2 πy 3 φ  1 2 − 1 2 yφ Neg. Bino. Z + Γ( y + α − 1 ) p α − 1 (1 − p ) y Γ( y +1)Γ( α − 1 ) { α log(1 − p ) , α } y φ − log (1 − e φ − 1 θ ) log Γ( y + φ − 1 ) Γ( y +1)Γ( φ − 1 ) Poisson Z + 1 y ! λ y e − λ { log λ, 1 } y φ e θ − log( y !) COM-Po. Z + 1 S ( µ,ν ) h µ y y ! i ν { log µ, ν } y 1 φ φ − 1 log S ( e θ , φ ) − φ log( y !) Binomial 1 N Z N +  N N y  π N y (1 − π ) N − N y { log π 1 − π , 1 N } y φ log(1 + e θ ) log  φ − 1 φ − 1 y  Beta [0 , 1] Γ( ν ) y µν − 1 (1 − y ) (1 − µ ) ν − 1 Γ( µν )Γ((1 − µ ) ν ) { µ, 1 ν } log y 1 − y φ φ log Γ( θ φ )Γ( 1 − θ φ ) log Γ( 1 φ )(1 − y ) 1 φ − 1 y Model ˙ b ( θ ) = E [ T ( y )] var [ T ( y )] g ( E [ T ( y )]) θ ( η ) Gaussian θ φ µ η Gamma sh − θ − 1 θ − 2 φ log µ − e − η Gamma sc log φ + ψ 0 ( θ/φ ) ψ 1 ( θ/φ ) log ψ − 1 0 ( µ − log φ ) + log φ e η In v . Gauss. ( − 2 θ ) − 1 2 φ ( − 2 θ ) − 3 2 2 log µ + log 2 − e − η Neg. Bino. φ − 1 e φ − 1 θ 1 − e φ − 1 θ φ − 1 e φ − 1 θ (1 − e φ − 1 θ ) 2 − log   µφ 1+ µφ  − φ − 1  − log(1 + e − η ) Poisson e θ e θ log µ η Linear Poisson e θ e θ log( e µ − 1) log(log(1 + e η )) COM-Poisson e θ + 1 2 φ − 1 2 − log( µ − 1 2 φ + 1 2 ) η Linear COM-Poisson e θ + 1 2 φ − 1 2 − log( e µ − 1 2 φ + 1 2 − 1) log(log(1 + e η )) Binomial e θ 1+ e θ φe θ (1+ e θ ) 2 log µ 1 − µ η Beta ψ 0 ( θ φ ) − ψ 0 ( 1 − θ φ ) ψ 1 ( θ φ ) + ψ 1 ( 1 − θ φ ) µ e η 1+ e η T able 2: (top) Exponential family distributions; (bottom) expectations and link functions for GGPM 4.2 Binomial distribution The binomial distribution models the probability of a certain number of ev ents occurring in N independent trials, the ev ent probability in an individual trial is π , and y ∈ { 0 N , 1 N , · · · , N N } is the fraction of e vents. Assuming the canonical link function, then µ = E [ y | θ ] = g − 1 ( η ) = e η 1+ e η , (23) and hence the mean is related to the latent space through the logistic function. For N = 1 , the binomial-GGPM (or Bernoulli-GGPM) is equiv alent to the GPC model using the logistic function. For N > 1 , the model can naturally accommodate uncertainty in the labels by using fractional y i , 10 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S e.g., for N = 2 there are three lev els y ∈ { 0 , 1 2 , 1 } . Furthermore in the N = 1 case, by changing the link function to the probit function, we obtain GPC using the probit likelihood, g ( µ ) = Φ − 1 ( µ ) ⇒ µ = g − 1 ( η ) = Φ( η ) where Φ( η ) is the cumulati ve distrib ution of a Gaussian. Substituting into the GGPM, we hav e θ ( η ) = log Φ( η ) 1 − Φ( η ) , b ( θ ( η )) = − log(1 − Φ( η )) . A common interpretation of the probit GPC is that the class probability arises from a noisy Heaviside step function (Seeger, 2004). Here, the GGPM framework provides further insight that the probit and logistic GPC models correspond to Bernoulli-GGPMs using dif ferent link functions. 4.3 Counting GP models In this section, we consider counting regression models (Poisson, negati ve binomial, Conw ay- Maxwell Poisson), and sho w ho w the link function can be changed to better model the mean trend. 4 . 3 . 1 P O I S S O N D I S T R I B U T I O N The Poisson distrib ution is a model for counting data, where the outputs y ∈ Z + = { 0 , 1 , · · · } are counts, and λ is the arriv al-rate (mean) parameter . For the canonical link function, E [ y | θ ] = g − 1 ( η ) = e η = λ, g ( µ ) = log µ. (24) Hence, the mean of the Poisson is the exponential of the latent value. The Poisson-GGPM is a Bayesian regression model for predicting counts y from an input vector x , and has been previously studied (Chan and V asconcelos, 2009; Diggle et al., 1998; V anhatalo and V ehtari, 2007). 4 . 3 . 2 L I N E A R I Z E D M E A N The canonical link function assumes that the mean is the exponential of the latent function. This may cause problems when the actual mean trend is different, as illustrated in Figure 1a, where the count actually follo ws a linear trend. One way to address this problem is to use a non-linear kernel function to counteract the exponential link function. In this case, the ideal kernel function should be a logarithm function. Howe ver , there is no such positi ve deﬁnite kernel, and hence changing the kernel cannot recover a linear trend exactly . Furthermore, using the RBF kernel has poor extrapolation capabilities due to the limited extent of the RBF function, as illustrated in Figure 2. Alternati vely , the mean can be directly linearized by changing the link function of the Poisson- GGPM to represent a linear trend. For this purpose, we use the logistic error function, E [ y | θ ] = g − 1 ( η ) = log(1 + e η ) ⇒ g ( µ ) = log( e µ − 1) , µ > 0 . For lar ge values of η , the link function is linear, while for negati ve v alues of η , the function ap- proaches zero. The parameter function and new partition function are θ ( η ) = log (log(1 + e η )) , b ( θ ( η )) = log(1 + e η ) . (25) Figures 1a and 1b illustrate the difference between the standard and linearized Poisson GGPMs. The standard Poisson-GGPM cannot correctly model the linear trend, resulting in a poor data ﬁt at the extremes, while the linearized Poisson follo ws the linear trend. 11 S H A N G A N D C H A N (a) Poisson (b) linear Poisson (c) COM-Poisson (d) linear COM-Poisson −15 −10 −5 0 5 10 −2 0 2 4 6 x η (x) samples std( η * ) mean( η * ) −15 −10 −5 0 5 10 −40 −20 0 20 40 η (x) x −15 −10 −5 0 5 10 −2 −1 0 1 2 3 4 η (x) x −15 −10 −5 0 5 10 −40 −20 0 20 40 η (x) x −15 −10 −5 0 5 10 0 5 10 15 20 25 x y (count) data trend samples mean(y * ) std(y * ) mode(y * ) −15 −10 −5 0 5 10 0 5 10 15 20 25 x y (count) −15 −10 −5 0 5 10 0 5 10 15 20 25 x y (count) −15 −10 −5 0 5 10 0 5 10 15 20 25 x y (count) Figure 1: Examples of count regression using GGPM with linear kernel and different likelihood functions: a) Poisson; b) linearized Poisson; c) COM-Poisson; d) linearized COM- Poisson. The data follo ws a linear trend and is underdispersed. The top ro w sho ws the learned latent function, and the bottom ro w shows the predictiv e distributions. The back- ground color indicates the count probability (white most probable, black least probable) −15 −10 −5 0 5 10 15 20 −5 0 5 η (x) x samples std( η * ) mean( η * ) −15 −10 −5 0 5 10 15 20 0 5 10 15 20 25 30 35 x y (count) data trend samples mean(y * ) std(y * ) mode(y * ) Figure 2: Example of regressing a linear trend using a Poisson-GGPM with RBF kernel. One limitation with the Poisson distrib ution is that it models an equidispersed random v ariable, i.e. the variance is equal to the mean. Howe ver , in some cases, the actual random variable is over dispersed (with variance greater than the mean) or underdisper sed (with variance less than the mean). W e next consider two other counting distrib utions that can handle overdispersion and underdispersion. 4 . 3 . 3 N E G A T I V E B I N O M I A L The negati ve binomial distrib ution is a model for counting data, which is ov erdispersed (variance larger than the mean). The mean is giv en by µ = 1 − p αp , where p ∈ (0 , 1) and α ≥ 0 is the scale parameter that adjusts the variance. The variance can be written as a function of mean, and always exceeds the mean, v ar( y ) = E [ y ] + α E [ y ] 2 . When α → 0 , the ne gati ve binomial reduces to the Poisson distribution. 12 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S Since θ = α log(1 − p ) < 0 for v alid parameters, the choice of θ ( η ) must satisfy a non-positi ve constraint. Hence, we use a ﬂipped log-loss function, θ ( η ) = − log (1 + e − η ) , (26) with the corresponding link function, E [ y | θ ] = g − 1 ( η ) = α − 1 (1 + e − η ) − α − 1 1 − (1 + e − η ) − α − 1 ⇒ g ( µ ) = − log "  µα 1 + µα  − α − 1 # . (27) When α = 1 , the link function reduces to log µ , and the negati ve binomial reduces to a Geometric distribution. 4 . 3 . 4 C O N W A Y - M A X W E L L - P O I S S O N D I S T R I B U T I O N Another alternativ e distribution for count data, which represents different dispersion levels, is the Conway-Maxwell-Poisson (COM-Poisson) distribution (Conway and Maxwell, 1962; Shmueli et al., 2005; Guikema and Gof felt, 2008), p ( y | µ, ν ) = 1 S ( µ, ν )  µ y y !  ν , S ( µ, ν ) = ∞ X n =0  µ n n !  ν , (28) where y ∈ Z + , µ is (roughly) the mean parameter, and ν is the dispersion parameter . The COM- Poisson is a smooth interpolation between three distributions: geometric ( ν = 0 ), Poisson ( ν = 1 ), and Bernoulli ( ν → ∞ ). The distribution is ov erdispersed for ν < 1 , and underdispersed for ν > 1 . The partition function S ( µ, ν ) has no closed-form expression, but can be estimated numerically up to any precision (Shmueli et al., 2005). Note that b φ ( θ ) is now also a function of φ , which only affects optimization of the dispersion φ (details in Appendix A.3.3 and Chan (2013)). For the canonical link function, E [ y ] ≈ e η + 1 2 ν − 1 2 = g − 1 ( η ) ⇒ g ( µ ) = log( µ − 1 2 ν + 1 2 ) . (29) Alternati vely the parameter function in (25) can be used to model a linear trend in the mean. The COM-Poisson GGPM includes a dispersion hyperparameter that decouples the variance of the Poisson from the mean, thus allowing more control on the observation noise of the output. Figures 1c and 1d sho w examples of using the COM-Poisson-GGPM on underdispersed counting data with a linear trend. Note that the v ariance of the prediction is much lo wer for the COM-Poisson models than for the Poisson models (Figures 1a and 1b), thus illustrating that the COM-Poisson GGPM can effecti vely estimate the dispersion of the data. A COM-Poisson GLM (with canonical link) was proposed in Guikema and Goffelt (2008), and thus the COM-Poisson GGPM is a non- linear Bayesian extension using a GP prior on the latent function. 4.4 GP regr ession to non-negative r eal numbers In this section, we consider GGPMs with non-negati ve real number outputs. T wo are based on dif- ferent parameterizations of the Gamma distrib ution, and the other is based on the in verse Gaussian. The main difference between the likelihood functions is the amount of observ ation noise. In particu- lar , the v ariance of the Gamma distribution is φµ 2 , whereas the in v erse Gaussian is more dispersed, with v ariance φµ 3 . 13 S H A N G A N D C H A N 4 . 4 . 1 G A M M A D I S T R I B U T I O N ( M E A N PA R A M E T E R , S H A P E H Y P E R PA R A M E T E R ) The Gamma distribution is a model for non-neg ati ve real number observ ations. The distribution is parameterized by the mean µ > 0 , and the shape parameter ν > 0 . The exponential distribution is a special case of the Gamma distribution when ν = 1 . For the GGPM, we use µ as the distribution parameter and ν as the hyperparameter . Since θ must implicitly be negati ve, we set the function θ ( η ) = − e − η , and the link function is E [ y | θ ] = g − 1 ( η ) = − 1 /θ ( η ) = e η ⇒ g ( µ ) = log µ. (30) Hence the link function models η as the log-mean of the Gamma. W e denote this likelihood as Gamma-shape (Gamma sh ), since it uses the shape as the hyperparameter . 4 . 4 . 2 G A M M A D I S T R I B U T I O N ( S H A P E PA R A M E T E R , S C A L E H Y P E R PA R A M E T E R ) Alternati vely , the Gamma distribution can be parameterized by the shape parameter ν , and a scale hyperparameter s = µ ν . Fixing the scale hyperparameter to s = 2 , the Gamma distribution reduces to the Chi-square distribution. Since θ must be positi ve, one candidate of θ ( η ) is e η and the link function is E [ T ( y ) | θ ] = g − 1 ( η ) = log φ + ψ 0 ( e η /φ ) ⇒ g ( µ ) = log ψ − 1 0 ( µ − log φ ) + log φ, (31) where ψ k ( x ) = ∂ k +1 ∂ x k +1 log Γ( x ) is the polygamma function, and ψ − 1 k ( z ) is its in verse. Noting that ψ 0 ≈ log ( x ) , we can approximate the link function as g ( µ ) ≈ µ = E [ T ( y )] = E [log( y )] . W e denote this likelihood as Gamma-scale (Gamma sc ), since it uses the scale as the hyperparameter . 4 . 4 . 3 I N V E R S E G AU S S I A N D I S T R I B U T I O N The in verse Gaussian distribution is another model for non-negati ve real number outputs, where µ is the mean parameter and λ is the in verse dispersion. Since θ is implicitly negati ve, we set θ ( η ) = − e − η , and the link function is E [ y | θ ] = g − 1 ( η ) = 1 / √ 2 e − η ⇒ g ( µ ) = 2 log µ + log 2 . (32) The link function models η as the log-mean of the inv erse Gaussian. 4.5 Beta likelihood The beta distribution is a model for output over a real interv al, y ∈ [0 , 1] , where µ is the distrib ution mean and ν is the shape parameter . T o enforce the restriction 0 < θ < 1 , we apply the logistic function θ ( η ) = e η 1+ e η , resulting in the link function, E [ T ( y ) | θ ] = ψ 0 ( θ φ ) − ψ 0 ( 1 − θ φ ) ≈ log θ 1 − θ = η ⇒ g ( µ ) ≈ µ, (33) where we use the approximation to the digamma function, ψ 0 ( x ) ≈ log x . The Beta-GGPM maps observ ations from the interval [0 , 1] to real values in the latent space using the logit function. A similar idea has been explored in the study of estimating reﬂectance spectra from RGB values (Heikkinen et al., 2008), where the reﬂectance spectra output is re gressed using a logit-transformed 14 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S GP . Speciﬁcally , the output values in [0 , 1] are ﬁrst transformed to real values using the logit func- tion 2 , and a standard GP model is estimated on the transformed output. W e show the relationship between this logit-transformed GP and the Beta-GGPM in Section 5.2.5. 4.6 Choosing the link function In the previous examples, sev eral strategies hav e been used in selecting the link and parameter functions, g ( µ ) and θ ( η ) , to obtain a mapping from the latent space to the parameters. Using the canonical link function simpliﬁes the calculations, but may make undesired assumptions about the mapping between latent space and the distribution mean (as in the exponential mapping for Poisson). Modifying the link function allo ws the desired mean trend (e.g., the linearized Poisson). When selecting the mapping via the parameter function, the main 2 hurdles are: 1) to select a function that satisﬁes the implicit constraints on the parameter θ imposed by the log-partition function b ( θ ) ; 2) to select a function that is deﬁned for all values of input η ∈ R to accommodate the GP prior . Figure 3 plots the log-partition functions and link functions for each model. For example, there is an implicit negati ve constraint on θ for the Gamma-shape likelihood, due its the log-partition function. −6 −4 −2 0 2 4 6 8 −6 −4 −2 0 2 4 6 θ b( θ ) log−partition functions −3 −2 −1 0 1 2 3 4 5 −6 −4 −2 0 2 4 6 η θ ( η ) parameter functions −2 −1 0 1 2 3 4 5 −6 −4 −2 0 2 4 6 g −1 ( η ) η link functions Gaussian Gamma sh Gamma sc ( φ = 0.5) Inv.Gauss. Neg.Binomial ( φ = 2) Poisson COM−Poisson ( φ = 2) Binomial Beta ( φ = 5) Gauss. & Poisson & COM−Po. & Bino. Gamma sh & Inv.Gauss. Gamma sc Neg.Bino. Lin. Poisson & Lin. COM−Po. Beta Gaussian Gamma sh Gamma sc ( φ = 0.5) Inv.Gauss. Neg.Binomial ( φ = 2) Poisson Lin. Poisson COM−Poisson ( φ = 2) Lin. COM−Poisson ( φ = 2) Binomial Beta ( φ = 5) Figure 3: Log-partition, parameter , and link functions. Finally , it is worth noting the difference between a warped GP (Snelson et al., 2004) and GGPMs. The warped GP learns a function to warp the output of GPR. The warping function plays a similar role as the link function in the GGPM, with one notable difference. W ith Snelson et al. (2004), the warping is applied directly to the output variable y , z = f ( y ) , and a GPR is applied on the resulting z . As a result, the predicti ve distribution of z is Gaussian, whereas that of y = f − 1 ( z ) is arbitrary (and perhaps multimodal). On the other hand, with GGPM, the link function applies warping between the latent v alue η and the mean parameter θ , and hence the form of the output dis- tribution is preserved. In addition, the warping function is learned in Snelson et al. (2004), whereas in this paper we assume the link function is ﬁx ed for a gi ven GGPM. Certainly , in principle, the link function could be learned. Ho wev er , we note that the link function and the k ernel function both control the mapping between latent space and parameter space. Hence, if a link function were to be 2. Equiv alently , the in verse hyperbolic tangent function (arctanh) is applied to the data scaled to [ − 1 , 1] . 15 S H A N G A N D C H A N learned, it would hav e to be over functions that are orthogonal to those functions learnable by the kernel function in order to a void duplicate parameterization. 5. A pproximate infer ence for GGPMs In this section, we deriv e approximate inference algorithms for GGPMs based on the gener al form of the exponential family distribution in (11). One method of approximate inference is to use MCMC to draw samples from the posterior p ( η | X , y ) , b ut this can be computationally intensi ve (Nickisch and Rasmussen, 2008). Instead, we consider methods that approximate the posterior with a Gaussian. W e consider a closed-form approximation to inference based on a T aylor approximation, and we show that the T aylor approximation can justify common heuristics or pre-processing steps as principled inference; we prove that label r e gr ession (Nickisch and Rasmussen, 2008) is a T aylor ap- proximation to GP classiﬁcation (Bernoulli-GGPM), the closed-form Bayesian Poisson regression (Chan and V asconcelos, 2009) is a T aylor approximation of a Poisson-GGPM, GP regression on log- transformed outputs is actually a T aylor approximation to the Gamma-GGPM, and GP regression on lo git-transformed outputs (Heikkinen et al., 2008) is a T aylor approximation to the Beta-GGPM. Finally , we consider se veral other approximate inference algorithms that hav e been pre viously proposed for v arious GP models, and generalize them for GGPMs. 5.1 Gaussian approximation to the posterior As noted in Nickisch and Rasmussen (2008), most inference approximations on GPC work by ﬁnding a Gaussian approximation to the true posterior . Similarly , for GGPMs approximate inference also ﬁnds suitable Gaussian approximation q ( η | X , y ) to the true posterior , i.e., p ( η | X , y ) ≈ q ( η | X , y ) = N ( η | ˆ m , ˆ V ) (34) where the parameters { ˆ m , ˆ V } are determined by the type of approximation. Substituting the ap- proximation q ( η | X , y ) into (20), the approximate posterior for η ∗ is p ( η ∗ | X , x ∗ , y ) ≈ q ( η ∗ | X , y ∗ , y ) = N  η ∗   ˆ µ η , ˆ σ 2 η  , (35) where the mean and v ariance are ˆ µ η = k T ∗ K − 1 ˆ m , ˆ σ 2 η = k ∗∗ − k T ∗ ( K − 1 − K − 1 ˆ VK − 1 ) k ∗ . (36) In many inference approximations, { ˆ m , ˆ V } take the form ˆ V = ( K − 1 + W − 1 ) − 1 , ˆ m = ˆ VW − 1 t , (37) where W is a diagonal matrix, and t is a target v ector . In these cases, (36) can be rewritten ˆ µ η = k T ∗ ( K + W ) − 1 t , ˆ σ 2 η = k ∗∗ − k T ∗ ( K + W ) − 1 k ∗ . (38) Note that these are equi v alent to the standard equations for GPR, b ut with an “ef fectiv e” observation noise W and target t determined by the particular approximate inference algorithm. 16 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S 5.2 T aylor approximation In this section, we present a novel closed-form approximation to inference, which is based on ap- plying a T aylor approximation of the likelihood term (Chan and Dong, 2011). W e ﬁrst deﬁne the follo wing deri vati ve functions of the observ ation log-likelihood, u ( η , y ) = ∂ ∂ η log p ( y | θ ( η )) = 1 a ( φ ) ˙ θ ( η ) h T ( y ) − ˙ b ( θ ( η )) i , (39) w ( η, y ) = −  ∂ 2 ∂ η 2 log p ( y | θ ( η ))  − 1 = a ( φ ) n ¨ b ( θ ( η )) ˙ θ ( η ) 2 − h T ( y ) − ˙ b ( θ ( η )) i ¨ θ ( η ) o − 1 (40) For the canonical link function, these deri vati ves simplify to u ( η , y ) = 1 a ( φ ) [ T ( y ) − ˙ b ( η )] , w ( η , y ) = a ( φ ) ¨ b ( η ) . (41) 5 . 2 . 1 J O I N T L I K E L I H O O D A P P RO X I M A T I O N W e ﬁrst consider approximating the joint likelihood of the data and latent values, log p ( y , η | X ) = log p ( y | θ ( η )) + log p ( η | X ) . (42) The data likelihood term p ( y i | θ ( η i )) is the main hurdle for tractable integration over η . Hence, we approximate the data log-likelihood term a 2nd-order T aylor expansion at the e xpansion point ˜ η i , log p ( y i | θ ( η i )) ≈ log p ( y i | θ ( ˜ η i )) + ˜ u i ( η i − ˜ η i ) − 1 2 ˜ w − 1 i ( η i − ˜ η i ) 2 (43) where ˜ u i = u ( ˜ η i , y i ) and ˜ w i = w ( ˜ η i , y i ) are the deriv atives ev aluated at ˜ η i . Deﬁning ˜ u = [ ˜ u 1 , · · · , ˜ u n ] T and ˜ W = diag ( ˜ w 1 , . . . , ˜ w n ) , the joint likelihood in (42) can then be approximated as (see Appendix A.1 for deri v ation) log q ( y , η | X ) = log p ( y | θ ( ˜ η )) − 1 2 log | K | − n 2 log 2 π − 1 2    η − A − 1 ˜ W − 1 ˜ t    2 A − 1 − 1 2   ˜ t   2 W + K + 1 2 ˜ u T ˜ W ˜ u (44) where A = ˜ W − 1 + K − 1 , ˜ t = ˜ η + ˜ W ˜ u is the target vector , and the individual targets are ˜ t i = ˜ η i + ˜ w i ˜ u i . The approximate joint log-likelihood in (44) will be used to ﬁnd the approximate posterior and marginal lik elihood. 5 . 2 . 2 A P P RO X I M A T E P O S T E R I O R Removing terms in (44) that do not depend on η , the posterior of η is approximately Gaussian, log q ( η | X , y ) ∝ log q ( y , η | X ) ∝ − 1 2    η − A − 1 ˜ W − 1 t    2 A − 1 ⇒ q ( η | X , y ) = N ( η | ˆ m , ˆ V ) , (45) where, ˆ V = ( ˜ W − 1 + K − 1 ) − 1 , and ˆ m = ˆ V ˜ W − 1 ˜ t . These are of the form in (37), and hence, the approximate posterior of η ∗ has parameters ˆ µ η = k T ∗ ( K + ˜ W ) − 1 ˜ t , ˆ σ 2 η = k ∗∗ − k T ∗ ( K + ˜ W ) − 1 k ∗ . 17 S H A N G A N D C H A N The T aylor approximation is a closed-form (non-iterativ e) approximation, that can be interpreted as performing GPR on a set of tar gets ˜ t with target-speciﬁc, non-i.i.d. observation noise ˜ W . The targets ˜ t are a function of the the e xpansion point ˜ η , which can be selected as a non-linear transfor- mation of the observ ations y . ˜ W can be interpreted as a Gaussian approximation of the observation noise in the transformation space of ˜ t , where the noise is dependent on the expansion points. For a standard GPR, this is iid noise, i.e., W = σ 2 n I . In other cases, the noise is dependent on the expan- sion points, and the particular properties of the observ ation likelihood. Instances of the closed-form T aylor approximation for different GP models are further explored in Section 4. One advantage with the T aylor approximation is that it is an efﬁcient non-iter ative method with the same complex- ity as standard GPR. In Section 7, we exploit its ef ﬁciency to speed up other approximate inference methods (e.g. EP) during hyperparameter estimation. 5 . 2 . 3 C H O I C E O F E X P A N S I O N P O I N T The targets ˜ t are a function of the the expansion point ˜ η , which can be chosen as a non-linear transformation of the observ ations y . Assuming that the output y i occurs close to the mean of the distribution, a reasonable choice of the expansion point is ˜ η i = g ( T ( y i )) , which we denote as the canonical expansion point . The deriv atives and tar gets are then simpliﬁed to (see Appendix A.2) ˜ u i = u ( g ( T ( y i )) , y i ) = 0 ⇒ ˜ t i = ˜ η i = g ( T ( y i )) , (46) ˜ w i = w ( g ( y i ) , y i ) = a ( φ ) ¨ b ( θ ( g ( T ( y i )))) ˙ θ ( g ( T ( y i ))) 2 . (47) Hence, the T aylor approximation becomes GPR on the transformation of the input g ( T ( y i )) , with an appropriate non-i.i.d. noise term giv en by ˜ w i . This formulation gi ves some further insight on common preprocessing transformations, such as log( y i ) , used with GPR. Using the T aylor approximation, we can show that some forms of prepro- cessing are actually making speciﬁc assumptions on the output noise. In addition, sev eral heuristic methods (e.g., label regression) can be shown to be instances of approximate inference using the T aylor approximation. This is further explored for speciﬁc cases in Section 5.2.5. Finally , it is worth noting the relationship between the GGPM T aylor approximation and warped GPs (Snelson et al., 2004). W arped GPs also apply a warping from y i to t i , and apply GPR on t i . The main dif ference is that with warped GPs, the noise in the transformed space of t i is modeled as i.i.d. Gaussian, resulting in an arbitrary predictiv e distribution of y i . On the other hand, the T aylor approximation models the noise according to the warping function (i.e., as in ˜ w i ), and hence the predicti ve distrib ution of y i is preserved. 5 . 2 . 4 A P P RO X I M A T E M A R G I N A L The approximate marginal likelihood is obtained by integrating out η in (44), yielding (see Ap- pendix A.3 for deri v ation) log q ( y | X ) = − 1 2 ˜ t T ( ˜ W + K ) − 1 ˜ t − 1 2 log    ˜ W + K    + r ( φ ) (48) where r ( φ ) = log p ( y | θ ( ˜ η )) + 1 2 ˜ u T ˜ W ˜ u + 1 2 log | ˜ W | . The approximate marginal is similar to that of standard GPR, but uses the modiﬁed targets and noise terms, as discussed earlier . There is one additional penalty term r ( φ ) on the dispersion φ , which arises from the non-Gaussianity of the observ ation noise. The deri vati ves of (48) for using conjugate gradient are deriv ed in Appendix A.3. 18 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S 5 . 2 . 5 T R A N S F O R M E D G P S A S T A Y L O R A P P RO X I M A T E I N F E R E N C E W e now sho w that heuristic methods using GPs on transformed outputs can be explained as T aylor approximate inference on GGPMs with speciﬁc likelihoods. The deri vations, including deriv ative functions and targets used in T aylor approximate inference, are gi ven in Appendix B. Binomial/Bernoulli – An agnostic choice of expansion point is ˜ η i = 0 , which ignores the training classes, leading to ˜ t i = 4( y i − 0 . 5) , ˜ w i = 4 / N . (49) Hence, the T aylor approximation for binomial-GGPM is equiv alent to GPR in the latent space of the binomial model, with targets ˜ t i scaled between [ − 2 , +2] and an effecti ve noise term ˜ w i = 4 / N . When y i ∈ { 0 , 1 } , the target v alues are {− 2 , +2 } , which is equiv alent to label regression (Rasmussen and W illiams, 2006; Nickisch and Rasmussen, 2008; Kapoor et al., 2010), up to a scale factor . Hence, label re gr ession can be interpr eted as a T aylor appr oximation to GPC inference . The scaling of the targets ( ± 2 or ± 1 ) is irrelev ant when the latent space is only used for classifying based on the sign of η ∗ . Howe ver , this scaling is important when computing the actual label probabilities using the predicti ve distribution. The above interpretation explains why label regression tends to work well in practice, e.g., in Kapoor et al. (2010). Poisson – Based on the canonical expansion point, we have ˜ η i = log ( y i + c ) , where c ≥ 0 is a constant to prevent taking the logarithm of zero, and hence the target and effecti ve noise for the T aylor approximation are ˜ t i = log( y i + c ) − c y i + c , ˜ w i = 1 y i + c . (50) For c = 0 , the T aylor approximation is exactly the closed-form approximation proposed for Bayesian Poisson regression in Chan and V asconcelos (2009), which was deriv ed in a different way using an approximation to the log-gamma distrib ution. Gamma (shape hyper parameter) – Using the canonical expansion point, ˜ η i = log y i , yields the target and ef fectiv e noise, ˜ t i = log y i , ˜ w i = 1 ν = φ. (51) Note that this is equi valent to using a standard GP on the log of the outputs (Diggle et al., 1998; Snelson et al., 2004), which is standard practice in the statistics literature when the observ ations are only positi ve values. Hence, this practice of applying a GP on log-transformed outputs is equiv alent to assuming a Gamma likelihood and using T aylor approximate inference. In verse Gaussian – Using the canonical expansion point, ˜ η i = log(2 y 2 i ) , yields the target and ef fecti ve noise, ˜ t i = log(2 y 2 i ) = 2 log y i + log 2 , ˜ w i = 4 φy i . (52) This is equiv alent to using a standard GP on the linear transform of log of the outputs, and the noise ˜ w i is monotone in the v alue of output. 19 S H A N G A N D C H A N Beta – Consider an agnostic choice of the expansion point, ˜ η i = 0 , yields the targets and noise, ˜ t i = 2 φ ψ 1 ( 1 2 φ ) log y i 1 − y i , ˜ w i = 8 φ 2 ψ 1 ( 1 2 φ ) . (53) Using the approximation to the trigamma function, ψ 1 ( x ) ≈ 1 /x , the targets are approximately ˜ t i ≈ log y i 1 − y i , ˜ w i ≈ 4 φ. (54) Hence the GP on logit-transformed outputs of Heikkinen et al. (2008) is equiv alent to using T aylor approximate inference and a Beta likelihood, with the abov e approximation to the trigamma func- tion. Alternati vely , another choice is the canonical expansion point, ˜ η i = log y i 1 − y i , which we used in our experiments. 5.3 Laplace approximation The Laplace approximation is a Gaussian approximation of the posterior p ( η | X , y ) at its maximum (mode). Hence, the Laplace approximation is a speciﬁc case of the closed-form T aylor approxima- tion in Section 5.2, where the expansion point ˜ η is set to the maximum of the true posterior, ˆ η = argmax η log p ( η | X , y ) . (55) The true posterior mode is obtained iterati vely using the Newton-Raphson method, where in each iteration (Rasmussen and W illiams, 2006), ˆ η ( new ) = ˆ η −  ∂ ∂ η η T log p ( ˆ η | X , y )  − 1 ∂ ∂ η log p ( ˆ η | X , y ) = ( ˆ W − 1 + K − 1 ) − 1 ˆ W − 1 ˆ t , (56) where ˆ u and ˆ W are ev aluated at ˆ η , and ˆ t = ˆ W ˆ u + ˆ η . In each iteration the expansion point ˆ η is moved closer to the maximum, and the target vector ˆ t is updated. Note that the update for ˆ η is of the same form as the mean ˆ m in the closed-form T aylor approximation. Hence, the T aylor approximation could also be considered a one-iteration Laplace approximation, using the expansion point ˜ η as the initial point. The parameters of the approximate posterior of η and η ∗ are ˆ m = ˆ η , ˆ V = ( ˆ W − 1 + K − 1 ) − 1 , (57) ˆ µ η = k T ∗ K − 1 ˆ η = k T ∗ ˆ u , ˆ σ η = k ∗∗ − k T ∗ ( K + ˆ W ) − 1 k ∗ . (58) The mode is unique when the log posterior is concave, or equiv alently when W − 1 is positive deﬁ- nite, i.e., ∀ y , η , w ( η, y ) − 1 = 1 a ( φ ) n ¨ b ( θ ( η )) ˙ θ ( η ) 2 − h T ( y ) − ˙ b ( θ ( η )) i ¨ θ ( η ) o > 0 , ⇒ ¨ b ( θ ( η )) ˙ θ ( η ) 2 > h T ( y ) − ˙ b ( θ ( η )) i ¨ θ ( η ) . 20 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S For a canonical link function, this simpliﬁes to ¨ b ( η ) > 0 , i.e., a unique maximum exists when b ( η ) is con vex. Finally , the Laplace approximation for the marginal likelihood is log q ( y | X ) = log p ( y | θ ( ˆ η )) − 1 2 ˆ η T K − 1 ˆ η − 1 2 log    ˆ W − 1 K + I    (59) = log p ( y | θ ( ˆ η )) − 1 2 ˆ u T K ˆ u − 1 2 log    ˆ W − 1 K + I    . (60) where the last line follows from the ﬁrst-deriv ativ e condition at the maximum. Note that ˆ η is dependent on the kernel matrix K and φ . Hence, at each iteration during optimization of q ( y | X ) , we ha ve to recompute its v alue. Deriv ativ es of the marginal are presented in the supplemental (Chan, 2013). 5.4 Expectation propagation Expectation propagation (EP) (Minka, 2001) is a general algorithm for approximate inference, which has been shown to be effecti ve for GPC (Nickisch and Rasmussen, 2008). EP approximates each likelihood term p ( y i | θ ( η i )) with an unnormalized Gaussian t i = ˜ Z i N  η i   ˜ µ i , ˜ σ 2 i  (also called a site function), yielding an approximate data likelihood q ( y | θ ( η )) = n Y i =1 t i ( η i | ˜ Z i , ˜ µ i , ˜ σ 2 i ) = N  η    ˜ µ , ˜ Σ  n Y i =1 ˜ Z i , where ˜ µ = [ ˜ µ 1 , · · · ˜ µ n ] T and ˜ Σ = diag([ ˜ σ 2 1 , · · · , ˜ σ 2 n ]) . Using the site functions, the posterior approximation is q ( η | X , y ) = 1 Z E P n Y i =1 t i ( η i ) p ( η | X ) = N  η    ˆ m , ˆ V  where { ˆ m , ˆ V } are in the form of (37), and hence ˆ m = ˆ V ˜ Σ − 1 ˜ µ , ˆ V = ( K − 1 + ˜ Σ − 1 ) − 1 , (61) ˆ µ η = k T ∗ ( K + ˜ Σ ) − 1 ˜ µ , ˆ σ 2 η = k ∗∗ − k T ∗ ( K + ˜ Σ ) − 1 k ∗ . (62) The normalization constant is also the EP approximation to the margi nal likelihood (Nickisch and Rasmussen, 2008; Rasmussen and W illiams, 2006), log Z E P = log q ( y | X ) = log Z q ( y | θ ( η )) p ( η | X ) d η (63) = − 1 2 ˜ µ T ( K + ˜ Σ ) − 1 ˜ µ − 1 2 log    K + ˜ Σ    + X i log ˜ Z i . (64) Deri v ativ es of (64) are presented in the supplemental (Chan, 2013). 5 . 4 . 1 C O M P U T I N G T H E S I T E PA R A M E T E R S Instead of computing the optimal site parameters all at once, EP works by iterati vely updating each indi vidual site using the other site approximations (Minka, 2001; Rasmussen and W illiams, 2006). 21 S H A N G A N D C H A N In particular , to update site t i , we ﬁrst compute the cavity distrib ution , which is the marginalization ov er all sites except t i , q ¬ i ( η i ) = N  η i   µ ¬ i , σ 2 ¬ i  ∝ Z p ( η | X ) Y j 6 = i t j ( η j | ˜ Z j , ˜ µ j , ˜ σ 2 j ) dη j , (65) where the notation ¬ i indicates the sites without t i , and q ¬ i ( η i ) is an approximation to the posterior distribution of η i , giv en all observations except y i . Since both terms are Gaussian, this integral can be computed in closed-form. Next, the site parameters of t i are selected to match the moments (mean, variance, and normalization) between ˆ q ( η i ) = p ( y i | θ ( η i )) q ¬ i ( η i ) and t i ( η i ) q ¬ i ( η i ) . This requires ﬁrst calculating the moments of q ( η i ) = 1 ˆ Z i p ( y i | θ ( η i )) N  η i   µ ¬ i , σ 2 ¬ i  , ˆ µ i = E q [ η i ] , ˆ σ 2 i = v ar q ( η i ) , ˆ Z i = Z p ( y i | θ ( η i )) q ¬ i ( η i ) dη i , (66) follo wed by “subtracting” the cavity distrib ution and then yielding the site updates. ˜ µ i = ˜ σ 2 i ( ˆ σ − 2 i ˆ µ i − σ − 2 ¬ i µ ¬ i ) , ˜ σ 2 i = ( ˆ σ − 2 i − σ − 2 ¬ i ) − 1 , (67) log ˜ Z i = log ˆ Z i + 1 2 log 2 π ( σ 2 ¬ i + ˜ σ 2 i ) + ( µ ¬ i − ˜ µ i ) 2 2( σ 2 ¬ i + ˜ σ 2 i ) . (68) EP iterates over each of the site t i , i.e. each observation y i , iterativ ely until conv ergence. Note that in general, EP is not guaranteed to con verge. Although it is usually well behav ed when the data log-likelihood log p ( y i | θ ( η i )) is concav e and the approximation is initialized to the prior (Nick- isch and Rasmussen, 2008; Jyl ¨ anki et al., 2011). Finally , these moments may not be analytically tractable (in fact, q ( η i ) is the same form as the predicti ve distribution), so approximate integration is usually required. In general, the con vergence of EP also depends on the accuracy of the moment approximations. 5.5 KL diver gence minimization In this section we discuss a v ariational approximation that maximizes a lower -bound of the mar ginal likelihood by minimizing the KL div ergence between the approximate posterior and the true poste- rior . This type of approximate inference was ﬁrst applied to robust GP regression in Manfred and Archambeau (2009), and later to GP classiﬁcation in Nickisch and Rasmussen (2008). In this paper , we extend it to the GGPM. As with other approximations, the approximate posterior is assumed to be Gaussian, p ( η | X , y ) ≈ q ( η | X , y ) = N ( η | m , V ) (69) for some m and V . T o obtain the best approximate posterior , the KL di ver gence is minimized between the approximation and the true posterior , { m ∗ , V ∗ } = argmin m , V KL ( N ( η | m , V ) k p ( η | X , y ) ) . (70) As sho wn in Nickisch and Rasmussen (2008), minimizing the KL in (70) is equiv alent to maximiz- ing the lo wer bound of log p ( y | X ) L = log p ( y | X ) − KL ( N ( η | m , V ) k p ( η | X , y ) ) (71) = f ( m , v ) + 1 2 log   K − 1 V   − 1 2 tr( K − 1 V ) − 1 2 m T K − 1 m + n 2 . (72) 22 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S The function f ( m , v ) is the e xpectation of the observ ation log-likelihood, f ( m , v ) ≡ Z N ( η | m , V ) log p ( y | θ ( η )) d η = n X i =1 E η i | m i ,v i [log p ( y i | θ ( η i ))] , (73) where m i = [ m ] i and v i = [ V ] ii , v = diag ( V ) , and E η i | m i ,v i [ · ] is the expectation with respect to N ( η i | m i , v i ) . The ﬁrst term of (72) emphasizes the ﬁt to the data (via the e xpectation), while remaining terms (KL div ergence terms) penalize the posterior from being too different from the prior distribution. At a local maximum of (72), the ﬁrst deri v ativ e conditions yield the following constraints, ˆ m = K ˆ γ , ˆ V = ( K − 1 + ˆ Λ ) − 1 (74) where the optimal ˆ γ and ˆ Λ satisfy ˆ γ = ∂ f ( m , v ) ∂ m , ˆ Λ = − 2 ∂ f ( m , v ) ∂ V (75) Since the mean and co variance have the forms in (74), the optimization problem can be reformulated with the v ariational parameters { γ , λ } , such that m = K γ , V = ( K − 1 + Λ ) − 1 = K ( I + ΛK ) − 1 , Λ = diag( λ ) . (76) This parameterization also av oids in verting the kernel matrix. Substituting into (72), L = f ( K γ , v ) + 1 2 log   ( I + ΛK ) − 1   − 1 2 tr(( I + ΛK ) − 1 ) − 1 2 γ T K γ + n 2 . (77) Finally , the optimal approximate posterior can be obtained by maximizing (77) with respect to the v ariational parameters { γ , λ } using standard optimization techniques (e.g., the conjugate gradient method). Note that L is a lower bound on the marginal likelihood, as in (72). Hence, the model hyperparameters can also be estimated by maximizing (77). In practice, the model hyperparameters and approximate posterior can be estimated at the same time, by jointly maximizing (77) with respect to all parameters (again, e.g., using conjugate gradient). 5 . 5 . 1 E X P E C TA T I O N T E R M S The computation of (77) and its deri v ativ es requires calculating f ( m , v ) and its deriv ativ es, f ( m i , v i ) = E η i | m i ,v i [log p ( y i | θ ( η i ))] , ∂ f ( m i , v i ) ∂ φ = ∂ ∂ φ E η i | m i ,v i [log p ( y i | θ ( η i ))] , (78) ∂ f ( m i , v i ) ∂ m i = ∂ ∂ m i E η i | m i ,v i [log p ( y i | θ ( η i ))] , ∂ f ( m i , v i ) ∂ v i = ∂ ∂ v i E η i | m i ,v i [log p ( y i | θ ( η i ))] . (79) Plugging in for the exponential f amily form, the ﬁrst two terms can be re written as f ( m i , v i ) = 1 a ( φ )  T ( y i ) E η i | m i ,v i [ θ ( η i )] − E η i | m i ,v i [ b ( θ ( η i ))]  + c ( φ, y i ) , (80) ∂ f ( m i , v i ) ∂ φ = − ˙ a ( φ ) a ( φ ) 2  T ( y i ) E η i | m i ,v i [ θ ( η i )] − E η i | m i ,v i [ b ( θ ( η i ))]  + ˙ c ( φ, y i ) . (81) 23 S H A N G A N D C H A N Hence, the e xpectations E [ θ ( η )] and E [ b ( θ ( η ))] under a Gaussian distribution are required. Expres- sions of the last two terms can be obtained by directly taking the deri vati ve, ∂ f ( m i , v i ) ∂ m i = Z ∂ N ( η i | m i , v i ) ∂ m i log p ( y i | θ ( η i )) dη i = E η i | m i ,v i  η i − m i v i log p ( y i | θ ( η i ))  , (82) ∂ f ( m i , v i ) ∂ v i = Z ∂ N ( η i | m i , v i ) ∂ v i log p ( y i | θ ( η i )) dη i = E η i | m i ,v i  ( η i − m i ) 2 − v i 2 v 2 i log p ( y i | θ ( η i ))  . (83) Hence, these two deriv ativ es require the expectations E [ η k θ ( η )] and E [ η k b ( θ ( η ))] , where k ∈ { 1 , 2 } . For certain likelihood and link functions (e.g., Poisson with canonical link), the above expectations ha ve a closed form solutions. In other cases, they need to be approximated. Alternati ve expressions to (82, 83) can be obtained by performing a change of variable in the expectation η = ¯ η − m √ v , ∂ f ( m i , v i ) ∂ m i = ∂ ∂ m i E ¯ η i | 0 , 1 [log p ( y i | θ ( √ v i ¯ η i + m i ))] = E η i | m i ,v i [ u ( η i , y i )] , (84) ∂ f ( m i , v i ) ∂ v i = ∂ ∂ v i E ¯ η | 0 , 1 [log p ( y i | θ ( √ v i ¯ η + m i ))] = 1 2 E η i | m i ,v i  η i − m i v i u ( η i , y i )  . (85) Hence, alternativ ely the expectations E [ η k ˙ θ ( η )] and E [ η k ˙ b ( θ ( η )) ˙ θ ( η )] , k ∈ { 0 , 1 } , are required under a Gaussian. The alternati ve forms in (84, 85) allow an intuiti ve comparison between the KLD method and the Laplace approximation, gi ven in the ne xt section. 5 . 5 . 2 A P P RO X I M A T E P O S T E R I O R After maximizing L , resulting in optimal variational parameters { ˆ γ , ˆ λ } , the approximate posteriors hav e parameters ˆ m = K ˆ γ , ˆ V = ( K − 1 + ˆ Λ ) − 1 , (86) ˆ µ η = k T ∗ ˆ γ , ˆ σ 2 η = k ∗∗ − k T ∗ ( K + ˆ Λ − 1 ) − 1 k ∗ . (87) An interesting comparison can be made against the predictiv e latent distribution using the Laplace approximation in (58). In particular, with the Laplace approximation, the latent mean depends on the ﬁrst deriv ative u of the observ ation log-likelihood (at the mode), whereas with the variational method, the latent mean depends on the expectation of the ﬁrst deri vati ve, ˆ γ = E [ u ] , using (84). Similarly , with the Laplace approximation, the effecti ve noise term W is the in verse of the 2nd deri v ativ e of the observ ation log-likelihood, and with KLD, this term depends on an estimate of the 2nd deri v ativ e by differencing the 1st deri vati ve around m , gi ven by (85). 5.6 Summary In this section, we ha ve studied closed-form T aylor approximation, and discussed its connections with output-transformed GPs. W e also discussed other popular approximate inference methods in the context of GGPMs. Using the general EFD form for the likelihood, we can identify the speciﬁc quantities required for each algorithm in terms of parameters E , as summarized in T able 3. For 24 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S example, EP requires the expectations of θ ( η ) and b ( θ ( η )) under the approximate predictiv e distri- bution, whereas KL minimization requires these expectations under a Gaussian distribution. This result has two practical consequences: 1) the implementation of likelihood functions is simpliﬁed, since only the deriv atives and e xpectations of simple functions in E need to be implemented; 2) we elucidate the expectations that may require numerical approximation. Furthermore, different like- lihood and link functions can be combined in novel ways without much additional implementation ef fort. Approximation general likelihood GGPM likelihood T aylor posterior & marginal ∂ k ∂ η k log p ( y | η ) , k = { 1 , 2 } ˙ b, ¨ b, ˙ θ, ¨ θ marginal deri vativ es ∂ ∂ φ log p ( y | η ) ˙ a, ˙ c Laplace posterior & marginal ∂ k ∂ η k log p ( y | η ) , k = { 1 , 2 } ˙ b, ¨ b, ˙ θ, ¨ θ marginal deri vativ es ∂ 3 ∂ η 3 log p ( y | η ) , ∂ ∂ φ ∂ k ∂ η k log p ( y | η ) , k ∈ { 0 , 2 } ˙ a, ˙ c, ... b , ... θ EP posterior & marginal log ˆ Z i , ∂ k ∂ η k i log ˆ Z i , k = { 1 , 2 } E q [ η i ] , v ar q ( η i ) marginal deri vativ es ∂ ∂ φ log ˆ Z i ˙ c, ˙ a, E q [ θ ( η )] , E q [ b ( θ ( η ))] KLD posterior & marginal E [ η k log p ( y | η )] , k ∈ { 0 , 1 , 2 } E [ η k θ ( η )] , E [ η k b ( θ ( η ))] , k ∈ { 0 , 1 , 2 } marginal deri vativ es E [ ∂ ∂ φ log p ( y | η )] ˙ c, ˙ a T able 3: The required calculations of the likelihood function for approximate inference. The ex- pectations E q and v ar q can be calculated from deri v ativ es of log ˆ Z i . 5.7 Implementation Details The GGPM was implemented in MA TLAB by extending the GPML toolbox (Rasmussen and Nick- isch, 2010) to include implementations for: 1) the generic exponential family distribution using the parameters { a ( φ ) , b ( θ ) , c ( y, φ ) , θ ( η ) , T ( y ) } ; 2) the closed-form T aylor approximation for in- ference; 3) the EP and KLD moments and the predictive distributions, approximated using numer- ical integration when necessary . Empirically , we found that EP was sensitive to the accuracy of the approximate integrals, and exhibited conv ergence problems when less accurate approximations were used (e.g. Gaussian-Hermite quadrature). Hyperparameters (dispersion and kernel parame- ters) were optimized by maximizing the marginal likelihood, using the existing scaled conjugate gradient method in GPML. The code will be made av ailable 3 . 6. Comparison of approximate posteriors In this section, we pro vide a theoretical and e xperimental comparison of the approximate posteriors from Section 5. In particular , we show that the efﬁcacy of an approximate inference method is inﬂuenced by properties of the likelihood, e valuation metrics, and datasets. 6.1 Ordering of posterior means and predicti ve means W e ﬁrst compare the latent posteriors of the T aylor , Laplace, and EP approximations for one latent v ariable. Consider an example using the Gamma-shape likelihood, and the corresponding true la- tent posterior p ( η | y ) ∝ p ( y | η ) p ( η ) in Figure 4 (top-left). The ﬁrst deri vati ve of the log-posterior, 3. http://visal.cs.cityu.edu.hk/downloads/ 25 S H A N G A N D C H A N f ( η | y ) = ∂ ∂ η log p ( η | y ) , is plotted in Figure 4 (bottom-left). Note that the deriv ativ e of the Gamma- shape log-likelihood, u ( η , y ) = ν ( y e − η − 1) , is con ve x and monotonically decreasing for all y > 0 . Claim 1 If the derivative of the observation log-likelihood, ∂ ∂ η log p ( y | η ) , is con vex and monotoni- cally decr easing, then the means of the 1-D appr oximate posteriors ar e order ed accor ding to µ T A < µ LA < µ E P , (88) wher e µ T A , µ LA , and µ E P ar e the latent means for the T aylor appr oximation, Laplace appr oxima- tion, and EP , respectively . The or dering is r eversed when ∂ ∂ η log p ( y | η ) is concave and decr easing. Proof The log posterior is log p ( η | y ) ∝ log p ( y | η ) + log p ( η ) , and the derivative is f ( η | y ) = ∂ ∂ η log p ( y | η ) + ∂ ∂ η log p ( η ) (89) The ﬁrst term on the RHS of (89) is assumed to be con vex and monotonically decr easing, while the second term is a linear function with ne gative slope (derivative of the Gaussian lo g-likelihood). Hence, f ( η | y ) is also con vex and monotonically decr easing. The mean of the Laplace appr oxima- tion is the mode of the true posterior , i.e., the zer o cr ossing of f ( η | y ) , and is marked with a star (*) in F igur e 4. The T aylor appr oximation is equivalent to one iteration of Newton’ s method on f ( η | y ) , starting at the expansion point ˜ η . Hence, g eometrically , the mean of the T aylor appr oximation is the zer o-cr ossing of the tang ent line to f ( η | y ) at the e xpansion point (marked with a cir cle (o) in F igur e 4). Since f ( η | y ) is con ve x and monotonically decr easing, the zer o-cr ossing point of any tangent line is always less than the zer o-cr ossing point of f ( η | y ) . Ther efor e, the T aylor mean is always smaller than the Laplace mean. Because f ( η | y ) is monotonically decr easing and con vex, the posterior p ( η | y ) is ske wed to the right, and its mean is larg er than the mode. Since EP matches the mean of the appr oximation to the mean of the true posterior , the EP mean must be larg er than the mode (i.e ., the Laplace mean). Claim 1 suggests that the posterior means follow a particular order for the 1-dimensional pos- terior . For the general multi-dimensional case, it is dif ﬁcult to prove a similar result, since the dimensions of the latent posterior are correlated through the kernel matrix 4 . Nonetheless, we can empirically sho w that the ordering in (88) holds for multi variate posteriors on a verage. T o this end, we ran a synthetic experiment using the Gamma-shape likelihood, with dispersion parameter φ ∈ [0 . 1 , 5] , and RBF kernel function with scale K s = 2 and bandwidth K w ∈ [0 . 1 , 5] . For a gi ven bandwidth K w and dispersion φ pair, 100 dif ferent functions are ﬁrst randomly sampled from a Gamma-shape GGPM. F or each function, 40 points are used for training, and the multi variate means of the approximate posteriors µ T A , µ LA , and µ E P are calculated using T aylor , Laplace, and EP , respectiv ely . The differences between these means are av eraged ov er all 100 trials and plotted in Figure 5. For all parameter settings, the ordering holds for the multiv ariate means on average, i.e., the T A mean is always less than the LA mean, which is always less than the EP mean. Note that, as the dispersion lev el increases, the difference between the three means also increases. Increasing 4. When the kernel matrix is diagonal, i.e., the kernel function is a delta function, it is easy to sho w that the ordering in Claim 1 will hold for each dimension. 26 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S 0 0.5 1 1.5 2 2.5 3 −4 −2 0 2 4 6 8 10 η f( η |y) function: f( η |y) tangent line mean of EP mean of Laplace mean of Taylor Taylor expansion point 0 0.5 1 1.5 2 2.5 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 η p( η |y) Gamma sh likelihood latent posterior p( η |y) EP q( η |y) mean of EP mean of Laplace 0 0.5 1 1.5 2 2.5 3 −20 −15 −10 −5 0 5 10 15 20 25 30 η function: f( η |y) tangent line mean of EP mean of Laplace mean of Taylor Taylor expansion point 0 0.5 1 1.5 2 2.5 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 η Gamma sc likelihood latent posterior p( η |y) EP q( η |y) mean of EP mean of Laplace Figure 4: Comparing the approximate posterior means of T aylor , Laplace and EP methods for the Gamma-shape (left) and Gamma-scale (right) likelihood functions. The ﬁrst ro w sho ws the true latent posterior p ( η | y ) and the EP approximation q ( η | y ) . The second row sho ws the ﬁrst deriv ative of the log-posterior f ( η | y ) = ∂ ∂ η log p ( η | y ) . The mean of the Laplace approximation is the zero-crossing point of f ( η | y ) . The mean of the T aylor approxi- mation is the zero-crossing of the tangent line at the expansion point (one iteration of Ne wton’ s method). varying φ varying k w T A vs. LA 0 1 2 3 4 5 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 k w =0.1 k w =0.6 k w =4.6 dispersion: φ difference: µ TA − µ LA 0 1 2 3 4 5 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 φ =0.1 φ =0.6 φ =1.1 φ =4.6 kernel width: k w difference: µ TA − µ LA LA vs. EP 0 1 2 3 4 5 −0.4 −0.35 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 k w =0.1 k w =0.6 k w =4.6 dispersion: φ difference: µ LA − µ EP 0 1 2 3 4 5 −0.4 −0.35 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 φ =0.1 φ =0.6 φ =4.6 kernel width: k w difference: µ LA − µ EP Figure 5: The dif ference between approximate posterior means for the Gamma-shape likelihood function. The top row compares T aylor ( µ T A ) vs. Laplace ( µ LA ), while the bottom row compares Laplace ( µ LA ) vs. EP ( µ E P ). In each row , the two plots show the dif ference w .r .t the dispersion parameter φ and the RBF kernel bandwidth k w . 27 S H A N G A N D C H A N the dispersion lev el will “stretch” the observation log-likelihood term. This will scale down its deri v ativ e, and as a result, the zero-crossing point of the tangent line ( µ T A ) will move further from the zero-crossing of f ( η | y ) (i.e., µ LA ). Similarly , “stretching” the observation log-likelihood also mov es the mean further from the mode, thus increasing the difference between µ LA and µ E P . For the kernel bandwidth, increasing the bandwidth also stretches the prior term. This scales down the deri v ati ves of f , and as a result, the difference between T A and LA also increases in a similar way . On the other hand, as the bandwidth increases, the dif ference between LA and EP decreases . Increasing the bandwidth squashes the prior , making it more uniform. As a result, the prior has less inﬂuence on the mean of the posterior , compared to the likelihood term, and the EP mean con ver ges to the mean of the likelihood term. The ordering of the posterior means also suggest that the predicti ve means will follo w a similar order , i.e., the predictions using T A will be less than those of LA, and the predictions using EP will always be larger than LA. This is difﬁcult to prove theoretically , since the the v ariance of the posterior af fects the predicti ve mean, b ut has been observed empirically in toy e xamples, as well as in experiments on real data in Section 8. This is demonstrated in Figure 6, which uses a Gamma- shape GGPM with the four inference methods. The training samples are distrib uted in three distinct regions and hav e the similar trends to an exponential function. All the four inference methods can well capture the exponential trend. Howe ver , given an input x ∗ , the latent v alues for T A, LA, and EP exhibit the ordering of Claim 1, as seen in the second ro w of Figure 6(a). In addition, the predicti ve means also follow the ordering. KLD and EP methods hav e almost overlapped latent means. EP and KLD both optimize the KL di ver gence between the true posterior p and the approximate q . The dif ference is that EP optimizes D ( p || q ) and KLD optimizes D ( q || p ) . For a unimodal distribution p , minimizing either D ( q || p ) or D ( p || q ) will correctly capture the mean of p . 6.2 Effect on prediction err or The systematic ordering between the approximate posterior and predictiv e means suggests that dif- ferences in prediction accuracy between approximation methods are inﬂuenced by the distrib ution of the test data. If the test data has more points “below the mean curve”, then T A will hav e better accuracy because it systematically underpredicts. On the other hand, if there are more points “above the mean curve”, then EP will hav e better accuracy . This effect is illustrated in Figure 6b. Because of the conﬁguration of the data, the predictive mean function passes above the middle points and belo w the extremal points. Thus T A will hav e the smallest prediction error for the middle region and largest error at the extremes. In contrast, EP will have lowest error on points near the ends, and largest error on the middle region. If the test data contains more points in the middle, as in Figure 6b, then the T aylor approximation will have lo wer average predictiv e error . In contrast, if the test data contains more points at the extremes, as in Figure 6a, then EP will have lo wer av erage error . Gamma-shape Gamma-scale First Example (a) Second Example (b) First Example (a) Second Example (b) Inference MAE NLP MAE NLP MAE NLP MAE NLP T aylor 0 . 894 1 . 293 1 . 123 1 . 429 0 . 581 1 . 274 0 . 900 1 . 309 Laplace 0 . 818 1 . 282 1 . 144 1 . 424 0 . 633 1 . 243 0 . 715 1 . 241 EP 0 . 815 1 . 282 1 . 145 1 . 424 0 . 636 1 . 243 0 . 713 1 . 241 KLD 0 . 815 1 . 282 1 . 145 1 . 424 0 . 636 1 . 243 0 . 713 1 . 241 T able 4: A verage errors for the two 1D e xamples. 28 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S (a) The ﬁrst example: EP having the best performance 0 0.5 1 1.5 2 2.5 0.026 0.136 0.246 the difference predictive: EP − Taylor 0 0.5 1 1.5 2 2.5 0.025 0.130 0.236 the difference predictive: Laplace − Taylor 0 0.5 1 1.5 2 2.5 1.0403 5.417 9.7938 x 10 −3 the difference predictive: EP − Laplace 0 0.5 1 1.5 2 2.5 −11.4631 −5.8959 −0.3287 x 10 −5 the difference predictive: EP − KLD 0 0.5 1 1.5 2 2.5 0.018 0.019 0.020 the difference latent: EP − Taylor 0 0.5 1 1.5 2 2.5 0.017 0.018 0.019 the difference latent: Laplace − Taylor 0 0.5 1 1.5 2 2.5 7.0107 7.4207 7.8307 x 10 −4 the difference latent: EP − Laplace 0 0.5 1 1.5 2 2.5 −4.4868 −2.4767 −0.4666 x 10 −6 the difference latent: EP − KLD 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 x y Taylor sample mean(y) mode(y) std(y) 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 x y Laplace sample mean(y) mode(y) std(y) 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 x y EP sample mean(y) mode(y) std(y) 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 x y KLD sample mean(y) mode(y) std(y) (b) The second example: T aylor having the best performance 0 0.5 1 1.5 2 2.5 0.047 0.161 0.275 the difference predictive: EP − Taylor 0 0.5 1 1.5 2 2.5 0.045 0.154 0.263 the difference predictive: Laplace − Taylor 0 0.5 1 1.5 2 2.5 1.7861 6.8954 12.0047 x 10 −3 the difference predictive: EP − Laplace 0 0.5 1 1.5 2 2.5 −1.4606 1.1784 3.8175 x 10 −5 the difference predictive: EP − KLD 0 0.5 1 1.5 2 2.5 0.024 0.034 0.043 the difference latent: EP − Taylor 0 0.5 1 1.5 2 2.5 0.023 0.032 0.042 the difference latent: Laplace − Taylor 0 0.5 1 1.5 2 2.5 1.0244 1.3133 1.6022 x 10 −3 the difference latent: EP − Laplace 0 0.5 1 1.5 2 2.5 −8.2261 −1.5157 5.1948 x 10 −6 the difference latent: EP − KLD 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 x y Taylor sample mean(y) mode(y) std(y) 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 x y Laplace sample mean(y) mode(y) std(y) 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 x y EP sample mean(y) mode(y) std(y) 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12 14 x y KLD sample mean(y) mode(y) std(y) Figure 6: T wo 1D examples of comparing different inference methods. In each example the top ro w shows the learned Gamma sh -GGPM regression models with four different inference methods: T aylor , Laplace, EP , KLD. The middle ro w shows the dif ference between pre- dicti ve distributions, while the bottom ro w sho ws the difference between latent functions. T o quantify this difference in predicti ve accuracy , a test set was generated by randomly sampling points in the neighborhood of each training point. Each inference method was ev aluated using two measures: 1) the mean absolute error (MAE), which measures the goodness-of-ﬁt; 2) the mean negati ve log predictiv e density ev aluated at the test points (NLP), which measures how well the model predicts the entire density (Snelson et al., 2004). Evaluation results are presented in T able 4. For the ﬁrst example (Figure 6a), there are more test points in the end regions, and EP achiev es the lo west MAE of 0 . 815 versus 0 . 894 for T A. On the other hand, in the second example (Figure 6b), where most of the test data is in the middle re gion, T A is the most accurate among the four methods (MAE of 1 . 123 versus 1 . 145 of EP). Finally , if the likelihood is changed from Gamma-shape to Gamma-scale, the MAE results of the four inference methods are rev ersed for the two examples 29 S H A N G A N D C H A N (see T able 4 right), since the deriv ativ e of the log-likelihood of the Gamma-scale is concave. If we e v aluate the results by NLP , the T aylor method always performs w orse than the other three methods. From these e xamples and Claim 1, it can be concluded that the performances of dif ferent in- ference methods are highly affected by dataset distributions, likelihood functions and e valuation metrics. The systematic ordering of the latent posterior means can affect the prediction accuracy if the test set is unbalanced or ske wed. Real-life datasets are often noisy and high-dimensional, and it is unlikely for a single inference method to dominate for all datasets and metrics. 7. Initializing hyper parameter estimation As with GPR, the estimation of GGPM hyperparameters by maximizing the marginal likelihood often suffers from the problem of multiple local optima. Figure 7 sho ws the negati ve log-marginal likelihood (NML), as a function of the RBF kernel width and scale hyperparameters, for the four inference methods on the rainfall dataset (see Section 8.1 for description). The NML surface for all four likelihood functions hav e at least two local optima, and the Laplace, EP and KLD methods produce very similar surf aces. A common approach to hyperparameter estimation is to initialize the log-hyperparameters to zero, and then optimize using the scaled conjugate gradient method (Rasmussen and Nickisch, 2010). Ho wever , in the example in Figure 7, using the same initialization strategy leads to differ ent hyperparameter estimates from each inference method. As illustrated in Figure 7, T A, KLD, and EP con ver ge to similar local optimum on the left, whereas Laplace con ver ges to a different local minimum on the right. Hence, a more robust initialization method is required to better explore the search space, and to ensure that a good optimum can be found for each inference method. One strategy is to run the optimization procedure man y times using a large set of random initializations. Ho we ver , for some inference algorithms, e.g. EP , the computational b urden will be large. T aylor 0 0.5 1 1.5 2 2.5 3 3.5 4 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 335.4 331.8 335.4 335.4 log of kernel width log of kernel scale iter:1 iter:8 iter:11 0 0.5 1 1.5 2 2.5 3 3.5 4 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 log of kernel width log of kernel scale 322.1 322.2 322.1 322.1 iter:1 iter:7 iter:8 iter:13 iter:15 iter:19 Laplace EP 0 0.5 1 1.5 2 2.5 3 3.5 4 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 321.9 322.1 321.9 321.9 log of kernel width log of kernel scale iter:1 iter:7 iter:11 iter:16 iter:17 0 0.5 1 1.5 2 2.5 3 3.5 4 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 321.9 322.1 321.9 321.9 log of kernel width log of kernel scale iter:1 iter:7 iter:9 iter:14 iter:16 KLD Figure 7: Contour plot of the negati ve log marginal likelihood as a function of the RBF kernel width and scale hyperparameters for four approximate inference methods. The iterations of scaled conjugate gradient minimization with initialization (0, 0) are plotted in green. 30 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S W e propose an ef ﬁcient initialization strategy that uses T aylor inference to quickly ﬁnd a few good initializations for the other inference methods (Laplace, KLD, and EP). The procedure is il- lustrated in Figure 8. T aylor inference is used to optimize the hyperparameters using 50 random ini- tializations (see Figure 8a), resulting in con ver gence to sev eral local optima with dif ferent mar ginal likelihoods. The top 3 unique local optima are used as the initializations for the Laplace and EP methods, and the results are presented in Figures 8b and 8c. In both cases, the T aylor-initialized Laplace and EP can recover the same local optima as the randomly-initialized versions, but with a signiﬁcant reduction in computational cost (3 times faster for Laplace, and 13 times faster for EP). The hyperparameter resulting in the largest marginal likelihood can then be selected as the estimate. (a) (b) (c) −5 0 5 10 15 20 −0.5 0 0.5 1 1.5 331.8 333.8 335.4 337.6 410.8 log of kernel width log of kernel scale Local minima of Taylor inference Talyor by random initialization (Time: 1445) Random initialization −5 0 5 10 15 20 −0.5 0 0.5 1 1.5 322.1 322.2 327.5 334.1 408.0 log of kernel width log of kernel scale Local minima of Laplace inference Laplace by random initialization (Time: 4949) Laplace by Taylor initialization (Time: 141) Talyor by random initialization (Time: 1445) −5 0 5 10 15 20 −0.5 0 0.5 1 1.5 321.9 322.1 327.5 334.1 401.7 log of kernel width log of kernel scale Local minima of EP inference EP by random initialization (Time: 70871) EP by Taylor initialization (Time: 3903) Talyor by random initialization (Time: 1445) Figure 8: Illustration of using T aylor inference to initialize other inference methods for hyperpa- rameter estimation. (a) Candidate local optima are found using T aylor method with 50 random initializations; (b) Comparison of local optima of Laplace method initialized by T aylor and 50 random points. (c) Comparison of local optima of EP method initialized by T aylor and 50 random points. T able 5 sho ws the quantitative comparison between random-initialization and T aylor -initialization on three dataset (servo, auto-mpg and housing from Section 8.2). W e compare the tw o initialization methods using the relativ e change in MAE, ∆MAE = MAE T − MAE R MAE R , where MAE T and MAE R are the MAEs when using T aylor -initialization and random-initialization, respectiv ely . The relativ e change in NLP is calculated in an analogous way . In all cases, the relative changes in MAE are small (within 0.002), and not statistically signiﬁcant (paired t -test, p > 0 . 15 ). A similar conclusion holds for the relativ e change in NLP . The T aylor -initialization yields a signiﬁcant reduction in computa- tional cost. For example, T aylor-initialized EP was about 26 times faster than random-initialized EP . Furthermore, with the help of T aylor initialization, we did not encounter an y con ver gence problems with EP (similarly observed for the real experiments in Section 8). These experiments demonstrate that T aylor -initialization can speedup hyperparameter estimation for other inference methods, while maintaining the same quality as fully random initialization. 8. Experiments In this section, we present experiments using GGPMs and inference methods on a wide variety of real-world datasets. In Section 8.1, we consider ﬁnite counting data and the Binomial-GGPM. In Section 8.2, we experiment with regression to non-negati ve reals (Gamma-GGPM, In verse- 31 S H A N G A N D C H A N Laplace EP Likelihood ∆ MAE ∆ NLP speedup ∆ MAE ∆ NLP speedup Gamma sh 0 . 000 ± 0 . 0001 0 . 000 ± 0 . 0000 2 . 97 ± 0 . 901 0 . 000 ± 0 . 0001 0 . 000 ± 0 . 0023 24 . 92 ± 7 . 566 ( 0.9987) ( 0.4531) ( 0.1889) ( 0.4090) Gamma sc − 0 . 002 ± 0 . 0127 0 . 002 ± 0 . 0077 3 . 17 ± 1 . 109 − 0 . 001 ± 0 . 0026 − 0 . 001 ± 0 . 0019 26 . 86 ± 6 . 487 ( 0.3474) ( 0.2597) ( 0.2056) ( 0.1270) In v .Gauss − 0 . 000 ± 0 . 0012 0 . 001 ± 0 . 0023 3 . 44 ± 1 . 590 0 . 002 ± 0 . 0164 0 . 013 ± 0 . 0534 28 . 78 ± 9 . 861 ( 0.3296) ( 0.2709) ( 0.5010) ( 0.2103) all − 0 . 001 ± 0 . 0074 0 . 001 ± 0 . 0046 3 . 20 ± 1 . 236 0 . 000 ± 0 . 0095 0 . 004 ± 0 . 0312 26 . 85 ± 8 . 158 ( 0.3005) ( 0.1567) ( 0.6372) ( 0.2158) T able 5: Comparison between random-initialization and T aylor -initialization in terms of relati ve change in MAE and NLP , and the speedup factor . Parenthesis denote the p v alues using a paired t -test. The differences are not statistically signiﬁcant. Gaussian GGPM). Finally , Section 8.3 presents results on range data (Beta-GGPM), and Section 8.4 considers counting data and Poisson-GGPMs 5 . In the follo wing experiments, we use the T aylor initialization method from Section 7 to speed up hyperparameter estimation using the other inference methods (LA, EP , KLD). For all inference methods, the best hyperparameter is selected as the one with the largest marginal likelihood on the training set. Our results indicate that the performances of dif ferent inference methods are highly af fected by likelihoods, e v aluation metrics and datesets. Each inference method, ev en the T aylor method, can hav e the best performance. Furthermore, our hypothesis testing results show that the dif ference between EP and KLD is not statistically signiﬁcant; Laplace approximation and EP often perform comparably . 8.1 Binomial Example In this section we apply binomial-GGPM to the T okyo rainfall dataset 6 . The dataset records the number of occurrences of rainfall over 1mm in T okyo for e very calendar day in 1983 and 1984. The rainfall occurrence for a giv en calendar day follo w a binomial distribution. W e assign the occurrence times “0”, “1”, and “2” to the outputs y ∈ { 0 , 0 . 5 , 1 } of the binomial model. The input feature is the calendar day (0 to 365), and the RBF kernel was used 7 . As presented earlier , Figure 7 shows the negati ve log marginal likelihood (NML) as a function of the RBF kernel width and scale, and Figure 8 shows the results of T aylor -initialization, which yielded 5 local minima in the NML that correspond to different interpretations of the data. The best two interpretations (largest marginal likelihoods) are presented in Figures 9a and 9c, using the four inference methods. Figure 9a uses a larger kernel width, resulting in a smoother function that sho ws the clear seasonal pattern in T okyo, as described by Kitagaw a (1987): dry winter (Dec., Jan., and Feb .), rain y season in late June to mid-July , stable hot summer in late July through Aug., and generally ﬁne but with an occasional typhoon in Sept. and Oct. It w ould be dif ﬁcult to identify these trends by only looking at the original data. Finally , Figure 10 depicts the curves for the remaining 5. In this paper , we do not present results using GP regression and classiﬁcation, which hav e been extensiv ely studied in Rasmussen and W illiams (2006); Kuss and Rasmussen (2005); Nickisch and Rasmussen (2008). 6. http://www .stat.uni-muenchen.de/service/datenarchiv/tokio/tokio e.html 7. Since the input feature is the calendar day , a cyclic kernel which wraps around from 365 to 0 can be used to better model the correlation between days. In this paper , we do not adopt cyclic kernel to follo w the same settings as the reference methods. 32 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S (a) 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 probability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec binomial−GGPM:Taylor binomial−GGPM:Laplace binomial−GGPM:EP binomial−GGPM:KLD (b) 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 probability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec binomial−GGPM:Taylor binomial−GGPM:Laplace Biller_30 Fahrmeir_4064 (c) 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 probability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec binomial−GGPM:Taylor binomial−GGPM:Laplace binomial−GGPM:EP binomial−GGPM:KLD (d) 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 probability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec binomial−GGPM:Taylor Biller_100 (e) 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 probability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec binomial−GGPM:Laplace Fahrmeir_32 Figure 9: T wo binomial-GGPM mean functions on T okyo rainfall data, and comparisons with other methods: (a) binomial-GGPM mean function using lar ge kernel width (smoother func- tion), and (b) comparison to Biller (2000) and Fahrmeir et al. (1994); (c) binomial-GGPM using small kernel width (rougher function), with (d) comparison of T aylor inference to Biller (2000) and (e) comparison of Laplace inference to Fahrmeir et al. (1994). 33 S H A N G A N D C H A N 50 100 150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 calendar day probability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Taylor Laplace EP KLD 50 100 150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 calendar day probability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Taylor Laplace EP KLD 50 100 150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 calendar day probability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Taylor Laplace EP KLD (a) (b) (c) Figure 10: Estimated binomial-GGPM mean functions on T ok yo rainfall, corresponding to three bad local optima (too big or small kernel widths) as sho wn in Figure 8. three bad local minima, where the kernel bandwidth is either too small or too large. The Laplace, EP and KLD methods hav e almost ov erlapped estimates for all the ﬁ ve local optima. The T aylor method also captures similar trends as the other three methods. W e compare the binomial-GGPMs with two spline-based regression models. The ﬁrst model (Fahrmeir et al., 1994) is an extension of GLM that replaces the linear function with a cubic spline. In this model a parameter λ S is used to control the tradeoff between data-ﬁt and smoothness of the cubic function. In Fahrmeir et al. (1994), λ S was estimated by cross-v alidation, resulting in two local minima, λ S = 4064 and λ S = 32 . The larger λ S = 4064 yields a relativ ely smoother curve, which is very close to the smoother estimate using binomial-GGPM (see Figure 9b). The curve with λ S = 32 is quite similar to the rougher estimate of binomial-GGPM using Laplace inference. The dif ference of the two local optima can be attrib uted to different smoothness le vels. The second model (Biller, 2000) is a fully Bayesian approach to regression splines with auto- matic knot placement. A Poisson distribution with parameter λ N is placed ov er the number of knots. Experimental results (Biller, 2000) show that the rainfall dataset is very sensitive to the choice of λ N . W ith a smaller λ N the resulting function is very smooth, while increasing the value of λ N al- lo ws a more ﬂexible function. The estimate with λ N = 100 is similar to the rougher curve estimated by T aylor (see Figure 9d). When the value is decreased to λ N = 30 , the curve becomes similar to the smoother Binomial-GGPM estimate (see Figure 9b). From the two comparisons, we can see the shapes of resulted curves are highly affected by the adoption of optimal hyperparameters. If we use the negati ve marginal likelihood as a metric, the smoother curve will be fav ored by the T aylor method, while the rougher curve will be selected by the other three inference methods. Dataset Name X dim Y min Y max N train N test servo 4 0.13 7.10 70 97 auto-mpg 7 9.00 46.60 100 298 housing 12 5.00 50.00 200 306 abalone 8 1.00 29.00 1000 3177 T able 6: Datasets for non-neg ati ve real re gression. 8.2 Non-negative Real Numbers Experiments In this section, we perform four experiments on non-negati ve real numbers regression using GGPM with the Gamma and In verse Gaussian likelihoods. W e use the following four UCI datasets: 1) 34 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S abalone dataset housing dataset Lik ∗ Inf. MAE MSE NLP MAE MSE NLP GP Exact 1 . 60 ± 0 . 029 4 . 66 ± 0 . 232 2 . 19 ± 0 . 012 2 . 40 ± 0 . 136 11 . 27 ± 1 . 716 2 . 60 ± 0 . 047 LGP Exact 1 . 47 ± 0 . 016 4 . 12 ± 0 . 125 2 . 00 ± 0 . 009 2 . 22 ± 0 . 141 10 . 39 ± 1 . 858 2 . 57 ± 0 . 040 WGP Exact 1 . 52 ± 0 . 024 4 . 36 ± 0 . 169 1 . 94 ± 0 . 010 2 . 22 ± 0 . 138 10 . 24 ± 1 . 386 2 . 54 ± 0 . 191 GA sh T aylor 1 . 49 ± 0 . 021 4 . 26 ± 0 . 238 1 . 99 ± 0 . 007 2 . 21 ± 0 . 140 10 . 46 ± 1 . 841 2 . 59 ± 0 . 044 GA sh Laplace 1 . 55 ± 0 . 022 4 . 55 ± 0 . 232 2 . 01 ± 0 . 007 2 . 21 ± 0 . 144 10 . 41 ± 1 . 885 2 . 58 ± 0 . 041 GA sh EP 1 . 55 ± 0 . 022 4 . 57 ± 0 . 237 2 . 01 ± 0 . 007 2 . 22 ± 0 . 146 10 . 46 ± 1 . 897 2 . 58 ± 0 . 040 GA sh KLD 1 . 55 ± 0 . 022 4 . 53 ± 0 . 185 2 . 01 ± 0 . 007 2 . 22 ± 0 . 146 10 . 46 ± 1 . 891 2 . 58 ± 0 . 040 GA sc T aylor 1 . 67 ± 0 . 023 5 . 10 ± 0 . 173 2 . 12 ± 0 . 012 2 . 23 ± 0 . 159 10 . 27 ± 2 . 282 2 . 52 ± 0 . 038 GA sc Laplace 1 . 53 ± 0 . 024 4 . 41 ± 0 . 268 2 . 06 ± 0 . 010 2 . 23 ± 0 . 146 10 . 35 ± 2 . 104 2 . 52 ± 0 . 036 GA sc EP 1 . 53 ± 0 . 025 4 . 39 ± 0 . 274 2 . 06 ± 0 . 010 2 . 22 ± 0 . 141 10 . 30 ± 2 . 046 2 . 52 ± 0 . 037 GA sc KLD 1 . 53 ± 0 . 025 4 . 35 ± 0 . 300 2 . 05 ± 0 . 010 2 . 22 ± 0 . 142 10 . 29 ± 2 . 071 2 . 52 ± 0 . 038 INV T aylor 1 . 42 ± 0 . 022 4 . 11 ± 0 . 354 1 . 99 ± 0 . 014 2 . 27 ± 0 . 156 11 . 26 ± 2 . 080 2 . 75 ± 0 . 069 INV Laplace 1 . 54 ± 0 . 024 4 . 73 ± 0 . 440 1 . 98 ± 0 . 008 2 . 28 ± 0 . 163 11 . 05 ± 2 . 073 2 . 72 ± 0 . 064 INV EP 1 . 55 ± 0 . 025 4 . 78 ± 0 . 449 1 . 99 ± 0 . 008 2 . 32 ± 0 . 171 11 . 39 ± 2 . 172 2 . 71 ± 0 . 059 INV KLD 1 . 56 ± 0 . 019 5 . 04 ± 0 . 576 1 . 99 ± 0 . 007 2 . 32 ± 0 . 170 11 . 36 ± 2 . 149 2 . 71 ± 0 . 059 auto-mpg dataset servo dataset Lik ∗ Inf. MAE MSE NLP MAE MSE NLP GP Exact 2 . 11 ± 0 . 053 8 . 69 ± 0 . 349 2 . 48 ± 0 . 038 0 . 43 ± 0 . 047 0 . 33 ± 0 . 058 1 . 06 ± 0 . 033 LGP Exact 2 . 10 ± 0 . 090 8 . 71 ± 0 . 655 2 . 36 ± 0 . 044 0 . 27 ± 0 . 035 0 . 18 ± 0 . 042 0 . 32 ± 0 . 044 WGP Exact 2 . 08 ± 0 . 061 8 . 45 ± 0 . 672 2 . 39 ± 0 . 055 0 . 25 ± 0 . 031 0 . 18 ± 0 . 043 0 . 13 ± 0 . 113 GA sh T aylor 2 . 09 ± 0 . 072 8 . 71 ± 0 . 540 2 . 36 ± 0 . 043 0 . 26 ± 0 . 035 0 . 20 ± 0 . 048 0 . 11 ± 0 . 084 GA sh Laplace 2 . 09 ± 0 . 073 8 . 64 ± 0 . 506 2 . 36 ± 0 . 040 0 . 27 ± 0 . 037 0 . 20 ± 0 . 047 0 . 09 ± 0 . 072 GA sh EP 2 . 09 ± 0 . 074 8 . 65 ± 0 . 503 2 . 36 ± 0 . 040 0 . 27 ± 0 . 037 0 . 20 ± 0 . 048 0 . 09 ± 0 . 070 GA sh KLD 2 . 09 ± 0 . 075 8 . 66 ± 0 . 498 2 . 36 ± 0 . 041 0 . 27 ± 0 . 037 0 . 20 ± 0 . 048 0 . 09 ± 0 . 071 GA sc T aylor 2 . 10 ± 0 . 076 8 . 73 ± 0 . 536 2 . 40 ± 0 . 036 0 . 28 ± 0 . 024 0 . 21 ± 0 . 041 0 . 19 ± 0 . 040 GA sc Laplace 2 . 09 ± 0 . 064 8 . 74 ± 0 . 496 2 . 40 ± 0 . 035 0 . 26 ± 0 . 023 0 . 19 ± 0 . 033 0 . 18 ± 0 . 043 GA sc EP 2 . 09 ± 0 . 064 8 . 75 ± 0 . 503 2 . 40 ± 0 . 036 0 . 25 ± 0 . 023 0 . 18 ± 0 . 036 0 . 17 ± 0 . 048 GA sc KLD 2 . 09 ± 0 . 065 8 . 74 ± 0 . 495 2 . 40 ± 0 . 036 0 . 25 ± 0 . 024 0 . 18 ± 0 . 036 0 . 17 ± 0 . 048 INV T aylor 2 . 11 ± 0 . 092 8 . 88 ± 0 . 890 2 . 39 ± 0 . 055 0 . 33 ± 0 . 045 0 . 34 ± 0 . 123 0 . 41 ± 0 . 141 INV Laplace 2 . 08 ± 0 . 071 8 . 58 ± 0 . 624 2 . 36 ± 0 . 043 0 . 35 ± 0 . 046 0 . 41 ± 0 . 139 0 . 41 ± 0 . 155 INV EP 2 . 08 ± 0 . 069 8 . 59 ± 0 . 589 2 . 36 ± 0 . 041 0 . 37 ± 0 . 050 0 . 45 ± 0 . 139 0 . 38 ± 0 . 146 INV KLD 2 . 08 ± 0 . 065 8 . 59 ± 0 . 567 2 . 36 ± 0 . 042 0 . 37 ± 0 . 051 0 . 45 ± 0 . 147 0 . 40 ± 0 . 150 T able 7: A v erage errors for servo, auto-mpg, housing and abalone datasets. ( ∗ likelihood abbrevia- tions: LGP - GP on the log-transformed outputs; WGP - warped GP; GA sh - Gamma sh -GGPM; GA sc - Gamma sc -GGPM; INV - In v . Gaussian-GGPM) abalone 8 – predict the age of abalone from physical measurements; 2) housing 9 – predict housing v alues in suburbs of Boston; 3) auto-mpg 10 – estimate city-cycle fuel consumption in miles per gallon(mpg); 4) servo 11 – predict the rise time of a servo-mechanism in terms of two gain settings and two choices of mechanical linkages. The four datasets are summarized in T able 6, which lists the output range ( Y min , Y max ), the input dimension ( X dim ), and the size of the training and test sets ( N train , N test ). In all experiments, we follo w the same testing protocol. A giv en number of training samples are randomly selected from the dataset, and the remaining data is used for testing. This process is repeated 10 times, and the means and standard deviations of MAE and NLP reported. T o test statistical signiﬁcance, we use the Friedman test (Howell, 2010), which is a non-parametric test on the differences of sev eral related samples, based on ranking. T o ﬁnd candidate hyperparameters in each trial, we use the T aylor-initialization procedure described in the Section 7. Then the candidate 8. http://archive.ics.uci.edu/ml/datasets/Abalone 9. http://archive.ics.uci.edu/ml/datasets/Housing 10. http://archive.ics.uci.edu/ml/datasets/Auto+MPG 11. http://archive.ics.uci.edu/ml/datasets/Serv o 35 S H A N G A N D C H A N hyperparameters are used as the initialization for the other three inference methods. W e also im- plemented three other typical GP models: standard GPR, GPR on the log-transformed data, and the warped GP (Snelson et al., 2004). Experimental results are presented in T able 7. The non-ne gati ve GGPMs typically perform bet- ter than standard GPR. As expected from Section 5.2.5, the log-transformed GP performs similarly to the Gamma sh -GGPM with T aylor inference. Ho wever , there are small performance dif ferences due to the different marginal likelihoods used to estimate the hyperparameters; the former is based on the standard GP marginal in (9), while the latter is based on the T aylor marginal in (48) that includes an extra penalty term on the dispersion hyperparameter . The learned warping functions of WGP for the four datasets are also log-like, and hence WGP also has similar performance. The In v .-Gaussian-GGPM with T aylor inference can also be vie wed as using a standard GP on the log of the outputs, but with observation noise that is output dependent (see Section 5.2.5). W e can beneﬁt from this property for the auto-mpg and abalone datasets, since In v .- Gaussian-GGPM with T aylor inference achieves the smallest ﬁtting errors. While for the servo and housing dataset, this likelihood is worse than the Gamma sh likelihood. Next, we rank each likelihood function based on the resulting performance in either NLP or MAE, and use a Friedman test to determine signiﬁcance 12 . T able 8 shows the average rankings and p v alues for v arious likelihood combinations. Looking at NLP , there is a best ranked likelihood function that is statistically signiﬁcant in all cases except for one ( p < 0 . 001 ). F or auto-mpg, the Gamma sh has slightly better ranking than Inv .-Gaussian, and the dif ference is marginally signiﬁcant ( 0 . 05 < p < 0 . 06 ). Since each likelihood can have the smallest average rankings, all the three GGPMs should be taken into consideration when a ne w dataset is giv en. Looking at MAE performance, there are also signiﬁcant differences in ranking between like- lihoods on each dataset. Note though that in many cases the dif ference between the top two like- lihoods are not signiﬁcant (e.g., housing, auto-mpg, servo). This is mainly because the relati ve performance of different inference methods changes between likelihood functions, e.g., on serv o, T aylor inference performs better compared to the other inference methods using the Gamma sh , but the ranking is rev ersed when using the Gamma sc . Again this indicates that the effects of likelihood functions to MAE are dataset dependent, so that the choice of likelihood function is important. (a) Using NLP as measurement dataset GA sh , GA sc , INV ( p ) GA sc , INV ( p ) GA sh , GA sc ( p ) GA sh , INV ( p ) abalone 1.900, 3.000, 1.100 (0.0000) 2.000, 1.000 (0.0000) 1.000 , 2.000 (0.0000) 1.900, 1.100 (0.0000) housing 2.000, 1.000 , 3.000 (0.0000) 1.000 , 2.000 (0.0000) 2.000, 1.000 (0.0000) 1.000 , 2.000 (0.0000) auto-mpg 1.400 , 2.800, 1.800 (0.0000) 1.850, 1.150 (0.0000) 1.050 , 1.950 (0.0000) 1.350, 1.650 (0.0578) servo 1.125 , 1.875, 3.000 (0.0000) 1.000 , 2.000 (0.0000) 1.125 , 1.875 (0.0000) 1.000 , 2.000 (0.0000) (b) Using MAE as measurement dataset GA sh , GA sc , INV ( p ) GA sc , INV ( p ) GA sh , GA sc ( p ) GA sh , INV ( p ) abalone 2.375, 1.563 , 2.063 (0.0012) 1.313 , 1.688 (0.0163) 1.750, 1.250 (0.0016) 1.625, 1.375 (0.1138) housing 1.750 , 1.837 , 2.413 (0.0053) 1.300 , 1.700 (0.0114) 1.462, 1.538 (0.6310) 1.288 , 1.712 (0.0065) auto-mpg 1.816 , 2.395, 1.789 (0.0117) 1.684, 1.316 (0.0231) 1.289 , 1.711 (0.0094) 1.526, 1.474 (0.7456) servo 1.528 , 1.472 , 3.000 (0.0000) 1.000 , 2.000 (0.0000) 1.528, 1.472 (0.7389) 1.000 , 2.000 (0.0000) T able 8: A v erage rankings and p values (in parenthesis) for dif ferent likelihood combinations using the Friedman test. In each grouping, bolded rankings differ signiﬁcantly from non-bold rankings ( p < 0 . 05 ), but not from each other . 12. W e pool trials ov er all inference methods. 36 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S (a) Using NLP as measurement Dataset Lik T A, LA, EP , KLD ( p ) T A, LA ( p ) LA, EP ( p ) EP , KLD ( p ) abalone INV 3.050, 1.650 , 2.500, 2.800 (0.0683) 1.700, 1.300 (0.2059) 1.200 , 1.800 (0.0339) 1.400, 1.600 (0.4795) housing GA sc 2.200, 2.300, 2.700, 2.800 (0.6537) 1.400, 1.600 (0.5271) 1.400, 1.600 (0.5271) 1.500, 1.500 (1.0000) auto-mpg GA sh 3.400, 2.300, 2.050, 2.250 (0.0488) 1.800, 1.200 (0.0578) 1.600, 1.400 (0.4142) 1.450, 1.550 (0.5637) servo GA sh 3.050, 1.900, 2.500, 2.550 (0.2318) 1.800, 1.200 (0.0578) 1.400, 1.600 (0.5271) 1.500, 1.500 (1.0000) all - 2.925, 2.038 , 2.438 , 2.600 (0.0137) 1.675, 1.325 (0.0269) 1.400, 1.600 (0.1701) 1.462, 1.538 (0.5127) (b) Using MAE as measurement Dataset Lik T A, LA, EP , KLD ( p ) T A, LA ( p ) LA, EP ( p ) EP , KLD ( p ) abalone GA sc 4.000, 3.000, 1.750 , 1.250 (0.0000) 2.000, 1.000 (0.0016) 2.000, 1.000 (0.0016) 1.750, 1.250 (0.0588) housing GA sh 2.350, 2.050, 2.550, 3.050 (0.3411) 1.500, 1.500 (1.0000) 1.350, 1.650 (0.3173) 1.350, 1.650 (0.3173) auto-mpg INV 3.300, 1.800, 2.300, 2.600 (0.0694) 1.800, 1.200 (0.0578) 1.300, 1.700 (0.2059) 1.400, 1.600 (0.5271) servo GA sc 4.000, 3.000, 1.700 , 1.300 (0.0000) 2.000, 1.000 (0.0016) 2.000, 1.000 (0.0016) 1.700, 1.300 (0.1025) all - 3.413, 2.462, 2.075 , 2.050 (0.0000) 1.825, 1.175 (0.0000) 1.663, 1.337 (0.0374) 1.550, 1.450 (0.4795) T able 9: A v erage rankings and p v alues (in parenthesis) for different inference methods using the Friedman test. Only the best performing likelihood function is considered. In each group- ing, bolded rankings differ signiﬁcantly from non-bold rankings ( p < 0 . 05 ), but not from each other . W e next compare approximate inference methods for the best-performing likelihood functions (according to a verage rankings in T able 8) on each dataset. T able 9 shows the average rankings and the corresponding p v alues using a Friedman test. Looking the ranking based on NLP in T able 9a, EP and KLD ha ve similar rankings (within 0.2) on each dataset, and any dif ferences are not statistically signiﬁcant. Similarly , LA and EP also ha ve similar rankings (within 0.2) for each dataset except on abalone. Over all datasets, LA has the best ranking (2.038), but this result is not statistically signiﬁcant, suggesting that the rank orderings of LA, EP , and KLD are not consistent. Finally , LA has a statistically better ranking than T A, when pooling over all datasets, although the difference is not large; LA outperforms T A about two-thirds of the time. The results are similar when looking at the rankings based on MAE in T able 9b . First looking at the rankings o ver all datasets, as before, EP and KLD hav e almost identical rankings in MAE, but no w EP has better MAE than LA about tw o-thirds of the time (statistically signiﬁcant). Finally , LA typically dominates T A in MAE ranking, ov er all datasets. Interestingly , there are some datasets (e.g. housing, auto-mpg), where there is no signiﬁcant dif ference between the inference algorithms. In other words, the best inference algorithm may change in each trail, based on the particular training and test set. In summary , the performances of inference methods are highly af fected by likelihoods, datasets and ev aluation metrics. For a giv en dataset, the choice of likelihood has a large impact on the pre- dicti ve density (NLP), with only one likelihood usually dominating, and less so on the prediction error (MAE), where typically more than one likelihood can achiev e lo w error . Giv en the “cor- rect” likelihood, the performance of inference methods tends to be similar , e.g., LA, EP , and KLD hav e similar rankings when ev aluated with NLP , and EP and KLD hav e similar ranking for MAE. Ho we ver , giv en the “wrong” likelihood, the performance of the inference algorithms can be highly af fected by the dataset and the e valuation metrics. 37 S H A N G A N D C H A N 8.3 Range Data Experiments In this experiment, we consider con version of device-dependent RGB v alues to device- and illuminant- independent reﬂectance spectra. In Heikkinen et al. (2008), this con version is cast as a regularized regression problem, where the input can be RGB or HSV color v alues, and the output is reﬂectance v alues at sampled wa velengths. In particular , the reﬂectance spectra is ﬁrst scaled from the [0 1] interv al to [-1 1], and then mapped to a real v alue via the in verse hyperbolic tangent (arctanh) func- tion, then a regularized regression framework is applied. The method in Heikkinen et al. (2008) is equi v alent to applying standard GPR to the logit-transformed spectral values, as discussed in Sec- tion 5.2.5. Note that the hyperparameters are ﬁxed in Heikkinen et al. (2008), whereas using the GP interpretation, the hyperparameters can be estimated automatically using maximum marginal likelihood. Since the regression output is constrained to the unit interval, we consider Beta-GGPM for spectra reﬂectance regression, and perform experiments on the Munsell dataset, consisting of 1269 RGB/spectral pairs. Follo wing the protocol of Heikkinen et al. (2008), we used 669 for training and 600 for testing, and results are averaged over 10 trials. W e also used a smaller training set, con- sisting of 20% of the original training set. Hyperparameters are learned using maximum marginal likelihood. (a) Full training dataset Model Inference A vg. Error Max. Error Std. Error GP Exact 0 . 0090 ± 0 . 00032 0 . 0919 ± 0 . 0195 0 . 0102 ± 0 . 00100 arctanh+GP Exact 0 . 0087 ± 0 . 00036 0 . 0946 ± 0 . 0205 0 . 0101 ± 0 . 00098 Beta-GGPM T aylor 0 . 0088 ± 0 . 00037 0 . 0961 ± 0 . 0195 0 . 0103 ± 0 . 00098 Beta-GGPM Laplace 0 . 0088 ± 0 . 00038 0 . 0965 ± 0 . 0199 0 . 0103 ± 0 . 00100 Beta-GGPM EP 0 . 0087 ± 0 . 00038 0 . 0946 ± 0 . 0208 0 . 0101 ± 0 . 00100 (b) Small training dataset Model Inference A vg. Error Max. Error Std. Error GP Exact 0 . 0132 ± 0 . 00115 0 . 0999 ± 0 . 0179 0 . 0132 ± 0 . 00221 arctanh+GP Exact 0 . 0123 ± 0 . 00074 0 . 0965 ± 0 . 0186 0 . 0124 ± 0 . 00141 Beta-GGPM T aylor 0 . 0123 ± 0 . 00080 0 . 0970 ± 0 . 0170 0 . 0124 ± 0 . 00147 Beta-GGPM Laplace 0 . 0123 ± 0 . 00079 0 . 0973 ± 0 . 0172 0 . 0124 ± 0 . 00143 Beta-GGPM EP 0 . 0122 ± 0 . 00077 0 . 0967 ± 0 . 0183 0 . 0124 ± 0 . 00141 T able 10: A verage errors for the Munsell dataset. The experimental results are presented in T able 10. First, the standard GP performs worse than Beta-GGPM, due to the mismatch between output domain and actual outputs. This ef fect is more pronounced when less training data is ava ilable; when using the smaller training set, the av erage error drops around 8% for the Beta-GGPM v ersus the GP . Next, Beta-GGPM with T aylor inference and Gauss-GGPM+arctanh have almost the same v alues for the three error metrics, this is consistent to our conclusion that Beta-GGPM with T aylor inference can be viewed as using a standard GP on the logit transformation of the outputs. Finally , the three inference methods (T A, LA, EP) perform similarly for Beta-GGPM on this dataset, in terms of average error . T able 11 shows the Friedman test results for the different inference combinations. For the full training set, LA and EP hav e similar average rankings, with T A ranked 0.6 worse. Ho we ver , the dif ferences are not statistically signiﬁcant, due to the similar av erage error values. For the reduced training set, EP has the best ranking, follo wed by T A and then LA. Ag ain, the rankings are not statistically signiﬁcant, although EP is marginally better than LA ( 0 . 05 < p < 0 . 06 ). 38 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S (a) Full training dataset Metrics T A, LA, EP ( p ) T A, LA ( p ) T A, EP ( p ) LA, EP ( p ) A vg. Error 2.450, 1.850, 1.700 (0.0724) 1.750, 1.250 (0.0253) 1.700, 1.300 (0.1024) 1.600, 1.400 (0.4142) Max. Error 1.900, 2.500, 1.600 (0.1225) 1.300, 1.700 (0.2059) 1.600, 1.400 (0.5271) 1.800, 1.200 (0.0578) Std. Error 2.500, 2.050, 1.450 (0.0581) 1.700, 1.300 (0.2059) 1.800, 1.200 (0.0578) 1.750, 1.250 (0.0956) (b) Small training dataset Metrics T A, LA, EP ( p ) T A, LA ( p ) T A, EP ( p ) LA, EP ( p ) A vg. Error 2.050, 2.400, 1.550 (0.1316) 1.350, 1.650 (0.3173) 1.700, 1.300 (0.2059) 1.750, 1.250 (0.0588) Max. Error 1.800, 2.000, 2.200 (0.6703) 1.400, 1.600 (0.5271) 1.400, 1.600 (0.5271) 1.400, 1.600 (0.5271) Std. Error 1.700, 2.350, 1.950 (0.3225) 1.250, 1.750 (0.0956) 1.450, 1.550 (0.7389) 1.600, 1.400 (0.5271) T able 11: Beta-GGPM: A verage rankings of dif ferent inference methods and p values using the Friedman test. Bolded rankings differ signiﬁcantly from non-bold rankings ( p < 0 . 05 ). 8.4 Counting experiments W e perform two counting experiments using GGPMs with Poisson-based likelihoods. In all cases, predictions are based on the mode of the distribution for GGPMs, and the rounded, truncated mean for GPR. In the ﬁrst experiment, we perform crowd counting using the UCSD crowd counting dataset 13 from Chan et al. (2008). The dataset contains 30-dimensional features extracted from images and the corresponding number of people in each image, for two dif ferent directions (right and left motion). The goal is to predict the number of people using just the image features. The right cro wd contains more people (a verage of 14.69 per image) than the left crowd (av erage of 9.98). The dataset consists of 2000 feature/count pairs for each direction, and follo wing Chan et al. (2008), we use 800 for training and 1200 of testing. W e predict using the Poisson and COM-Poisson GGPMs and the exponential mean mapping (canonical link function), as well as the versions using the linear mean mapping (linearized link function) from Section 4.3.2. The compound linear plus RBF k ernel was used for all models. The cro wd counting results are presented in T able 12. On the “right” crowd, the Poisson- and COM-Poisson-GGPMs perform better using the exponential mapping versus the linear mapping. This is due to the large number of people in the “right” cro wd, which leads to a more non-linear (exponential) trend in the feature space. In contrast, the linearized link functions perform better on the “left cro wd”, indicating a more linear trend in the data (due to smaller cro wd sizes and fe wer occlusions). Looking at the likelihood functions, the Poisson likelihood has higher accuracy on the “right” crowd, whereas the COM-Poisson is better on the “left” crowd. The main difference is that COM-Poisson provides some ﬂe xibility to control the variance of the observ ation noise, which helps for the “left” cro wd. T o do hypothesis testing, we randomly selected 10 small training sets, consisting of 400 fea- ture/count pairs, from the original training dataset. The learned GGPMs are e v aluated on the orig- inal test dataset, and experimental results are presented in T able 13. T able 14 shows the a verage rankings and p v alues using the Friedman test for dif ferent likelihoods combinations. W e can see the Poisson- and COM-Poisson-GGPMs perform signiﬁcantly better than the corresponding lin- earized models on the “right” crowd, and vice versa, the linearized link functions perform better on the “left” crowd. This is consistent with our observations from the experiments on the original training and test datasets. 13. Data set at http://visal.cs.cityu.edu.hk/downloads/#ucsdpeds-feats 39 S H A N G A N D C H A N Right crowd Left crowd exponential mean linearized mean exponential mean linearized mean Likelihood Inference MAE NLP MAE NLP MAE NLP MAE NLP GP Exact - - 1 . 56 2 . 94 - - 0 . 85 1 . 83 Poisson T aylor 1 . 25 2 . 33 1 . 36 2 . 34 1 . 03 2 . 20 0 . 88 2 . 18 Poisson Laplace 1 . 27 2 . 33 1 . 36 2 . 34 1 . 03 2 . 20 0 . 87 2 . 18 Poisson EP 1 . 27 2 . 33 1 . 37 2 . 34 1 . 03 2 . 20 0 . 87 2 . 18 COM-Poisson T aylor 1 . 39 2 . 50 1 . 48 2 . 50 0 . 94 1 . 76 0 . 91 1 . 63 COM-Poisson Laplace 1 . 39 2 . 54 1 . 49 2 . 51 0 . 94 1 . 77 0 . 85 1 . 52 COM-Poisson EP 1 . 43 2 . 54 1 . 49 2 . 52 0 . 94 1 . 76 0 . 83 1 . 55 T able 12: Mean absolute errors for cro wd counting with comparisons between likelihood functions, inference methods, and link functions. Right crowd Left crowd exponential mean linearized mean exponential mean linearized mean Lik. Inf. MAE NLP MAE NLP MAE NLP MAE NLP GP Exact - - 1.56 ± 0.027 2.56 ± 0.059 - - 0.88 ± 0.022 1.69 ± 0.055 PO T A 1.31 ± 0.038 2.34 ± 0.038 1.46 ± 0.036 2.35 ± 0.034 1.04 ± 0.023 2.19 ± 0.036 0.95 ± 0.024 2.17 ± 0.027 PO LA 1.33 ± 0.033 2.34 ± 0.038 1.43 ± 0.030 2.35 ± 0.035 1.02 ± 0.019 2.19 ± 0.030 0.93 ± 0.019 2.17 ± 0.019 PO EP 1.33 ± 0.034 2.34 ± 0.038 1.42 ± 0.032 2.35 ± 0.035 1.02 ± 0.018 2.19 ± 0.030 0.93 ± 0.018 2.17 ± 0.019 COM T A 1.54 ± 0.076 2.61 ± 0.106 1.50 ± 0.046 2.43 ± 0.079 0.96 ± 0.038 1.83 ± 0.055 0.92 ± 0.045 1.68 ± 0.052 COM LA 1.55 ± 0.098 2.22 ± 0.069 1.58 ± 0.037 2.43 ± 0.072 0.95 ± 0.021 1.81 ± 0.047 0.86 ± 0.023 1.65 ± 0.043 COM EP 1.40 ± 0.033 2.20 ± 0.057 1.54 ± 0.048 2.35 ± 0.034 0.92 ± 0.016 1.75 ± 0.051 0.86 ± 0.027 1.62 ± 0.049 T able 13: A v erage errors for crowd counting dataset using reduced training set. ( likelihood abbrevi- ations: PO - Poisson; COM - COM-Poisson) (a) Using NLP as measurement dataset PO, L-PO, COM, L-COM ( p ) PO, L-PO ( p ) COM, L-COM ( p ) L-PO, L-COM ( p ) Right 1.758 , 2.788, 2.030, 3.424 (0.0000) 1.000 , 2.000 (0.0000) 1.333, 1.667 (0.0555) 1.121 , 1.879 (0.0000) Left 4.000, 3.000 2.000, 1.000 (0.0000) 2.000, 1.000 (0.0000) 2.000, 1.000 (0.0000) 2.000, 1.000 (0.0000) (b) Using MAE as measurement dataset PO, L-PO, COM, L-COM ( p ) PO, L-PO ( p ) COM, L-COM ( p ) L-PO, L-COM ( p ) Right 1.061 , 2.333, 3.061, 3.545 (0.0000) 1.000 , 2.000 (0.0000) 1.394, 1.606 (0.2230) 1.061 , 1.939 (0.0000) Left 3.933, 2.333 2.656, 1.078 (0.0000) 2.000, 1.000 (0.0000) 2.000, 1.000 (0.0000) 1.944, 1.056 (0.0000) T able 14: A v erage rankings and p values (in parenthesis) for different likelihood combinations using the Friedman test. Bolded rankings differ signiﬁcantly from non-bold rankings ( p < 0 . 05 ). ( likelihood abbrevi ations: PO - Poisson; L-PO - Linear Poisson; COM - COM- Poisson); L-COM - Linear COM-Poisson) 40 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S In the second experiment, the GGPM is used for age estimation of face images. W e use the FG- NET dataset 14 , which consists of face images of 82 people at dif ferent ages (average of 12 images per person). The input v ector into the GGPM is 150 facial features, which are extracted using activ e appearance models (Cootes et al., 2001), while the output is the age of the face. W e used lea ve-one- person-out testing as in Zhang and Y eung (2010) to ev aluate the performance of dif ference GGPMs. For each fold, the images of one person are used for testing, and all the other people are used as the training dataset. The experiment results for age estimation are presented in T able 15. The linear Poisson-GGPM has the lowest MAE of 5.82 versus 6.12 for standard GPR. T able 16a sho ws the Friedman test re- sults for different likelihood combinations, and results indicate that the linearized Poisson GGPM signiﬁcantly better ranking than the other two GGPMs. T able 16b shows the Friedman test for dif ferent inference combinations. Although there are small differences in MAE between the infer- ence algorithms for each likelihood, the rankings are v ery similar , which indicates that no inference method dominates in terms of MAE. In general, Laplace and EP perform similarly , and there is no statistically signiﬁcant dif ference between them. Moreover , T aylor approximation can outperform the other two methods for some GGPMs, e.g. Ne g. binomial-GGPM, b ut again the dif ference is not statistically signiﬁcant. Finally , Figure 11 presents an example prediction on a test person. Method Inference MAE GP Exact 6 . 12 W arped GP (Zhang and Y eung, 2010) Exact 6 . 11 Poisson GGPM T aylor 6 . 44 Poisson GGPM Laplace 6 . 41 Poisson GGPM EP 6 . 40 Linearized Poisson GGPM T aylor 5 . 98 Linearized Poisson GGPM Laplace 5 . 82 Linearized Poisson GGPM EP 5 . 83 Neg. binomial GGPM T aylor 6 . 19 Neg. binomial GGPM Laplace 6 . 37 Neg. binomial GGPM EP 6 . 37 T able 15: MAEs for age estimation. 0 5 10 15 20 25 30 35 40 45 50 55 0 0.05 0.1 0.15 age probability 7 years old 14 years old 34 years old 41 years old Figure 11: Examples of predicted distributions. (a) Different lik elihoods Inference PO, L-PO, NB ( p ) L-PO, NB ( p ) PO, L-PO ( p ) PO, NB ( p ) all 2.193, 1.805, 2.002 (0.0001) 1.433, 1.567 (0.0328) 1.628, 1.372 (0.0000) 1.565, 1.435 (0.0364) (b) Different inference methods Likelihood T A, LA, EP ( p ) T A, LA ( p ) T A, EP ( p ) LA, EP ( p ) Poisson 2.128, 1.976, 1.896 (0.2550) 1.555, 1.445 (0.3113) 1.573, 1.427 (0.1687) 1.530, 1.470 (0.4233) Lin. Poisson 2.152, 1.927, 1.921 (0.2094) 1.567, 1.433 (0.2216) 1.585, 1.415 (0.1175) 1.494, 1.506 (0.8946) Neg. binomial 1.890, 2.073, 2.037 (0.4013) 1.433, 1.567 (0.2100) 1.457, 1.543 (0.4189) 1.506, 1.494 (0.8815) T able 16: A v erage rankings and p v alues using the Friedman test. Bolded rankings differ signiﬁ- cantly from non-bold rankings ( p < 0 . 05 ). ( likelihood abbreviations: PO - Poisson; L-PO - Lin. Poisson; NB - Neg. binomial) 14. Data set at http://www .fgnet.rsunit.com 41 S H A N G A N D C H A N 8.5 Summary Our experiments ha ve considered a variety of likelihood functions, inference methods, and datasets. In general, the choice of likelihood function is more important than the choice of approximate inference algorithm. Using a likelihood function that matches the output domain and observation noise will typically lead to better performance ov er the standard GP . The link function can also af fect the ﬁnal result, and the selection of the link function is dataset dependent, similar to the choice of the kernel function. Looking at the inference methods, EP and KLD typically ha ve similar performance, and Laplace inference also achiev es comparable results. T aylor inference sometimes yields the smallest MAEs, while its NLP is often larger than the other three inference. This suggests that accurate estimation in terms of NLP does not always lead to smaller ﬁtting error in terms of MAE. The performances of dif ferent inference methods are also highly affected by the distrib ution of training and test data, due to the systematic ordering observed in the posterior means (Section 6). Finally , on many datasets there is no dominant approximate inference algorithm, yielding mixed results and a verage rankings with dif ferences that are not statistically signiﬁcant. 9. Conclusions In this paper , we hav e studied approximate inference for generalized Gaussian process models. The GGPM is a unifying framew ork for existing GP models, where the observation likelihood of the GP model is itself parameterized using the exponential family distribution (EFD). This allows approximate inference algorithms to be deri ved using the general form of the EFD. A particular GP model can then be formed by setting the parameters of the EFD, which instantiates a particular observ ation likelihood function for an output domain. In addition to the observation lik elihood, the GGPM also has a link function that controls the mapping between the latent variable (GP prior) and the mean of the output distribution. By appropriately setting the link function, mean trends (e.g., logarithmic) can be learned that would otherwise not be possible with standard positi ve-deﬁnite kernel functions. W e also study an approximate inference method based on a T aylor approximation, which is non-iterati ve and computationally efﬁcient. The T aylor approximation can justify many common heuristics in GP modeling (e.g., label regression, GPR on log-transformed outputs, and GPR on logit-transformed outputs) as principled inference with a particular likelihood function. W e also present approximate inference for GGPMs using the Laplace approximation, expectation propaga- tion, and KL di ver gence minimization. Furthermore, we demonstrate that the posterior means of the T aylor , Laplace, and EP approximations usually hav e a speciﬁc ordering, which are a result of the particular method of approximation. As a consequence, since the posterior means are biased in this way , the prediction error of a particular inference algorithm heavily depends on the distributions of the training and test data. Finally , we perform hyperparameter estimation using the T aylor ap- proximation to initialize the other less-efﬁcient approximate inference methods. Our initialization procedure can greatly increase the speed of the the learning phase, while not signiﬁcantly affecting the quality of the hyperparameters. In addition, we did not notice any con ver gence issues of EP when using our initialization procedure. W e conduct a comprehensive set of experiments, using a v ariety of likelihood functions, approx- imate inference methods, and datasets. In our e xperiments, we found that the selection of the correct likelihood function has a larger impact on the prediction accuracy than the approximate inference 42 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S method. Indeed, in many cases there was no dominant inference method, and any differences in av erage ranking were not statistically signiﬁcant. Whereas, the appropriate choice of likelihood function and link function improv ed accuracy signiﬁcantly . Finally , in this paper we have only considered univ ariate observations for the GGPM. Future work will use the multiv ariate exponential family distrib ution to form a multi variate GGPM. Such a model would encompass existing multiv ariate GPs, such as GP ordinal re gression (Chu and Ghahra- mani, 2005), multi-class GP classiﬁcation (Girolami and Rogers, 2006; Kim and Ghahramani, 2006; W illiams and Barber , 1998), and semiparametric latent factor models (T eh et al., 2005). A ppendix A. Closed-f orm T aylor Appr oximation This appendix contains the deri v ations for the closed-form T aylor approximation in Section 5.2 A.1 Joint lik elihood approximation Summing ov er (43), we have the approximation to output lik elihood log p ( y | θ ( η )) = n X i =1 log p ( y i | θ ( η i )) ≈ n X i =1 log p ( y i | θ ( ˜ η i )) + ˜ u i ( η i − ˜ η i ) − 1 2 ˜ w − 1 i ( η i − ˜ η i ) 2 (90) = − 1 2 ( η − ˜ η ) T ˜ W − 1 ( η − ˜ η ) + ˜ u T ( η − ˜ η ) + log p ( y | θ ( ˜ η )) . (91) Substituting (91) into (42), we obtain an approximation to the joint posterior , log q ( y , η | X ) = log p ( y | θ ( ˜ η )) − 1 2 log | K | − n 2 log 2 π − 1 2 ( η − ˜ η ) T ˜ W − 1 ( η − ˜ η ) + ˜ u T ( η − ˜ η ) − 1 2 η T K − 1 η (92) = log p ( y | θ ( ˜ η )) − 1 2 log | K | − n 2 log 2 π − 1 2 ( η − ˜ η − ˜ W ˜ u ) T ˜ W − 1 ( η − ˜ η − ˜ W ˜ u ) − 1 2 η T K − 1 η | {z } + 1 2 ˜ u T ˜ W ˜ u . (93) Next, we note that the brack eted term in (93) is of the form ( x − a ) T B ( x − a ) + x T Cx = x T Dx − 2 x T Ba + a T Ba (94) = x T Dx − 2 x T DD − 1 Ba + a T B T D − 1 Ba − a T B T D − 1 Ba + a T Ba (95) =   x − D − 1 Ba   2 D − 1 + a T ( B − B T D − 1 B ) a =   x − D − 1 Ba   2 D − 1 + k a k 2 B − 1 + C − 1 (96) where D = B + C , and the last line uses the matrix inv ersion lemma. Using this property ( x = η , a = ˜ η + ˜ W ˜ u , B = W − 1 , C = K − 1 ), the joint likelihood can be re written as (44). A.2 Special expansion point For the case when η i = g ( T ( y i )) , we note that T ( y i ) − ˙ b ( θ ( g ( T ( y i )))) = T ( y i ) − ˙ b ( ˙ b − 1 ( g − 1 ( g ( T ( y i ))))) = 0 , (97) which yields (46). 43 S H A N G A N D C H A N A.3 A pproximate Mar ginal The approximate marginal is obtained by substituting the approximate joint in (44) log p ( y | X ) = log Z exp(log p ( y , η | X )) d η ≈ log Z exp(log q ( y , η | X )) d η (98) = log p ( y | θ ( ˜ η )) − 1 2 log | K | − n 2 log 2 π − 1 2   ˜ t   2 ˜ W + K + 1 2 ˜ u T ˜ W ˜ u + log Z e − 1 2 k η − A − 1 ˜ W − 1 ˜ t k 2 A − 1 d η (99) = log p ( y | θ ( ˜ η )) − 1 2 log | K | − n 2 log 2 π − 1 2   ˜ t   2 ˜ W + K + 1 2 ˜ u T ˜ W ˜ u + log(2 π ) n 2   A − 1   1 2 (100) = log p ( y | θ ( ˜ η )) − 1 2 log | K | | A | − 1 2   ˜ t   2 ˜ W + K + 1 2 ˜ u T ˜ W ˜ u (101) Looking at the determinant term, log | A | | K | = log    ( ˜ W − 1 + K − 1 ) K    = log    I + ˜ W − 1 K    = log    ˜ W + K       ˜ W − 1    . (102) Hence, the approximate marginal is (48). The dispersion penalty can be further re written as r ( φ ) = P n i =1 r i ( φ ) , where for an indi vidual data point, we hav e r i ( φ ) = log p ( y i | θ ( ˜ η i )) + 1 2 ˜ w i ˜ u 2 i + 1 2 log | ˜ w i | . (103) A . 3 . 1 D E R I V A T I V E S W RT H Y P E R PA R A M E T E R S The deri v ativ e of (48) with respect to the kernel hyperparameter α j is ∂ ∂ α j log q ( y | X ) = 1 2 ˜ t T  K + ˜ W  − 1 ∂ K ∂ α j  K + ˜ W  − 1 ˜ t − 1 2 tr  ( K + ˜ W ) − 1 ∂ K ∂ α j  (104) = 1 2 tr   zz T − ( K + ˜ W ) − 1  ∂ K ∂ α j  , z = ( K + ˜ W ) − 1 ˜ t . (105) where ∂ K ∂ α j is the element-wise deriv ativ e of the kernel matrix with respect to the kernel hyperpa- rameter α j , and we use the deri v ativ e properties, ∂ ∂ α A − 1 = − A − 1 ∂ A ∂ α A − 1 , ∂ ∂ α log | A | = tr( A − 1 ∂ A ∂ α ) . (106) 44 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S A . 3 . 2 D E R I V A T I V E W RT D I S P E R S I O N For the deri vati ve with respect to the dispersion parameter , we ﬁrst note that ∂ ∂ φ ˜ w i = ˙ a ( φ ) n ¨ b ( θ ( ˜ η i )) ˙ θ ( ˜ η i ) 2 − h T ( y i ) − ˙ b ( θ ( ˜ η i )) i ¨ θ ( ˜ η i ) o − 1 = ˙ a ( φ ) a ( φ ) ˜ w i , (107) ∂ ∂ φ ˜ W = ˙ a ( φ ) a ( φ ) ˜ W , (108) ∂ ∂ φ ˜ u i = − ˙ a ( φ ) a ( φ ) 2 ˙ θ ( ˜ η i ) h T ( y i ) − ˙ b ( θ ( ˜ η i )) i = − ˙ a ( φ ) a ( φ ) ˜ u i , (109) ∂ ∂ φ ˜ w i ˜ u 2 i = ˙ a ( φ ) a ( φ ) ˜ w i ˜ u 2 i + 2 ˜ w i ˜ u i  − ˙ a ( φ ) a ( φ ) ˜ u i  = − ˙ a ( φ ) a ( φ ) ˜ w i ˜ u 2 i , (110) ∂ ∂ φ log p ( y i | θ ( ˜ η i )) = − ˙ a ( φ ) a ( φ ) 2 [ T ( y i ) θ ( ˜ η i ) − b ( θ ( ˜ η i ))] + ˙ c ( φ, y i ) = − ˙ a ( φ ) a ( φ ) ˜ v i + ˙ c ( φ, y i ) , (111) where ˜ v i = 1 a ( φ ) [ y i θ ( ˜ η i ) − b ( θ ( ˜ η i ))] , and ˙ c ( φ, y i ) = ∂ ∂ φ c ( φ, y i ) . Thus, ∂ ∂ φ r i ( φ ) = − ˙ a ( φ ) a ( φ ) ˜ v i + ˙ c ( φ, y i ) − 1 2 ˙ a ( φ ) a ( φ ) ˜ w i ˜ u 2 i + 1 2 1 ˜ w i ˙ a ( φ ) a ( φ ) ˜ w i (112) = ˙ a ( φ ) a ( φ )  1 2 − ˜ v i − 1 2 ˜ w i ˜ u 2 i  + ˙ c ( φ, y i ) . (113) Summing ov er i , ∂ ∂ φ r ( φ ) = X i ˙ a ( φ ) a ( φ )  1 2 − ˜ v i − 1 2 ˜ w i ˜ u 2 i  + ˙ c ( φ, y i ) (114) = ˙ a ( φ ) a ( φ )  n 2 − 1 T ˜ v − 1 2 ˜ u T ˜ W ˜ u  + n X i =1 ˙ c ( φ, y i ) . (115) Also note that ˜ t is not a function of φ , as the term cancels out in ˜ W ˜ u . Finally , ∂ ∂ φ log q ( y | X ) = 1 2 ˜ t T  K + ˜ W  − 1 ∂ ˜ W ∂ φ  K + ˜ W  − 1 ˜ t − 1 2 tr(( K + ˜ W ) − 1 ∂ ˜ W ∂ φ ) + ∂ ∂ φ r ( φ ) (116) = 1 2 tr "  zz T − ( K + ˜ W ) − 1  ∂ ˜ W ∂ φ # + ˙ a ( φ ) a ( φ )  n 2 − 1 T ˜ v − 1 2 ˜ u T ˜ W ˜ u  + n X i =1 ˙ c ( φ, y i ) (117) = ˙ a ( φ ) a ( φ )  1 2 tr h zz T − ( K + ˜ W ) − 1  ˜ W i + n 2 − 1 T ˜ v − 1 2 ˜ u T ˜ W ˜ u  + n X i =1 ˙ c ( φ, y i ) (118) = ˙ a ( φ ) a ( φ )  1 2 z T ˜ Wz − 1 2 tr h ( K + ˜ W ) − 1 ˜ W i + n 2 − 1 T ˜ v − 1 2 ˜ u T ˜ W ˜ u  + n X i =1 ˙ c ( φ, y i ) (119) 45 S H A N G A N D C H A N A . 3 . 3 D E R I V A T I V E W RT D I S P E R S I O N W H E N b φ ( θ ) W e next look at the special case where the term b φ ( θ ) is also a function of φ . W e ﬁrst note that ∂ ∂ φ ˜ w i = ˙ a ( φ ) n ¨ b φ ( θ ( ˜ η i )) ˙ θ ( ˜ η i ) 2 − h T ( y i ) − ˙ b φ ( θ ( ˜ η i )) i ¨ θ ( ˜ η i ) o − 1 − a ( φ ) ˙ θ ( ˜ η i ) 2 ∂ ∂ φ ¨ b φ ( θ ( ˜ η i )) + ¨ θ ( ˜ η i ) ∂ ∂ φ ˙ b φ ( θ ( ˜ η i )) n ¨ b φ ( θ ( ˜ η i )) ˙ θ ( ˜ η i ) 2 − h T ( y i ) − ˙ b φ ( θ ( ˜ η i )) i ¨ θ ( ˜ η i ) o 2 (120) = ˙ a ( φ ) a ( φ ) ˜ w i − 1 a ( φ ) ˜ w 2 i  ˙ θ ( ˜ η i ) 2 ∂ ∂ φ ¨ b φ ( θ ( ˜ η i )) + ¨ θ ( ˜ η i ) ∂ ∂ φ ˙ b φ ( θ ( ˜ η i ))  , (121) ∂ ∂ φ ˜ u i = − ˙ a ( φ ) a ( φ ) 2 ˙ θ ( ˜ η i ) h T ( y i ) − ˙ b φ ( θ ( ˜ η i )) i − ˙ θ ( ˜ η i ) a ( φ ) ∂ ∂ φ ˙ b φ ( θ ( ˜ η i )) = − ˙ a ( φ ) a ( φ ) ˜ u i − ˙ θ ( ˜ η i ) a ( φ ) ∂ ∂ φ ˙ b φ ( θ ( ˜ η i )) , (122) ∂ ∂ φ ( ˜ w i ˜ u 2 i ) = ∂ ˜ w i ∂ φ ˜ u 2 i + 2 ˜ w i ˜ u i ∂ ˜ u i ∂ φ , ∂ ∂ φ ˜ t i = ∂ ∂ φ ( ˜ η i + ˜ w i ˜ u i ) = ˜ w i ∂ ˜ u i ∂ φ + ˜ u i ∂ ˜ w i ∂ φ , (123) ∂ ∂ φ log p ( y i | θ ( ˜ η i )) = − ˙ a ( φ ) a ( φ ) 2 [ T ( y i ) θ ( ˜ η i ) − b φ ( θ ( ˜ η i ))] − 1 a ( φ ) ∂ ∂ φ b φ ( θ ( ˜ η i )) + ˙ c ( φ, y i ) (124) = − ˙ a ( φ ) a ( φ ) ˜ v i + ˙ c ( φ, y i ) − 1 a ( φ ) ∂ ∂ φ b φ ( θ ( ˜ η i )) . (125) Thus, ∂ ∂ φ r i ( φ ) = ∂ ∂ φ log p ( y i | θ ( ˜ η i )) + 1 2  ∂ ˜ w i ∂ φ ˜ u 2 i + 2 ˜ w i ˜ u i ∂ ˜ u i ∂ φ  + 1 2 1 ˜ w i ∂ ˜ w i ∂ φ (126) For the ﬁrst term in (48), ∂ ∂ φ h ˜ t T ( ˜ W + K ) − 1 ˜ t i = ˜ t T ∂ ( ˜ W + K ) − 1 ∂ φ ˜ t + 2 ˜ t T ( ˜ W + K ) − 1 ∂ ˜ t ∂ φ (127) = − ˜ t T  K + ˜ W  − 1 ∂ ˜ W ∂ φ  K + ˜ W  − 1 ˜ t + 2 ˜ t T ( ˜ W + K ) − 1 ∂ ˜ t ∂ φ . (128) For the second term in (48), ∂ ∂ φ log    K + ˜ W    = tr(( K + ˜ W ) − 1 ∂ ˜ W ∂ φ ) . (129) Finally , we ha ve ∂ ∂ φ log q ( y | X ) = 1 2 ˜ t T  K + ˜ W  − 1 ∂ ˜ W ∂ φ  K + ˜ W  − 1 ˜ t − ˜ t T ( ˜ W + K ) − 1 ∂ ˜ t ∂ φ − 1 2 tr(( K + ˜ W ) − 1 ∂ ˜ W ∂ φ ) + X i ∂ ∂ φ r i ( φ ) . (130) 46 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S A ppendix B. T aylor appr oximation f or speciﬁc GGPMs This appendix contains the deriv ativ es and tar get functions used to deri ve the special cases of T aylor approximation in Section 5.2.5. Binomial/Bernoulli – The deri vati ve functions are u ( η , y ) = N ( y − e η 1+ e η ) , w ( η , y ) = (1+ e η ) 2 N e η . Thus, for a gi ven e xpansion point ˜ η i , the target and ef fectiv e noise are ˜ t i = ˜ η i + (1+ e ˜ η i ) 2 e ˜ η i ( y i − e ˜ η i 1+ e ˜ η i ) , ˜ w i = (1+ e ˜ η i ) 2 N e ˜ η i . Poisson – The deri vati ve functions are u ( η , y ) = y − e η , w ( η , y ) = e − η . Thus, gi ven an e xpansion point ˜ η i , the target and ef fectiv e noise are ˜ t i = ˜ η i + ( y i e − ˜ η i − 1) , ˜ w i = e − ˜ η i . (131) Gamma – The deri v ativ es of the Gamma sh likelihood (mean parameter , shape hyperparameter) are u ( η , y ) = ν e − η ( y − e η ) = ν ( y e − η − 1) , w ( η , y ) = 1 ν (1 + y e − η − 1) − 1 = 1 ν y e − η . Thus, gi ven an e xpansion point ˜ η i , the target and ef fectiv e noise are ˜ t i = ˜ η i + 1 ν y i e − ˜ η i ν ( y i e − ˜ η i − 1) = ˜ η i + 1 − 1 y i e − ˜ η i , ˜ w i = 1 ν y i e − ˜ η i . (132) In verse Gaussian – The deri vati ves of the In verse Gaussian lik elihood are u ( η , y ) = φe − η [ y − (2 e − η ) − 1 / 2 ] , w ( η , y ) = φ { (8 e η ) − 1 / 2 + [ y − (2 e − η ) − 1 / 2 ] e − η } − 1 . Using the canonical expansion point, ˜ η i = log(2 y 2 i ) , yields u ( ˜ η i , y i ) = 0 , w ( ˜ η i , y i ) = 4 φy i . (133) Beta – Consider an agnostic choice of the expansion point, ˜ η i = 0 , and hence ˜ θ i = θ ( ˜ η i ) = 1 2 . W e also hav e, ˙ θ ( ˜ η i ) = e ˜ η i (1 + e ˜ η i ) 2 = 1 4 , ¨ θ ( ˜ η i ) = e ˜ η i ( e ˜ η i − 1) (1 + e ˜ η i ) 3 = 0 . (134) Looking at the 1st and 2nd deri v ativ es of b ( θ ) at ˜ θ i , we hav e ˙ b ( ˜ θ i ) = ψ 0 ( ˜ θ i φ ) − ψ 0 ( 1 − ˜ θ i φ ) = ψ 0 ( 1 2 φ ) − ψ 0 ( 1 2 φ ) = 0 , (135) ¨ b ( ˜ θ i ) = 1 φ ψ 1 ( ˜ θ i φ ) + 1 φ ψ 1 ( 1 − ˜ θ i φ ) = 2 φ ψ 1 ( 1 2 φ ) . (136) 47 S H A N G A N D C H A N Using the abov e results, we can now calculate the deri vati ve functions at ˜ η i = 0 , ˜ u i = u ( ˜ η i , y i ) = 1 φ ˙ θ ( ˜ η i ) h T ( y i ) − ˙ b ( ˜ θ i ) i = 1 4 φ T ( y i ) = 1 4 φ log y i 1 − y i , (137) ˜ w i = w ( ˜ η i , y i ) = φ ¨ b ( ˜ θ i ) ˙ θ ( ˜ η i ) 2 − 0 = φ 2 φ ψ 1 ( 1 2 φ ) 1 4 2 = 8 φ 2 ψ 1 ( 1 2 φ ) . (138) which yields the targets, ˜ t i = ˜ η i + ˜ w i ˜ u i = 2 φ ψ 1 ( 1 2 φ ) log y i 1 − y i . (139) Acknowledgements The authors thank CE Rasmussen and CKI Williams for the GPML code (Rasmussen and Nickisch, 2010). This work was supported by City Univ ersity of Hong K ong (internal grant 7200187), and by the Research Grants Council of the Hong K ong Special Administrative Region, China (CityU 110610 and CityU 123212) References J.H. Albert. Computational methods using a Bayesian hierarchical generalized linear model. J our- nal of the American Statistical Association , 83(404):1037–1044, Dec. 1988. Edward J. Bedrick, Ronald Christensen, and W esle y Johnson. A new perspective on priors for generalized linear models. Journal of the American Statistical Association , 91(436):1450–1460, December 1996. Clemens Biller . Adapti ve Bayesian re gression splines in semiparametric generalized linear models. J ournal of Computational and Graphical Statistics , 9(1):122–140, 2000. Liefeng Bo and Cristian Sminchisescu. T win Gaussian processes for structured prediction. Inter - national J ournal of Computer V ision , 87(1):28–52, 2010. Gavin C. Cawley , Gareth J. Janacek, and Nicola LC T albot. Generalised kernel machines. In International J oint Confer ence on Neural Networks , pages 1720–1725. IEEE, 2007. A. B. Chan. Supplemental material of “on approximate inference for generalized gaussian process models”. T echnical report, City Uni versity of Hong K ong, July 2013. Antoni B. Chan and Daxiang Dong. Generalized Gaussian process models. In IEEE Conf. Computer V ision and P attern Recognition , pages 2681–2688, 2011. Antoni B. Chan and Nuno V asconcelos. Bayesian Poisson regression for crowd counting. In IEEE Intl Conf. Computer V ision , pages 545–551, 2009. Antoni B. Chan, Z. S. J. Liang, and Nuno V asconcelos. Priv acy preserving crowd monitoring: Counting people without people models or tracking. In IEEE Confer ence on Computer V ision and P attern Recognition , pages 1–7, 2008. 48 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S Jixu Chen, Minyoung Kim, Y u W ang, and Qiang Ji. Switching Gaussian process dynamic models for simultaneous composite motion tracking and recognition. In IEEE Conf. Computer V ision and P attern Recognition , pages 2655–2662, Los Alamitos, CA, USA, 2009. W ei Chu and Zoubin Ghahramani. Gaussian processes for ordinal regression. J ournal of Machine Learning and Resear ch , 6:1019–1041, 2005. R. W Conway and W . L. Maxwell. A queuing model with state dependent service rates. Journal of Industrial Engineering , 12:132–136, 1962. T .F . Cootes, G.J. Edwards, and C.J. T aylor . Acti ve appearance models. IEEE TP AMI , 23.6:681–685, 2001. Sourish Das and Dipak K. Dey . On Bayesian analysis of generalized linear models: A new perspec- ti ve. T echnical report, Statistical and Applied Mathematical Sciences Institute, 2007. Peter J. Diggle, J A T awn, and RA Moyeed. Model-based geostatistics. J ournal of the Royal Statis- tical Society: Series C (Applied Statistics) , 47(3):299–350, 1998. Richard O. Duda, Peter E. Hart, and David G. Stork. P attern Classiﬁcation . John W iley and Sons, 2001. D. Ellis, E. Sommerlade, and I. Reid. Modelling pedestrian trajectory patterns with Gaussian pro- cesses. In International Confer ence Computer V ision W orkshops (ICCV W orkshops) , pages 1229– 1234, 2009. Ludwig Fahrmeir , Gerhard T utz, and W olfgang Henne vogl. Multivariate statistical modelling based on gener alized linear models , volume 2. Springer New Y ork, 1994. Martin Fergie and Aphrodite Galata. Local Gaussian processes for pose recognition from noisy inputs. In Pr oceedings of the British Machine V ision Conference , pages 98.1–98.11. BMV A Press, 2010. ISBN 1-901725-38-3. Mark Gibbs and Da vid J. C. Mackay . V ariational Gaussian process classiﬁers. IEEE T ransactions on Neural Networks , 11:1458–1464, 2000. Mark Girolami and Simon Rogers. V ariational Bayesian multinomial probit regression with Gaus- sian process priors. Neural Computation , 18:1790–1817, 2006. Seth D. Guikema and Jeremy P . Gof felt. A ﬂexible count data regression model for risk analysis. Risk Analysis , 28(1):213–223, 2008. Dong Han, Liefeng Bo, and Cristian Sminchisescu. Selection and context for action recognition. In International Confer ence on Computer V ision , pages 1933–1940, sep. 2009. Lauren A. Hannah, David M. Blei, and W arren B. Powell. Dirichlet process mixtures of generalized linear models. Journal of Mac hine Learning Researc h , 12:1923–1953, 2011. V ille Heikkinen, Reiner Lenz, T uija Jetsu, Jussi Parkkinen, Markku Hauta-Kasari, and T imo J ¨ a ¨ askel ¨ ainen. Ev aluation and uniﬁcation of some methods for estimating reﬂectance spectra from rgb images. J. Opt. Soc. Am. A , 25(10):2444–2458, Oct 2008. 49 S H A N G A N D C H A N David C. Ho well. Fundamental Statistics for the Behavioral Sciences . PSY 200 (300) Quantitativ e Methods in Psychology Series. W adsworth Cengage Learning, 2010. ISBN 9780495811251. Pasi Jyl ¨ anki, Jarno V anhatal, and Aki V ehtari. Robust Gaussian process regression with a student-t likelihood. J. Machine Learning Resear ch , 12:3227–3257, 2011. Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and T re vor Darrell. Gaussian processes for object categorization. International Journal of Computer V ision , 88:169–188, 2010. Hyun-Chul Kim and Zoubin Ghahramani. Bayesian Gaussian process classiﬁcation with the EM-EP algorithm. IEEE T ransactions on P attern Analysis and Mac hine Intelligence , 28(12):1948–1959, 2006. Genshiro Kitagawa. Non-Gaussian statełspace modeling of nonstationary time series. Journal of the American statistical association , 82(400):1032–1041, 1987. Malte Kuss. Gaussian Pr ocess Models for Robust Re gression, Classiﬁcation, and Reinforcement Learning . PhD thesis, T echnische Uni versit ¨ at Darmstadt, 2006. Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary Gaussian process classiﬁcation. Journal of Mac hine Learning Researc h , 6:1679–1704, 2005. Chen Change Loy , T ao Xiang, and Shaogang Gong. Modelling multi-object activity by Gaussian processes. In British Machine V ision Confer ence , 2009. Opper Manfred and C ´ edric Archambeau. The v ariational Gaussian approximation re visited. Neur al Computation , 21(3):786–792, March 2009. P . McCullagh and J.A. Nelder . Generalized linear models . Chapman & Hall, CRC, 1989. ISBN 9780412317606. Thomas P . Minka. A family of algorithms for appr oximate Bayesian inference . PhD thesis, Mas- sachusetts Institute of T echnology , 2001. Radford M. Neal. Monte Carlo implementation of Gaussian process models for Bayesian regres- sion and classiﬁcation. T echnical report, Dept. of Statistics, Uni versity of T oronto, 1997. URL arXiv:physics/9701026v2 . Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary Gaussian process clas- siﬁcation. Journal of Mac hine Learning Researc h , 9:2035–2078, 2008. Hannes Nickisch and Matthias W . Seeger . Con ve x variational Bayesian inference for large scale generalized linear models. In International Confer ence on Machine Learning , pages 761–768, Ne w Y ork, NY , USA, 2009. A CM. ISBN 978-1-60558-516-1. Basilio Noris, Karim Benmachiche, and A. Billard. Calibration-free eye gaze direction detection with Gaussian processes. In International Conference on Computer V ision Theory and Applica- tions , 2008. Christopher J. Paciorek and Mark J. Schervish. Nonstationary covariance functions for Gaussian process re gression. In Advances in Neural Information Pr ocessing Systems , pages 273–280, 2004. 50 O N A P P R OX I M A T E I N F E R E N C E F O R G E N E R A L I Z E D G AU S S I A N P RO C E S S M O D E L S Christian Plagemann, Kristian K ersting, Patrick Pfaff, and W olfram Burgard. Gaussian beam pro- cesses: A nonparametric Bayesian measurement model for range ﬁnders. In In Pr oc. of Robotics: Science and Systems (RSS) , 2007. Leonid Raskin, Michael Rudzsky , and Ehud Ri vlin. T racking and classifying of human motions with Gaussian process annealed particle ﬁlter . In Asian confer ence on Computer vision , pages 442–451, Berlin, Heidelberg, 2007. Springer -V erlag. ISBN 3-540-76385-6, 978-3-540-76385-7. Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (GPML) toolbox. Journal of Mac hine Learning Researc h , 11:3011–15, Nov 2010. Carl Edward Rasmussen and Christopher K. I. W illiams. Gaussian Pr ocesses for Machine Learning . MIT Press, 2006. T errance Sa vitsky and Marina V annucci. Spiked Dirchlet process priors for Gaussian process mod- els. Journal of Pr obability and Statistics , 2010, 2010. T errance Savitsk y , Marina V annucci, and Naijun Sha. V ariable selection for nonparametric Gaussian process priors: Models and computational strategies. Statistical Science , 26(1):130–149, 2011. Matthias Seeger . Gaussian processes for machine learning. International J ournal of Neur al Systems , 14(2):69–106, 2004. Matthias Seeger , Sebastian Gerwinn, and Matthias Bethge. Bayesian inference for sparse general- ized linear models. In Machine Learning: ECML 2007 , pages 298–309, 2007. Jian Qing Shi and T aeryon Choi. Gaussian Pr ocess Re gr ession Analysis for Functional Data . CRC Press, T aylor & Francis Group, 2011. Galit Shmueli, Thomas P Minka, Joseph B Kadane, Sharad Borle, and Peter Boatwright. A useful distribution for ﬁtting discrete data: revi val of the conway-maxwell-poisson distrib ution. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 54(1):127–142, 2005. F Sinz, Q Candela, GH Bakir , CE Rasmussen, and M Franz. Learning depth from stereo. In In P attern Recognition, Pr oc. 26th D A GM Symposium , pages 245–252. Springer, 2004. Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. W arped Gaussian processes. In Advances in Neural Information Pr ocessing Systems , pages 337–344. MIT Press, 2004. Y ee Whye T eh, Matthias See ger , and M I Jordan. Semiparametric latent factor models. In Artiﬁcial Intelligence and Statistics , 2005. V olker T resp. The generalized Bayesian committee machine. In ACM International conference on Knowledge discovery and data mining , pages 130–139, Ne w Y ork, NY , USA, 2000. A CM. Raquel Urtasun and T rev or Darrell. Sparse probabilistic regression for acti vity-independent human pose inference. In IEEE Conf. Computer V ision and P attern Reco gnition , pages 1–8, 2008. Raquel Urtasun, David J Fleet, Aaron Hertzmann, and Pascal Fua. Priors for people tracking from small training sets. In IEEE International Confer ence on Computer V ision , pages 403–410, 2005. 51 S H A N G A N D C H A N Jarno V anhatalo and Aki V ehtari. Sparse log Gaussian processes via MCMC for spatial epidemiol- ogy . In W orkshop on Gaussian Pr ocesses in Practice , 2007. Jarno V anhatalo, Pasi Jyl ¨ anki, and Aki V ehtari. Gaussian process regression with Student-t likeli- hood. In Neural Information Pr ocessing Systems , 2009. Jarno V anhatalo, V ille Pietil ¨ ainen, and Aki V ehtari. Approximate inference for disease mapping with sparse Gaussian processes. Statistics in Medicine , 29(15):1580–1607, 2010. Jarno V anhatalo, Jaakko Riihim ¨ aki, Jouni Hartikainen, and Aki V ehtari. Bayesian modeling with Gaussian processes using the MA TLAB toolbox GP- stuff. submitted , 2011. Jack M. W ang, David J. Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human motion. IEEE T ransactions on P attern Analysis and Machine Intelligence , 30(2):283– 298, Feb . 2008. Christopher K. I. W illiams and David Barber . Bayesian classiﬁcation with Gaussian processes. IEEE T ransactions on P attern Analysis and Machine Intelligence , 20(12):1342–1351, Dec. 1998. Oli ver W illiams. A switched Gaussian process for estimating disparity and segmentation in binoc- ular stereo. In Advances in Neural Information Pr ocessing Systems , 2006. Y u Zhang and Dit-Y an Y eung. Multi-task warped Gaussian process for personalized age estimation. In Confer ence on Computer V ision and P attern Recognition , pages 2622–2629, 2010. Zhihua Zhang, G. Dai, D. W ang, and M. I. Jordan. Bayesian generalized kernel models. In Confer- ence on Artiﬁcial Intelligence and Statistics , v olume 9, pages 972–979, 2010. Xu Zhao, Huazhong Ning, Y uncai Liu, and T . Huang. Discriminati ve estimation of 3d human pose using Gaussian processes. In Intl Conf. P attern Recognition , pages 1–4, Dec. 2008. Jianke Zhu, Stev en CH Hoi, and Michael R L yu. Nonrigid shape recovery by Gaussian process regression. In IEEE Confer ence on Computer V ision and P attern Recognition , pages 1319–1326, Jun. 2009. 52

On Approximate Inference for Generalized Gaussian Process Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment