The Variational Gaussian Process

Published as a conference paper at ICLR 2016 T H E V A R I A T I O N A L G AU S S I A N P R O C E S S Dustin T ran Harvard Uni versity dtran@g.harvard.edu Rajesh Ranganath Princeton Univ ersity rajeshr@cs.princeton.edu David M. Blei Columbia Univ ersity david.blei@columbia.edu A B S T R A C T V ariational inference is a po werful tool for approximate inference, and it has been recently applied for representation learning with deep generative models. W e de- velop the variational Gaussian pr ocess ( V G P ) , a Bayesian nonparametric varia- tional family , which adapts its shape to match complex posterior distributions. The V G P generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution o ver random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity . W e prove a univ ersal approximation theorem for the V G P , demonstrating its representativ e power for learning any model. For inference we present a v ariational objectiv e inspired by auto-encoders and perform black box inference over a wide class of models. The V G P achieves new state-of-the-art re- sults for unsupervised learning, inferring models such as the deep latent Gaussian model and the recently proposed DRA W . 1 I N T R O D U C T I O N V ariational inference is a powerful tool for approximate posterior inference. The idea is to posit a family of distributions ov er the latent variables and then ﬁnd the member of that family closest to the posterior . Originally dev eloped in the 1990s ( Hinton & V an Camp , 1993 ; W aterhouse et al. , 1996 ; Jordan et al. , 1999 ), variational inference has enjoyed rene wed interest around developing scalable optimization for lar ge datasets ( Hoffman et al. , 2013 ), deri ving generic strategies for easily ﬁtting many models ( Ranganath et al. , 2014 ), and applying neural netw orks as a ﬂexible parametric family of approximations ( Kingma & W elling , 2014 ; Rezende et al. , 2014 ). This research has been particularly successful for computing with deep Bayesian models ( Neal , 1990 ; Ranganath et al. , 2015a ), which require inference of a complex posterior distrib ution ( Hinton et al. , 2006 ). Classical variational inference typically uses the mean-ﬁeld family , where each latent variable is independent and governed by its own v ariational distribution. While con venient, the strong inde- pendence limits learning deep representations of data. Newer research aims tow ard richer families that allow dependencies among the latent v ariables. One way to introduce dependence is to con- sider the variational family itself as a model of the latent variables ( Lawrence , 2000 ; Ranganath et al. , 2015b ). These variational models naturally extend to Bayesian hierarchies, which retain the mean-ﬁeld “likelihood” b ut introduce dependence through variational latent v ariables. In this paper we dev elop a po werful new variational model—the v ariational Gaussian process ( V G P ). The V G P is a Bayesian nonparametric variational model; its complexity grows efﬁciently and to- wards any distribution, adapting to the inference problem at hand. W e highlight three main contri- butions of this w ork: 1 Published as a conference paper at ICLR 2016 1. W e pro ve a univ ersal approximation theorem: under certain conditions, the VG P can capture any continuous posterior distribution—it is a variational family that can be speciﬁed to be as expressi ve as needed. 2. W e deriv e an efﬁcient stochastic optimization algorithm for variational inference with the V G P . Our algorithm can be used in a wide class of models. Inference with the V G P is a black box variational method ( Rang anath et al. , 2014 ). 3. W e study the V G P on standard benchmarks for unsupervised learning, applying it to per- form inference in deep latent Gaussian models ( Rezende et al. , 2014 ) and DRA W ( Gregor et al. , 2015 ), a latent attention model. F or both models, we report the best results to date. T echnical summary . Generativ e models hypothesize a distribution of observ ations x and latent variables z , p ( x , z ) . V ariational inference posits a family of the latent v ariables q ( z ; λ ) and tries to ﬁnd the v ariational parameters λ that are closest in KL di vergence to the posterior . When we use a variational model, q ( z ; λ ) itself might contain v ariational latent variables; these are implicitly marginalized out in the v ariational family ( Ranganath et al. , 2015b ). The V G P is a ﬂexible variational model. It draw inputs from a simple distribution, warps those inputs through a non-linear mapping, and then uses the output of the mapping to gov ern the distribution of the latent variables z . The non-linear mapping is itself a random variable, constructed from a Gaussian process. The V G P is inspired by ideas from both the Gaussian process latent variable model ( La wrence , 2005 ) and Gaussian process regression ( Rasmussen & W illiams , 2006 ). The variational parameters of the V G P are the kernel parameters for the Gaussian process and a set of variational data , which are input-output pairs. The variational data is crucial: it anchors the non-linear mappings at given inputs and outputs. It is through these parameters that the V G P learns complex representations. Finally , gi ven data x , we use stochastic optimization to ﬁnd the v ariational parameters that minimize the KL div ergence to the model posterior . 2 V A R I A T I O N A L G AU S S I A N P R O C E S S V ariational models introduce latent variables to the variational f amily , providing a rich construction for posterior approximation ( Ranganath et al. , 2015b ). Here we introduce the variational Gaussian process ( V G P ), a Bayesian nonparametric variational model that is based on the Gaussian process. The Gaussian process ( G P ) provides a class of latent variables that lets us capture downstream dis- tributions with v arying complexity . W e ﬁrst revie w v ariational models and Gaussian processes. W e then outline the mechanics of the V G P and prove that it is a uni versal approximator . 2 . 1 V A R I AT I O N A L M O D E L S Let p ( z | x ) denote a posterior distribution over d latent variables z = ( z 1 , . . . , z d ) conditioned on a data set x . For a family of distributions q ( z ; λ ) parameterized by λ , variational inference seeks to minimize the div ergence KL( q ( z ; λ ) k p ( z | x )) . This is equi valent to maximizing the evidence lower bound ( E L B O ) ( W ainwright & Jordan , 2008 ). The E L B O can be written as a sum of the expected log likelihood of the data and the KL diver gence between the variational distribution and the prior , L = E q ( z ; λ ) [log p ( x | z )] − KL( q ( z ; λ ) k p ( z )) . (1) T raditionally , variational inference considers a tractable family of distributions with analytic forms for its density . A common speciﬁcation is a fully factorized distrib ution Q i q ( z i ; λ i ) , also known as the mean-ﬁeld family . While mean-ﬁeld families lead to ef ﬁcient computation, they limit the expressi veness of the approximation. The v ariational family of distributions can be interpreted as a model of the latent variables z , and it can be made richer by introducing ne w latent variables. Hierarchical variational models consider distributions speciﬁed by a variational prior of the mean-ﬁeld parameters q ( λ ; θ ) and a factorized “likelihood” Q i q ( z i | λ i ) . This speciﬁes the v ariational model, q ( z ; θ ) = Z h Y i q ( z i | λ i ) i q ( λ ; θ ) d λ , (2) 2 Published as a conference paper at ICLR 2016 z i f i ξ θ D = { ( s , t ) } d (a) V A R I AT I O N A L M O D E L z i x d (b) G E N E R AT I V E M O D E L Figure 1: (a) Graphical model of the v ariational Gaussian process. The V G P generates samples of latent variables z by ev aluating random non-linear mappings of latent inputs ξ , and then drawing mean-ﬁeld samples parameterized by the mapping. These latent variables aim to follow the posterior distribution for a generati ve model (b) , conditioned on data x . which is go verned by prior hyperparameters θ . Hierarchical variational models are richer than classical variational families—their expressiv eness is determined by the complexity of the prior q ( λ ) . Many expressiv e variational approximations can be viewed under this construct ( Saul & Jordan , 1996 ; Jaakkola & Jordan , 1998 ; Rezende & Mohamed , 2015 ; T ran et al. , 2015 ). 2 . 2 G AU S S I A N P RO C E S S E S W e now revie w the Gaussian process ( G P ) ( Rasmussen & W illiams , 2006 ). Consider a data set of m source-target pairs D = { ( s n , t n ) } m n =1 , where each source s n has c cov ariates paired with a multi- dimensional target t n ∈ R d . W e aim to learn a function over all source-tar get pairs, t n = f ( s n ) , where f : R c → R d is unknown. Let the function f decouple as f = ( f 1 , . . . , f d ) , where each f i : R c → R . G P regression estimates the functional form of f by placing a prior , p ( f ) = d Y i =1 G P ( f i ; 0 , K ss ) , where K ss denotes a cov ariance function k ( s , s 0 ) ev aluated ov er pairs of inputs s , s 0 ∈ R c . In this paper , we consider automatic relev ance determination ( A R D ) kernels k ( s , s 0 ) = σ 2 A R D exp  − 1 2 c X j =1 ω j ( s j − s 0 j ) 2  , (3) with parameters θ = ( σ 2 A R D , ω 1 , . . . , ω c ) . The weights ω j tune the importance of each dimension. They can be dri ven to zero during inference, leading to automatic dimensionality reduction. Giv en data D , the conditional distrib ution of the G P forms a distrib ution ov er mappings which inter - polate between input-output pairs, p ( f | D ) = d Y i =1 G P ( f i ; K ξ s K − 1 ss t i , K ξξ − K ξ s K − 1 ss K > ξ s ) . (4) Here, K ξ s denotes the covariance function k ( ξ , s ) for an input ξ and over all data inputs s n , and t i represents the i th output dimension. 2 . 3 V A R I AT I O N A L G AU S S I A N P RO C E S S E S W e describe the v ariational Gaussian process ( V G P ), a Bayesian nonparametric variational model that admits arbitrary structures to match posterior distributions. The V G P generates z by generat- ing latent inputs, warping them with random non-linear mappings, and using the warped inputs as parameters to a mean-ﬁeld distribution. The random mappings are drawn conditional on “varia- tional data, ” which are variational parameters. W e will show that the V G P enables samples from the mean-ﬁeld to follow arbitrarily comple x posteriors. The V G P speciﬁes the following generati ve process for posterior latent v ariables z : 3 Published as a conference paper at ICLR 2016 1. Draw latent input ξ ∈ R c : ξ ∼ N ( 0 , I ) . 2. Draw non-linear mapping f : R c → R d conditioned on D : f ∼ Q d i =1 G P ( 0 , K ξξ ) | D . 3. Draw approximate posterior samples z ∈ supp( p ) : z = ( z 1 , . . . , z d ) ∼ Q d i =1 q ( f i ( ξ )) . Figure 1 displays a graphical model for the V G P . Here, D = { ( s n , t n ) } m n =1 represents variational data, comprising input-output pairs that are parameters to the v ariational distrib ution. Marginalizing ov er all latent inputs and non-linear mappings, the V G P is q V G P ( z ; θ , D ) = Z Z " d Y i =1 q ( z i | f i ( ξ )) # " d Y i =1 G P ( f i ; 0 , K ξξ ) | D # N ( ξ ; 0 , I ) d f d ξ . (5) The V G P is parameterized by kernel hyperparameters θ and variational data. As a variational model, the V G P forms an inﬁnite ensemble of mean-ﬁeld distributions. A mean- ﬁeld distri bution is gi ven in the ﬁrst term of the integrand above. It is conditional on a ﬁxed function f ( · ) and input ξ ; the d outputs f i ( ξ ) = λ i are the mean-ﬁeld’ s parameters. The V G P is a form of a hierarchical v ariational model ( Eq.2 ) ( Ranganath et al. , 2015b ). It places a continuous Bayesian nonparametric prior ov er mean-ﬁeld parameters. Unlike the mean-ﬁeld, the V G P can capture correlation between the latent variables. The reason is that it ev aluates the d independent G P draws at the same latent input ξ . This induces correlation between their outputs, the mean-ﬁeld parameters, and thus also correlation between the latent vari- ables. Further , the V G P is ﬂexible. The complex non-linear mappings drawn from the G P allow it to capture complex discrete and continuous posteriors. W e emphasize that the V G P needs variational data. Unlike typical G P regression, there are no ob- served data a v ailable to learn a distrib ution over non-linear mappings of the latent v ariables z . Thus the "data" are v ariational parameters that appear in the conditional distribution of f in Eq.4 . They anchor the random non-linear mappings at certain input-ouput pairs. When optimizing the V G P , the learned variational data enables ﬁnds a distribution of the latent variables that closely follo ws the posterior . 2 . 4 U N I V E R S A L A P P RO X I M AT I O N T H E O R E M T o understand the capacity of the V G P for representing complex posterior distributions, we analyze the role of the Gaussian process. For simplicity , suppose the latent variables z are real-valued, and the V G P treats the output of the function draws from the G P as posterior samples. Consider the optimal function f ∗ , which is the transformation such that when we draw ξ ∼ N ( 0 , I ) and calculate z = f ∗ ( ξ ) , the resulting distribution of z is the posterior distribution. An explicit construction of f ∗ exists if the dimension of the latent input ξ is equal to the number of latent variables. Let P − 1 denote the in verse posterior CDF and Φ the standard normal CDF . Using techniques common in copula literature ( Nelsen , 2006 ), the optimal function is f ∗ ( ξ ) = P − 1 (Φ( ξ 1 ) , . . . , Φ( ξ d )) . Imagine generating samples z using this function. For latent input ξ ∼ N ( 0 , I ) , the standard normal CDF Φ applies the probability inte gral transform: it squashes ξ i such that its output u i = Φ( ξ i ) is uniformly distributed on [0 , 1] . The in verse posterior CDF then transforms the uniform random variables P − 1 ( u 1 , . . . , u d ) = z to follow the posterior . The function produces exact posterior samples. In the V G P , the random function interpolates the values in the variational data, which are optimized to minimize the KL di ver gence. Thus, during inference, the distribution of the G P learns to concentrate around this optimal function. This perspecti ve provides intuition behind the following result. Theorem 1 (Universal approximation) . Let q ( z ; θ , D ) denote the variational Gaussian pr ocess. Consider a posterior distribution p ( z | x ) with a ﬁnite number of latent variables and continuous quantile function (in verse CDF). Ther e exists a sequence of parameters ( θ k , D k ) such that lim k →∞ KL( q ( z ; θ k , D k ) k p ( z | x )) = 0 . 4 Published as a conference paper at ICLR 2016 R auxiliary inference Q variational inference P Figure 2: Sequence of domain mappings during inference, from variational latent variable space R to posterior latent v ariable space Q to data space P . W e perform v ariational inference in the posterior space and auxiliary inference in the variational space. See Appendix B for a proof. Theorem 1 states that any posterior distribution with strictly posi- tiv e density can be represented by a V G P . Thus the V G P is a ﬂexible model for learning posterior distributions. 3 B L A C K B O X I N F E R E N C E W e deri ve an algorithm for black box inference ov er a wide class of generativ e models. 3 . 1 V A R I AT I O N A L O B J E C T I V E The original E L B O ( Eq.1 ) is analytically intractable due to the log density , log q V G P ( z ) ( Eq.5 ). T o ad- dress this, we present a tractable v ariational objective inspired by auto-encoders ( Kingma & W elling , 2014 ). A tractable lo wer bound to the model evidence log p ( x ) can be derived by subtracting an expected KL div ergence term from the E L B O , log p ( x ) ≥ E q VG P [log p ( x | z )] − KL( q V G P ( z ) k p ( z )) − E q VG P h KL( q ( ξ , f | z ) k r ( ξ , f | z )) i , where r ( ξ , f | z ) is an auxiliary model (we describe r in the next subsection). V arious versions of this objective have been considered in the literature ( Jaakkola & Jordan , 1998 ; Agako v & Barber , 2004 ), and it has been recently revisited by Salimans et al. ( 2015 ) and Ranganath et al. ( 2015b ). W e perform v ariational inference in the posterior latent variable space, minimizing KL( q k p ) to learn the variational model; for this to occur we perform auxiliary inference in the variational latent v ariable space, minimizing KL( q k r ) to learn an auxiliary model. See Figure 2 . Unlike previous approaches, we rewrite this variational objectiv e to connect to auto- encoders: e L ( θ , φ ) = E q VG P [log p ( x | z )] − E q VG P h KL( q ( z | f ( ξ )) k p ( z )) i − E q VG P h KL( q ( f | ξ ; θ ) k r ( f | ξ , z ; φ )) + log q ( ξ ) − log r ( ξ | z ) i , (6) where the KL div ergences are now taken over tractable distributions (see Appendix C ). In auto- encoder parlance, we maximize the expected negati ve reconstruction error , regularized by two terms: an expected div ergence between the v ariational model and the original model’ s prior , and an ex- pected diver gence between the auxiliary model and the variational model’ s prior . This is simply a nested instantiation of the variational auto-encoder bound ( Kingma & W elling , 2014 ): a div ergence between the inference model and a prior is tak en as regularizers on both the posterior and variational spaces. This interpretation justiﬁes the pre viously proposed bound for variational models; as we shall see, it also enables lower v ariance gradients during stochastic optimization. 3 . 2 A U T O - E N C O D I N G V A R I AT I O N A L M O D E L S An inference network provide a ﬂexible parameterization of approximating distributions as used in Helmholtz machines ( Hinton & Zemel , 1994 ), deep Boltzmann machines ( Salakhutdinov & 5 Published as a conference paper at ICLR 2016 Larochelle , 2010 ), and v ariational auto-encoders ( Kingma & W elling , 2014 ; Rezende et al. , 2014 ). It replaces local variational parameters with global parameters coming from a neural network. For latent v ariables z n (which correspond to a data point x n ), an inference network speciﬁes a neural network which takes x n as input and its local v ariational parameters λ n as output. This amortizes inference by only deﬁning a set of global parameters. T o auto-encode the V G P we specify inference networks to parameterize both the variational and auxiliary models: x n 7→ q ( z n | x n ; θ n ) , x n , z n 7→ r ( ξ n , f n | x n , z n ; φ n ) . Formally , the output of these mappings are the parameters θ n and φ n respectiv ely . W e write the output as distributions above to emphasize that these mappings are a (global) parameterization of the variational model q and auxiliary model r . The local v ariational parameters θ n for q are the variational data D n . The auxiliary model r is speciﬁed as a fully factorized Gaussian with local variational parameters φ n = ( µ n ∈ R c + d , σ 2 n ∈ R c + d ) . 1 3 . 3 S T O C H A S T I C O P T I M I Z A T I O N W e maximize the v ariational objectiv e e L ( θ , φ ) over both θ and φ , where θ newly denotes both the kernel hyperparameters and the inference network’ s parameters for the V G P , and φ denotes the inference network’ s parameters for the auxiliary model. Follo wing the black box methods, we write the gradient as an expectation and apply stochastic approximations ( Robbins & Monro , 1951 ), sampling from the variational model and e valuating noisy gradients. First, we reduce variance of the stochastic gradients by analytically deriving any tractable expec- tations. The KL div ergence between q ( z | f ( ξ )) and p ( z ) is commonly used to reduce variance in traditional variational auto-encoders: it is analytic for deep generative models such as the deep latent Gaussian model ( Rezende et al. , 2014 ) and deep recurrent attentiv e writer ( Gregor et al. , 2015 ). The KL diver gence between r ( f | ξ , z ) and q ( f | ξ ) is analytic as the distributions are both Gaussian. The difference log q ( ξ ) − log r ( ξ | z ) is simply a difference of Gaussian log densities. See Appendix C for more details. T o deriv e black box gradients, we can ﬁrst reparameterize the V G P , separating noise generation of samples from the parameters in its generativ e process ( Kingma & W elling , 2014 ; Rezende et al. , 2014 ). The G P easily enables reparameterization: for latent inputs ξ ∼ N ( 0 , I ) , the transformation f ( ξ ; θ ) = L ξ + K ξ s K − 1 ss t i is a location-scale transform, where LL > = K ξξ − K ξ s K − 1 ss K > ξ s . This is equi valent to ev aluating ξ with a random mapping from the G P . Suppose the mean-ﬁeld q ( z | f ( ξ )) is also reparameterizable, and let  ∼ w such that z (  ; f ) is a function of ξ whose output z ∼ q ( z | f ( ξ )) . This two-lev el reparameterization is equiv alent to the generative process for z outlined in Section 2.3 . W e no w rewrite the v ariational objecti ve as e L ( θ , φ ) = E N ( ξ ) h E w (  ) h log p ( x | z (  ; f )) i − KL( q ( z | f ) k p ( z )) i (7) − E N ( ξ ) h E w (  ) h KL( q ( f | ξ ; θ ) k r ( f | ξ , z (  ; f ); φ )) + log q ( ξ ) − log r ( ξ | z (  ; f )) ii . Eq.7 enables gradients to mov e inside the expectations and backpropagate ov er the nested reparam- eterization. Thus we can take unbiased stochastic gradients, which exhibit low v ariance due to both the analytic KL terms and reparameterization. The gradients are deriv ed in Appendix D , including the case when the ﬁrst KL is analytically intractable. W e outline the method in Algorithm 1 . For massiv e data, we apply subsampling on x ( Hoffman et al. , 2013 ). For gradients of the model log-likelihood, we employ con venient differentiation tools such as those in Stan and Theano ( Carpenter et al. , 2015 ; Bergstra et al. , 2010 ). For non-dif ferentiable latent variables z , or mean-ﬁeld distributions without efﬁcient reparameterizations, we apply the black box gradient estimator from Ranganath et al. ( 2014 ) to take gradients of the inner e xpectation. 1 W e let the kernel hyperparameters of the V G P be ﬁxed across data points. Note also that unique from other auto-encoder approaches, we let r ’ s inference network take both x n and z n as input: this avoids an explicit speciﬁcation of the conditional distribution r (  , f | z ) , which may be difﬁcult to model. This idea was ﬁrst suggested (but not implemented) in Ranganath et al. ( 2015b ). 6 Published as a conference paper at ICLR 2016 Algorithm 1: Black box inference with a variational Gaussian process Input : Model p ( x , z ) , Mean-ﬁeld family Q i q ( z i | f i ( ξ )) . Output : V ariational and auxiliary parameters ( θ , φ ) . Initialize ( θ , φ ) randomly . while not con ver ged do Draw noise samples ξ ∼ N ( 0 , I ) ,  ∼ w . Parameterize v ariational samples z = z (  ; f ( ξ )) , f ( ξ ) = f ( ξ ; θ ) . Update ( θ , φ ) with stochastic gradients ∇ θ e L , ∇ φ e L . end 3 . 4 C O M P U TA T I O N A L A N D S T O R A G E C O M P L E X I T Y The algorithm has O ( d + m 3 + LH 2 ) complexity , where d is the number of latent v ariables, m is the size of the variational data, and L is the number of layers of the neural networks with H the av erage hidden layer size. In particular, the algorithm is linear in the number of latent vari- ables, which is competiti ve with other v ariational inference methods. The number of v ariational and auxiliary parameters has O ( c + LH ) complexity; this complexity comes from storing the kernel hyperparameters and the neural network parameters. Unlike most G P literature, we require no low rank constraints, such as the use of inducing variables for scalable computation ( Quiñonero-Candela & Rasmussen , 2005 ). The variational data serve a similar purpose, b ut inducing variables reduce the rank of a (ﬁxed) kernel matrix; the variational data directly determine the kernel matrix and thus the kernel matrix is not ﬁxed. Although we ha ven’t found it necessary in practice, see Appendix E for scaling the size of variational data. 4 R E L A T E D W O R K Recently , there has been interest in applying parametric transformations for approximate inference. Parametric transformations of random variables induce a density in the transformed space, with a Jacobian determinant that accounts for how the transformation warps unit volumes. Kucukelbir et al. ( 2016 ) consider this viewpoint for automating inference, in which they posit a transformation from the standard normal to a possibly constrained latent v ariable space. In general, ho wever , calculating the Jacobian determinant incurs a costly O ( d 3 ) comple xity , cubic in the number of latent variables. Dinh et al. ( 2015 ) consider volume-preserving transformations which av oid calculating Jacobian determinants. Salimans et al. ( 2015 ) consider volume-preserving transformations deﬁned by Marko v transition operators. Rezende & Mohamed ( 2015 ) consider a slightly broader class of parametric transformations, with Jacobian determinants having at most O ( d ) comple xity . Instead of specifying a parametric class of mappings, the V G P posits a Bayesian nonparametric prior ov er all continuous mappings. The V G P can recov er a certain class of parametric transformations by using kernels which induce a prior ov er that class. In the context of the V G P , the G P is an inﬁnitely wide feedforward network which warps latent inputs to mean-ﬁeld parameters. Thus, the VG P offers complete ﬂexibility on the space of mappings—there are no restrictions such as in vertibility or linear complexity—and is fully Bayesian. Further , it is a hierarchical variational model, using the G P as a variational prior ov er mean-ﬁeld parameters ( Ranganath et al. , 2015b ). This enables inference over both discrete and continuous latent variable models. In addition to its ﬂexibility over parametric methods, the V G P is more computationally efﬁcient. Parametric methods must consider transformations with Jacobian determinants of at most O ( d ) complexity . This restricts the ﬂexibility of the mapping and therefore the ﬂexibility of the varia- tional model ( Rezende & Mohamed , 2015 ). In comparison, the distribution of outputs using a G P prior does not require any Jacobian determinants (following Eq.4 ); instead it requires auxiliary in- ference for inferring v ariational latent v ariables (which is fast). Further , unlike discrete Bayesian 7 Published as a conference paper at ICLR 2016 Model − log p ( x ) ≤ DLGM + V AE [ 1 ] 86.76 DLGM + HVI (8 leapfrog steps) [ 2 ] 85.51 88.30 DLGM + NF ( k = 80 ) [ 3 ] 85.10 EoN ADE-5 2hl (128 orderings) [ 4 ] 84.68 DBN 2hl [ 5 ] 84 . 55 D ARN 1hl [ 6 ] 84 . 13 Con volutional V AE + HVI [ 2 ] 81.94 83.49 DLGM 2hl + IW AE ( k = 50 ) [ 1 ] 82.90 DRA W [ 7 ] 80.97 DLGM 1hl + V G P 84.79 DLGM 2hl + V G P 81.32 DRA W + V G P 79.88 T able 1: Negati ve predictiv e log-likelihood for binarized MNIST . Previous best results are [1] ( Burda et al. , 2016 ), [2] ( Salimans et al. , 2015 ), [3] ( Rezende & Mohamed , 2015 ), [4] ( Raiko et al. , 2014 ), [5] ( Murray & Salakhutdinov , 2009 ), [6] ( Gregor et al. , 2014 ), [7] ( Gregor et al. , 2015 ). nonparametric priors such as an inﬁnite mixture of mean-ﬁeld distributions, the G P enables black box inference with lower v ariance gradients—it applies a location-scale transform for reparameteri- zation and has analytically tractable KL terms. T ransformations, which con vert samples from a tractable distribution to the posterior , is a classic technique in Bayesian inference. It was ﬁrst studied in Monte Carlo methods, where it is core to the de velopment of methods such as path sampling, annealed importance sampling, and sequential Monte Carlo ( Gelman & Meng , 1998 ; Neal , 1998 ; Chopin , 2002 ). These methods can be recast as specifying a discretized mapping f t for times t 0 < . . . < t k , such that for draws ξ from the tractable distribution, f t 0 ( ξ ) outputs the same samples and f t k ( ξ ) outputs exact samples follo wing the poste- rior . By applying the sequence in various forms, the transformation bridges the tractable distribution to the posterior . Specifying a good transformation—termed “schedule” in the literature—is crucial to the ef ﬁciency of these methods. Rather than specify it explicitly , the V G P adaptiv ely learns this transformation and av oids discretization. Limiting the V G P in various ways recovers well-known probability models as v ariational approxima- tions. Speciﬁcally , we recover the discrete mixture of mean-ﬁeld distributions ( Bishop et al. , 1998 ; Jaakkola & Jordan , 1998 ). W e also reco ver a form of factor analysis ( T ipping & Bishop , 1999 ) in the variational space. Mathematical details are in Appendix A . 5 E X P E R I M E N T S Follo wing standard benchmarks for variational inference in deep learning, we learn generativ e mod- els of images. In particular , we learn the deep latent Gaussian model ( D L G M ) ( Rezende et al. , 2014 ), a layered hierarchy of Gaussian random v ariables following neural network architecures, and the recently proposed Deep Recurrent Attentive Writer ( D R AW ) ( Gregor et al. , 2015 ), a latent attention model that iteratively constructs complex images using a recurrent architecture and a sequence of variational auto-encoders ( Kingma & W elling , 2014 ). For the learning rate we apply a version of RMSProp ( T ieleman & Hinton , 2012 ), in which we scale the v alue with a decaying schedule 1 /t 1 / 2+  for  > 0 . W e ﬁx the size of v ariational data to be 500 across all experiments and set the latent input dimension equal to the number of latent variables. 5 . 1 B I N A R I Z E D M N I S T The binarized MNIST data set ( Salakhutdinov & Murray , 2008 ) consists of 28x28 pixel images with binary-v alued outcomes. Training a D L G M , we apply two stochastic layers of 100 random v ari- ables and 50 random variables respectively , and in-between each stochastic layer is a deterministic 8 Published as a conference paper at ICLR 2016 Model Epochs ≤ − log p ( x ) D R AW 100 526.8 200 479.1 300 464.5 D R AW + V G P 100 460.1 200 444.0 300 423.9 T able 2: Negati ve predictive log-likelihood for Sketch, learned over hundreds of epochs ov er all 18,000 training examples. Figure 3: Generated images from D R AW with a V G P (top), and D R AW with the original variational auto-encoder (bottom). The V G P learns texture and sharpness, able to sketch more complex shapes. layer with 100 units using tanh nonlinearities. W e apply mean-ﬁeld Gaussian distributions for the stochastic layers and a Bernoulli likelihood. W e train the V G P to learn the D L G M for the cases of one stochastic layer and two stochastic layers. For D R AW ( Gregor et al. , 2015 ), we augment the mean-ﬁeld Gaussian distrib ution originally used to generate the latent samples at each time step with the V G P , as it places a complex variational prior ov er its parameters. The encoding recurrent neural network now outputs variational data (used for the v ariational model) as well as mean-ﬁeld Gaussian parameters (used for the auxiliary model). W e use the same architecture hyperparameters as in Gregor et al. ( 2015 ). After training we ev aluate test set log likelihood, which are lo wer bounds on the true value. See T able 1 which reports both approximations and lower bounds of log p ( x ) for various methods. The V G P achie ves the highest known results on log-likelihood using D R AW , reporting a value of -79.88 compared to the original highest of -80.97. The V G P also achie ves the highest kno wn results among the class of non-structure e xploiting models using the D L G M , with a value of -81.32 compared to the previous best of -82.90 reported by Burda et al. ( 2016 ). 5 . 2 S K E T C H As a demonstration of the VG P ’ s complexity for learning representations, we also examine the Sketch data set ( Eitz et al. , 2012 ). It consists of 20,000 human sketches equally distributed ov er 250 object categories. W e partition it into 18,000 training examples and 2,000 test examples. W e ﬁx the archi- tecture of D R AW to hav e a 2x2 read window , 5x5 write attention windo w , and 64 glimpses—these values were selected using a coarse grid search and choosing the set which lead to the best training log likelihood. For inference we use the original auto-encoder version as well as the augmented version with the V G P . See T able 2 . D R AW with the V G P achiev es a signiﬁcantly better lo wer bound, performing better than the original version which has seen state-of-the-art success in many computer vision tasks. (Until the results presented here, the results from the original D R AW were the best reported performance for this data set.). Moreov er , the model inferred using the V G P is able to generate more complex images than the original version—it not only performs better b ut maintains higher visual ﬁdelity . 6 D I S C U S S I O N W e present the variational Gaussian process ( V G P ), a variational model which adapts its shape to match complex posterior distrib utions. The V G P draws samples from a tractable distribution, and posits a Bayesian nonparametric prior ov er transformations from the tractable distribution to mean- ﬁeld parameters. The V G P learns the transformations from the space of all continuous mappings—it is a univ ersal approximator and ﬁnds good posterior approximations via optimization. In future work the VG P will be explored for application in Monte Carlo methods, where it may be an efﬁcient proposal distribution for importance sampling and sequential Monte Carlo. An important av enue of research is also to characterize local optima inherent to the objectiv e function. Such analysis will improve our understanding of the limits of the optimization procedure and thus the limits of variational inference. 9 Published as a conference paper at ICLR 2016 A C K N O W L E D G E M E N T S W e thank Da vid Duv enaud, Alp K ucukelbir , Ryan Giordano, and the anon ymous re viewers for their helpful comments. This work is supported by NSF IIS-0745520, IIS-1247664, IIS-1009542, ONR N00014-11-1-0651, D ARP A F A8750-14-2-0009, N66001-15-C-4032, Facebook, Adobe, Amazon, and the Seibel and John T empleton Foundations. R E F E R E N C E S Agako v , Felix V and Barber , Da vid. An auxiliary variational method. In Neural Information Pr o- cessing , pp. 561–566. Springer , 2004. Bergstra, James, Breuleux, Olivier , Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Des- jardins, Guillaume, Turian, Joseph, W arde-F arley , David, and Bengio, Y oshua. Theano: a CPU and GPU math expression compiler . In Pr oceedings of the Python for Scientiﬁc Computing Con- fer ence (SciPy) , 2010. Bishop, Christopher M., Lawrence, Neil D., Jordan, Michael I., and Jaakkola, T ommi. Approximat- ing posterior distributions in belief networks using mixtures. In Neural Information Pr ocessing Systems , 1998. Burda, Y uri, Grosse, Roger, and Salakhutdinov , Ruslan. Importance weighted autoencoders. In International Confer ence on Learning Repr esentations , 2016. Carpenter , Bob, Hof fman, Matthew D., Brubaker , Marcus, Lee, Daniel, Li, Peter , and Betancourt, Michael. The Stan Math Library: Rev erse-mode automatic dif ferentiation in C++. arXiv pr eprint arXiv:1509.07164 , 2015. Chopin, Nicolas. A sequential particle ﬁlter method for static models. Biometrika , 89(3):539–552, 2002. Cunningham, John P , Shenoy , Krishna V , and Sahani, Maneesh. Fast Gaussian process methods for point process intensity estimation. In International Conference on Machine Learning . A CM, 2008. Dinh, Laurent, Krueger , David, and Bengio, Y oshua. NICE: Non-linear independent components estimation. In International Confer ence on Learning Repr esentations W orkshop , 2015. Eitz, Mathias, Hays, James, and Ale xa, Marc. How do humans sk etch objects? ACM T rans. Graph. (Pr oc. SIGGRAPH) , 31(4):44:1–44:10, 2012. Gelman, Andrew and Meng, Xiao-Li. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science , 1998. Gregor , Karol, Danihelka, Iv o, Mnih, Andriy , Blundell, Charles, and W ierstra, Daan. Deep autore- gressiv e networks. In International Confer ence on Machine Learning , 2014. Gregor , Karol, Danihelka, Ivo, Grav es, Alex, Rezende, Danilo Jimenez, and W ierstra, Daan. DRA W: A recurrent neural network for image generation. In International Conference on Ma- chine Learning , 2015. Hinton, G. and V an Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Computational Learning Theory , pp. 5–13. A CM, 1993. Hinton, Geoffre y E and Zemel, Richard S. Autoencoders, minimum description length, and helmholtz free energy . In Neural Information Pr ocessing Systems , 1994. Hinton, Geof frey E, Osindero, Simon, and T eh, Y ee-Whye. A fast learning algorithm for deep belief nets. Neural computation , 18(7):1527–1554, 2006. Hoffman, Matthew D, Blei, Da vid M, W ang, Chong, and Paisle y , John. Stochastic v ariational infer- ence. Journal of Mac hine Learning Resear ch , 14:1303–1347, 2013. Jaakkola, T ommi S and Jordan, Michael I. Improving the mean ﬁeld approximation via the use of mixture distrib utions. In Learning in Graphical Models , pp. 163–173. Springer Netherlands, Dordrecht, 1998. 10 Published as a conference paper at ICLR 2016 Jordan, Michael I, Ghahramani, Zoubin, Jaakk ola, T ommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine Learning , 37(2):183–233, 1999. Kingma, Diederik P and W elling, Max. Auto-encoding variational Bayes. In International Confer - ence on Learning Repr esentations , 2014. Kucukelbir , Alp, T ran, Dustin, Ranganath, Rajesh, Gelman, Andrew , and Blei, David M. Automatic differentiation v ariational inference. arXiv pr eprint arXiv:1603.00788 , 2016. Lawrence, Neil. V ariational Infer ence in Pr obabilistic Models . PhD thesis, 2000. Lawrence, Neil. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. The Journal of Mac hine Learning Researc h , 6:1783–1816, 2005. Murray , Iain and Salakhutdinov , Ruslan R. Evaluating probabilities under high-dimensional latent variable models. In Advances in neural information pr ocessing systems , pp. 1137–1144, 2009. Neal, Radford M. Learning stochastic feedforward networks. Department of Computer Science, University of T oronto , 1990. Neal, Radford M. Annealed importance sampling. Statistics and Computing , 11(2):125–139, 1998. Nelsen, Roger B. An Introduction to Copulas (Springer Series in Statistics) . Springer-V erlag New Y ork, Inc., 2006. Osborne, Michael. Bayesian Gaussian pr ocesses for sequential prediction, optimisation and quadra- tur e . PhD thesis, Oxford Univ ersity New Colle ge, 2010. Quiñonero-Candela, Joaquin and Rasmussen, Carl Edward. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Resear ch , 6:1939–1959, 2005. Raiko, T apani, Li, Y ao, Cho, Kyunghyun, and Bengio, Y oshua. Iterativ e neural autoregressi ve distribution estimator nade-k. In Advances in Neural Information Pr ocessing Systems , pp. 325– 333, 2014. Ranganath, Rajesh, Gerrish, Sean, and Blei, Da vid M. Black box variational i nference. In Artiﬁcial Intelligence and Statistics , 2014. Ranganath, Rajesh, T ang, Linpeng, Charlin, Laurent, and Blei, David M. Deep exponential families. In Artiﬁcial Intelligence and Statistics , 2015a. Ranganath, Rajesh, T ran, Dustin, and Blei, David M. Hierarchical variational models. arXiv pr eprint arXiv:1511.02386 , 2015b. Rasmussen, Carl Edward and W illiams, Christopher K I. Gaussian pr ocesses for machine learning . Adaptiv e Computation and Machine Learning. MIT Press, Cambridge, MA, 2006. Rezende, Danilo Jimenez and Mohamed, Shakir . V ariational inference with normalizing ﬂo ws. In International Confer ence on Machine Learning , 2015. Rezende, Danilo Jimenez, Mohamed, Shakir , and W ierstra, Daan. Stochastic backpropagation and approximate inference in deep generativ e models. In International Conference on Machine Learn- ing , 2014. Robbins, Herbert and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics , 1951. Salakhutdinov , Ruslan and Larochelle, Hugo. Efﬁcient learning of deep Boltzmann machines. In International Confer ence on Artiﬁcial Intelligence and Statistics , pp. 693–700, 2010. Salakhutdinov , Ruslan and Murray , Iain. On the quantitati ve analysis of deep belief networks. In International Confer ence on Machine Learning , 2008. Salimans, Tim, Kingma, Diederik P , and W elling, Max. Markov chain Monte Carlo and v ariational inference: Bridging the gap. In International Conference on Mac hine Learning , 2015. Saul, Lawrence K and Jordan, Michael I. Exploiting tractable substructures in intractable netw orks. In Neural Information Pr ocessing Systems , 1996. 11 Published as a conference paper at ICLR 2016 T ieleman, T . and Hinton, G. Lecture 6.5—RmsProp: Divide the gradient by a running av erage of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. T ipping, Michael E and Bishop, Christopher M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 61(3):611–622, 1999. T ran, Dustin, Blei, David M., and Airoldi, Edoardo M. Copula variational inference. In Neural Information Pr ocessing Systems , 2015. V an Der V aart, Aad and V an Zanten, Harry . Information rates of nonparametric Gaussian process methods. The Journal of Mac hine Learning Resear ch , 12:2095–2119, 2011. W ainwright, Martin J and Jordan, Michael I. Graphical models, exponential families, and v ariational inference. F oundations and T r ends R  in Machine Learning , 1(1-2):1–305, 2008. W aterhouse, S., MacKay , D., and Robinson, T . Bayesian methods for mixtures of e xperts. In Neural Information Pr ocessing Systems , 1996. A S P E C I A L C A S E S O F T H E V A R I A T I O N A L G AU S S I A N P R O C E S S W e now analyze two special cases of the V G P : by limiting its generative process in various ways, we recover well-known models. This provides intuition behind the V G P ’ s complexity . In Section 4 we show man y recently proposed models can also be viewed as special cases of the V G P . Special Case 1. A mixtur e of mean-ﬁeld distributions is a V G P without a kernel. A discrete mixture of mean-ﬁeld distributions ( Bishop et al. , 1998 ; Jaakkola & Jordan , 1998 ; Lawrence , 2000 ) is a classically studied variational model with dependencies between latent v ari- ables. Instead of a mapping which interpolates between inputs of the variational data, suppose the V G P simply performs nearest-neighbors for a latent input ξ —selecting the output t n tied to the near - est variational input s n . This induces a multinomial distribution of outputs, which samples one of the variational outputs’ mean-ﬁeld parameters. 2 Thus, with a G P prior that interpolates between inputs, the V G P can be seen as a kernel density smoothing of the nearest-neighbor function. Special Case 2. V ariational factor analysis is a V G P with linear kernel and no variational data. Consider factor analysis ( T ipping & Bishop , 1999 ) in the v ariational space: 3 ξ ∼ N ( 0 , I ) , z i ∼ N ( w > ξ , I ) . Marginalizing ov er the latent inputs induces linear dependence in z , q ( z ; w ) = N ( z ; 0 , ww > ) . Consider the dual interpretation ξ ∼ N ( 0 , I ) , f i ∼ G P (0 , k ( · , · )) , k ( s , s 0 ) = s > s 0 , z i = f i ( ξ ) , with q ( z | ξ ) = N ( z ; 0 , ξ ξ > ) . The maximum likelihood estimate of w in factor analysis is the maximum a posteriori estimate of ξ in the G P formulation. More generally , use of a non-linear kernel induces non-linear dependence in z . Learning the set of kernel hyperparameters θ thus learns the set capturing the most variation in its latent embedding of z ( Lawrence , 2005 ). B P R O O F O F T H E O R E M 1 Theorem 1 . Let q ( z ; θ , D ) denote the variational Gaussian pr ocess. Consider a posterior distri- bution p ( z | x ) with a ﬁnite number of latent variables and continuous quantile function (in verse CDF). Ther e exists a sequence of par ameters ( θ k , D k ) such that lim k →∞ KL( q ( z ; θ k , D k ) k p ( z | x )) = 0 . 2 Formally , gi ven variational input-output pairs { ( s n , t n ) } , the nearest-neighbor function is deﬁned as f ( ξ ) = t j , such that k ξ − s j k < k ξ − s k k for all k . Then the output’ s distribution is multinomial with probabilities P ( f ( ξ ) = t j ) , proportional to areas of the partitioned nearest-neighbor space. 3 For simplicity , we avoid discussion of the V G P ’ s underlying mean-ﬁeld distribution, i.e., we specify each mean-ﬁeld factor to be a degenerate point mass at its parameter v alue. 12 Published as a conference paper at ICLR 2016 Pr oof. Let the mean-ﬁeld distribution be gi ven by de generate delta distributions q ( z i | f i ) = δ f i ( z i ) . Let the size of the latent input be equiv alent to the number of latent variables c = d and ﬁx σ 2 A R D = 1 and ω j = 1 . Furthermore for simplicity , we assume that ξ is drawn uniformly on the d -dimensional hypercube. Then as explained in Section 2.4 , if we let P − 1 denote the inv erse posterior cumulati ve distribution function, the optimal f denoted f ∗ such that KL( q ( z ; θ ) k p ( z | x )) = 0 is f ∗ ( ξ ) = P − 1 ( ξ 1 , ..., ξ d ) . Deﬁne O k to be the set of points j / 2 k for j = 0 to 2 k , and deﬁne S k to be the d -dimensional product of O k . Let D k be the set containing the pairs ( s i , f ∗ ( s i )) , for each element s i in S k . Denote f k as the G P mapping conditioned on the dataset D k , this random mapping satisﬁes f k ( s i ) = f ∗ ( s i ) for all s i ∈ S k by the noise free prediction property of Gaussian processes ( Rasmussen & W illiams , 2006 ). Then by continuity , as k → ∞ , f k con ver ges to f ∗ . A broad condition under which the quantile function of a distribution is continuous is if that distri- bution has positi ve density with respect to the Lebesgue measure. The rate of con vergence for ﬁnite sizes of the variational data can be studied via posterior contraction rates for G P s under random covariates ( V an Der V aart & V an Zanten , 2011 ). Only an additional assumption using stronger continuity conditions for the posterior quantile and the use of Matern cov ariance functions is required for the theory to be applicable in the variational setting. C V A R I A T I O N A L O B J E C T I V E W e derive the tractable lo wer bound to the model e vidence log p ( x ) presented in Eq.6 . T o do this, we ﬁrst penalize the E L B O with an expected KL term, log p ( x ) ≥ L = E q VG P [log p ( x | z )] − KL( q V G P ( z ) k p ( z )) ≥ E q VG P [log p ( x | z )] − KL( q V G P ( z ) k p ( z )) − E q VG P h KL( q ( ξ , f | z ) k r ( ξ , f | z )) i . W e can combine all terms into the expectations as follows: e L = E q ( z , ξ ,f ) h log p ( x | z ) − log q ( z ) + log p ( z ) − log q ( ξ , f | z ) + log r ( ξ , f | z ) i = E q ( z , ξ ,f ) h log p ( x | z ) − log q ( z | f ( ξ )) + log p ( z ) − log q ( ξ , f ) + log r ( ξ , f | z ) i , where we apply the product rule q ( z ) q ( ξ , f | z ) = q ( z | f ( ξ )) q ( ξ , f ) . Recombining terms as KL div ergences, and written with parameters ( θ , φ ) , this recov ers the auto-encoded v ariational objectiv e in Section 3 : e L ( θ , φ ) = E q VG P [log p ( x | z )] − E q VG P h KL( q ( z | f ( ξ )) k p ( z )) i − E q VG P h KL( q ( f | ξ ; θ ) k r ( f | ξ , z ; φ )) + log q ( ξ ) − log r ( ξ | z ) i . The KL div ergence between the mean-ﬁeld q ( z | f ( ξ )) and the model prior p ( z ) is analytically tractable for certain popular models. For example, in the deep latent Gaussian model ( Rezende et al. , 2014 ) and D R AW ( Gregor et al. , 2015 ), both the mean-ﬁeld distribution and model prior are Gaussian, leading to an analytic KL term: for Gaussian random v ariables of dimension d , KL( N ( x ; m 1 , Σ 1 ) kN ( x ; m 2 , Σ 2 )) = 1 2  ( m 1 − m 2 ) > Σ − 1 1 ( m 1 − m 2 ) + tr( Σ − 1 1 Σ 2 + log Σ 1 − log Σ 2 ) − d  . 13 Published as a conference paper at ICLR 2016 In general, when the KL is intractable, we combine the KL term with the reconstruction term, and maximize the variational objecti ve e L ( θ , φ ) = E q VG P [log p ( x , z ) − log q ( z | f ( ξ ))] − E q VG P h KL( q ( f | ξ ; θ ) k r ( f | ξ , z ; φ )) + log q ( ξ ) − log r ( ξ | z ) i . (8) W e expect that this experiences slightly higher variance in the stochastic gradients during optimiza- tion. W e no w consider the second term. Recall that we specify the auxiliary model to be a fully factorized Gaussian, r ( ξ , f | z ) = N (( ξ , f ( ξ )) > | z ; m , S ) , where m ∈ R c + d , S ∈ R c + d . Further, the v aria- tional priors q ( ξ ) and q ( f | ξ ) are both deﬁned to be Gaussian. Therefore it is also a KL div ergence between Gaussian distributed random v ariables. Similarly , log q ( ξ ) − log r ( ξ | z ) is simply a dif- ference of Gaussian log densities. The second expression is simple to compute and backpropagate gradients. D G R A D I E N T S O F T H E V A R I A T I O N A L O B J E C T I V E W e derive gradients for the variational objective ( Eq.7 ). This follows trivially by backpropaga- tion: ∇ θ e L ( θ , φ ) = E N ( ξ ) [ E w (  ) [ ∇ θ f ( ξ ) ∇ f z (  ) ∇ z log p ( x | z )]] − E N ( ξ ) h E w (  ) h ∇ θ KL( q ( z | f ( ξ ; θ )) k p ( z )) ii − E N ( ξ ) h E w (  ) h ∇ θ KL( q ( f | ξ ; θ ) k r ( f | ξ , z ; φ )) ii , ∇ φ e L ( θ , φ ) = − E N ( ξ ) [ E w (  ) [ ∇ φ KL( q ( f | ξ ; θ ) k r ( f | ξ , z ; φ )) − ∇ φ log r ( ξ | z ; φ )]] , where we assume the KL terms are analytically written from Appendix C and gradients are prop- agated similarly through their computational graph. In practice, we need only be careful about the expectations, and the gradients of the functions written abov e are taken care of with automatic differentiation tools. W e also deri ve gradients for the general v ariational bound of Eq.8 —it assumes that the ﬁrst KL term, measuring the diver gence between q and the prior for p , is not necessarily tractable. Following the reparameterizations described in Section 3.3 , this variational objecti ve can be rewritten as e L ( θ , φ ) = E N ( ξ ) h E w (  ) h log p ( x , z (  ; f )) − log q ( z (  ; f ) | f ) ii − E N ( ξ ) h E w (  ) h KL( q ( f | ξ ; θ ) k r ( f | ξ , z (  ; f ); φ )) + log q ( ξ ) − log r ( ξ | z (  ; f )) ii . W e calculate gradients by backpropagating over the nested reparameterizations: ∇ θ e L ( θ , φ ) = E N ( ξ ) [ E w (  ) [ ∇ θ f ( ξ ) ∇ f z (  )[ ∇ z log p ( x , z ) − ∇ z log q ( z | f )]]] − E N ( ξ ) h E w (  ) h ∇ θ KL( q ( f | ξ ; θ ) k r ( f | ξ , z ; φ )) ii ∇ φ e L ( θ , φ ) = − E N ( ξ ) [ E w (  ) [ ∇ φ KL( q ( f | ξ ; θ ) k r ( f | ξ , z ; φ )) − ∇ φ log r ( ξ | z ; φ )]] . E S C A L I N G T H E S I Z E O F V A R I A T I O N A L D A T A If massiv e sizes of variational data are required, e.g., when its cubic complexity due to in version of a m × m matrix becomes the bottleneck during computation, we can scale it further . Consider ﬁxing the v ariational inputs to lie on a grid. For stationary kernels, this allows us to exploit T oeplitz structure for fast m × m matrix inv ersion. In particular , one can embed the T oeplitz matrix into a circulant matrix and apply conjugate gradient combined with fast Fouri er transforms in order to compute in verse-matrix vector products in O ( m log m ) computation and O ( m ) storage ( Cunning- ham et al. , 2008 ). F or product kernels, we can further exploit Kronecker structure to allow fast m × m matrix in version in O ( P m 1+1 /P ) operations and O ( P m 2 /P ) storage, where P > 1 is the number of kernel products ( Osborne , 2010 ). The A R D kernel speciﬁcally leads to O ( cm 1+1 /c ) complexity , which is linear in m . 14

The Variational Gaussian Process

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment