Adaptive Gaussian Predictive Process Approximation

Adaptiv e Gaussian Predictiv e Pro cess Appro ximation Sury a T T okdar Duke University Abstract W e address the issue of knots selection for Gaussian predictive pro cess metho dology . Predictiv e process appro ximation provides an eﬀective solution to the cubic order com- putational complexit y of Gaussian pro cess models. This appro ximation crucially dep ends on a set of points, called knots, at whic h the original pro cess is retained, while the rest is appro ximated via a deterministic extrap olation. Knots should b e few in num b er to keep the computational complexit y low, but pro vide a go od cov erage of the pro cess domain to limit appro ximation error. W e present theoretical calculations to show that co verage m ust b e judged by the canonical metric of the Gaussian pro cess. This necessitates ha v- ing in place a knots selection algorithm that automatically adapts to the c hanges in the canonical metric aﬀected by changes in the parameter v alues con trolling the Gaussian pro cess cov ariance function. W e presen t an algorithm tow ard this b y emplo ying an in- complete Cholesky factorization with piv oting and dynamic stopping. Although these concepts already exist in the literature, our contribution lies in unifying them into a fast algorithm and in using computable error bounds to ﬁnesse implementation of the predic- tiv e pro cess approximation. The resulting adaptive predictive process oﬀers a substan tial automatization of Guassian pro cess mo del ﬁtting, esp ecially for Bay esian applications where thousands of v alues of the co v ariance parameters are to b e explored. Keywor ds: Gaussian predictiv e process, Knots selection, Cholesky factorization, Piv oting, Ba yesian mo del ﬁtting, Marko v chain sampling. 1 In tro duction Ba yesian nonparametric methodology is driv en b y construction of prior distributions on func- tion spaces. T ow ard this, Gaussian pro cess distributions hav e prov ed extremely useful due to their mathematical and computational tractability and ability to incorp orate a wide range of smo othness assumptions. Gaussian pro cess mo dels hav e b een widely used in spatio-temp oral mo deling (Handco c k and Stein, 1993; Kim et al., 2005; Banerjee et al., 2008), computer emu- lation (Kennedy and O’Hagan, 2001; Oakley and OHagan, 2002), non-parametric regression and classiﬁcation (Neal, 1998; Csat´ o et al., 2000; Rasm ussen and Williams, 2006; Short et al., 2007), density and quantile regression (T okdar et al., 2010; T okdar and Kadane, 2011), func- tional data analysis (Shi and W ang, 2008; P etrone et al., 2009), image analysis (Sudderth and Jordan, 2009), etc. Rasmussen and Williams (2006) give a thorough ov erview of likelihoo d based exploration of Gaussian pro cess mo dels, including Bay esian treatments. F or theoreti- cal details on common Bay esian models based on Gaussian processes, see T okdar and Ghosh (2007), Choi and Schervish (2007), Ghosal and Roy (2006), v an der V aart and v an Zanten (2008, 2009) and the references therein. 1 F or our purp ose, a Gaussian pro cess can b e viewed as a random, real v alued function ω = ( ω ( t ) , t ∈ T ) on a compact Euclidean domain T , suc h that for an y ﬁnitely many p oin ts t 1 , · · · , t k ∈ T the random v ector ( ω ( t 1 ) , · · · , ω ( t k )) is a k -dimensional Gaussian vector. A Gaussian pro cess ω is completely characterized b y its mean and cov ariance functions µ ( t ) = E[ ω ( t )] and ψ ( s, t ) = Cov[ ω ( s ) , ω ( t )]. F or a Gaussian pro cess mo del, where a function v alued parameter ω is assigned a Gaussian pro cess prior distributions, the data lik eliho od t ypically in volv es ω through a v ector W = ( ω ( s 1 ) , · · · , ω ( s N )) of ω v alues at a ﬁnite set of p oin ts s 1 , · · · , s N ∈ T . These p oin ts could p ossibly dep end on other mo del parameters. The fact that W is a Gaussian vector mak es computation conceptually straigh tforw ard. Ho wev er, a w ell known bottleneck in implemen ting Gaussian pro cess models is the O ( N 3 ) complexit y of in verting or factorizing the N × N cov ariance matrix of W . V arious reduced rank appro ximations to cov ariance matrices ha ve b een prop osed to o vercome this problem (Smola and Bartlett, 2001; Seeger et al., 2003; Sch waighofer and T resp, 2003; Qui˜ nonero-Candela and Rasm ussen, 2005; Snelson and Ghahramani, 2006), mostly rep orted in the machine learning literature. Among these, a sp ecial metho d of approximation, known as predictiv e pro cess appro ximation (T okdar, 2007; Banerjee et al., 2008), has been indep enden tly discov ered and successfully used in the Ba yesian literature. The app eal of this metho d lies in a sto c hastic pro cess representation of the appro ximation that obtains from tracking ω at a small num b er of p oin ts, called knots, and extrap olating the rest by using prop erties of Gaussian pro cess la ws (Section 2.1). F or predictive pro cess appro ximations, choosing the num ber and lo cations of the knots remains a diﬃcult problem. This problem is only going to escalate as more complex Gaussian pro cess mo dels are used in hierarchical Bay esian mo deling, with ric h parametric and non- parametric form ulations of the Gaussian pro cess co v ariance function b ecoming commonplace. T o understand this diﬃcult y , we ﬁrst la y out (Section 2.2) the basic probabilit y theory b ehind the approximation accuracy of the predictive pro cess and demonstrate ho w the choice of knots determines an accuracy bound. The key concept here is that the knots m ust pro vide a go o d co verage of the domain T of the Gaussian pro cess. While this is intuitiv e, what needs emphasis is that the geometry of T is to b e view ed through the top ology induced b y the Gaussian pro cess canonical metric ρ ( s, t ) = [E { ω ( s ) − ω ( t ) } 2 ] 1 / 2 , which could b e quite diﬀeren t from the Euclidean geometry of T . This theory helps understand (Section 2.3) why existing approaches of choosing knots, based on space ﬁlling design concepts (Zh u and Stein, 2005; Zimmerman, 2006; Finley et al., 2009) or mo del extensions where knots are learned from data as additional mo del parameters (Snelson and Ghahramani, 2006; T okdar, 2007; Guhaniyogi et al., 2010), are lik ely to oﬀer p oor appro ximation and face severe computational diﬃculties. A fundamental w eakness of these approaches is their inabilit y to automatically adapt to the c hanges in the geometry of T caused b y changes in the v alues of the cov ariance parameters. In Section 3 we present a simple extension of the predictive pro cess appro ximation that enables it to automatically adapt to the geometry of the Gaussian process co v ariance function. This extension, called adaptive pr e dictive pr o c ess appr oximation , works with an equiv alen t represen tation of the predictiv e process through reduced rank Cholesky factorization (Section 3.1) and adds to it t w o adaptiv e features, pivoting and dynamic stopping . Pivoting determines the order in which knots are selected from an initial set of candidate p oints while dynamic stopping determines how many knots to select. The resulting approximation meets a pre- 2 sp eciﬁed accuracy b ound (Section 3.2). The connection b etw een predictive pro cess approximation and reduced rank Cholesky factorization is w ell known (Qui˜ nonero-Candela and Rasm ussen, 2005) and piv oting has b een recen tly inv estigated in this con text from the p oin t of view of stable computation (F oster et al., 2009). The nov elty of our w ork lies in unifying these ideas to deﬁne an adaptiv e v ersion of the predictive pro cess and in prop osing accuracy b ounds as the driving force in ﬁnessing the implemen tation of suc h approximation techniques. The end pro duct is a substan tial automatization of ﬁtting Gaussian pro cess mo dels that can broaden up the scop e of suc h mo dels without the additional burden of ha ving to mo del or learn the knots. This is illustrated with tw o examples in Section 4. 2 Predictiv e pro cess appro ximation 2.1 Deﬁnition Fix a set of p oints, referred to as knots hereafter, { t 1 , t 2 , · · · , t m } ⊂ T and write ω ( t ) = ν ( t ) + ξ ( t ), where, ν ( t ) = E { ω ( t ) | ω ( t 1 ) , · · · , ω ( t m ) } and ξ ( t ) = ω ( t ) − E { ω ( t ) | ω ( t 1 ) , · · · , ω ( t m ) } . By the prop erties of Gaussian pro cess laws, ν and ξ are indep enden t Gaussian pro cesses. The pro cess ν , called a Gaussian predictiv e pro cess, has rank m , because it can b e written as ν ( t ) = P m i =1 A i ψ ( t i , t ) with the co eﬃcien t v ector A = ( A 1 , · · · , A m ) b eing a Gaussian v ector. By replacing ω with ν in the statistical mo del, one now deals with the cov ariance matrix of V = ( ν ( s 1 ) , · · · , ν ( s N )), which, due to the rank- m prop ert y of ν , can b e factorized in O ( N m 2 ) time. 2.2 Accuracy b ound Replacing ω with ν can b e justiﬁed as follows. Let δ = sup t ∈ T min 1 ≤ i ≤ m ρ ( t, t i ) denote the mesh size of the knots, where ρ ( t, s ) = [ E { ω ( t ) − ω ( s ) } 2 ] 1 / 2 is the canonical metric on T induced by ω (Adler, 1990, page 2). F or a smo oth ψ ( t, s ), δ can b e made arbitrarily small b y packing T with suﬃciently many , well placed knots. But, as δ tends to 0, so do es κ 2 = sup t ∈ T V ar { ξ ( t ) } . This is b ecause for any t ∈ T , and an y i ∈ { 1 , · · · , m } , by the indep endence of ν and ξ , V ar { ξ ( t ) } = V ar { ω ( t ) } − V ar { ν ( t ) } = E[V ar { ω ( t ) | ω ( t 1 ) , · · · , ω ( t m ) } ] = E[V ar { ω ( t ) − ω ( t i ) | ω ( t 1 ) , · · · , ω ( t m ) } ] ≤ V ar { ω ( t ) − ω ( t i ) } = ρ 2 ( t, t i ) , and hence κ ≤ δ . That κ can be made arbitrarily small is goo d news, b ecause it plays a k ey role in providing probabilistic b ounds on the residual pro cess ξ . Theorem 2.1. L et ω b e a zer o me an Gaussian pr o c ess on a c omp act subset T ⊂ R p . L et ν b e a ﬁnite r ank pr e dictive pr o c ess appr oximation of ω with r esidual pr o c ess ξ = ω − ν . 3 (i) If T ⊂ [ a, b ] p and ther e is a ﬁnite c onstant c > 0 such that V ar { ω ( s ) − ω ( t ) } ≤ c 2 k s − t k 2 , s, t ∈ T then P  sup t ∈ T | ξ ( t ) | >   ≤ 3 exp  −  2 B 2 κ  , ∀  > 0 (1) with B = 27 p 2 pc ( b − a ) and κ 2 = sup t ∈ T V ar { ξ ( t ) } . (ii) F or any ﬁnite subset S ⊂ T P  sup t ∈ S | ξ ( t ) | >   ≤ 3 exp  −  2 9 κ 2 S (2 + log | S | )  , ∀  > 0 (2) wher e | S | denotes the size of S and κ 2 S = sup t ∈ S V ar { ξ ( t ) } . A pro of is giv en in App endix A. Note that the constan t B does not dep end on the n umber or lo cations of the knots, it only dep ends on the dimensionality and size of T as well as smo othness prop erties of the cov ariance function ω . It is p ossible to replace κ in (1) with κ 2(1 − η ) for an y arbitrary η ∈ (0 , 1), but with a diﬀeren t constant B . While (1) pro vides an accuracy b ound o ver the entire domain T , the b ound in (2) ov er a ﬁnite subset maybe of more practical v alue. F or Gaussian pro cess regression mo dels with additiv e Gaussian noise, a common mo diﬁca- tion (Finley et al., 2009) of predictive pro cess approximation is to replace ω with the pro cess ˜ ν = ν + ξ ∗ where ξ ∗ is a zero mean Gaussian process, independent of ν and ξ , satisfying, Co v { ξ ∗ ( t ) , ξ ∗ ( s ) } =  Co v { ξ ( t ) , ξ ( s ) } = V ar ξ ( t ) if t = s 0 if t 6 = s. The addition of ξ ∗ giv es ˜ ν the same p oin twise mean and v ariance as those of ω , without adding to the computational cost. The residual pro cess is no w giv en b y ˜ ξ = ω − ˜ ν = ξ − ξ ∗ whose v ariance equals 2V ar { ξ ( t ) } b ecause of indep endence b et ween ξ and ξ ∗ . Because ξ ∗ is almost surely discontin uous, the b ound in (1) do es not apply to ˜ ξ . But (2) contin ues to hold with κ 2 S replaced by ˜ κ 2 S = 2 κ 2 S . 2.3 Need for adaptation F or predictive pro cess appro ximations, choosing the num b er and the lo cations of the knots remains a diﬃcult problem. Ideally this choice should adapt to the canonical metric ρ , so that a small δ obtains with as few knots as p ossible. Ho wev er, it’s not a single ρ that we need to adapt to. In mo dern Gaussian pro cess applications, the cov ariance function ψ and consequen tly the canonical metric depend on additional mo del parameters. A t ypical example is ω of the form ω ( t ) = ω 0 ( t ) + x 1 ( t ) ω 1 ( t ) + + x p ( t ) ω p ( t ) where x j ( t )’s are known, ﬁxed functions and ω j ’s are indep enden t mean zero Gaussian pro- cesses with co v ariances ψ j ( t, s ) = τ 2 j exp( − β 2 j k s − t k 2 ) with θ = ( τ 0 , τ 1 , · · · , τ p , β 0 , β 1 , · · · , β p ) serving as a mo del parameter. Because a lik eliho o d based mo del ﬁtting will lo op through h undreds or even thousands of v alues of θ , it is imp ortan t to hav e a low-cost algorithm to c ho ose the knots that automatically adapts to the geometry of an y arbitrary canonical metric. 4 Suc h an adaptive feature is lac king from existing knots selection approaches whic h primar- ily treat knots as additional mo del parameters. The knots are then learned along with other mo del parameters, either via optimization (Snelson and Ghahramani, 2006) or by Mark ov c hain sampling (T okdar, 2007; Guhaniyogi et al., 2010). Another p opular approach is to w ork with a ﬁxed set of knots based on space-ﬁlling design concepts (Zh u and Stein, 2005; Zimmerman, 2006; Finley et al., 2009). Among these, the prop osal in Finley et al. (2009) o verlaps with our prop osal. But while we pursue a low-cost adaptation at every v alue of θ at whic h likelihoo d ev aluation is needed, Finley et al. (2009) consider one ﬁxed set of knots adapted to a representativ e v alue of θ . Their knot selection algorithm has O ( N 2 m ) comput- ing time, which makes it infeasible to run within an optimization or a Mark ov chain sampler lo op. 3 Adaptativ e predictiv e pro cess 3.1 Predictiv e pro cess approximation via Cholesky factorization Let ω = ( ω ( t ) , t ∈ T ) b e a zero-mean Gaussian pro cess with cov ariance ψ ( s, t ). Supp ose the ﬁnite set S = { s 1 , s 2 , · · · , s N } ⊂ T contains all p oin ts in T where ω needs to b e ev aluated for the purp ose of mo del ﬁtting. Let Ψ = (( ψ ij )) denote the N × N co v ariance matrix of the Gaussian vector W = ( ω ( s 1 ) , · · · , ω ( s N )). A Cholesky factor Λ of Ψ, with Λ b eing a N × N upp er triangular matrix with non-negative diagonal elements and satisfying Ψ = Λ 0 Λ, obtains from the following recursiv e calculations λ ii = s ψ ii − X `m ˆ λ 2 ii . F rom (2), κ 2 S con trols error b ounds P (max s ∈ S | ω ( s ) − ν ( s ) | >  ) o v er the set of in terest S . Therefore, the ab ov e incomplete Cholesky factorization pro duces a Gaussian predictive pro cess approximation, with readily av ailable error bounds, pro vided we are happy to choose the knots from the set S . The restriction to S app ears reasonable for most applications with the additional burden on the mo deler to identify S carefully . F or example, in a Gaussian pro cess regression mo del with additive noise, it is suﬃcient to tak e S to be the training set of cov ariate v alues, if only p osterior predictive mean and v ariances are needed at test cases. But if posterior predictive co v ariance b etw een t w o test cases, or a test and a training case is required, then S should include these test cases as well. 3.2 Piv oting and dynamic stopping Our quest of an adaptive v ersion of ν stays within this restriction, but employs a dynamic c hoice of the stopping time m and the order in whic h the elements of S are pro cessed. T o decide whether the current stopping time is acceptable, w e chec k the current κ S against a giv en tolerance κ tol . If κ S exceeds the tolerance, w e increment m to m + 1, and rep eat (4) only for i equal to the new v alue of m , follow ed b y (5), pro ducing an up date of κ S . The top ro w elements ˆ λ ij , i < m need no c hanges. The increment of m and the subsequent alterations to ˆ Λ clearly reduce κ S as the tailing ˆ λ ii ’s in (5) are reduced. This reduction can b e expected to improv e if after incremen ting m and b efore pro ceeding with the new calculations, one sw aps the current m -th and k -th rows of Ψ where i = k giv es the maxim um of the tailing ˆ λ ii v alues, i = m, · · · , N . A sequence of such swaps, from start to the terminating m , gives a greedy , dynamic approximation to ﬁnding the optimal ordering of the elements of S that gives a κ S ≤ κ tol with a minimum stopping time m . The dynamic swapping is a common feature, known as pivoting , of all leading soft ware pac k ages for Cholesky factorization. If run until m = N , pivoting pro duces a p ermutation π = ( π 1 , · · · , π N ) of (1 , · · · , N ) and an upp er triangular matrix Λ with non-negativ e diag- onals such that P π Ψ P 0 π = Λ 0 Λ where P π is the N × N p erm utation matrix asso ciated with π . Our prop osal ab o ve simply adds to this pivoted Cholesky factorization a dynamic, tol- erance based stopping. The resulting ˆ Λ gives the Cholesky factor of the cov ariance matrix of V = ( ν ( s 1 ) , · · · , ν ( s N )) where ν is the Gaussian predictive pro cess asso ciated with the knots s π 1 , · · · , s π m . The Gaussian predictive pro cess ν comes with the error bound (2) with κ tol replacing κ . The additional computing time needed for pivoting is only O ( N m ), a small fraction of the computing time O ( N m 2 ) needed to get the elements ˆ Λ if π w as precomputed. Algorithm 1 gives a pseudo co de for p erforming the incomplete Cholesky factorization with an additional improvisation. The user sp eciﬁed tolerance κ tol is taken to b e a relativ e tolerance lev el instead of an absolute one. The absolute tolerance is ﬁxed as κ tol times the maximum of V ar { ω ( s i ) } 1 / 2 , i = 1 , · · · , N . This mak es sense for Gaussian process approximation as the maxim um v ariance of the process can b e viewed as a scaling parameter. 3.3 Geometric Illustration W e illustrate the adaptive c hoice of knots on the Bartlett exp erimen tal forest dataset (Finley et al., 2009). This dataset contain n = 437 well iden tiﬁed forest lo cations s 1 , · · · , s n , measured 6 Algorithm 1 Pivoted, incomplete Cholesky factorization with dynamic stopping. Require: A co v ariance function ψ ( · , · ), p ositiv e integers N and m max and tolerance κ tol > 0. R ← 0 m max × m max π 1 ← 1 , π 2 ← 2 , · · · , π N ← N k ← 1 d max ← max 1 ≤ l ≤ N ψ ( s π l , s π l ) l max ← arg max 1 ≤ l ≤ N ψ ( s π l , s π l ) κ tol ← √ d max κ tol while d max > κ 2 tol do sw ap π k and π ( l max ) r kk ← d 1 / 2 max for j = k + 1 to m do r kj ← { ψ ( s π k , s π j ) − P l 0, an orthogonal pro jection matrix Q and some ﬁxed function x ( t ) that relates to the slop e of the forest landscap e at lo cation t . The v ariance of ω ( t ) is x ( t ) 2 and the correlation b et w een ω ( s ) and ω ( t ) dep ends on the Euclidean distance b et ween the Q -pro jections of these tw o lo cation vectors. The parameter β > 0 encodes the spatial range of the cov ariance, i.e., it controls how rapidly ψ ( s, t ) deca ys with the distance b et w een s and t . W e demonstrate how the choice of knots according to Algorithm 1 adapts to v ariations in eac h of β , x ( t ) and Q . Figure 1(a) sho ws nine c hoices of (6) that v ary in Q and β , while x ( t ) and κ tol are held ﬁxed at x ( t ) ≡ 1 and κ 2 tol = 10 − 4 . The top row has β = 10 − 3 , the middle ro w has β = 5 × 10 − 4 and the bottom ro w has β = 10 − 4 . The left column has Q equal to the identit y matrix, the middle column has Q that pro jects along the horizontal axis and the righ t column has Q that pro jects along the v ertical axis (indicated b y arrows). It is clear that the algorithm picks few er knots as the spatial range decreases. A smaller spatial range gives a ﬂatter top ology in the canonical metric, consequently fewer p oints are required to capture the v ariation in the random surface ω ( t ). It is also clear that the algorithm adapts to the directional elemen t of this top ology . It lines up the knots along the horizontal axis when Q is the horizontal pro jection and lines up the knots along the v ertical axis when Q pro jects along that axis. Figure 1(b) sho ws nine other c hoices of (6) that v ary in Q and x ( t ), while β and κ tol are held ﬁxed at β = 10 − 3 and κ 2 tol = 10 − 4 . V ariation in Q is as in Figure 1(a). W e take x ( t ) = 1 − F ( c · slope ( t ) | a, b ), where F ( x | a, b ) denotes the gamma distribution function with shap e parameter a and rate parameter b , and slope ( t ) is the slop e of the landscap e at p oin t t . W e ﬁx a and b so that the mean and v ariance of the gamma distribution match the mean and v ariance of the recorded slop e v alues at the 437 lo cations. The top ro w of Figure 1(b) has c = 0, the middle row has c = 1 and the b ottom ro w has c = 6. Larger v alues of c mak e x ( t ) closer to zero at regions with a high slop e while x ( t ) alwa ys equals 1 at regions that are ﬂat (shown in lighter color, mostly along a narrow v alley running from south-east to north-west). With a larger c , most of the v ariation in ω ( t ) is conﬁned to lo cations t with a ﬂat slope. It is clear that our algorithm adapts to this feature b y picking knots from suc h areas. Figure 1(c) shows (6) with β = 10 − 4 , x ( t ) ≡ 1, Q = the identit y matrix, but with four c hoices of κ 2 tol = 10 − 1 , 10 − 2 , 10 − 4 and 10 − 8 (clo c kwise from top left). F or smaller tolerance lev els, more knots are pick ed to meet a tigh ter accuracy condition. A go o d co verage of the en tire region is maintained throughout, but additional knots are c hosen to give a denser represen tation. Note the higher concen tration of knots at the b oundary than the interior. This is a consequence of the greedy natur e of the algorithm as it tries to pic k the next knot as the p oin t that is least correlated with the ones already selected. Although a b etter algorithm could correct for such a b oundary bias, the linear additional computing cost of the greedy searc h oﬀers a highly attractive trade-oﬀ against a few extra knots. Figure 1(d) indicate that the n um b er of knots m required to meet the tolerance criterion gro ws at a logarithmic rate 8 (a) (b) (c) 1 2 3 4 5 6 7 8 40 60 80 100 140 − l o g 10 ( κ t o l 2 ) m (d) Figure 1: Geometry of knot selection illustrated on Bartlett exp erimen tal forest data. (a) Adaptation of knots to c hanges in Q (columns) and β (rows). (b) The same for c hanges in Q (columns) and x ( s ) (rows). (c) The same for changes in κ tol . (d) Num b er of knots needed to meet sp eciﬁed accuracy b ounds for a particular c hoice of the cov ariance. 9 with 1 /κ 2 tol . How ever, theoretical results are not yet av ailable on the relationship b et w een m and κ tol . 4 Application to Ba y esian computation 4.1 Sparse nonparametric regression A low cost, adaptiv e c hoice of knots is extremely b eneﬁcial in Bay esian Marko v chain Monte Carlo (Robert and Casella, 2004; Gilks et al., 1995) computations, where lik eliho od ev aluation is needed at thousands of diﬀerent v alues of the cov ariance parameters. W e illustrate this with a sparse non-parametric regression mo del, where a go od exploration of the space of co v ariance parameters is critical to mo del ﬁtting. W e show that to eﬃciently explore the co v ariance parameter space, it is imp ortant to adapt to the c hanges in the canonical metric caused b y the changes in these parameter v alues. In this regard, the adaptive predictive pro cess appro ximation prop osed here has a clear adv antage o v er the existing approaches of handling knots. W e consider a to y data set consisting of ( x i , y i ), i = 1 , · · · , n = 10 , 000, where x i = ( x i 1 , · · · , x ip ) 0 are drawn indep enden tly from the uniform distribution o v er [0 , 1] p , with p = 10, and y i are generated indep enden tly as y i ∼ N (2 sin { 2 π x i 1 } , 0 . 1 2 ). W e consider a regression mo del y i = µ + τ ω ( x i ) + τ  i ,  i IID ∼ N (0 , σ 2 ) , (7) with ω mo deled as a zero-mean Gaussian pro cess o ver T = [0 , 1] p . The co v ariance function of ω is taken to b e ψ ( s, t ) = ψ ( s, t | β ) = exp {− p X j =1 β 2 j ( s j − t j ) 2 } , t, s ∈ T . F or simplicit y of exp osition, we set µ and τ 2 at the mean and v ariance of the observed y i v alues, and fo cus on learning the parameters β 1 , · · · , β p and σ 2 . The β j parameters are assigned standard normal prior distributions, folded onto the p ositiv e real line and log σ 2 is assigned a standard normal prior distribution. Prior indep endence across parameters is assumed. A ﬁxed set of knots ov er T is clearly infeasible for this application due to the dimension- alit y of T . Placing only 3 knots along eac h axis tak es the total count to 3 10 = 59049. While the alternativ e approac h of placing an auxiliary model on the knots and learning them join tly with the co v ariance parameters via Marko v c hain sampling can drastically reduce the total n umber of knots, it is lik ely to lead to a p o or exploration of the cov ariance parameters for the following reason. Consider a Gibbs update for β 1 giv en the remaining parameters and the knots. Supp ose the current parameter v alues are β 1 = β 2 = 0 . 1, β 3 = · · · = β p = 0 . 004 and σ 2 = exp( − 3 . 5). F or these parameter v alues, a “well learned” c hoice of the knots is found b y applying Al- gorithm 1 to ψ ( s, t | β ) with β set at the v ector of curren t v alues (w e use κ 2 tol = 10 − 4 ). The corresp onding p osterior conditional densit y p ( β 1 | β 2 , · · · , β p , σ 2 , k nots, data ) is shown by the solid curve in Figure 2. If, instead, the current v alues had b een β 1 = 1, β 2 = 0 . 1, β 3 = · · · = β p = 0 . 004, σ 2 = exp( − 3 . 5) and the knots were chosen b y applying Al- gorithm 1 to the corresp onding ψ ( s, t | β ), then the resulting p osterior conditional densit y 10 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 20 30 β 1 p( β 1 |-,knots) Figure 2: Conditional p osterior densit y of β 1 , in the non-parametric regression model, given other parameters and the knots, for t wo diﬀeren t choices of the knots. p ( β 1 | β 2 , · · · , β p , σ 2 , k nots, data ) would look lik e the dashed curve in Figure 2. The t wo curv es are very diﬀeren t even though the v alues of the conditioning parameters β 2 , · · · , β p and σ 2 are the same. This diﬀerence shows that the conditional distribution of the cov ariance pa- rameters strongly dep ends on the curren t set of knots, whic h, if w ell learned, should dep end on the current v alues of the co v ariance parameters. Consequently , there will b e additional stic kiness in the c hain of sampled v alues of the co v ariance parameters. The prop osed adaptive predictive pro cess gets rid of this additional stickiness by auto- matically adapting the knots to the co v ariance parameter v alues. As discussed b efore, for this application it is reasonable to restrict the search of knots to the set of observed co v ariate v ectors S = { x 1 , · · · , x n } . Then, one only needs to sp ecify a tolerance lev el κ tol to obtain an approximation ˆ p ( y | β , σ 2 ) of the marginal lik eliho o d p ( y | β , σ 2 ) = R p ( y | ω , σ 2 ) p ( dω | β ) of ( β , σ 2 ). The approximation obtains by replacing ω in the integral with its predictiv e pro cess appro ximation ν adapted to the corresp onding ψ ( s, t | β ). The approximate marginal likeli- ho od can b e computed in O ( nm 2 ) time, where m is the n umber of knots needed for ψ ( s, t | β ) with the given tolerance level; detailed formulas are av ailable in Snelson and Ghahramani (2006). The appro ximate p osterior densit y ˆ p ( β , σ 2 | data ) ∝ ˆ p ( y | β , σ 2 ) p ( β , σ 2 ) can be explored with common Metrop olis-Hastings samplers. Figure 3 rep orts summaries from a random w alk Metrop olis sampler exploration with a tolerance lev el κ 2 tol = 10 − 4 . 4.2 V arying co eﬃcien t regression for spatial data P ace and Barry (1997) use coun t y level data from 1980 United States presiden tial election to relate v oter turnout to education, income and homeownership standards. They use a spatial autoregressiv e mo del to allow this relation v ary geographically . W e pursue an alternative form ulation with a Gaussian pro cess spatial regression mo del. 11 (a) (b) (c) 0 2000 6000 10000 10 15 20 25 30 35 40 Sweep m (d) Figure 3: A random w alk Metrop olis sampler exploration of the p osterior for the non- parametric regression problem and asso ciated p osterior summaries. (a) T race plots of the sampled v alues of β 1 , · · · , β 10 , σ . (b) Histograms of the same. (c) P osterior median (solid line) and 95% credible in terv als (dashed lines) for f ( x ) at x = ( i/ 100 , 0 , · · · , 0), i = 0 , · · · , 100, o verlaid on the data scatter along x 1 and y . (d) V alues of m along the run of the sampler, sho wn at ev ery 100th sweep. 12 F or coun ty i = 1 , · · · , n , let V i , P i , E i , I i and H i denote, resp ectively , voter coun t, p opulation size of age 18 y ears or more, population size with at least high school education, aggregate income and n umber of owner o ccupied housing units. W e tak e log-p ercen tage v oter count y i = log( V i /P i ) as the resp onse v ariable. Three predictor v ariables are deﬁned as x 1 i = log( E i /P i ), x 2 i = log( I i /P i ) and x 3 i = log( H i /P i ). W e relate the response to the predictors through a spatially v arying regression mo del y i = ω 0 i + x 1 i ω 1 i + x 2 i ω 2 i + x 3 i ω 3 i +  i ,  i IID ∼ N (0 , σ 2 ) . (8) T o ensure that geographically proximate coun ties ha ve similar co eﬃcien ts, the vector of co eﬃcien ts from all counties ( ω j 1 , · · · , ω j n ), for each j = 0 , 1 , 2 , 3, is taken to b e a zero mean multiv ariate normal with Cov( ω j i , ω j k ) = τ 2 j exp {− β 2 j k t i − t k k 2 } , where t i is the spatial lo cation v ector of coun ty i , giv en by the latitude-longitude pair of the coun ty cen troid. These four vectors are tak en to b e m utually independent. The ab o ve formulation is equiv alen t to y i = ω ( t i ) +  i ,  i IID ∼ N (0 , σ 2 ) (9) where ω ( t ) is a zero-mean Gaussian pro cess on T = { t 1 , · · · , t n } with cov ariance function ψ ( s, t | θ ) = P 3 j =0 x j ( t ) x j ( s ) exp {− β 2 j k s − t k 2 } , s, t ∈ T , with x 0 ( t i ) ≡ 1, x j ( t i ) = x j i , etc, that dep ends up on the parameter v ector θ = ( τ 0 , τ 1 , τ 2 , τ 3 , β 0 , β 1 , β 2 , β 3 ). These cov ariance parameters are each assigned a standard normal prior distribution, folded onto the p ositiv e half of the real line, indep enden tly of each other. W e also assign log σ 2 a standard normal prior distribution, and log σ 2 is taken to be a priori independent of θ . W e ﬁt this mo del using 1040 randomly chosen counties, roughly one third of the total coun t. Figure 4 shows a random walk Metrop olis sampler exploration of the approximate p osterior densit y ˆ p ( θ, σ 2 | y ) ∝ ˆ p ( y | θ , σ 2 ) p ( θ ) p ( σ 2 ) with κ 2 tol = 10 − 4 and S taken to b e the set of lo cations for the counties included in mo del ﬁtting. T race plots in Figure 4(a) indicate go o d con vergence of the sampler. The marginal p osterior distributions of the mo del parameters app ear unimo dal (Figure 4(b)). Figure 4(c) shows knots lo cations from tw o diﬀeren t sweeps of the sampler. It is in teresting that the knots are not uniformly distributed ov er the whole coun try . The sampled v alues of m , sho wn in Figure 4(d) indicate that unlike the regression example of the previous section a substantial fraction ( ∼ 15%-35%) of p oints from S are needed to meet the accuracy bound. Figure 5 shows summaries of our mo del ﬁt. Figure 5(a) shows the p ercen tage voter turnout (b ottom) and the predicted p ercen tage turnout (top) for all 3106 counties. The predicted v alue for count y i is calculated as the Mon te Carlo appro ximation to exp[E { ω ( t i ) | data } ]. Fig- ure 5(b) com bines the v alues from Figure 5(a) in to a scatter plot, split b y coun ties included in mo del ﬁtting (green dots) and the remaining ones (red dots). Figure 5(c) shows the spatially v arying regression co eﬃcien ts E { ω j i | data } on predictors x j i , j = 1 , 2 , 3, for all coun ties i . Figure 5(d) shows the eﬀect size of these co eﬃcien ts, found by E { ω j i | data } / p V ar { ω j i | data } . F rom Figure 5 w e see that predictor x 3 , whic h relates to home ownership, has a strong, spatially v arying inﬂuence on voter turnout. This predictor has a p ositiv e coeﬃcient for all counties, with larger v alues and eﬀect sizes for counties in the southeast. Predictor x 1 , whic h relates to education, has a mo derate, spatially v arying inﬂuence, with ab out 1/3 of the counties ha ving an eﬀect size larger than 2. It is interesting that co eﬃcien ts of x 1 and x 3 app ear to b e in v ersely related to eac h other. Predictor x 2 , which relates to income, has 13 (a) (b) (c) 0 2000 6000 10000 150 200 250 300 350 Sweep m (d) Figure 4: A random walk Metrop olis exploration of the p osterior for the election turnout analysis. (a) T race plots of model parameters. (b) Histograms of the same. (c) Knot lo cations at t wo distan t sweeps of the sampler. (d) V alues of m along the sampler. 14 (a) 0 20 40 60 80 100 0 20 40 60 80 100 Predicted turnout (in %) Turnout (in %) (b) (c) (d) Figure 5: Posterior summary for the election turnout analysis. (a) Observ ed (b ottom) and predicted (top) p ercen tage turnout sho wn by coun ties. (b) Scatter plot of the same, green dots in the background mark training data while the red dots in the foreground are for held out data. (c) Estimated co eﬃcients for the three predictors shown by coun ties. (d) Eﬀect sizes of the three predictors sho wn by coun ties. 15 a w eak inﬂuence, with eﬀect size less than 2 in absolute v alue for more than 95% of the coun ties. How ev er, the spatial v ariation of the co eﬃcien t of x 2 is more intricate and w avy than that of the other t wo predictors. Although we do not try to in terpret these v ariations, w e note that spatial regression mo del indeed provides a b etter ﬁt than an ordinary linear regression that relates y to x 1 , x 2 and x 3 . The ro ot mean square prediction error from the ordinary linear regression, calculated ov er the coun ties not included in mo del ﬁtting, equals 0.14. The same statistic for the spatial regression mo del is 0.11. 5 Discussion W e hav e addressed the question of knot selection within the predictiv e pro cess methodology and ha ve oﬀered prop osals that can substantially automatize the implemen tation of Gaus- sian pro cess mo dels in Ba yesian analysis. A key conceptual contribution of our work is the emphasis on error b ounds to derive a ﬁnite rank approximation of an inﬁnite dimensional Gaussian pro cess. It must b e noted that the accuracy b ounds we pro vide are all a priori . That is, we can not provide an accuracy b ound on how well the p osterior distribution un- der the predictiv e pro cess model approximates the p osterior distribution under the original Gaussian pro cess model. How ev er, T okdar (2007) pro vides some useful theoretical calculation to ward this. W e note that approac hes that use an auxiliary model on knots are fundamen tally diﬀeren t from our deterministic c hoice of knots driv en b y accuracy bounds. Our approac h clearly sta ys within the limits of an approximating method. The smaller the sp eciﬁed tolerance, the closer w e are to the original Gaussian pro cess. Approac hes with a mo del on the knots essentially deﬁne a diﬀeren t stochastic process. The new sto c hastic pro cess could indeed pro vide a b etter mo del for the given task, as discussed in Snelson and Ghahramani (2006). Ho wev er, it would b e erroneous to assume that the theoretical prop erties of a Gaussian pro cess mo del would also apply to this new stochastic pro cess model. It is imp ortan t to realize that at the crux of a predictive pro cess appro ximation is the rank deﬁciency of the cov ariance matrix of the Gaussian vector W = ( ω ( s 1 ) , · · · , ω ( s N )), whic h is a manifestation of the underlying smo othness of the pro cess ω . F or Gaussian pro cess mo dels that employ a relatively un-smo oth ω , suc h as spatial autoregressive mo dels for lattice data (Besag, 1974; Rue et al., 2009), predictive pro cess ma y not be the ideal appro ximating to ol. Indeed, in suc h cases, the co v ariance matrix of W need not be rank deﬁcient, but can hav e sp ecial banded structures that allow for other approximation techniques to yield O ( N m 2 ) computing. F or a smo oth ω , the rank deﬁciency of the co v ariance matrix of W p oses an additional problem. The co v ariance matrix, irrespective of its size, can be ill conditioned, making numer- ical computations unstable. The dynamic, tolerance based stopping of the adaptive predictiv e pro cess can solve this problem to a large extent, b ecause the knot ﬁnding algorithm is lik ely to terminate b efore the cov ariance matrix of ( ω ( s π 1 ) , · · · , ω ( s π m )) b ecomes ill conditioned. A T ec hnical details Pr o of of The or em 2.1. Let Ψ( x ) = 1 3 e x 2 , x ≥ 0. Ψ( x ) is increasing and conv ex with Ψ(0) ∈ (0 , 1). F or an y random v ariable Z its Ψ-Orlicz norm (Pollard, 1990, page 3) is deﬁned as 16 k Z k Ψ := inf { C > 0 : EΨ( | Z | /C ) ≤ 1 } . Such a norm pro vides b ounds on tail probabilities as follo ws: P ( | Z | > x ) ≤ 1 / Ψ( x/ k Z k Ψ ) = 3 exp( − x 2 / k Z k 2 Ψ ) for all x > 0. This immedi- ately leads to (1) and (2) once w e show k sup t ∈ T | ξ ( t ) |k Ψ ≤ B κ 1 / 2 and k sup t ∈ S | ξ ( t ) k Ψ ≤ 3 κ S p log(2 + log | S | ). (i) Lemma 3.4 of Pollard (1990) states that for any t 0 ∈ T , k sup t ∈ T | ξ ( t ) |k Ψ ≤ k ξ ( t 0 ) k Ψ + ∞ X i =0 ∆ 2 i r 2 + log D ( ∆ 2 i , T , d ) (10) ≤ k ξ ( t 0 ) k Ψ + 9 Z ∆ 0 p log D ( , T , d ) d (11) where D ( , T , d ) is the  -pac king n umber of T under a (pseudo) metric d ( s, t ) with ∆ = sup s,t d ( s, t ), provided k ξ ( s ) − ξ ( t ) k Ψ ≤ d ( s, t ) for all s, t ∈ T . It is easy to calculate that k Z k Ψ = 1 . 5 if Z ∼ N (0 , 1) and that k Z k Ψ = 1 . 5 σ if Z ∼ N (0 , σ 2 ). Therefore, for any s, t ∈ T , k ξ ( s ) − ξ ( t ) k Ψ = 1 . 5[V ar { ξ ( s ) − ξ ( t ) } ] 1 / 2 . Therefore (10) holds if we tak e d ( s, t ) = 1 . 5[V ar { ξ ( s ) − ξ ( t ) } ] 1 / 2 . T o calculate the right hand side of (10), ﬁx t 0 to b e any of the knots used in deﬁning ν . Then ξ ( t 0 ) = 0 and consequen tly the ﬁrst term k ξ ( t 0 ) k Ψ = 0. T o calculate the integral in the second term, furst note that d ( s, t ) 2 = 2 . 25V ar { ξ ( s ) − ξ ( t ) } ≤ 2 . 25V ar { ω ( s ) − ω ( t ) } ≤ 2 . 25 c 2 k s − t k 2 where the ﬁrst inequalit y holds b ecause ω = ν + ξ with ν and ξ indep enden t, and the second inequality follo ws from our assumption on ω . Therefore D ( , T , d ) ≤ { 1 + 1 . 5 c ( b − a ) / } d . Next, b ound the diameter ∆ as follo ws ∆ 2 = 2 . 25 sup s,t ∈ T V ar { ξ ( s ) − ξ ( t ) } ≤ 2 . 25 sup s,t ∈ T 2[V ar { ξ ( s ) } + V ar { ξ ( t ) } ] ≤ 9 κ 2 . (12) No w use log(1 + x ) ≤ x to b ound the right hand side of (10) by 9 Z 3 κ 0 p p log { 1 + 1 . 5 c ( b − a ) / } d ≤ 9 p 1 . 5 pc ( b − a ) Z 3 κ 0  − 1 / 2 d = B κ 1 / 2 as desired. (ii) No w, to calculate k sup t ∈ S | ξ ( t ) k Ψ , note that the condition of Lemma 3.4 of Pollard (1990) is trivially satisﬁed with d ( s, t ) = k ξ ( s ) − ξ ( t ) k Ψ = 1 . 5[V ar { ξ ( s ) − ξ ( t ) } ] 1 / 2 due to discreteness of S and D ( , T , d ) ≤ | S | for all  > 0. Therefore w e can apply (10) with S instead of T to conclude k sup t ∈ S | ξ ( t ) |k Ψ ≤ ∆ p 2 + log | S | where ∆ 2 = sup s,t ∈ S d ( s, t ) 2 ≤ 9 κ 2 S as in (12). 17 References Adler, R. J. (1990). An intr o duction to c ontinuity, extr ema, and r elate d topics for gener al Gaussian pr o c esses , V olume 12. Hayw ard, CA: Institute of Mathematical Statistics. Banerjee, S., A. E. Gelfand, A. O. Finley , and H. Sang (2008). Gaussian predictiv e pro cess mo dels for large spatial data sets. Journal of the R oyal Statistic al So ciety Series B 70 (4), 825–848. Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the R oyal Statistic al So ciety, Series B 36 , 192–209. Choi, T. and M. Sc hervish (2007). On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis 98 (10), 1969–1987. Csat´ o, L., E. F ok ou´ e, M. Opp er, B. Sc hottky , and O. Winther (2000). Eﬃcient approac hes to gaussian pro cess classiﬁcation. In S. A. Solla, T. K. Leen, and K.-R. M ¨ uller (Eds.), A dvanc es in Neur al Information Pr o c essing Systems , V olume 12, Cambr idge, MA. The MIT Press. Finley , A. O., H. Sang, S. Banerjee, and A. E. Gelfand (2009). Impro ving the p erformance of predictiv e pro cess mo deling for large datasets. Computational Statistics & Data Analy- sis 53 (8), 2873–2884. F oster, L., A. W aagen, N. Aijaz, M. Hurley , A. Luis, J. Rinsky , C. Saty a volu, M. J. W ay , P . Gazis, and A. Sriv asta v a (2009). Stable and eﬃcient gaussian pro cess calculations. The Journal of Machine L e arning R ese ar ch 10 , 857–882. Ghosal, S. and A. Roy (2006). P osterior consistency of gaussian pro cess prior for nonpara- metric binary regression. The A nnals of Statistics 34 , 2413–2429. Gilks, W., S. Richardson, and D. Spiegelhalter (1995). Markov Chain Monte Carlo in Pr ac- tic e: Inter disciplinary Statistics . Chapman and Hall/CR C. Guhaniy ogi, R., A. O. Finley , S. Banerjee, and A. E. Gelfand (2010). Adaptiv e gaussian predictiv e pro cess mo del for large spatial data sets. American Statistical Association Joint Statistical Meeting. August 2, 2010. V ancouver, British Colum bia. Handco c k, M. S. and M. L. Stein (1993). A bay esian analysis of kriging. T e chnometrics 35 , 403–410. Kennedy , M. and A. O’Hagan (2001). Bay esian calibration of computer models (with discus- sion). Journal of the R oyal Statistic al So ciety, Series b 63 , 425–64. Kim, H.-M., B. K. Mallic k, and C. C. Holmes (2005). Analyzing nonstationary spatial data using piecewise gaussian pro cesses. Journal of the A meric an Statistic al Asso ciation 100 , 653–668. Neal, R. M. (1998). Regression and classiﬁcation using gaussian pro cess priors. In J. M. Bernardo, J. O. Berger, A. P . Dawid, and A. F. M. Smith (Eds.), Bayesian Statistics , V olume 6, pp. 475–501. Oxford Univ ersity Press. 18 Oakley , J. and A. OHagan (2002). Bay esian inference for the uncertaint y distribution of computer mo del outputs. Biometrika 89 , 769–784. P ace, R. K. and R. Barry (1997). Quic k computation of spatial autoregressiv e estimators. Ge o gr aphic al A nalysis 29 , 232–247. P etrone, S., M. Guindani, and A. E. Gelfand (2009). Hybrid diric hlet mixture mo dels for functional data. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dol- o gy) 71 (4), 755–782. P ollard, D. (1990). Empiric al Pr o c esses: The ory and Applic ations , V olume 2. Institute of Mathematical Statistics and American Statistical Association. Qui ˜ nonero-Candela, J. and C. E. Rasmussen (2005). A unifying view of sparse approximate gaussian pro cess regression. Journal of Machine L e arning R ese ar ch 6 , 1939–1959. Rasm ussen, C. E. and C. K. I. Williams (2006). Gaussian Pr o c esses for Machine L e arning . The MIT Press. Rob ert, C. and G. Casella (2004). Monte Carlo Statistic al Metho ds (2 ed.). Springer. Rue, H., S. Martino, and N. Chopin (2009). Appro ximate bay esian inference for latent gaussian models b y using in tegrated nested laplace appro ximations. Journal of the R oyal Statistic al So ciety, Series B 71 , 319–392. Sc hw aighofer, A. and V. T resp (2003). T ransductive and inductiv e metho ds for approximate gaussian process regression. In S. Bec ker, S. Thrun, , and K. Obermay er (Eds.), A dvanc es in Neur al Information Pr o c essing Systems , V olume 15, Cambridge, Massach ussetts, pp. 953–960. The MIT Press. Seeger, M., C. K. I. Williams, and N. Lawrence (2003). F ast forward selection to speed up sparse gaussian pro cess regression. In C. M. Bishop and B. J. F rey (Eds.), Ninth Interna- tional Workshop on A rtiﬁcial Intel ligenc e and Statistics . So ciety for Artiﬁcial In telligence and Statistics. Shi, J. Q. and B. W ang (2008). Curve prediction and clustering with mixtures of gaussian pro cess functional regression mo dels. Statistic al Computing 18 , 267–283. Short, M. B., D. M. Higdon, and P . P . Kron b erg (2007). Estimation of farada y rotation measures of the near galactic sky using gaussian pro cess mo dels. Bayesian Analysis 2 , 665–680. Smola, A. J. and P . L. Bartlett (2001). Sparse greedy gaussian pro cess regression. In A dvanc es in Neur al Information Pr o c essing Systems , V olume 13, Cambridge, Massach ussetts, pp. 619–625. The MIT Press. Snelson, E. and Z. Ghahramani (2006). Sparse gaussian pro cesses using pseudo-inputs. In Y. W eiss, B. Sch¨ olk opf, and J. Platt (Eds.), A dvanc es in Neur al Information Pr o c essing Systems , V olume 18, Cambrisge, Massac h ussetts. The MIT Press. 19 Sudderth, E. and M. Jordan (2009). Shared segmen tation of natural scenes using dep enden t pitman-y or pro cesses. In D. Koller, D. Sc h uurmans, Y. Bengio, and L. Bottou (Eds.), A dvanc es in Neur al Information Pr o c essing Systems 21 . MIT Press. T okdar, S. T. (2007). T ow ards a faster implementation of densit y estimation with logistic gaussian pro cess priors. Journal of Computational and Gr aphic al Statistics 16 , 633–655. T okdar, S. T. and J. K. Ghosh (2007). Posterior consistency of logistic gaussian process priors in density estimation. Journal of Statistic al Planning and Infer enc e 137 , 34–42. T okdar, S. T. and J. B. Kadane (2011). Simultaneous linear quan tile regression: A semipara- metric bay esian approac h. Duke Statistical Science Discussion Paper #12. T okdar, S. T., Y. M. Zh u, and J. K. Ghosh (2010). Density regression with logistic gaussian pro cess priors and subspace pro jection. Bayesian Analayis 5 (2), 316–344. v an der V aart, A. W. and J. H. v an Zan ten (2008). Rates of contraction of posterior distri- butions based on gaussian process priors. Annals of Statistics 36 , 1435–1463. v an der V aart, A. W. and J. H. v an Zan ten (2009). Adaptiv e bay esian estimation using a gaussian random ﬁeld with inv erse gamma bandwidth. The A nnal of Statistics 37 (5B), 2655–2675. Zh u, Z. and M. Stein (2005). Spatial sampling design for parameter estimation of the cov ari- ance function. Journal of Statistic al Planning and Infer enc e 134 , 583–603. Zimmerman, D. (2006). Optimal netw ork design for spatial prediction, co v ariance parameter estimation, and empirical prediction. Envir onmetrics 17 , 635–652. 20

Adaptive Gaussian Predictive Process Approximation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment