Generalized and Scalable Deep Gaussian Process Emulation
Gaussian process (GP) emulators have become essential tools for approximating complex simulators, significantly reducing computational demands in optimization, sensitivity analysis, and model calibration. While traditional GP emulators effectively mo…
Authors: Deyu Ming, Daniel Williamson
Generalized and Scalable Deep Gaussian Pr ocess Emulation P R E P R I N T Deyu Ming ∗ School of Management Univ ersity College London, UK Daniel Williamson Land En vironment Economics and Policy Institute Univ ersity of Exeter , UK March 26, 2026 A B S T R A C T Gaussian process (GP) emulators hav e become essential tools for approximating complex simula- tors, significantly reducing computational demands in optimization, sensiti vity analysis, and model calibration. While traditional GP emulators ef fectively model continuous and Gaussian-distributed simulator outputs with homogeneous v ariability , they typically struggle with discrete, heteroskedas- tic Gaussian, or non-Gaussian data, limiting their applicability to increasingly common stochastic simulators. In this work, we introduce a scalable Generalized Deep Gaussian Process (GDGP) emulation frame work designed to accommodate simulators with heterosk edastic Gaussian outputs and a wide range of non-Gaussian response distributions, including Poisson, negati ve binomial, and categorical distrib utions. The GDGP frame work le verages the e xpressi veness of DGPs and e xtends them to latent GP structures, enabling it to capture the complex, non-stationary beha vior inherent in many simulators while also modeling non-Gaussian simulator outputs. W e make GDGP scalable by incorporating the V ecchia approximation for settings with a lar ge number of input locations, while also dev eloping ef ficient inference procedures for handling large numbers of replicates. In particular , we present methodological de velopments that further enhance the computation of the approach for heteroskedastic Gaussian responses. W e demonstrate through a series of synthetic and empirical examples that these extensions deliv er the practical application of GDGP emulators and a unified methodology capable of addressing div erse modeling challenges. The proposed GDGP framework is implemented in the open-source R package dgpsi . K eywords surrogate modeling, stochastic simulator , heteroskedasticity , non-Gaussian responses 1 Introduction Gaussian process (GP) emulators are widely used statistical surrogates for computationally expensiv e computer simulators that enable tasks such as optimization, sensiti vity analysis and calibration to be performed without embedding the simulator directly . Although man y simulators are deterministic, GPs express uncertainty in their outputs by assigning any finite collection of outputs a multi v ariate Gaussian distribution, with correlation controlled by a k ernel function. ∗ Corresponding author: deyu.ming.16@ucl.ac.uk . M I N G & W I L L I A M S O N P R E P R I N T Howe ver , many simulators hav e discrete outputs or other forms where Gaussian models are not appropriate, and the approach is generally to use a latent GP surface with a link function to transform to the outcomes in the likelihood. Examples from the literature ha ve included binary responses (Chang et al. 2016), count modeling (Salter et al. 2025) and strictly-non negativ e functions (Spiller et al. 2023). Gi ven the increasing use of stochastic simulators, such as agent-based models, Baker et al. (2022) reviewed state-of-the-art GP-based methods for analyzing such simulators. Howe ver , most existing methods (Goldberg et al. 1997, Binois et al. 2018), together with recent developments and applications (Cole et al. 2022, Murph et al. 2024, Y i & T aflanidis 2024, Patil et al. 2025), focus on heteroskedastic GP emulation, in which simulator outputs are treated as continuous observations with heterosk edastic Gaussian noise. Non-stationarity in GP emulation has been addressed separately in the literature through more fle xible constructions such as deep Gaussian processes (DGPs; Salimbeni & Deisenroth 2017, Sauer et al. 2020, Ming et al. 2023), treed GPs (Gramacy & Lee 2008), and related extensions. Howe ver , these dev elopments have largely been pursued independently of the “non-Gaussian" simulator literature and have mostly focused on deterministic or continuous- response settings. T o the best of our knowledge, only very recent work (Cooper et al. 2026) has started to examine the combination of DGPs with binary outputs in a fully Bayesian frame work. More generally , the existing literature has tended to produce model-specific methods designed for particular response types, rather than a unified approach applicable across a broad range of output distributions. This leaves an important gap in the emulation literature. There is currently no general framework that combines the flexibility of DGP emulators with the ability to handle diverse non-Gaussian and heteroskedastic Gaussian outputs. In this work, we propose a Generalized Deep Gaussian Process (GDGP) emulation framew ork for stochastic simulators whose outputs may follow a wide class of non-Gaussian distributions, including Poisson, negati ve binomial, and categorical distributions, among others. By exploiting the flexibility of DGP emulators, the proposed frame work not only accommodates different response distrib utions, but also captures non-stationary behaviors when present. From a computational perspecti ve, our work further develops GDGP by enhancing its scalability . T wo widely used approaches for scalable GP inference are the sparse inducing-point approximation (Titsias 2009) and the V ecchia approximation (V ecchia 1988, Katzfuss et al. 2020, Katzfuss & Guinness 2021). The former was e xtended to DGPs in the v ariational frame work of Salimbeni & Deisenroth (2017), while the latter was sho wn by Sauer et al. (2023) to achiev e strong performance for DGP emulation in a fully Bayesian setting. Moti vated by the computational adv antages of Stochastic Imputation (SI; Ming et al. 2023) as an approximation to fully Bayesian DGP inference, we derive V ecchia-based approximations for SI to enable ef ficient inference of GDGP for problems with lar ge numbers of input locations, and further show that SI can efficiently accommodate large numbers of replicates. In addition, for the heteroskedastic Gaussian case, we deriv e a suite of closed-form expressions that provide further computational savings. The remainder of the manuscript is organized as follows. Section 2 revie ws DGPs, follo wed by Section 3, which introduces GDGPs and describes SI-based inference for prediction and training. Section 4 presents scalability extensions of GDGP for settings with lar ge numbers of input locations and replicates, together with further methodological dev elopments for the heteroskedastic Gaussian case. Section 5 reports a series of experiments cov ering heteroskedastic Gaussian, categorical, and count distrib utions. Finally , Section 6 concludes the manuscript. 2 Review of Deep Gaussian Pr ocesses In this section, we revie w the DGP formulation of Ming et al. (2023), since its inference extends naturally to the GDGP framew ork presented in Section 3. A DGP , whose hierarchical architecture is shown in Figure 1, is defined as an L -layer feed-forward network of stationary GPs. The model takes x ∈ R N × D , consisting of N input points in D dimensions, and produces F ∈ R N × Q as the output, where the p -th column F ( p ) denotes the p -th output of the network for p = 1 , . . . , Q . Let {G P ( p ) l } p =1 ,...,P l ,l =1 ,...,L with P L = Q be the stationary GPs that form the DGP and W ( p ) l ∈ R N × 1 be the output of G P ( p ) l (with W ( p ) L = F ( p ) for p = 1 , . . . , Q ), then { W ( p ) l } p =1 ,...,P l at the given layer l are assumed independent 2 M I N G & W I L L I A M S O N P R E P R I N T G P (1) 1 x G P (2) 1 x G P ( P 1 ) 1 x G P (1) 2 G P (2) 2 G P ( P 2 ) 2 . . . . . . . . . G P (1) L − 1 G P (2) L − 1 G P ( P L − 1 ) L − 1 G P (1) L F (1) G P (2) L F (2) G P ( Q ) L F ( Q ) . . . . . . . . . . . . . . . . . . . . . Figure 1: The hierarchy of DGP model that produces F gi ven the input x . and multiv ariate normally distributed: W ( p ) l | W l − 1 ind ∼ N ( 0 , Σ ( p ) l ( W l − 1 )) , (1) for all p = 1 , . . . , P l , where W l − 1 = ( W (1) l − 1 , . . . , W ( P l − 1 ) l − 1 ) ∈ R N × P l − 1 with W 0 = x , and Σ ( p ) l ( W l − 1 ) = ( σ ( p ) l ) 2 R ( p ) l ( W l − 1 ) ∈ R N × N is the cov ariance matrix with R ( p ) l ( W l − 1 ) being the correlation matrix. The ij -th element of R ( p ) l ( W l − 1 ) is specified by k ( p ) l ( W l − 1 ,i ∗ , W l − 1 ,j ∗ )+ η 1 { i = j } , where k ( p ) l ( · , · ) is a kno wn kernel function with η being the nugget term and 1 {·} being the indicator function. The kernel function k ( p ) l ( · , · ) is specified in the following multiplicati ve form: k ( p ) l ( W l − 1 ,i ∗ , W l − 1 ,j ∗ ) = P l − 1 Y d =1 k ( p ) l,d ( W l − 1 ,id , W l − 1 ,j d ) , where k ( p ) l,d ( · , · ) is a one-dimensional kernel function for the d -th dimension of input W l − 1 to G P ( p ) l . T wo common choices of k ( p ) l,d ( · , · ) include squared exponential and Matérn kernels. For notational conv enience, in the remainder of the manuscript we write {G P ( p ) l } and { W ( p ) l } as shorthand for {G P ( p ) l } p =1 ,...,P l ,l =1 ,...,L and { W ( p ) l } p =1 ,...,P l ,l =1 ,...,L , respectiv ely . 3 Generalized Deep Gaussian Processes The DGP formulation revie wed in Section 2 is capable of emulating non-stationary computer models with deterministic responses, as well as stochastic responses with homogeneous Gaussian behavior . T o generalize this frame work to stochastic computer models with heteroskedastic or non-Gaussian responses, we draw on the idea underlying generalized linear models (GLMs) by introducing a likelihood layer at the top of the DGP hierarchy (see Figure 2), thereby explicitly modeling the stochasticity and distributional form of the outputs. Let Y = ( Y 1 , . . . , Y N ) ⊤ denote a vector of N outputs with corresponding inputs x . W e assume Y i , for i = 1 , . . . , N , are conditionally independent with probability density function (PDF) p ( y i | ϕ i ) , giv en a vector of Q distributional parameters ϕ i = ( ϕ i 1 , . . . , ϕ iQ ) . Let Φ = ( ϕ ⊤ 1 , . . . , ϕ ⊤ N ) ⊤ , and let g q , for q = 1 , . . . , Q , denote known monotonic link functions. Then the q -th column of Φ , denoted by Φ ∗ q , is linked to the DGP output through g q ( Φ ∗ q ) = F ( q ) , (2) for q = 1 , . . . , Q , where F ( q ) is the p -th column of F , the output of the DGP network introduced in Section 2. DG P x L Y F Figure 2: The hierarchy of GDGP . The L node is the likelihood layer that represents the distributional relation between F and Y . Since GDGP may be vie wed as a DGP with a non-GP node in the final layer , inference can be implemented naturally within the SI frame work de veloped for DGPs. The SI framework w as introduced as a natural approximation to fully 3 M I N G & W I L L I A M S O N P R E P R I N T Bayesian inference for DGPs, analogous to the common practice in GP modeling of first estimating the correlation lengthscales and then conditioning on them for posterior inference and prediction. A comparison of SI with fully Bayesian inference (Sauer et al. 2020) and with variational inference (Salimbeni & Deisenroth 2017) for DGPs is provided in Ming et al. (2023). In the remainder of this section, we present the details of the three components of SI inference for GDGP , namely prediction (Section 3.1), imputation (Section 3.2), and training (Section 3.3). 3.1 Prediction The SI framework lev erages the linked Gaussian process (LGP; Kyzyurov a et al. 2018, Ming & Guillas 2021) to giv e a closed form expression to the posterior predicti ve distribution. Let θ ( p ) l = { ( σ ( p ) l ) 2 , γ ( p ) l } with γ ( p ) l = ( γ ( p ) l, 1 , . . . , γ ( p ) l,P l − 1 ) ⊤ be the model parameters in G P ( p ) l and assume that θ ( p ) l are known and distinct for all p = 1 , . . . , P l and l = 1 , . . . , L . Then giv en a realization y of output Y , the posterior predictiv e distribution of the GDGP output Y 0 ( x 0 ) at a new input x 0 can be approximated as follows: p ( y 0 | x 0 ; y , x ) = Z p ( y 0 | x 0 ; y , f , { w ( p ) l } , x ) p ( f , { w ( p ) l }| y , x ) d f d { w ( p ) l } = Z Z p ( y 0 | f 0 ) p ( f 0 | x 0 ; f , { w ( p ) l } , x ) d f 0 p ( f , { w ( p ) l }| y , x ) d f d { w ( p ) l } = E F , { W ( p ) l }| y , x Z p ( y 0 | f 0 ) p ( f 0 | x 0 ; F , { W ( p ) l } , x ) d f 0 ˙ = 1 K K X k =1 Z p ( y 0 | f 0 ) p ( f 0 | x 0 ; f k , { w ( p ) l } k , x ) d f 0 ˙ = 1 K K X k =1 Z p ( y 0 | f 0 ) Q Y q =1 b p ( f ( q ) 0 | x 0 ; f k , { w ( p ) l } k , x ) d f 0 = 1 K K X i =1 b p ( y 0 | x 0 ; f k , { w ( p ) l } k , x ) , where Q Q q =1 b p ( f ( q ) 0 | x 0 ; f k , { w ( p ) l } k , x ) is the LGP approximation to p ( f 0 | x 0 ; f k , { w ( p ) l } k , x ) . Gi ven a realization f k and { w ( p ) l } k of F and { W ( p ) l } respectiv ely , b p ( f ( q ) 0 | x 0 ; f k , { w ( p ) l } k , x ) for q = 1 , . . . , Q defines an univ ariate normal distribution with analytically tractable mean µ ( q ) 1 → L,k ( x 0 ) and variance σ 2 ( q ) 1 → L,k ( x 0 ) that can be obtained by iterating the following formulae: µ ( q ) 1 → l,k ( x 0 ) = I ( q ) l,k ( x 0 ) ⊤ R ( q ) l,k − 1 w ( q ) l,k , (3) σ 2 ( q ) 1 → l,k ( x 0 ) = w ( q ) l,k ⊤ R ( q ) l,k − 1 J ( q ) l,k ( x 0 ) R ( q ) l,k − 1 w ( q ) l,k − I ( q ) l,k ( x 0 ) ⊤ R ( q ) l,k − 1 w ( q ) l,k 2 + σ ( q ) l 2 1+ η ( q ) l − tr R ( q ) l,k − 1 J ( q ) l,k ( x 0 ) (4) for l = 2 , . . . , L and q = 1 , . . . , P l , where the i -th element of I ( q ) l,k ( x 0 ) ∈ R N × 1 is giv en by P l − 1 Y d =1 ξ ( q ) l µ ( d ) 1 → ( l − 1) ,k ( x 0 ) , σ 2 ( d ) 1 → ( l − 1) ,k ( x 0 ) , ( w ( d ) l − 1 ,k ) i , the ij -th element of J ( q ) l,k ( x 0 ) ∈ R N × N is giv en by P l − 1 Y d =1 ζ ( q ) l µ ( d ) 1 → ( l − 1) ,k ( x 0 ) , σ 2 ( d ) 1 → ( l − 1) ,k ( x 0 ) , ( w ( d ) l − 1 ,k ) i , ( w ( d ) l − 1 ,k ) j , 4 M I N G & W I L L I A M S O N P R E P R I N T and µ ( q ) 1 → 1 ,k ( x 0 ) and σ 2 ( q ) 1 → 1 ,k ( x 0 ) are giv en by µ ( q ) 1 → 1 ,k ( x 0 ) = r ( q ) ( x 0 ) ⊤ R ( q ) 1 − 1 w ( q ) 1 ,k (5) σ 2 ( q ) 1 → 1 ,k ( x 0 ) = σ ( q ) 1 2 1+ η ( q ) 1 − r ( q ) ( x 0 ) ⊤ R ( q ) 1 − 1 r ( q ) ( x 0 ) (6) respecti vely , for q = 1 , . . . , P 1 , where r ( q ) ( x 0 ) = [ k ( q ) 1 ( x 0 , x 1 ∗ ) , . . . , k ( q ) 1 ( x 0 , x N ∗ )] ⊤ , and ξ ( q ) l ( · , · , · ) and ζ ( q ) l ( · , · , · , · ) are analytically tractable functions given in Ming & Guillas (2021, Appendix A) for G P ( q ) l with squared exponential or Matérn kernels. Let ˜ µ Y 0 ,k and ( ˜ σ Y 0 ,k ) 2 denote the mean and variance of b p ( y 0 | x 0 ; f k , { w ( p ) l } k , x ) , respectiv ely . For many parametric distributions for the lik elihood layer , p ( y k | ϕ k = g − 1 ( f k )) , including the heteroskedastic Gaussian, Poisson, ne gati ve binomial, and zero-inflated Poisson and negati ve binomial distributions considered in the work, the LGP approximation yields closed-form expressions for ˜ µ Y 0 ,k and ( ˜ σ Y 0 ,k ) 2 ; see Appendix A for a non-exhausti ve list of distrib utions admitting such expressions. The mean and variance of Y 0 ( x 0 ) can therefore be approximated by ˜ µ Y 0 ˙ = 1 K K X k =1 ˜ µ Y 0 ,k and ( ˜ σ Y 0 ) 2 ˙ = 1 K K X k =1 (( ˜ µ Y 0 ,k ) 2 +( ˜ σ Y 0 ,k ) 2 ) − ( ˜ µ Y 0 ) 2 (7) respectiv ely . When closed-form expressions for the rele v ant summaries are una v ailable, or when other quantities of interest (QoIs) are needed to characterize Y 0 ( x 0 ) , such as category probabilities in the categorical case, fast sample-based approximations can be obtained via the method of composition (T anner 1993). Specifically , noting that p ( y 0 , f 0 , f , { w ( p ) l }| x 0 ; y , x ) = p ( y 0 | f 0 ) p ( f 0 | x 0 ; f , { w ( p ) l } , x ) p ( f , { w ( p ) l }| y , x ) ˙ = p ( y 0 | f 0 ) Q Y q =1 b p ( f ( q ) 0 | x 0 ; f , { w ( p ) l } , x ) p ( f , { w ( p ) l }| y , x ) , we can generate K approximate samples from p ( y 0 | x 0 ; y , x ) by repeatedly applying Algorithm 1. These samples can then be used to approximate the mean, variance, or other QoIs of Y 0 ( x 0 ) . The overall prediction procedure for GDGP is summarized in Algorithm 2. Algorithm 1 Method of Composition 1: Draw f k and { w ( p ) l } k from p ( f , { w ( p ) l }| y , x ) ; 2: Draw f 0 ,k from the LGPs b p ( f ( q ) 0 | x 0 ; f k , { w ( p ) l } k , x ) for all q = 1 , . . . , Q ; 3: Draw y 0 ,k from p ( y 0 | f 0 ,k ) . Algorithm 2 Prediction from the GDGP Input: (i) Observations x and y ; (ii) T rained {G P ( p ) l } p =1 ,...,P l ,l =1 ,...,L in the GDGP hierarchy; (iii) A ne w input position x 0 . Output: Mean, ˜ µ Y 0 , and variance, ( ˜ σ Y 0 ) 2 , of Y 0 ( x 0 ) . 1: Draw K realizations f 1 , . . . , f K and { w ( p ) l } 1 , . . . , { w ( p ) l } K from p ( f , { w ( p ) l }| y , x ) ; 2: Construct K LGPs b p ( f ( q ) 0 | x 0 ; f 1 , { w ( p ) l } 1 , x ) , . . . , b p ( f ( q ) 0 | x 0 ; f K , { w ( p ) l } K , x ) for all q = 1 , . . . , Q ; 3: Compute the mean, variance, or other QoIs of Y 0 ( x 0 ) using (i) equations (7) , when the relev ant closed-form expressions are a vailable; or (ii) the method of composition by executing Steps 2–3 of Algorithm 1 for all K LGPs. 3.2 Imputation Simulations or imputations of F and { W ( p ) l } from p ( f , { w ( p ) l }| y , x ) in Step 1 of Algorithm 2 cannot be achieved exactly due to the complexity introduced by the GDGP hierarchy . Howe ver , the conditional posteriors p ( w ( p ) l |{ w ( p ) l }\ 5 M I N G & W I L L I A M S O N P R E P R I N T w ( p ) l , f , y , x ) can be expressed as p ( w ( p ) l |{ w ( p ) l }\ w ( p ) l , f , y , x ) ∝ P l +1 Y q =1 p ( w ( q ) l +1 | w (1) l , . . . , w ( p ) l , . . . , w ( P l ) l ) p ( w ( p ) l | w (1) l − 1 , . . . , w ( P l − 1 ) l − 1 ) (8) for p = 1 , . . . , P l and l = 1 , . . . , L − 1 , where w ( q ) L = f ( q ) and P L = Q . Analogously , p ( f ( q ) | f \ f ( q ) , { w ( p ) l } , y , x ) ∝ p ( y | f (1) , . . . , f ( q ) , . . . , f ( Q ) ) p ( f ( q ) | w (1) L − 1 , . . . , w ( P L − 1 ) L − 1 ) = N Y i =1 p ( y i | g − 1 1 ( f (1) i ) , . . . , g − 1 q ( f ( q ) i ) , . . . , g − 1 Q ( f ( Q ) i )) p ( f ( q ) | w (1) L − 1 , . . . , w ( P L − 1 ) L − 1 ) (9) for q = 1 , . . . , Q . Since p ( w ( p ) l | w (1) l − 1 , . . . , w ( P l − 1 ) l − 1 ) in Equation (8) and p ( f ( q ) | w (1) L − 1 , . . . , w ( P L − 1 ) L − 1 ) in Equation (9) are multiv ariate normal, one can impute latent layers F and { W ( p ) l } by utilizing the Elliptical Slice Sampling (Murray et al. 2010) within a Gibbs (ESS-within-Gibbs) sampler . Algorithm 3 illustrates a single imputation of F and { W ( p ) l } from the ESS-within-Gibbs sampler . Algorithm 3 One-step ESS-within-Gibbs sampler to impute F and { W ( p ) l } Input: A current imputation f i and { w ( p ) l } i . Output: A new imputation f i +1 and { w ( p ) l } i +1 . 1: for l = 1 , . . . , L do 2: if l = L then 3: for q = 1 , . . . , Q do 4: Impute F ( q ) by drawing f ( q ) i +1 from p ( f ( q ) | f \ f ( q ) , { w ( p ) l } , y , x ) in the form of (9) via an ESS update; 5: end for 6: else 7: for p = 1 , . . . , P l do 8: Impute W ( p ) l by drawing w ( p ) l,i +1 from p ( w ( p ) l |{ w ( p ) l }\ w ( p ) l , f , y , x ) in the form of (8) via an ESS update; 9: end for 10: end if 11: end for 3.3 T raining Follo wing the SI framework introduced by Ming et al. (2023), the unknown GP parameters θ ( p ) l in the GDGP structure are estimated using the Stochastic Expectation-Maximization (SEM) algorithm (Celeux & Diebolt 1985), as summarized in Algorithm 4. Once the GDGP hyperparameters have been estimated, the imputation module (see Section 3.2) generates a set of GDGP realizations, thereby enabling fast posterior prediction via the LGPs constructed according to Algorithm 2. 4 Scalability The computational complexity of performing inference for GDGP emulators using SI can become prohibiti ve as the training data size increases, primarily due to the well-kno wn cubic complexity of cov ariance matrix in version. In the remainder of this section, we show ho w to incorporate the V ecchia approximation into the SI frame work, enabling scalable GDGP emulation that can efficiently handle lar ge training datasets. The main computational bottleneck of SI arises from the e valuation of likelihood functions and sampling from {G P ( p ) l } , which require repeated inv ersion of cov ariance matrices during the ESS-within-Gibbs sampling in the Imputation step and the log-likelihood optimization in the Maximization step of Algorithm 4. T o address this with 6 M I N G & W I L L I A M S O N P R E P R I N T Algorithm 4 T raining for the GDGP via SEM Input: (i) Observations x and y ; (ii) initial values of model parameters { b θ ( p, 1) l } ; (iii) total number of iterations T and burn-in period B for SEM; (iv) b urn-in periods C for ESS. Output: Point estimates of model parameters. 1: for t = 1 , . . . , T do 2: Imputation-step : draw an imputation of F and { W ( p ) l } from p ( f , { w ( p ) l }| y , x ; { b θ ( p,t ) l } ) by ev aluating C steps of ESS-within-Gibbs sampler in Algorithm 3; 3: Maximization-step : update model parameters by maximizing log-likelihoods of indi vidual GPs: b θ ( p,t +1) l = argmax log p ( w ( p ) l | w (1) l − 1 , . . . , w ( P l − 1 ) l − 1 ; θ ( p ) l ) , for all p = 1 , . . . , P l and l = 1 , . . . , L , where w ( p ) L = f ( p ) and P L = Q . 4: end for 5: Compute point estimates b θ ( p ) l of model parameters by: b θ ( p ) l = 1 T − B T X t = B +1 b θ ( p,t ) l ∀ p, l. the V ecchia approximation, consider an elementary GP , G P ( p ) l , in the GDGP hierarchy . The V ecchia’ s log-likelihood function (Guinness 2021) of G P ( p ) l can be written as: log b L ( θ ( p ) l ) = log b p ( w ( p ) l | w l − 1 ) = N X i =1 log p ( w ( p ) l,i | w ( p ) l,g ( − i ) , w l − 1 ) = N X i =1 log p ( w ( p ) l,g ( i ) | w l − 1 ,g ( i ) ∗ ) − log p ( w ( p ) l,g ( − i ) | w l − 1 ,g ( − i ) ∗ ) = − N 2 log(2 π ) − 1 2 N X i =1 log det Σ ( p ) l,g ( i ) − log det Σ ( p ) l,g ( − i ) − 1 2 N X i =1 w ( p ) l,g ( i ) ⊤ Σ ( p ) l,g ( i ) − 1 w ( p ) l,g ( i ) − w ( p ) l,g ( − i ) ⊤ Σ ( p ) l,g ( − i ) − 1 w ( p ) l,g ( − i ) , (10) where g ( − i ) ⊆ { 1 , 2 , . . . , i − 1 } is a conditioning inde x set of size | g ( − i ) | = min( M , i − 1) for all i = 2 , . . . , N with g ( − 1) = ∅ and g ( i ) = g ( − i ) ∪{ i } ; Σ ( p ) l,g ( i ) ∈ R ( | g ( − i ) | +1) × ( | g ( − i ) | +1) and Σ ( p ) l,g ( − i ) ∈ R | g ( − i ) |×| g ( − i ) | are cov ariance matrices of W ( p ) l,g ( i ) and W ( p ) l,g ( − i ) respectiv ely . The comple xity of ev aluating V ecchia’ s log-likelihood function (10) is O ( N M 3 ) , which is significantly lo wer than the O ( N 3 ) complexity of e valuating the original log-likelihood function of G P ( p ) l when M ≪ N . Notably , the summations in (10) are independent, allowing for parallel processing across different summands, which can ef fectiv ely reduce the N factor in O ( N M 3 ) . The V ecchia approximation implies that the probability distribution of W ( p ) l , defined by the probability density function b p ( w ( p ) l | w l − 1 ) = Q N i =1 p ( w ( p ) l,i | w ( p ) l,g ( − i ) , w l − 1 ) , is itself a multiv ariate normal distribution: W ( p ) l | W l − 1 ∼ N 0 , P ( p ) l − 1 , (11) 7 M I N G & W I L L I A M S O N P R E P R I N T where P ( p ) l = U ( p ) l U ( p ) l ⊤ is the precision matrix of W ( p ) l | W l − 1 with U ( p ) l being a sparse upper triangular matrix whose j i -th element is giv en by (Katzfuss & Guinness 2021): U ( p ) l,j i = d ( p ) l,i − 1 2 , j = i, − b ( p ) l,ik d ( p ) l,i − 1 2 , j ∈ g ( − i ) , 0 , otherwise , (12) where • d ( p ) l,i = ( σ ( p ) l ) 2 1+ η − r ( p ) l,i ⊤ R ( p ) l,g ( − i ) − 1 r ( p ) l,i • b ( p ) l,i = r ( p ) l,i ⊤ R ( p ) l,g ( − i ) − 1 with r ( p ) l,i = [ k ( p ) l ( w l − 1 ,i ∗ , w l − 1 ,j ∗ )] ⊤ j ∈ g ( − i ) ∈ R | g ( − i ) |× 1 and R ( p ) l,g ( − i ) being the correlation matrix of W ( p ) l,g ( − i ) ; and b ( p ) l,ik is the k -th element of b ( p ) l,i with k = p os g ( − i ) ( j ) denoting the position of j in the conditioning set g ( − i ) . Note that U ( p ) l is highly sparse, and its construction has a complexity at most of O ( N M 3 ) , as the construction of each column is independent (allowing for parallel computation), each column contains at most M off-diagonal non-zero elements, and computing each column in volves in verting the correlation matrix R ( p ) l,g ( − i ) ∈ R | g ( − i ) |×| g ( − i ) | . Giv en the fact that W ( p ) l | W l − 1 = U ( p ) l ⊤ − 1 Z , where Z ∈ R N × 1 is a vector of i.i.d. univ ariate standard normal random variables, a realization w ( p ) l of W ( p ) l | W l − 1 can then be generated from the multiv ariate normal distribution (11) by performing: w ( p ) l = U ( p ) l ⊤ − 1 z , (13) where z is a realization from Z . Since U ( p ) l is a sparse upper triangular matrix, the computation of (13) can be performed via forward substitution with a complexity of O ( N M ) instead of O ( N 2 ) , lev eraging the sparsity of U ( p ) l . Replacing the log-likelihood function in the ESS update in Algorithm 3 and the Maximization step in Algorithm 4 with V ecchia’ s form (10) , and replacing the sampling from {G P ( p ) l } in the ESS update in Algorithm 3 with (13) , then enables scalable training of GDGP via SEM. It is worth noting that V ecchia’ s log-likelihood function (10) can equiv alently be expressed in terms of U ( p ) l via (11) as: log b L ( θ ( p ) l ) = − N 2 log(2 π ) − 1 2 log det U ( p ) l U ( p ) l ⊤ − 1 − 1 2 w ( p ) l ⊤ U ( p ) l U ( p ) l ⊤ w ( p ) l . (14) Howe ver , there are two main computational reasons why (10) is preferred over (14) for SI. Firstly , the computation of (14) requires storage of U ( p ) l . Although U ( p ) l is sparse, it can still be memory-intensive when N is large. In contrast, (10) is more memory-efficient, as each summand only requires storing a small cov ariance matrix Σ ( p ) l,g ( i ) , and the ev aluation of log b L ( θ ( p ) l ) can be performed by incrementally accumulating contributions of individual or a batch of summands with minimal memory o verhead. Secondly , the form in (10) enables analytically trackable gradients with respect to θ ( p ) l , facilitating faster optimization of {G P ( p ) l } in the Maximization step of SEM in Algorithm 4. The ordering of w ( p ) l, 1 , . . . , w ( p ) l,N and the corresponding conditioning index set g ( − i ) for i = 2 , . . . , N affect the accuracy of V ecchia’ s log-likelihood function (10) and the multiv ariate normal distribution (13) . Dif ferent ordering approaches have been discussed in Katzfuss & Guinness (2021). In this study , we adopt a simple random ordering following Sauer et al. (2023). Regarding the selection of the conditioning index set g ( − i ) , we employ nearest-neighbor (NN) conditioning, where g ( − i ) consists of the indices of up to M nearest variables that precede w ( p ) l,i based on the 8 M I N G & W I L L I A M S O N P R E P R I N T Euclidean distance between their respectiv e inputs and the input w l − 1 ,i ∗ of w ( p ) l,i , with each input dimension scaled by the corresponding lengthscale γ ( p ) l,d for d ∈ { 1 , . . . , P l − 1 } (Katzfuss et al. 2022). T o achie ve scalable prediction of GDGP , we can replace the log-likelihood function and sampling of {G P ( p ) l } in the imputation step on Line 1 of Algorithm 2 by (10) and (13) respectiv ely , and the following proposition giv es the V ecchia’ s expressions of (3) and (4) for scalable LGP construction on Line 2 of Algorithm 2: Proposition 4.1 Under the V ecchia approximation, the mean µ ( q ) 1 → L,k ( x 0 ) and variance σ 2 ( q ) 1 → L,k ( x 0 ) of the univariate normal distribution defined by b p ( f ( q ) 0 | x 0 ; f k , { w ( p ) l } k , x ) can be obtained by iterating the following formulae: µ ( q ) 1 → l,k ( x 0 ) = I ( q ) l,k, C ( x 0 ) ⊤ R ( q ) l,k, C − 1 w ( q ) l,k, C , (15) σ 2 ( q ) 1 → l,k ( x 0 ) = w ( q ) l,k, C ⊤ R ( q ) l,k, C − 1 J ( q ) l,k, C ( x 0 ) R ( q ) l,k, C − 1 w ( q ) l,k, C − I ( q ) l,k, C ( x 0 ) ⊤ R ( q ) l,k, C − 1 w ( q ) l,k, C 2 + σ ( q ) l 2 1+ η ( q ) l − tr R ( q ) l,k, C − 1 J ( q ) l,k, C ( x 0 ) (16) for l = 2 , . . . , L and q = 1 , . . . , P l , with µ ( q ) 1 → 1 ,k ( x 0 ) and σ 2 ( q ) 1 → 1 ,k ( x 0 ) given by µ ( q ) 1 → 1 ,k ( x 0 ) = r ( q ) C ( x 0 ) ⊤ R ( q ) 1 , C − 1 w ( q ) 1 ,k, C (17) σ 2 ( q ) 1 → 1 ,k ( x 0 ) = σ ( q ) 1 2 1+ η ( q ) 1 − r ( q ) C ( x 0 ) ⊤ R ( q ) 1 , C − 1 r ( q ) C ( x 0 ) (18) r espectively for q = 1 , . . . , P 1 wher e • C ⊆ { 1 , 2 , . . . , N } is a conditioning index set of size |C | ; • R ( q ) l,k, C ∈ R |C |×|C| is the corr elation matrix of W ( q ) l,k, C = [( W ( q ) l,k ) i ] ⊤ i ∈C ; • I ( q ) l,k, C ( x 0 ) ∈ R |C |× 1 is the sub-vector of I ( q ) l,k ( x 0 ) with its i -th element defined as P l − 1 Y d =1 ξ ( q ) l µ ( d ) 1 → ( l − 1) ,k ( x 0 ) , σ 2 ( d ) 1 → ( l − 1) ,k ( x 0 ) , ( w ( d ) l − 1 ,k ) C i ; • J ( q ) l,k, C ( x 0 ) ∈ R |C |×|C| is the sub-matrix of J ( q ) l,k ( x 0 ) with its ij -th element defined as P l − 1 Y d =1 ζ ( q ) l µ ( d ) 1 → ( l − 1) ,k ( x 0 ) , σ 2 ( d ) 1 → ( l − 1) ,k ( x 0 ) , ( w ( d ) l − 1 ,k ) C i , ( w ( d ) l − 1 ,k ) C j ; • R ( q ) 1 , C ∈ R |C |×|C| is the corr elation matrix of W ( q ) 1 , C = [( W ( q ) 1 ) i ] ⊤ i ∈C ; • r ( q ) C ( x 0 ) = [ k ( q ) 1 ( x 0 , x i ∗ )] ⊤ i ∈C . Proof A sketch of the proof is pro vided in Section S.1 of the supplementary materials. □ Note that Proposition 4.1 yields a substantially cheaper construction of LGPs, with complexity of O ( |C | 3 ) when |C | ≪ N . In this construction, the NN conditioning is applied for selecting the conditioning index set C . Specifically , C contains the ro w indices of the closest input positions in h x ∗ 1 /γ ( q ) 1 , 1 , . . . , x ∗ D /γ ( q ) 1 ,D i to h x 0 , 1 /γ ( q ) 1 , 1 , . . . , x 0 ,D /γ ( q ) 1 ,D i when l = 1 , and the ro w indices of the closest input positions in h w (1) l − 1 ,k /γ ( q ) l, 1 , . . . , w ( P l − 1 ) l − 1 ,k /γ ( q ) l,P l − 1 i to h µ (1) 1 → ( l − 1) ,k /γ ( q ) l, 1 , . . . , µ ( P l − 1 ) 1 → ( l − 1) ,k /γ ( q ) l,P l − 1 i for l = 2 , . . . , L . 9 M I N G & W I L L I A M S O N P R E P R I N T 4.1 Replicates A distinctiv e feature of stochastic simulators is that multiple outputs may be generated at the same input setting. Such replicates are often used to characterize the intrinsic stochasticity of the simulator more accurately , but they can also impose substantial additional computational cost on GDGP inference when the total number of observ ations becomes large relati ve to the number of unique input locations. In this subsection, we show that replicates can be incorporated into GDGP in a way that preserves the original computational comple xity . Let Y = { Y 1 , . . . , Y N } be the collection of outputs, where Y i ∈ R S i contains S i observations repeatedly generated at the i -th input x i ∗ . Assume that Y i, 1 , . . . , Y i,S i are conditionally independent gi ven ϕ i . In a direct formulation, increasing the number of replicates S i enlarges the row dimension of the imputed latent variables w ( p ) l to P N i =1 S i , which in turn yields correlation matrices R ( p ) l of size ( P N i =1 S i ) × ( P N i =1 S i ) for all p = 1 , . . . , P l and l = 1 , . . . , L . As a result, the likelihood ev aluations and sampling of multiv ariate normal distributions specified in equations (8) and (9) make Algorithm 3, and consequently both training and prediction in GDGP , computationally burdensome when the numbers of replicates are large. Note, ho wever , that these replicates arise from repeated observ ations at the same input locations rather than from distinct inputs. Consequently , Equation (9) can be rewritten as p ( f ( q ) | f \ f ( q ) , { w ( p ) l } , y , x ) ∝ N Y i =1 S i Y s =1 p ( y i,s | g − 1 1 ( f (1) i ) , . . . , g − 1 q ( f ( q ) i ) , . . . , g − 1 Q ( f ( Q ) i )) p ( f ( q ) | w (1) L − 1 , . . . , w ( P L − 1 ) L − 1 ) , (19) thereby av oiding the computational inefficienc y that would otherwise be induced by treating replicates as independent observations o ver an e xpanded set of inputs. In particular, Equation (19) implies that the correlation matrices arising in the DGP layers remain of size N × N , defined only over the N unique input locations. Consequently , the computational order for multiv ariate normal likelihood e valuation and sampling in the imputation step is unchanged from the no- replicate case: N 3 in the standard setting and N M 3 under the V ecchia approximation. At the likelihood layer, the presence of replicates only introduces additional product terms in Equation (19) , which increases the cost linearly in the total number of replicated observations. 4.2 The Heterosk edastic Gaussian Case In addition to the general scalability improv ements introduced above through the V ecchia approximation and the explicit treatment of replicates, the heteroskedastic Gaussian case permits further computational simplifications. This setting is of particular interest because heteroskedastic Gaussian emulation remains one of the most widely used approaches for stochastic simulators, while also illustrating ho w distribution-specific structure can be e xploited within GDGP . In this subsection, we show that, under the heteroskedastic Gaussian likelihood, the conditional posterior distribution p ( µ | log σ 2 , { w ( p ) l } , y , x ) of the mean parameter admits closed-form expressions in the standard setting, as well as under the V ecchia approximation and in the presence of replicates. As a result, µ can be sampled directly in the imputation step described in Section 3.2, av oiding ESS and providing additional computational gains. Proposition 4.2 Given Y ∈ R N , wher e Y i for i = 1 , . . . , N ar e conditionally independent and distributed as N ( µ i , σ 2 i ) under the GDGP , the posterior distrib ution p ( µ | log σ 2 , { w ( p ) l } , y , x ) of µ = F (1) , given σ 2 = exp { f (2) } , the imputed latent variables { w ( p ) l } , and the observed inputs x and outputs y , is a multivariate normal distrib ution given by: N Σ (1) L ( w L − 1 ) Σ (1) L ( w L − 1 )+ Γ − 1 y , Σ (1) L ( w L − 1 ) Σ (1) L ( w L − 1 )+ Γ − 1 Γ , (20) wher e Γ = diag ( σ 2 1 , . . . , σ 2 N ) = diag ( e f (2) 1 , . . . , e f (2) N ) . Proof The proof is straightforward by noting that p ( µ | log σ 2 , { w ( p ) l } , y , x ) ∝ N Y i =1 p ( y i | µ i = f (1) i , σ 2 i = exp { f (2) i } ) p ( f (1) | w (1) L − 1 , . . . , w ( P L − 1 ) L − 1 ) , 10 M I N G & W I L L I A M S O N P R E P R I N T where p ( y i | µ i = f (1) i , σ 2 i = exp { f (2) i } ) is a normal density , and p ( f (1) | w (1) L − 1 , . . . , w ( P L − 1 ) L − 1 ) is a multi variate normal density with zero mean and cov ariance matrix Σ (1) L ( w L − 1 ) . □ Applying Proposition 4.2 works well when N is relatively small. Ho wev er , for large N , the inv ersion of Σ (1) L ( w L − 1 )+ Γ in volved in (20) can become computationally prohibiti ve. There are three scenarios in which N may be lar ge: (i) a lar ge number of replicates, leading to P N i =1 S i ≫ N ev en when N is moderate; (ii) a naturally large number of observ ations, i.e., large N without replicates; and (iii) a large number of observations with varying degrees of replications, from small to large. T o address this scalability challenge while retaining the desired sampling efficiency during imputation, Propositions 4.3, 4.4 and 4.5 present analytical forms of the posterior distribution p ( µ | log σ 2 , { w ( p ) l } , y , x ) for each of these three cases, lev eraging linear algebra techniques and the V ecchia approximation. Proposition 4.3 (Small N , large S i ) Let Y be a permutation of [ Y 1 , . . . , Y N ] , in which Y i ∈ R S i has S i observa- tions that are r epeatedly gener ated at the i -th input x i ∗ , and that Y i, 1 , . . . , Y i,S i ar e conditionally independent and distributed as N ( µ i , σ 2 i ) under the GDGP . The posterior distribution p ( µ | log σ 2 , { w ( p ) l } , y , x ) of µ = F (1) , given σ 2 = exp { f (2) } , the imputed latent variables { w ( p ) l } , and the observed inputs x and outputs y , is a multivariate normal distribution given by: N Σ (1) L ( w L − 1 ) I + M ⊤ Λ − 1 MΣ (1) L ( w L − 1 ) − 1 M ⊤ Λ − 1 y , Σ (1) L ( w L − 1 ) I + M ⊤ Λ − 1 MΣ (1) L ( w L − 1 ) − 1 , wher e I ∈ R N × N is the identity matrix; M ∈ R ( P N i =1 S i ) × N is a r eplication matrix whose r ows ar e standard basis vectors (i.e. each r ow contains a single 1 and zer os elsewher e), so that Mx r eor ders and replicates the entries of x to align with Y ; and Λ = MΓM ⊤ with Γ = diag ( σ 2 1 , . . . , σ 2 N ) = diag ( e f (2) 1 , . . . , e f (2) N ) . Proposition 4.4 (Lar ge N , S i = 1 ) Given Y ∈ R N , wher e Y i for i = 1 , . . . , N ar e conditionally independent and distributed as N ( µ i , σ 2 i ) , the posterior distribution p ( µ | log σ 2 , { w ( p ) l } , y , x ) of µ = F (1) under the V ecchia appr oximation, given σ 2 = exp { f (2) } , the imputed latent variables { w ( p ) l } , and the observed inputs x and outputs y , is a multivariate normal distribution given by: N − U ⊤ F (1) F (1) − 1 U ⊤ YF (1) y , U F (1) F (1) U ⊤ F (1) F (1) − 1 , wher e U F (1) F (1) and U YF (1) ar e bloc k components of a sparse upper-triangular matrix U = U YY U YF (1) 0 U F (1) F (1) , which is the upper-lower decomposition of the pr ecision matrix P for the V ecchia-appr oximated joint distribution of Z = Y | W L − 1 , log σ 2 F (1) | W L − 1 , such that P = UU ⊤ . Specifically , the j i -th element of U ∗ F (1) = [ U ⊤ YF (1) , U ⊤ F (1) F (1) ] ⊤ ∈ R 2 N × N is given by: U ∗ F (1) ,j i = ( d i ) − 1 2 , j = i + N , − b ik ( d i ) − 1 2 , j ∈ g ( − ( i + N )) , 0 , otherwise , (21) with • d i = ( σ (1) L ) 2 (1+ η ) − r ⊤ i Σ − 1 g ( − ( i + N )) r i • b i = r ⊤ i Σ − 1 g ( − ( i + N )) , wher e g ( − ( i + N )) ⊆ { 1 , 2 , . . . , i + N − 1 } is a conditioning index set of size min( M , i + N − 1) for all i = 1 , . . . , N ; r i = ( σ (1) L ) 2 [ k (1) L ( w L − 1 ,i ∗ , [ w ⊤ L − 1 , w ⊤ L − 1 ] ⊤ j ∗ )] ⊤ j ∈ g ( − ( i + N )) ∈ R | g ( − ( i + N )) |× 1 ; Σ g ( − ( i + N )) is the covariance matrix of Z g ( − ( i + N )) ; and b ik is the k -th element of b i with k = p os g ( − ( i + N )) ( j ) denoting the position of j in the conditioning set g ( − ( i + N )) . 11 M I N G & W I L L I A M S O N P R E P R I N T Proposition 4.5 (Lar ge N , S i > 1 ) Let Y be a permutation of [ Y 1 , . . . , Y N ] , in which Y i ∈ R S i has S i observa- tions that are r epeatedly gener ated at the i -th input x i ∗ , and that Y i, 1 , . . . , Y i,S i ar e conditionally independent and distributed as N ( µ i , σ 2 i ) under the GDGP . Define ˜ Y = ( M T Λ − 1 M ) − 1 M T Λ − 1 Y , where M ∈ R ( P N i =1 S i ) × N is a r eplication matrix whose r ows are standar d basis vectors, so that Mx r eor ders and replicates the entries of x to align with Y ; and Λ = MΓM ⊤ with Γ = diag ( σ 2 1 , . . . , σ 2 N ) = diag ( e f (2) 1 , . . . , e f (2) N ) . Then, the posterior distribution p ( µ | log σ 2 , { w ( p ) l } , y , x ) of µ = F (1) under the V ecchia appr oximation, given σ 2 = exp { f (2) } , the imputed latent variables { w ( p ) l } , and the observed inputs x and outputs y , is a multivariate normal distribution given by: N − U ⊤ F (1) F (1) − 1 U ⊤ ˜ YF (1) ˜ y , U F (1) F (1) U ⊤ F (1) F (1) − 1 , wher e ˜ y = ( M T Λ − 1 M ) − 1 M T Λ − 1 y ; U F (1) F (1) and U ˜ YF (1) ar e block components of a sparse upper-triangular matrix U = U ˜ Y ˜ Y U ˜ YF (1) 0 U F (1) F (1) , which is the upper-lower decomposition of the pr ecision matrix P for the V ecchia-appr oximated joint distribution of Z = ˜ Y | W L − 1 , log σ 2 F (1) | W L − 1 , such that P = UU ⊤ . The j i -th element of U ∗ F (1) = [ U ⊤ ˜ YF (1) , U ⊤ F (1) F (1) ] ⊤ ∈ R 2 N × N is obtained following the Equation (21) . Proof The proofs of Propositions 4.3, 4.4 and 4.5 are in Sections S.2, S.3, and S.4 of the supplementary materials, respectiv ely . □ Note that Proposition 4.3 enables analytical sampling from p ( µ | log σ 2 , { w ( p ) l } , y , x ) with computational complexity O ( N 3 ) . This is computationally feasible when N is small, ev en in the presence of a large number of replicates. When N is lar ge, in contrast, Propositions 4.4 and 4.5 admit analytical sampling with comple xity O ( N M 3 + N M ) , reg ardless of the number of replicates. If Proposition 4.2 were applied naiv ely in these settings, the sampling complexity would scale as O (( P N i =1 S i ) 3 ) , which is computationally prohibitive, particularly when both N and the numbers of replicates S i are large. The V ecchia approximations underlying Propositions 4.4 and 4.5 are constructed using the r esponse-first ordering of Katzfuss et al. (2020), in which Y and ˜ Y are always ordered before F (1) , respectiv ely . Follo wing Katzfuss et al. (2020), in implementation the conditioning inde x set g ( − ( i + N )) is formed by selecting up to M variables in Z that precede Z i + N and whose corresponding inputs in w L − 1 are nearest neighbors of w L − 1 ,i ∗ . When a nearest neighbor corresponds to two candidate conditioning variables in Z , which occurs because, conditional on W L − 1 , Y and ˜ Y hav e the same input locations as F (1) , preference is giv en to the conditioning variable in F (1) . 5 Experiments In this section, we compare the emulation performance of GDGP under a selected set of likelihoods with several competing approaches across a series of experiments. Specifically , for GDGP we employ a two-layer DGP architecture (i.e., L = 2 ), in which the first layer contains a number of GP nodes equal to the input dimension, and the second layer contains a number of GP nodes equal to the number of parameters of the associated likelihood. Each GP node uses an isotropic Matérn-2.5 kernel. This two-layer DGP architecture has been shown (Sauer et al. 2020, Ming et al. 2023) to provide a good balance between computational efficienc y and the modeling flexibility of DGPs, and is used in all subsequent experiments. For consistenc y , all GP-based competitors are also specified using the Matérn-2.5 kernel. All GDGP emulators are im- plemented using the R package dgpsi , av ailable at https://github.com/mingdeyu/dgpsi- R , and all experiments are conducted on a Mac Studio with a 24-core Apple M2 Ultra processor and 128 GB of RAM. 12 M I N G & W I L L I A M S O N P R E P R I N T 5.1 Heterosk edastic Gaussian Likelihood In this section, we demonstrate the capabilities of GDGP under a heteroskedastic Gaussian likelihood. For benchmarking, we compare GDGP with specialized heteroskedastic Gaussian process models, including the maximum-lik elihood-based approach of Binois et al. (2018), hereinafter denoted as hetGP and implemented in the R package hetGP , and the fully Bayesian frame work of P atil et al. (2025), hereinafter denoted as bhetGP and implemented in the R package bhetGP . W e hav e chosen functions that exhibit non-stationarity to demonstrate the ef ficacy of the GDGP frame work, which is designed for these settings. 5.1.1 Heterosk edastic step function Consider a synthetic simulator whose output y at a gi ven location x follows a Gaussian distrib ution N ( µ ( x ) , σ 2 ( x )) , where µ ( x ) = ( − 1 , x < 0 . 5 , 1 , x ≥ 0 . 5 , and σ 2 ( x ) = 1 600 sin(4 x − 2)+10 exp {− 1200(2 x − 1) 2 } +1 . In this experiment, we ev aluate the performance of dif ferent models by assessing their ability to reconstruct the true mean function µ ( x ) and log-variance function log σ 2 ( x ) when x ∈ [0 , 1] . Performance is quantified using the normalized root mean squared error (NRMSE; Ming et al. 2023, Section 4) and the ne gati ve continuous rank ed probability score (NCRPS; Gneiting & Raftery 2007, Section 4.2). The former measures the deterministic accuracy of model predictions, while the latter ev aluates the quality of uncertainty quantification. Specifically , we select 100 input locations uniformly spaced ov er [0 , 1] as the training inputs. At each training input location, we generate R ∈ { 20 , 40 , 60 , 80 , 100 } replicates from the synthetic simulator as training outputs. For each value of R , we train each of the three model candidates and repeat the training procedure 100 times. NRMSE and NCRPS are then ev aluated on a test set obtained by e valuating µ ( x ) and log σ 2 ( x ) at 1 , 000 ev enly spaced input locations ov er [0 , 1] . Figure 3 compares the performance of hetGP , bhetGP , and GDGP . The results show that GDGP achie ves the best ov erall performance in terms of both NRMSE and NCRPS. In addition, for GDGP , both NRMSE and NCRPS decrease steadily as the number of replicates increases. In particular , for the mean function µ ( x ) , GDGP substantially outperforms both hetGP and bhetGP , o wing to its ability to capture non-stationary structures, such as the step function in this e xample. For the log-variance function log σ 2 ( x ) , both hetGP and GDGP achiev e lo wer NRMSE and NCRPS than bhetGP across all replicate settings. Although hetGP attains lo wer NRMSE than GDGP when the number of replicates is small (i.e., R = 20 and 40 , likely due to SI finding it hard to identify weaker non-stationarity in the log-variance process when the number of replicates is small), GDGP yields lower NRMSE on a verage as the number of replicates increases, and consistently achiev es lower NCRPS across dif ferent numbers of replicates. 5.1.2 Susceptible, Infective, and Reco ver ed (SIR) model In this experiment, we consider a stochastic Susceptible-Infectiv e-Recov ered (SIR) simulator obtained by extending the benchmark model in Binois & Gramacy (2021) distrib uted with the hetGP package. Relativ e to the original two-input formulation, which v aries only the initial susceptible and infected populations, the modified simulator is defined on a fiv e-dimensional unit hypercube, with inputs determining the initial susceptible population ( S 0 ), the initial infected population ( I 0 ), the transmission rate ( β ), the recov ery rate ( γ ), and the external connectivity index ( c ext ), which controls the external infectious pressure. When c ext < 0 . 5 , the simulator yields an effecti vely isolated population and hence a closed-population epidemic re gime with no importation, whereas when c ext ≥ 0 . 5 , it yields an externally seeded epidemic regime in which higher v alues of c ext correspond to greater exogenous infection pressure. Consequently , the simulator combines smooth v ariation in epidemic parameters with a regime change associated with the onset of e xternal 13 M I N G & W I L L I A M S O N P R E P R I N T 2.8 2.9 3.0 3.1 hetGP bhetGP GDGP NRMSE (in %) Replicates 20 40 60 80 100 (a) NRMSE of µ ( x ) 0.008 0.010 0.012 0.014 0.016 hetGP bhetGP GDGP NCRPS Replicates 20 40 60 80 100 (b) NCRPS of µ ( x ) 2 3 4 5 hetGP bhetGP GDGP NRMSE (in %) Replicates 20 40 60 80 100 (c) NRMSE of log σ 2 ( x ) 0.1 0.2 0.3 0.4 0.5 hetGP bhetGP GDGP NCRPS Replicates 20 40 60 80 100 (d) NCRPS of log σ 2 ( x ) Figure 3: NRMSEs (lower is better) and NCRPSs (lower is better) for hetGP , bhetGP , and GDGP , trained on 100 unique input locations with R ∈ { 20 , 40 , 60 , 80 , 100 } replicates per location, for emulating the mean function µ ( x ) and log-variance function log σ 2 ( x ) of the synthetic simulator described in Section 5.1.1. Results are ev aluated using 1 , 000 validation points and summarized o ver 100 independent training trials. seeding, and outputs the cumulativ e attack proportion ( I pop ) at the time horizon of 100 , thereby providing a useful test problem for ev aluating emulator performance under non-stationarity . W e use increasing numbers of Latin hypercube design points, n ∈ { 100 , 200 , 300 , 400 , 500 } , as the unique training input locations. For each given n , the SIR simulator is e v aluated with a randomly selected number of replicates between 1 and 100 at each input location. The resulting emulators are then assessed on a test set of size N = 60 , 000 , consisting of 2 , 000 unique space-filling input locations with 30 replicates per location. Performance is measured using the a verage scoring rule (Score; Gneiting & Raftery 2007, Equation (27)): Score = − 1 N N X i =1 ( y i − ˜ µ ( x i )) 2 ˜ σ 2 ( x i ) +log ˜ σ 2 ( x i ) , where y i , for i = 1 , . . . , N , denote the test outputs, and ˜ µ ( x i ) and ˜ σ 2 ( x i ) are the predicti ve mean and variance produced by an emulator at the test input x i . For each n , we repeat the abo ve procedure 20 times. Figure 4 compares the performance of the three candidate models across different numbers of unique training locations. As the training size increases, the performance of all models improves. Notably , GDGP substantially outperforms both hetGP and bhetGP: with only 200 training locations, GDGP achie ves scores comparable to those obtained by hetGP and bhetGP using 500 locations. Moreov er , GDGP exhibits smaller v ariability in scores across repeated trials. Figure 5 illustrates the computational benefits and predicti ve performance of our V ecchia-based scalability de velopment for GDGP , utilizing in particular Proposition 4.5 in Section 4.2. As shown in Figure 5(a), increasing the conditioning 14 M I N G & W I L L I A M S O N P R E P R I N T 6.0 6.5 7.0 7.5 hetGP bhetGP GDGP Score T raining Size 100 200 300 400 500 Figure 4: Scores (higher is better) for hetGP , bhetGP , and GDGP , trained on n ∈ { 100 , 200 , 300 , 400 , 500 } unique input locations with a random number of replicates between 1 and 100 at each location, for emulating the cumulati ve attack proportion of the SIR simulator in Section 5.1.2 at a time horizon of 100 . Scores are e v aluated on a test set of size N = 60 , 000 , consisting of 2 , 000 space-filling input locations with 30 replicates at each location, and summarized ov er 20 independent training trials for each n . set size for training (while keeping the conditioning set size for prediction sufficiently large at 200 ) improv es predictiv e performance, and the V ecchia-approximated GDGP approaches the full GDGP (i.e., the non-V ecchia version, in which all unique training inputs are included in the conditioning set and inference is implemented using Proposition 4.3 in Section 4.2). Notably , conditioning set sizes of 25 and 50 already yield satisfactory performance close to that of the full GDGP , respectiv ely , across different numbers of training sizes. T ogether with Figure 5(b), we observe that the V ecchia-approximated GDGP with small conditioning set sizes can substantially reduce computation while achieving accuracy comparable to that of the full GDGP as the training size increases. 5 6 7 8 100 200 300 400 500 600 700 800 900 1000 Number of unique training locations Score Conditioning Set Size Full 75 50 25 15 5 (a) 0 5 10 15 100 200 300 400 500 600 700 800 900 1000 Number of unique training locations Computation time (in minutes) Conditioning Set Size Full 75 50 25 15 5 (b) Figure 5: Scores (a) and computation times (b) , averaged over 20 repeated training trials, for the full (i.e., non- V ecchia-approximated) GDGP and V ecchia-approximated GDGPs with different training conditioning set sizes in { 5 , 15 , 25 , 50 , 75 } and prediction conditioning set size fixed at 200 . Models are trained on datasets with n = 100 , 200 , . . . , 1000 unique input locations, where the outputs at each location are generated a random num- ber of times (between 1 and 100 ) by the SIR simulator (in Section 5.1.2). Scores are ev aluated on a test set of size N = 60 , 000 , consisting of 2 , 000 space-filling input locations with 30 replicates at each location. 15 M I N G & W I L L I A M S O N P R E P R I N T 5.2 Categorical Likelihood In this section, we assess GDGP emulation in classification settings using a categorical likelihood and a softmax in verse link. W e compare se veral approaches, including GDGP , the V ecchia-approximated GDGP (vGDGP) with conditioning set sizes of 50 for training and 200 for prediction, the Bayesian GP classifier (bGPC; W illiams & Barber 1998) implemented in the R package kernlab , and the sparse variational GP (SVGP; Hensman et al. 2013) implemented in the Python package gpflow . T o provide a broader reference point beyond GP-based methods, we also report results for a neural network (NNet; implemented in the R package nnet ) and a random forest (RF; implemented in the R package randomForest ) classifier , which are widely used in practical classification tasks. For comparison, we use a selection of widely adopted open benchmark datasets (summarized in T able 1), which we view as observations generated by black-box simulators. T o ev aluate the performance of dif ferent models, we remov e duplicated data points from the raw data extracted from the corresponding R packages and create 20 random 90% / 10% training/testing partitions, with class distributions approximately balanced within the splits. For each partition, we train the models on the training set and ev aluate performance on the test set using the accuracy and logloss metrics, defined as follows: Accuracy = 1 |C | X c ∈C 1 n c X i : y i = c 1 { ˆ y i = c } (22) Logloss = − 1 |C | X c ∈C 1 n c X i : y i = c log ˜ µ p c ( x i ) , (23) where C denotes the set of all classes; y i is the true class of the i -th test observation; n c = |{ i : y i = c }| is the number of test observations whose true class is c ; ˆ y i = argmax k ∈C ˜ µ p k ( x i ) is the predicted class for the i -th test observation; and ˜ µ p k ( x i ) is the predicted mean probability of class k at input location x i . T able 1: Characteristics of the datasets used in the experiments of Section 5.2. The Source column indicates the R packages from which the datasets are obtained. Dataset # Instances # Dimensions # Classes Source iris 149 4 3 datasets pima 768 8 2 mlbench thyroid 215 5 3 mclust vehicle 846 18 4 mlbench T able 2 reports the means and standard deviations (across the 20 random partitions) of accuracy and logloss for all models. GDGP achiev es the best overall performance on iris , pima , thyr oid , and vehicle , with consistently higher accuracy and lo wer logloss than the competing methods. As expected, vGDGP e xhibits some de gradation due to the V ecchia approximation, but it remains competitiv e across all datasets. W e also note that, although bGPC and NNet attain comparable accurac y on some datasets, their logloss v alues are substantially lar ger, indicating poorer probabilistic calibration. 5.3 Count Likelihoods In this section, we e xamine the performance of GDGP with count likelihoods for simulators whose outputs are non- negati ve integer-v alued counts, and consider the Rosenzweig-MacArthur predator-pre y model, which incorporates density-dependent prey gro wth and a nonlinear T ype-II functional response for the predator (Rosenzweig & MacArthur 1963, Pineda-Krch 2008). In particular , the simulator is ev aluated over four inputs: the predator death rate ( d F ∈ [0 . 1 , 2 . 0] ), the prey death rate ( d R ∈ [0 . 1 , 1 . 8] ), the predation ef ficiency ( α ∈ [0 . 01 , 0 . 02] ), and the degree of predator saturation ( w ∈ [0 , 0 . 04] ), and returns the prey population count ( R ) at time 100 , given fixed initial populations of R 0 = 50 prey and F 0 = 5 predators. W e generate training data by ev aluating the simulator at 300 unique input design points, with a random number of replicates (i.e., runs) between 1 and 30 assigned to each design point. The resulting 16 M I N G & W I L L I A M S O N P R E P R I N T T able 2: Means and standard deviations (in parentheses), computed across the 20 random partitions of the datasets listed in T able 1, for accuracy (higher is better; see Equation (22) ) and logloss (lower is better; see Equation (23) ) for GDGP , V ecchia-approximated GDGP (vGDGP), Bayesian GP classifier (bGPC), sparse variational GP (SVGP), neural network (NNet), and random forest (RF). The highest accuracy and lo west logloss for each dataset are highlighted in bold. Model Metric Dataset iris pima thyroid vehicle GDGP Accuracy 96.00 (4.63) 73.27 (4.55) 98.22 (3.72) 80.14 (3.19) Logloss 0.1153 (.0712) 0.5262 (.0560) 0.0717 (.0469) 0.4353 (.0554) vGDGP Accuracy 95.00 (5.82) 71.38 (4.53) 95.00 (5.67) 78.07 (3.65) Logloss 0.1489 (.1028) 0.5438 (.0489) 0.1268 (.0638) 0.4830 (.0596) bGPC Accuracy 94.50 (5.25) 72.50 (4.30) 94.89 (6.65) 72.49 (4.39) Logloss 0.3844 (.0624) 0.5299 (.0455) 0.6176 (.0429) 0.8389 (.0317) SVGP Accuracy 94.42 (5.28) 72.39 (4.58) 93.67 (7.64) 75.51 (5.72) Logloss 0.1529 (.0539) 0.5429 (.0381) 0.2334 (.1239) 0.6232 (.0839) NNet Accuracy 95.42 (4.77) 67.55 (6.17) 95.89 (5.11) 79.31 (3.32) Logloss 1.0027 (1.3361) 2.0777 (1.2169) 1.0545 (1.5007) 5.2156 (1.1464) RF Accuracy 94.17 (5.09) 73.15 (4.37) 96.33 (5.06) 74.74 (4.32) Logloss 0.1516 (.1059) 0.5394 (.0601) 0.1184 (.0658) 0.4971 (.0404) training set is used to fit GDGP , SVGP , and a generalized linear model (GLM) with a Poisson likelihood. W e ev aluate the three models using the negati ve log-likelihood (NLL) on a test set consisting of simulator ev aluations at 1 , 000 unique Latin hypercube test sites, with 50 replicates per site. Figure 6(a) compares the performance of the three models across 50 training trials and sho ws that GDGP substantially outperforms both the GLM and SV GP in emulating the simulator , achieving noticeably lo wer NLL overall and smaller v ariation across trials. 3.0 4.0 5.0 7.0 11.0 GLM SVGP GDGP NLL (a) 3.0 4.0 5.0 7.0 11.0 GDGP−NB GDGP−ZIP GDGP−ZINB NLL (b) Figure 6: Negati ve log-likelihood (NLL in log-scale; lower is better) for (a) the generalized linear model (GLM), SV GP , and GDGP with a Poisson likelihood, and (b) GDGPs with ne gativ e binomial (NB), zero-inflated Poisson (ZIP), and zero-inflated negati ve binomial (ZINB) likelihoods, trained on 300 unique input locations with up to 30 replicates per location, for emulating the prey population count of the predator -prey simulator in Section 5.3. NLLs are ev aluated on a test set of size N = 50 , 000 , consisting of 1 , 000 space-filling input locations with 50 replicates per location, and summarized ov er 50 independent training trials. 17 M I N G & W I L L I A M S O N P R E P R I N T T o demonstrate the flexibility of GDGP and assess sensitivity to the choice of count likelihood, we extend the e xperiment to include GDGP with negati ve binomial (NB), zero-inflated Poisson (ZIP) and zero-inflated negati ve binomial (ZINB) likelihoods. This is motiv ated by the fact that the discrete birth-death and predation reactions in the simulator can induce substantial replicate-to-replicate variability in the terminal counts (e.g., some trajectories may reach absorbing extinction states by the end of the simulation, yielding zero predator or prey counts), resulting in over -dispersion and occasional excess zeros in the simulated outputs. As shown in Figure 6(b), GDGPs with NB, ZIP , and ZINB likelihoods consistently achieve significantly lower NLL than GDGP with a Poisson likelihood. Among the three, GDGP with ZIP attains slightly lower NLL than that with NB, suggesting that departures from the Poisson assumption are dri ven primarily by an e xcess of zeros. GDGP with ZINB achiev es the lo west NLL, implying that it provides the best probabilistic fit by jointly modeling zero inflation and residual o ver -dispersion beyond the e xtra zeros present in the simulated counts. 6 Conclusion In this work, we proposed a Generalized Deep Gaussian Process (GDGP) framew ork for emulating nonstationary computer models with non-Gaussian outputs. By introducing a likelihood layer at the top of the DGP hierarchy , the proposed framework extends DGP emulation beyond deterministic and homogeneous Gaussian settings to accom- modate a broad class of response distributions. This construction preserves the flexibility of DGPs for modeling non-stationary simulator behavior , while providing a unified approach to handling di verse simulator output types, including heteroskedastic, categorical, count, and zero-inflated outputs. Inference for GDGP was developed within the Stochastic Imputation (SI) frame work. T o improv e practicality in large-scale settings, we incorporated the V ecchia approximation to enable scalable inference for large numbers of input locations, and showed that SI can accommodate large numbers of replicates (within stochastic simulators) without increasing the computational complexity of inference. W e also demonstrated that, in the heteroskedastic Gaussian case, additional analytical structure can be exploited to obtain closed-form conditional updates in the imputation step, yielding further computational gains. Through a series of synthetic and empirical examples cov ering heteroskedastic Gaussian, categorical, and count responses, we sho wed that these developments mak e GDGP a practically useful and flexible emulation frame work. Although GDGP was de veloped in the context of computer model emulation, it may also be applied more broadly to general regression problems as a flexible alternativ e to traditional GLMs, for which the linear predictor may be too restrictiv e to capture complex nonlinear and non-stationary relationships. As part of this work, we also provide a careful implementation of the full GDGP framework in the open-source R package dgpsi . Sev eral directions for future work remain. On the methodological side, it would be of interest to extend the framew ork to more complex output structures, such as multi variate or mixed-type responses. From a computational perspectiv e, further gains may be possible by identifying additional likelihoods that admit analytical conditional updates, in the same spirit as the heteroskedastic Gaussian case. Another promising direction is to dev elop hybrid inference schemes at the likelihood layer , in which only parameters e xhibiting clear input dependence are modeled through DGP outputs, while others are treated as global parameters and estimated by maximum likelihood. Such a strategy could reduce the number of latent GPs and further improv e computational ef ficiency for multi-parameter lik elihoods. Finally , although our package dgpsi already implements sequential design and Bayesian optimization for (D)GP emulators, developing these methods tailored specifically to the GDGP setting would be a natural and interesting next step. Acknowledgments The authors ackno wledge support from the ADD-TREES project, funded by the Engineering and Physical Sciences Research Council (EPSRC) under grant EP/Y005597/1. 18 M I N G & W I L L I A M S O N P R E P R I N T References Baker , E., Barbillon, P ., Fadikar , A., Gramacy , R. B., Herbei, R., Higdon, D., Huang, J., Johnson, L. R., Ma, P ., Mondal, A. et al. (2022), ‘ Analyzing stochastic computer models: A revie w with opportunities’, Statistical Science 37 (1), 64–89. Binois, M. & Gramacy , R. B. (2021), ‘hetgp: Heteroskedastic Gaussian process modeling and sequential design in R’, Journal of Statistical Softwar e 98 , 1–44. Binois, M., Gramacy , R. B. & Ludko vski, M. (2018), ‘Practical heteroscedastic Gaussian process modeling for lar ge simulation experiments’, J ournal of Computational and Graphical Statistics 27 (4), 808–821. Celeux, G. & Diebolt, J. (1985), ‘The SEM algorithm: a probabilistic teacher algorithm deriv ed from the EM algorithm for the mixture problem’, Computational Statistics Quarterly 2 , 73–82. Chang, W ., Haran, M., Applegate, P . & Pollard, D. (2016), ‘Calibrating an ice sheet model using high-dimensional binary spatial data’, Journal of the American Statistical Association 111 (513), 57–72. Cole, D. A., Gramacy , R. B. & Ludkovski, M. (2022), ‘Lar ge-scale local surrogate modeling of stochastic simulation experiments’, Computational Statistics & Data Analysis 174 , 107537. Cooper , A., Booth, A. S. & Gramacy , R. B. (2026), ‘Modernizing full posterior inference for surrog ate modeling of categorical-output simulation e xperiments’, Quality Engineering 38 (1), 91–110. Gelfand, A. E., Diggle, P ., Guttorp, P . & Fuentes, M., eds (2010), Handbook of Spatial Statistics , 1 edn, CRC Press. Gneiting, T . & Raftery , A. E. (2007), ‘Strictly proper scoring rules, prediction, and estimation’, J ournal of the American statistical Association 102 (477), 359–378. Goldberg, P . W ., Williams, C. K. & Bishop, C. M. (1997), ‘Regression with input-dependent noise: a Gaussian process treatment’, Advances in Neural Information Pr ocessing Systems 10 , 493–499. Gramacy , R. B. & Lee, H. K. H. (2008), ‘Bayesian treed Gaussian process models with an application to computer modeling’, Journal of the American Statistical Association 103 (483), 1119–1130. Guinness, J. (2021), ‘Gaussian process learning via Fisher scoring of V ecchia’ s approximation’, Statistics and Computing 31 (3), 25. Hensman, J., Fusi, N. & Lawrence, N. D. (2013), Gaussian processes for big data, in ‘Proceedings of the T wenty-Ninth Conference on Uncertainty in Artificial Intelligence’, U AI’13, A U AI Press, Arlington, V irginia, USA, p. 282–290. Katzfuss, M. & Guinness, J. (2021), ‘ A general framework for v ecchia approximations of Gaussian processes’, Statistical Science 36 (1), 124–141. Katzfuss, M., Guinness, J., Gong, W . & Zilber, D. (2020), ‘V ecchia approximations of Gaussian-process predictions’, Journal of Agricultur al, Biological and En vir onmental Statistics 25 , 383–414. Katzfuss, M., Guinness, J. & Lawrence, E. (2022), ‘Scaled V ecchia approximation for fast computer-model emulation’, SIAM/ASA Journal on Uncertainty Quantification 10 (2), 537–554. Kyzyuro va, K. N., Berger , J. O. & W olpert, R. L. (2018), ‘Coupling computer models through linking their statistical emulators’, SIAM/ASA Journal on Uncertainty Quantification 6 (3), 1151–1171. MacKay , D. J. (1992), ‘The evidence framew ork applied to classification networks’, Neural computation 4 (5), 720–736. Ming, D. & Guillas, S. (2021), ‘Linked Gaussian process emulation for systems of computer models using Matérn kernels and adapti ve design’, SIAM/ASA J ournal on Uncertainty Quantification . arXiv:1912.09468. In press. Ming, D., Williamson, D. & Guillas, S. (2023), ‘Deep Gaussian process emulation using stochastic imputation’, T echnometrics 65 (2), 150–161. Murph, A. C., Strait, J. D., Moran, K. R., Hyman, J. D., V iswanathan, H. S. & Stauffer , P . H. (2024), ‘Sensitivity analysis in the presence of intrinsic stochasticity for discrete fracture network simulations’, Journal of Geophysical Resear ch: Machine Learning and Computation 1 (3), e2023JH000113. 19 M I N G & W I L L I A M S O N P R E P R I N T Murray , I., Adams, R. & MacKay , D. (2010), Elliptical slice sampling, in ‘Proceedings of the thirteenth international conference on artificial intelligence and statistics’, JMLR W orkshop and Conference Proceedings, pp. 541–548. Patil, P . V ., Gramacy , R. B., Carey , C. C. & Thomas, R. Q. (2025), ‘V ecchia approximated bayesian heteroskedastic Gaussian processes’, arXiv pr eprint arXiv:2507.07815 . Pineda-Krch, M. (2008), ‘GillespieSSA: implementing the Gillespie stochastic simulation algorithm in R’, Journal of Statistical Softwar e 25 , 1–18. Rosenzweig, M. L. & MacArthur , R. H. (1963), ‘Graphical representation and stability conditions of predator-pre y interactions’, The American Naturalist 97 (895), 209–223. Salimbeni, H. & Deisenroth, M. (2017), Doubly stochastic variational inference for deep Gaussian processes, in ‘ Advances in Neural Information Processing Systems’, pp. 4588–4599. Salter , J. M., McKinley , T . J., Xiong, X. & W illiamson, D. B. (2025), ‘Emulating computer models with high- dimensional count output’, Philosophical T ransactions of the Royal Society A: Mathematical, Physical and Engineer - ing Sciences 383 (2292). Sauer , A., Cooper, A. & Gramacy , R. B. (2023), ‘V ecchia-approximated deep Gaussian processes for computer experiments’, J ournal of Computational and Graphical Statistics 32 (3), 824–837. Sauer , A., Gramacy , R. B. & Higdon, D. (2020), ‘ Activ e learning for deep Gaussian process surrogates’, arXiv:2012.08015 . Spiller , E. T ., W olpert, R. L., Tierz, P . & Asher, T . G. (2023), ‘The zero problem: Gaussian process emulators for range-constrained computer models’, SIAM/ASA Journal on Uncertainty Quantification 11 (2), 540–566. T anner , M. A. (1993), T ools for Statistical Infer ence: Methods for the Exploration of P osterior Distributions and Likelihood Functions , 2nd edn, Springer . T itsias, M. (2009), V ariational learning of inducing v ariables in sparse Gaussian processes, in ‘ Artificial Intelligence and Statistics’, pp. 567–574. V ecchia, A. V . (1988), ‘Estimation and model identification for continuous spatial processes’, Journal of the Royal Statistical Society Series B: Statistical Methodology 50 (2), 297–312. W illiams, C. K. & Barber , D. (1998), ‘Bayesian classification with Gaussian processes’, IEEE T ransactions on pattern analysis and machine intelligence 20 (12), 1342–1351. Y i, S.-r . & T aflanidis, A. A. (2024), ‘Stochastic emulation with enhanced partial-and no-replication strategies for seismic response distribution estimation’, Earthquak e Engineering & Structural Dynamics 53 (7), 2354–2381. A ppendix A Closed-f orm Predicti ve Means and V ariances Let µ ( q ) 0 ,k and σ 2 ( q ) 0 ,k denote the mean and variance of the LGP approximation, b p ( f ( q ) 0 | x 0 ; f k , { w ( p ) l } k , x ) , at input location x 0 , for q = 1 , . . . , Q . Then, under a range of parametric choices for p ( y | ϕ ) , the predictiv e mean ˜ µ Y 0 ,k and variance ( ˜ σ Y 0 ,k ) 2 admit closed-form expressions. T able A.1 lists sev eral representativ e distributions for which such expressions are a v ailable. The corresponding closed-form results are giv en belo w: A.1 Poisson ˜ µ Y 0 ,k = exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) , ( ˜ σ Y 0 ,k ) 2 = exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) + e σ 2 (1) 0 ,k − 1 exp n 2 µ (1) 0 ,k + σ 2 (1) 0 ,k o . 20 M I N G & W I L L I A M S O N P R E P R I N T A.2 Exponential ˜ µ Y 0 ,k = exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) , ( ˜ σ Y 0 ,k ) 2 = 2 e σ 2 (1) 0 ,k − 1 exp n 2 µ (1) 0 ,k + σ 2 (1) 0 ,k o . A.3 Gamma ˜ µ Y 0 ,k = exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) , ( ˜ σ Y 0 ,k ) 2 = e σ 2 (1) 0 ,k + e σ 2 (1) 0 ,k +2 µ (2) 0 ,k +2 σ 2 (2) 0 ,k − 1 exp n 2 µ (1) 0 ,k + σ 2 (1) 0 ,k o . A.4 Heterosk edastic Gaussian ˜ µ Y 0 ,k = µ (1) 0 ,k , ( ˜ σ Y 0 ,k ) 2 = exp ( µ (2) 0 ,k + σ 2 (2) 0 ,k 2 ) + σ 2 (1) 0 ,k . A.5 Negative Binomial ˜ µ Y 0 ,k = exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) , ( ˜ σ Y 0 ,k ) 2 = e σ 2 (1) 0 ,k − 1 exp n 2 µ (1) 0 ,k + σ 2 (1) 0 ,k o +exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) +exp ( µ (2) 0 ,k + σ 2 (2) 0 ,k 2 +2 µ (1) 0 ,k +2 σ 2 (1) 0 ,k ) . A.6 Zero-Inflated P oisson ˜ µ Y 0 ,k = (1 − ¯ π 0 ,k ) exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) , ( ˜ σ Y 0 ,k ) 2 = (1 − ¯ π 0 ,k ) " exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) + e σ 2 (1) 0 ,k − 1 exp n 2 µ (1) 0 ,k + σ 2 (1) 0 ,k o # + ¯ π 0 ,k (1 − ¯ π 0 ,k ) exp n 2 µ (1) 0 ,k + σ 2 (1) 0 ,k o , where ¯ π 0 ,k = logit − 1 µ (2) 0 ,k q 1+ π 8 σ 2 (2) 0 ,k ! , obtained using the analytical approximation, suggested by MacKay (1992), to the expectation of the logistic function of a Gaussian random v ariable. 21 M I N G & W I L L I A M S O N P R E P R I N T A.7 Zero-Inflated Negati ve Binomial ˜ µ Y 0 ,k = (1 − ¯ π 0 ,k ) exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) , ( ˜ σ Y 0 ,k ) 2 = (1 − ¯ π 0 ,k ) " exp ( µ (1) 0 ,k + σ 2 (1) 0 ,k 2 ) +exp ( 2 µ (1) 0 ,k +2 σ 2 (1) 0 ,k + µ (2) 0 ,k + σ 2 (2) 0 ,k 2 ) + e σ 2 (1) 0 ,k − 1 exp n 2 µ (1) 0 ,k + σ 2 (1) 0 ,k o # + ¯ π 0 ,k (1 − ¯ π 0 ,k ) exp n 2 µ (1) 0 ,k + σ 2 (1) 0 ,k o , where ¯ π 0 ,k = logit − 1 µ (3) 0 ,k q 1+ π 8 σ 2 (3) 0 ,k ! . T able A.1: Dif ferent choices of p ( y | ϕ ) for which predictive mean ˜ µ Y 0 ,k and variance ( ˜ σ Y 0 ,k ) 2 admit closed-form expressions. Parameter ( ϕ ) Link Function ( g ) Probability Function ( p ( y | ϕ ) ) Poisson ϕ 1 = λ ∈ (0 , ∞ ) log λ y y ! e − λ Exponential ϕ 1 = µ ∈ (0 , ∞ ) log 1 µ e − y µ Gamma ϕ 1 = µ ∈ (0 , ∞ ) ϕ 2 = σ ∈ (0 , ∞ ) log log y 1 /σ 2 − 1 e − y/ ( µσ 2 ) ( µσ 2 ) 1 /σ 2 Γ(1 /σ 2 ) Heterosk edastic Gaussian ϕ 1 = µ ∈ R ϕ 2 = σ 2 ∈ (0 , ∞ ) identity log 1 √ 2 πσ 2 e − ( y − µ ) 2 2 σ 2 Negative Binomial ϕ 1 = µ ∈ (0 , ∞ ) ϕ 2 = σ ∈ (0 , ∞ ) log log Γ( y + 1 σ ) Γ(1 /σ )Γ( y +1) σµ 1+ σµ y 1 1+ σµ 1 /σ Zero-Inflated Poisson ϕ 1 = λ ∈ (0 , ∞ ) ϕ 2 = π ∈ (0 , 1) log logit ( π +(1 − π ) e − λ , y = 0 , (1 − π ) λ y y ! e − λ , y > 0 Zero-Inflated Negative Binomial ϕ 1 = µ ∈ (0 , ∞ ) ϕ 2 = σ ∈ (0 , ∞ ) ϕ 3 = π ∈ (0 , 1) log log logit π +(1 − π ) 1 1+ σµ 1 /σ , y = 0 , (1 − π ) Γ( y + 1 σ ) Γ(1 /σ )Γ( y +1) σµ 1+ σµ y 1 1+ σµ 1 /σ , y > 0 22 M I N G & W I L L I A M S O N P R E P R I N T Supplementary Materials S.1 Proof of Pr oposition 4.1 Since, when l = 1 , the LGP reduces to a GP and, under the V ecchia approximation, the predictiv e mean and variance of a GP at a single input location x 0 coincide with those of the nearest-neighbor GP (Katzfuss et al. 2020), it follows that, under the V ecchia approximation, the LGP mean and variance for a GP node in the first layer ( l = 1 ) are giv en by µ ( q ) 1 → 1 ,k ( x 0 ) = r ( q ) C ( x 0 ) ⊤ R ( q ) 1 , C − 1 w ( q ) 1 ,k, C σ 2 ( q ) 1 → 1 ,k ( x 0 ) = σ ( q ) 1 2 1+ η ( q ) 1 − r ( q ) C ( x 0 ) ⊤ R ( q ) 1 , C − 1 r ( q ) C ( x 0 ) , where C ⊆ { 1 , 2 , . . . , N } is a conditioning index set of size |C | . This establishes equations (17) and (18). Analogously , for a particular GP node G P ( q ) l with l ≥ 2 , its predictiv e mean µ ( q ) l,k and variance σ 2 ( q ) l,k under the V ecchia approximation for imputation k are giv en by µ ( q ) l,k = r ( q ) l,k, C ( W 0 ,l − 1 ,k ) ⊤ R ( q ) l,k, C − 1 w ( q ) l,k, C σ 2 ( q ) l,k = σ ( q ) l 2 1+ η ( q ) l − r ( q ) l,k, C ( W 0 ,l − 1 ,k ) ⊤ R ( q ) l,k, C − 1 r ( q ) C ( W 0 ,l − 1 ,k ) , where W 0 ,l − 1 ,k = ( W (1) 0 ,l − 1 ,k , . . . , W ( P l − 1 ) 0 ,l − 1 ,k ) with W ( q ) 0 ,l − 1 ,k ∼ N µ ( q ) 1 → ( l − 1) ,k ( x 0 ) , σ 2 ( q ) 1 → ( l − 1) ,k ( x 0 ) . Applying the laws of total e xpectation and variance to W ( q ) 0 ,l,k then giv es µ ( q ) 1 → l,k ( x 0 ) = E r ( q ) l,k, C ( W 0 ,l − 1 ,k ) ⊤ R ( q ) l,k, C − 1 w ( q ) l,k, C , σ 2 ( q ) 1 → l,k ( x 0 ) = w ( q ) l,k, C ⊤ R ( q ) l,k, C − 1 E r ( q ) l,k, C ( W 0 ,l − 1 ,k ) r ( q ) l,k, C ( W 0 ,l − 1 ,k ) ⊤ R ( q ) l,k, C − 1 w ( q ) l,k, C − E r ( q ) l,k, C ( W 0 ,l − 1 ,k ) ⊤ R ( q ) l,k, C − 1 w ( q ) l,k, C 2 + σ ( q ) l 2 1+ η ( q ) l − tr R ( q ) l,k, C − 1 E r ( q ) l,k, C ( W 0 ,l − 1 ,k ) r ( q ) l,k, C ( W 0 ,l − 1 ,k ) ⊤ , where the expectations are taken with respect to W 0 ,l − 1 ,k . Defining I ( q ) l,k, C ( x 0 ) = E r ( q ) l,k, C ( W 0 ,l − 1 ,k ) and J ( q ) l,k, C ( x 0 ) = E r ( q ) l,k, C ( W 0 ,l − 1 ,k ) r ( q ) l,k, C ( W 0 ,l − 1 ,k ) ⊤ , we then establish equations (15) and (16) , with the i -th el- ement of I ( q ) l,k, C ( x 0 ) giv en by P l − 1 Y d =1 E k ( q ) l,d W ( d ) 0 ,l − 1 ,k , ( w ( d ) l − 1 ,k ) C i and the ij -th element of J ( q ) l,k, C ( x 0 ) giv en by P l − 1 Y d =1 E k ( q ) l,d W ( d ) 0 ,l − 1 ,k , ( w ( d ) l − 1 ,k ) C i k ( q ) l,d W ( d ) 0 ,l − 1 ,k , ( w ( d ) l − 1 ,k ) C j , which can be expressed as P l − 1 Y d =1 ξ ( q ) l µ ( d ) 1 → ( l − 1) ,k ( x 0 ) , σ 2 ( d ) 1 → ( l − 1) ,k ( x 0 ) , ( w ( d ) l − 1 ,k ) C i and P l − 1 Y d =1 ζ ( q ) l µ ( d ) 1 → ( l − 1) ,k ( x 0 ) , σ 2 ( d ) 1 → ( l − 1) ,k ( x 0 ) , ( w ( d ) l − 1 ,k ) C i , ( w ( d ) l − 1 ,k ) C j 23 M I N G & W I L L I A M S O N P R E P R I N T respecti vely . These quantities can consequently be computed analytically , since ξ ( q ) l ( · , · , · ) and ζ ( q ) l ( · , · , · , · ) admit closed- form expressions (Ming & Guillas 2021, Appendix A) when the k ernel functions k ( q ) l,d ( · , · ) are squared exponential or Matérn. S.2 Proof of Pr oposition 4.3 Giv en µ , we have that Y | µ , log σ 2 ∼ N ( M µ , Λ ) , where Λ = MΓM ⊤ with Γ = diag ( σ 2 1 , . . . , σ 2 N ) . Since µ = F (1) and F (1) |{ W ( p ) l } , x d = F (1) | W L − 1 ∼ N 0 , Σ (1) L ( w L − 1 ) , we hav e p µ | log σ 2 , { w ( p ) l } , y , x ∝ p ( y | µ , σ 2 ) p µ |{ w ( p ) l } , x ∝ exp − 1 2 ( y − M µ ) ⊤ Λ − 1 ( y − M µ ) exp − 1 2 µ ⊤ Σ (1) L ( w L − 1 ) − 1 µ ∝ exp − 1 2 µ ⊤ M ⊤ Λ − 1 M + Σ (1) L ( w L − 1 ) − 1 µ + µ ⊤ M ⊤ Λ − 1 y . By completing the square, we can e xpress p µ | log σ 2 , { w ( p ) l } , y , x in the standard form for a multi v ariate normal, giving µ | log σ 2 , { W ( p ) l } , Y , x ∼ N M ⊤ Λ − 1 M + Σ (1) L ( w L − 1 ) − 1 − 1 M ⊤ Λ − 1 y , M ⊤ Λ − 1 M + Σ (1) L ( w L − 1 ) − 1 − 1 , which is equiv alent to N Σ (1) L ( w L − 1 ) I + M ⊤ Λ − 1 MΣ (1) L ( w L − 1 ) − 1 M ⊤ Λ − 1 y , Σ (1) L ( w L − 1 ) I + M ⊤ Λ − 1 MΣ (1) L ( w L − 1 ) − 1 by factoring out Σ (1) L ( w L − 1 ) − 1 from the expression within the brackets. S.3 Proof of Pr oposition 4.4 Let D = ( Y | log σ 2 , { W ( p ) l } , x ) − ( µ |{ W ( p ) l } , x ) . Since Y | log σ 2 , { W ( p ) l } , x , µ d = Y | log σ 2 , µ ∼ N ( µ , Γ ) , where Γ = diag ( σ 2 1 , . . . , σ 2 N ) , and µ |{ W ( p ) l } , x d = F (1) | W L − 1 ∼ N 0 , Σ (1) L ( w L − 1 ) , we hav e that D | µ ∼ N ( 0 , Γ ) . Since D | µ is fully N ( 0 , Γ ) for all µ , we ha ve D and µ are independent and D ∼ N ( 0 , Γ ) . Therefore, µ |{ W ( p ) l } , x D ! = F (1) | W L − 1 D follows a jointly multi variate normal distrib ution: N 0 , Σ (1) L ( w L − 1 ) 0 0 Γ !! . 24 M I N G & W I L L I A M S O N P R E P R I N T Define Z = Y | W L − 1 , log σ 2 F (1) | W L − 1 . Since Y | W L − 1 = F (1) | W L − 1 + D , Z is an affine transformation of F (1) | w L − 1 D : Z = I , I I , 0 F (1) | W L − 1 D where I is the identity matrix. Thus, Z is multiv ariate normal: Z ∼ N ( 0 , Σ Z ) , where Σ Z = Σ (1) L ( w L − 1 )+ Γ Σ (1) L ( w L − 1 ) Σ (1) L ( w L − 1 ) Σ (1) L ( w L − 1 ) ! ∈ R 2 N × 2 N . The cov ariance matrix Σ Z of Z can then be constructed by ( σ (1) L ) 2 R Z ( w stack L − 1 )+ Γ 0 , where w stack L − 1 = w L − 1 w L − 1 , Γ 0 = diag ( σ 2 1 , . . . , σ 2 N , 0 , . . . , 0) , and the ij -th element of R Z ( w stack L − 1 ) is specified by k (1) L ( w stack L − 1 ,i ∗ , w stack L − 1 ,j ∗ )+ η 1 { i = j } for i, j = 1 , . . . , 2 N . Since Z is multiv ariate normal, the distribution of Z under the V ecchia approximation remains multi variate normal: Z V ec ∼ N ( 0 , P ) , where P is the precision matrix that admits the upper-lo wer Cholesky decomposition P = UU ⊤ , where U is a sparse upper-triangular matrix. Conformably with the partition of Z into Y | W L − 1 , log σ 2 F (1) | W L − 1 , U can be written as U = U YY U YF (1) 0 U F (1) F (1) , where U YY ∈ R N × N , U YF (1) ∈ R N × N , and U F (1) F (1) ∈ R N × N denotes sub-matrices of U under this partition. Thus, we hav e P = " U YY U ⊤ YY + U YF (1) U ⊤ YF (1) U YF (1) U ⊤ F (1) F (1) U F (1) F (1) U ⊤ YF (1) U F (1) F (1) U ⊤ F (1) F (1) # . According to Gelfand et al. (2010, Theorem 12.2), we then ha ve F (1) | W L − 1 , log σ 2 , Y ∼ N − U ⊤ F (1) F (1) − 1 U ⊤ YF (1) y , U F (1) F (1) U ⊤ F (1) F (1) − 1 under the V ecchia approximation. Since µ | log σ 2 , { W ( p ) l } , Y , x d = F (1) | log σ 2 , W L − 1 , Y , the proposition follows. S.4 Proof of Pr oposition 4.5 Giv en µ , we have that Y | µ , log σ 2 ∼ N ( M µ , Λ ) , (S1) where Λ = MΓM ⊤ with Γ = diag ( σ 2 1 , . . . , σ 2 N ) . The likelihood is then gi ven by: p ( y | µ , log σ 2 ) = (2 π ) − P N i =1 S i / 2 | Λ | − 1 / 2 exp {− 1 2 ( y − M µ ) ⊤ Λ − 1 ( y − M µ ) } . 25 M I N G & W I L L I A M S O N P R E P R I N T Expand the quadratic form, we hav e: ( y − M µ ) ⊤ Λ − 1 ( y − M µ ) = y ⊤ Λ − 1 y − 2 µ ⊤ M ⊤ Λ − 1 y + µ ⊤ M ⊤ Λ − 1 M µ . Define s ( y ) def = M ⊤ Λ − 1 y ∈ R N and D def = M ⊤ Λ − 1 M ∈ R N × N , then, p ( y | µ , log σ 2 ) = (2 π ) − P N i =1 S i / 2 | Λ | − 1 / 2 exp {− 1 2 y ⊤ Λ − 1 y } exp { µ ⊤ s ( y ) − 1 2 µ ⊤ D µ } . Let h ( y ) = (2 π ) − P N i =1 S i / 2 | Λ | − 1 / 2 exp {− 1 2 y ⊤ Λ − 1 y } and g ( s ( y ) , µ ) = exp { µ ⊤ s ( y ) − 1 2 µ ⊤ D µ } , we hav e p ( y | µ , log σ 2 ) = h ( y ) g ( s ( y ) , µ ) . Thus, p ( µ | log σ 2 , w L − 1 , y ) ∝ p ( y | µ , log σ 2 ) p ( µ | w L − 1 ) = h ( y ) g ( s ( y ) , µ ) p ( µ | w L − 1 ) ∝ g ( s ( y ) , µ ) p ( µ | w L − 1 ) , which depends on y only through s ( y ) . Therefore, p ( µ | log σ 2 , w L − 1 , y ) = p ( µ | log σ 2 , w L − 1 , s ( y )) . Define ˜ Y = ( M T Λ − 1 M ) − 1 M T Λ − 1 Y ∈ R N . (S2) Since ˜ Y = D − 1 s ( Y ) is a deterministic transformation of s ( Y ) , conditioning on s ( Y ) is equiv alent to conditioning on ˜ Y : p ( µ | log σ 2 , w L − 1 , y ) = p ( µ | log σ 2 , w L − 1 , s ( y )) = p ( µ | log σ 2 , w L − 1 , ˜ y ) . As a result, finding p ( µ | log σ 2 , w L − 1 , y ) is equi v alent to finding p ( µ | log σ 2 , w L − 1 , ˜ y ) . Using Equation (S1) and definition (S2), we hav e that ˜ Y | µ , log σ 2 is a multiv ariate normal: ˜ Y | µ , log σ 2 ∼ N ( µ , D − 1 ) . Define Z = ˜ Y | W L − 1 , log σ 2 F (1) | W L − 1 , then following the same arguments in Section S.3, we can obtain that Z is multiv ariate normal: Z ∼ N ( 0 , Σ Z ) , where Σ Z = Σ (1) L ( w L − 1 )+ D − 1 Σ (1) L ( w L − 1 ) Σ (1) L ( w L − 1 ) Σ (1) L ( w L − 1 ) ! ∈ R 2 N × 2 N . Using the same arguments in Section S.3 again, we can obtain that F (1) | W L − 1 , log σ 2 , ˜ Y ∼ N − U ⊤ F (1) F (1) − 1 U ⊤ ˜ YF (1) ˜ y , U F (1) F (1) U ⊤ F (1) F (1) − 1 under the V ecchia approximation. Since p ( µ | log σ 2 , { w ( p ) l } , y , x ) = p ( µ | log σ 2 , w L − 1 , y ) and p ( µ | log σ 2 , w L − 1 , y ) = p ( µ | log σ 2 , w L − 1 , ˜ y ) = p ( f (1) | log σ 2 , w L − 1 , ˜ y ) , the proposition follows. 26
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment