Uncertainty-Aware Neural Multivariate Geostatistics
We propose Deep Neural Coregionalization, a scalable framework for uncertainty-aware multivariate geostatistics. DNC models multivariate spatial effects through spatially varying latent factors and loadings, assigning deep Gaussian process (DGP) prio…
Authors: Yeseul Jeon, Aaron Scheffler, Rajarshi Guhaniyogi
Uncertain t y-Aw are Neural Multiv ariate Geostatisti cs Y eseul Jeon 1 , 2 Aaron Sc heffler 2 Ra jarshi Guhaniy ogi 1 1 Departmen t of Statistics, T exas A&M Univ ersity 2 Departmen t of Epidemiology & Biostatistics, Univ ersity of Californi a San F rancisco Abstract W e prop ose De ep Neur al Cor e gionalization (DNC) , a scalable and uncertaint y- a ware framew ork for multiv ariate geostatistics. DNC represents multiv ariate sp at i al effects via spatially v arying factors and loadings, placing deep Gaussian pro cess (DGP) priors on both the factors and the en tries of the loading matrix. This join t sp ecification learns shared laten t spatial factors together with resp onse-sp ecific, lo cation-dep enden t mixing w eigh ts of the factors, yielding flexible, nonlin ear , an d space-dep endent asso- ciation within and acros s v ariables. A c entr al con tribution is to exploit a v ariational form ulation that makes the DGP– deep neural net work (DNN) cor re sp ondence ex- plicit: maximizing the DGP evidence lo wer b ound (ELBO) is equi v alent to training DNNs with w eight decay and Mon te Carlo (MC) drop out arc hitecture. In p rac t i ce, this enables (i) fast inference via mini-batch sto chastic optimization, without costly Mark ov Chain Mon te Carlo (MCMC), and (ii) principled u nc er t aint y quantification, as MC-drop out forward passes act as appr oximate p osterior dra ws, pro ducing calibrate d credible surfaces for prediction and spatial effect estimation. Across s imulations, DNC is comp etitive (or outp erforms) existing alternative spatial factor mo dels, especi al ly under strong nonstationarit y and complex cross-dep endencies, while offering substan- tial computational gains, with estimation times reduced from few hours to few minutes in a m ultiv ariate environmen tal study . In a m ultiv ariate environmen tal case study , DNC captures intricate spatially-dep endent cross-v ariable interactions, pro duces inter- pretable maps of m ultiv ariate spatial outcomes, and scales uncertaint y quantification to large datasets, with man y fold r e du ct i on in computational time compared to alter- 1 nativ e approaches. DNC pro vides a mo dern, explain abl e, and computationally efficien t solution for large-scale multiv ariate spatial analysis, aligning with emerging explainable artificial intelligence (AI) strategies in m ultiv ariate geostatistics. Keywor ds: Ba yesian Deep Learning, Deep Gaussian Pro cess, Linear Mo del of Coregional- ization, Monte Car l o Drop out, Multi v ariate Gaussian Pro cess. 1 In tro ducti on T echnological and com p u t at i o n al adv ances ha ve created data-rich settings that offer un- preceden ted opp ortunities to prob e the complexity of l a r ge, spatial l y indexed datasets. Over the past decade, spatial mo dels ha ve pl ay ed a central role in addressing increasingly com- plex questions in the environmen tal sciences. A particularly activ e frontier is the analysis of m ultiv ariate spat i al data, where m ultipl e outcomes are record ed at eac h lo cation. In su ch settings, it is natural to p osit dep endence b ot h within location s (among the co-measured v ar i a bl es) and across lo ca t i on s (spatial auto correlation within eac h v ariable). As a moti- v at i n g example in this article, consider a suite of spatially indexed sp ect r a l measures of v egetation activity which typically exhibits stro n g spat i a l str u ct u r e, readily seen in empiri- cal v ariograms and exploratory maps, while the v ariables themselves are m utuall y asso ciated through shared bi op hysical pro cesses. Analyzing each v ariable in isolation ma y recov er its marginal spatial pattern yet discards cross-v ariable information, often degrading interp ola- tion and prediction p erformance (see, e.g., [W ac kernagel, 2003, Chiles and Delfiner, 2012, Cressie and Wikle, 2011]). These con si d er a t i ons strongly motiv ate joint mo deling of mul- tiv ariate spatial pro cesses. Moreo ver, in man y ap p l i c at i o n s, including ours, the str en gt h and form of the asso ciations among v ariables v ary ov er space. Addressing this realit y re- quires mo dels that allow spatiall y v arying asso ciation st r u ct u r e b et ween v ariables and deliver computationally efficient inference at scale, su i t ab l e for large num b ers of lo cations. A common en try p oint to multiv ariate spatial analysi s is to p osit a vector-v alued la- ten t spatial pro cess, t ypically a multiv ariate Gaussi an pro cess, equipp ed with a matrix- v al u ed cross-co v ariance that en co des b oth within-v ariable and cross-v ariable dep endence across space. The literature on such cross-cov ariances is v ast, so we highlight the princi- pal construction paradigms that balance interpretabilit y and flexibil i ty . The most widely 2 used is the linear mo del of coregi on a l i zat i on (LMC) [Schmidt and Gelfand, 200 3, W ac ker- nagel, 2003], whi ch represents a multiv ariate field as a linear combination of ind ep endent univ ariate latent p ro cesses, pro ducing cross-cov ariances as sums of sepa r ab l e comp onents. A second family are con volution metho ds, which build cross-cov ariances b y con volving shared and v ariable-sp ecific k ernels with laten t white-noise or p a r ent pro cesses [Gaspari and Cohn, 1999, Ma jumdar and Gelfand, 2007]. A third idea is the latent-dimension approac h, whic h assigns each v ariable a co ordinate i n an auxiliary space and the n applies a univ ariate sta- tionary k ernel in the augmented domain, yielding v alid station a r y cros s-cov ariances by con- struction [Apanasovic h and Gen ton, 2010]. Finally , multiv ariate Mat ´ ern families pro vide in terpretable, practice-fr i en d l y sp ecifications wherein each margin enjoys a Mat ´ ern form and cross-parameters are const r ai n e d to ensure p ositiv e defin i t en es s [Gneiting et al., 2010, Genton and Kleib er, 2015]. While m uch of this work targets stationary dep en d en ce, wher e cross-cov ariances dep en d only on inter-location distances, mo dern applications frequently exhibit spat i a l l y v arying as- so ciations tied to geograph y , ecology , or ph ysics. Empirical evid e n ce from ecology and en vi- ronmen tal science (e.g., [Diez and Pull i am , 2007, Ov ask ainen et al., 2010, W add l e et al., 2010, Co ombes et al., 2015, Guha et al., 2024]) u n d er sco r es that nonstationary cross-corr el at i on can rev eal latent d r i vers, sharp en scien tific und er st a n d i n g, and m a t er i al l y impro ve predicti o n at new lo cations. Metho dologically , non-stationarity has b een intro duced by allo wing Mat´ ern parameters (v ariance, range, smo oth n ess ) to v ary ov er space [Kl ei b er and Nyc hk a, 2012], b y extending kernel conv olutions to yield spat i al l y v arying matrix-v a l u ed co v ariances [Calder, 2008, Ma jumdar and Gelfand, 2007], and b y adapting the laten t-dimension framew ork to nonstationary settings [Bornn et al., 2012]. Related hierarchical formulations, suc h as the matrix-v ariate Wishart pro cess [Gelfand et al., 2004] and Lagrangian framew orks that induce directionally stronger cross-cov ariances [ S a l v a ˜ na et al., 2023], off er additi on a l flexibi l i ty for mo deling spatial l y v arying co v ariance structures. Scaling these m ultiv ariate spatial mo d el s to large n (lo cations) remains c hallenging. Re- cen t adv ances in tro duce computationally efficien t Gaussian-pro ce ss surrogates—low-rank/predictiv e- pro cess, sparse /n ea r est -n e i ghb or, and multi-resolution sc hemes—to deliver nonstati o n ar y cross-co v ariances at scale [Guhaniy ogi et al., 2013a, Guhaniyogi, 2017, Zhang and Banerjee, 3 2022a]. These approac hes substan tially broaden the re ach of multiv ariate spatial mo del - ing, yet truly massive sample sizes can still strain computation, esp ecially when flexible spatially-dep endent asso ciation is requ i r ed across many v ariables and lo cations. Expanding up on the nonstationary Linear Mo del of Coregiona l i zat i o n (LMC) framework in tro d u ced b y Guhaniyogi et al. [ 2 01 3a ] , w e mo del m ultiv ariate outcomes by allowing b oth the latent facto r pro ces ses and eac h entry of the co-regionaliz at i o n matrix to v ary explic- itly with spa t i al lo cation. Unlik e Guh a n i yogi et al. [20 1 3a ] , who utilized lo w-rank Gaussian pro cesses (GPs) for these functions and encoun tered scalability challenges, we employ a sp e- cific deep Gaussian pro cess (DGP) architecture with co n n ect i on s to de ep neural net works (DNNs). Sp ecifically , we construct a tailored v ariational a p p roximation of the DGP’s p oste- rior distri b u t i on and focu s on opt i m i zi n g the a sso ciated ev i d en c e lo wer b ound (ELBO) with resp ect to the v ariational parameters. Imp ortantly , this ELBO i s mathematically equiv alen t to minim i z i n g a loss function for the weigh ts and biases i n deep neural netw orks, structured around the latent factors and their loa d i n gs . Consequen tly , the v ariational p o st er i o r pa- rameters of t h e DGP can b e efficien tly estimated using DNN optimization tec hniques via sto chastic backpropagation. This computational efficien cy enables fast estimation of v aria- tional parameters and facilitates sampling from the v ariational p osterior, allowing for b oth accurate p oint estimation and uncertaint y quan tification in infer en ce on multiv ariate spati al effects and prediction of multiv ariate outcomes. The prop osed form ulation yields several adv an tages: (i) Flexibilit y: rich, spatially de- p endent cross-v ariable asso c ia t io ns mo deled via data-driven nonlinear mapping s; ( i i ) Scal- abilit y: efficient training for massive n umber o f lo cation sets through st o c hastic optimiza- tion and mini-batching, free from estimating DGP through exp ensiv e Marko v Chain Monte Carlo ( MCM C) sampling; (iii) Uncertain ty quantification: v ariational inference lev erag- ing con n e ct i on s b etw een DNN optimization and approximate inference for deep Gaussian pro cesses, providing uncertain ty for b oth mo del comp onents an d predictions; (iv) St a ti s ti - cal–AI integration: embedding DNN struct u r e within a spatially v arying LMC preserves in terpretabili ty from geostatistics while tapping adv ances from deep le ar n i n g for multiv ariate spatial analysis. W e refer to this framework as deep neural coregion al i z at i o n (DNC). Owing to their expressive capacity , neural netw orks hav e recently b een applied to spa- 4 tial and spatio-temp or al mo deling, including CNN-based spat i o- t em p oral forecasting [Wikle and Zammit-Mangion, 2023], deep probabilistic warping for nonstationary spatial pro cesses [Zammit-Mangion et al., 2022], v ari a t i ona l auto enco ders for spatio-temp oral p oint pro cesses [Mateu and Ja l i l i an, 2022], and Bay esian neural net w orks for spatial interpol at i o n [Kirk- w o o d et al., 2022]. While t h es e studies show strong empirical p erformance, they often lack clear probabilistic in terpretatio n and rigorous uncertaint y quan tification. More recent w ork in tegrates DNNs wi t h i n statistical mo dels to obtain interpretable inference vi a connections to deep Gaussian pro cesses [Jeon et al., 2025a, b ] , but these developmen ts remain largely confined to un i v ariate function-o n -sca l a r and function-on-function regression. DNC closes this gap for m ultiv ariate spatial outcomes b y jointly learning spatially v arying factors and loadings with DGPs while facilit a t i n g rapid computation through DNN-based optimization, deliv ering flexib l e n on st a t i on a r y cr o ss- cov ariances togeth er wi th scalable, principled uncer- tain ty quan tification. The remainder of the p ap er is organized as follo ws. Section 2 briefly revi ews multiv ariate GP with a fo cus on t h e LMC. Section 3 intro duces th e prop osed Multiv ariate DeepGP metho dology . Section 4 describ es the pr ed i c ti ve distribut i on used to quan tify uncertaint y . In Sectio n 5, w e ev aluate the mo del’s p erformance through simulation studies, fo l l ow ed b y real-w orld applications in Section 6. Finally , Section 7 concludes the pap er with a di scu ssi o n and future research directions. 2 Multiv ariate Spatial Pro cess Mo dels Let S ⊂ R d b e a connected subset of the d -d i m e n si on a l Euclidean sp a ce, and let s ∈ S denote a generic p oint in S . In our subseq u ent application, d = 2. The multiv ariate spa t i al setting typically en visions, at each locat i on s ∈ S , a J × 1 m ultiv ariate resp onse v ector y ( s ) = y 1 ( s ) , . . . , y J ( s ) ⊤ , together with an J × p design matrix X ( s ) whose j th row is the 1 × p regressor vector x j ( s ) ⊤ . Let { s 1 , . . . , s n } denote the set of observ ation lo cations in the domain S where resp onses and predictors are observ ed. A m ultiv ariate spatial regression mo del p o si t s, for each s ∈ S , y ( s ) = X ( s ) β + w ( s ) + ϵ ( s ) , (1) 5 where β ∈ R p is a vector of regression co efficien ts, w ( s ) = w 1 ( s ) , . . . , w J ( s ) ⊤ is a J × 1 v ector of spatial random effects, and ϵ ( s ) = ϵ 1 ( s ) , . . . , ϵ J ( s ) ⊤ is a J × 1 vector of measure- men t errors. W e assume ϵ ( s ) i.i.d. ∼ N 0 , Σ ε across lo cations and indep endent of w ( · ), with Σ ε = σ 2 I typically tak en to b e diagonal. 2.1 Multiv ariate Spatial Pro cesses and Linear Mo del of Coregion- alization A pow erful route to modeling multiv ariate spatial dep endence is to represen t the J –v ariate spatial effect w ( s ) = ( w 1 ( s ) , . . . , w J ( s )) ⊤ as sp ac e–varying line ar tr ansformations of a col- lection of laten t spatial p r o cesses. This ec ho es latent–v ariable m o deling [Game rm a n et al. , 2008, Ti t si a s and La wrence, 2010], but r ep l ac es latent scalars with latent functions that propagate structured dep endence across outcomes. The resulting construction is the Line ar Mo del of Cor e gionalization (LMC) [Gelfa n d e t al. , 2004], extended to allow sp atial ly varying mixing of latent function s to expl i c i tl y encode sp ac e–varying asso ciation among out co m es [Guhaniyogi et a l . , 2013b]. Let { h j ( s ) : s ∈ S } J j =1 denote J laten t spatial factors and h ( s ) = ( h 1 ( s ) , . . . , h J ( s )) ⊤ is the collection of spatial factors at lo cation s . Let Ψ ( s ) b e a J × J lo c ation–sp e cific l o ad i n g (coregionalization) matrix with the j th column ψ j ( s ) = ( ψ 1 j ( s ) , ..., ψ J j ( s )) T ∈ R J . The space–v arying LMC sp ecifies w ( s ) = Ψ ( s ) h ( s ) = J X j =1 ψ j ( s ) h j ( s ) , s ∈ S . (2) Instead of mo deling the m ultiv ariate spatial effect directly whic h requir es assuming a sp e- cific form of cross-co v ariance function, the LMC form ulation decomp oses the multiv ariate spatial effect into shared latent spatial factors and lo cation sp ecifi c loa d i n g s, enabling a natural formulation of space v arying cross outcome asso ciations. This construction p ermits an expansion for eac h individual comp on ent of the multiv ariate spatial effect w ( s ), given b y , w k ( s ) = P J j =1 ψ kj ( s ) h j ( s ) , k = 1 , . . . , J. F or Ψ ( s ), we follow a widely accep t ed ap- proac h that considers Ψ ( s ) as an upp er–trian g u l ar matr i x , setting ψ kj ( s ) = 0, for j < k , and mo d el i n g its upp er–triangular entries as smo oth functions of spatial lo cation. Gaus si an 6 pro cesses (GPs) are a natur al choice for estimating these unknown laten t functions; a typical sp ecification assigns h j ( · ) ∼ GP(0 , κ j h ( · , · )) with a typically station a ry co v ariance fun ct i o n κ j h ( s , s ′ ) = Cov( h j ( s ) , h j ( s ′ )). A similar stati on a r y GP framew ork can b e applied to mo del eac h upp er-triangular entry of Ψ ( s ). Na ¨ ıv ely assign i n g Gaussian pro cesses (GPs) t o all unkno wn la te nt facto r s and a l l upp er– triangular loading functions is in a d eq u ate b oth stati st i call y and computational l y . Standard stationary GP on l ate nt fact o r s often fails to cap t u r e the complex, lo cal b eha vior of spatially v ar y i ng laten t functions [Sauer et al., 2023], a n d suffers from a debilitating computational cost of the order of ∼ J ( J +3) 2 n 3 . Suc h costs quickly b ecome prohibitive for ev en mo der- ately sized da t ase ts . Scalable GP su r r o ga t es, including lo w-rank GPs [Guhaniy ogi et al., 2013b, Guhaniyogi and Sanso, 2020], nearest–neigh b or GPs [Zhang and B an er je e, 2022b], and distributed frameworks [Guhani yogi et al., 2022], mitigate this burden, y et most im- plemen tations em p l oy such scalable GP su rr o ga t es o n l y for spatia l factors, while enforcing that Ψ ( s ) = Ψ remains spatially constant. This results in sp ac e-invariant cross-co v ariance among outco m es, i.e., Cov( w ( s ) , w ( s ′ )) = Ψ Cov( h ( s ) , h ( s ′ )) Ψ T , an assumption that is of- ten far to o restrictiv e and unr ea l i st i c for practical applica t i on s , (see e.g., [ D i ez and Pulliam, 2007], and [Ov ask ainen et al., 2010]). Our approach brea ks decisiv ely from these limitations. W e explicitl y mode l b oth h ( s ) and Ψ ( s ) using flexible deep Gaussian pro cess (DGP) architectures, uniquely cap a b l e of capturing in tricate, lo cal sp a ti a l feature s that stationary GPs often miss. F urthermore, by lev eraging the computational p ar a l l el s b etw een DGPs and deep neural netw orks (DNNs), w e enable fast p osterior inference through scala b l e DNN-based optimization. W e detail this sp ecification and its computational strategy in the nex t section. 3 Mo deling Spatial F actors and F actor Loadings: Ex- ploiting Connections Bet w een Deep Gauss ia n Pro- cesses and Deep Neural N et w orks W e b egin with a brief ov erview of deep Gaussian pro cesses, which will later serve as the foundation for mo deling b oth the spatial factor loadings and the laten t facto rs . 7 3.1 Deep Gaussian Pro cess A de ep Gaus si an pro cess (DGP) is a hierarc hical comp osition of Gaussian pro cesses in whic h eac h la y er is c onditional ly multiv ariate normal giv en the lay er ben e at h it. This um brella definition admits several concrete c on st r u ct i o n s with different computational and in terpretive trade–offs (see, e.g., [Dunlop et al . , 2018]). In thi s w ork, w e adop t the functional c omp osition p ersp ective, where functions are comp o sed and propagated lay er by la yer, which aligns naturally with deep neural netw ork ( D NN) implementations and enables fast, scalable inference. At a h i g h level, in termediate GP lay ers p erform sto chastic input warping , trans- forming inputs thr ou g h laten t representations b efore the fina l GP pro duces the ob s er ved resp onses; t h i s induce s rich, no n st at i o n ar y , and often non-Ga u ssi a n b eha vior while retaining a probabilistic mo deling, allowing uncertain ty to b e propagated through the hierarch y . Tw o–la yer DGP . Let the inputs b e spatial lo cation s S = { s 1 , . . . , s n } ⊂ R d , and let z = ( z ( s 1 ) , . . . , z ( s n )) ⊤ denote the corresp onding scalar resp onses. A tw o–la yer DGP intro d u c es a laten t feature map F = [ f 1 : · · · : f K 1 ] ∈ R n × K 1 , obtain ed by ev aluating K 1 laten t GP “no des” at th e lo cations S , represented b y the column s of F . The mo del sp ecifies z | F ∼ N 0 , Σ 2 ( F ) , F | S ∼ N 0 , Σ 1 ( S ) , (3) where Σ 1 ( S ) an d Σ 2 ( F ) are cov ariance mat r i ces obtained by ev aluating c hosen GP co- v ar i an ce fun ct i o n s on S and on the latent inputs F , resp ectively . The marginal likelihoo d in tegrates o u t the latent in p u ts : L ( z | S ) ∝ Z L ( z | F ) L ( F | S ) d F . (4) F ollowing deep learnin g nomenclature, we refer to the K 1 laten t GP no des of the in termediate la yer as no des of layer 1 . t h e o u t er (data) la yer is refer r ed to as layer 2 in a t wo-la y er deep GP , which outputs the sca l ar pr o cess z ( · ) ev aluated at S . Multi–output ex t en si o ns rep l a ce z b y a stac ked vector (or matrix) of resp onses. Multi-la yer GP . W e generalize to an L –lay er deep GP in which lay er l has K l laten t GP no des . Let S = { s 1 , . . . , s n } denote the inputs (e.g., spa t i al lo ca ti o n s) a n d d e fi n e th e l ay er– l 8 laten t matrix F l = [ f 1 ,l : · · · : f K l ,l ] ∈ R n × K l , wh er e the column f k,l ∈ R n is the k th latent GP no de at the l th lay er, ev aluated across the n sampled lo cations. W e set F 0 ≡ S (or more sp ecifically the original inputs). Conditional on the previous lay er, no des with i n a lay er are considered indep endently distribute d Gau ssi a n pro cesses with a co v ariance that dep ends on the laten t inputs from the la yer b elow. W riting Σ l ( F l − 1 ) for t h e n × n co v ariance matrix obtained by ev aluating a chosen cov ariance function on the r ows of F l − 1 , a finite–dimensional DGP prior is sp ecified by f k,l F l − 1 ∼ N 0 , Σ l ( F l − 1 ) , k = 1 , . . . , K l , l = 1 , . . . , L ; K L = 1 , F L = z ∈ R n . (5) As the uniformly di st r i b u t ed inputs S are propagated through the intermediate lay ers, they are non l i n ea r l y warped in latent space; the resulting pro cess z ( · ) is no longer stationary , and, in general, not Gaussian . More precisely , Cov { z ( s ) , z ( t ) } = E [Cov { z ( s ) , z ( t ) } | F 1 , . . . , F L − 1 ] is an a verage of a GP co v ariance function wit h r esp ect to the distribution induced by the inner la yers. Consequently , z ( · ) need n o t b e marginally Gaussian and cannot b e fully characterized b y a single cross-cov ariance function. The c hoice of the depth (n umber of la yers L ) of the DGP is problem-sp ecific. While adding depth increases expressiv eness, it ev entually yiel d s diminishing returns. Damianou and Lawrence [2013] provided some supp ort for using up to five lay ers in classi fi ca ti o n set- tings, whereas t w o– and three–lay er DGPs ha ve generally sufficed for real-v alued outputs t ypical of computer simulations and environmen tal outputs [Sauer et al., 2023]. Consisten t with this literature, we prefer shallo w arc hitecture and the n umber of la yers not more than four in all o u r subsequent DGP for mulation for unknown functions. 3.2 Deep Gaussian Pro cesses for Spa ti a l F actors and F actor Load- ings W e stac k all upp er–t r i a n gu l a r entries of the factor loading (co-regional i za ti o n ) matrix into a single vector ψ ( u ) ( s ) = ψ ( u ) 1 ( s ) , . . . , ψ ( u ) O ( s ) ⊤ , where O = J ( J +1) 2 . At eac h spatial lo cation s , we learn b oth the latent spatial factors h j ( s ) for j = 1 , . . . , J and the upp er–trian gu l a r en tries ψ ( u ) o ( s ) for o = 1 , . . . , O with indep endent DGPs, adopti n g the functional comp o si t i o n 9 view of DGPs outl i n ed in Section 3.1. Sp ecifically , for each factor h j ( · ) we sp ecify an L h -la yer DGP , with La yer l containing K ( h ) l laten t “no des,” ev aluated across all n lo cations in to a matrix F ( h ) l,j ∈ R n × K ( h ) l . The DGP form ulation is given by , f ( h ) k,l ,j F ( h ) l − 1 ,j ∼ N 0 , Σ l ( F ( h ) l − 1 ,j ) , k = 1 , . . . , K ( h ) l ; l = 1 , . . . , L h ; F ( h ) 0 ,j = [ s 1 : · · · : s n ] ⊤ F ( h ) l − 1 ,j = f ( h ) 1 ,l − 1 ,j : · · · : f ( h ) K ( h ) l − 1 ,l − 1 ,j , K ( h ) L h = 1 , F ( h ) L,j = ( h j ( s 1 ) , ..., h j ( s n )) ⊤ ∈ R n . (6) The top lay er pro duces a sin g l e output p er lo cation (so K ( h ) L h = 1), pro ducing the v ector h j ( s 1 ) , . . . , h j ( s n ) ⊤ , for j = 1 , ..., J . Analogously , for eac h upp er–tr i a n gu l a r loadi n g entry ψ ( u ) o ( · ) we endow an L ψ -la yer DGP . La yer l comp r i ses K ( ψ ) l laten t “no des”; stacking their ev aluations across the n lo cations yields F ( ψ ) l,o ∈ R n × K ( ψ ) l . The hierarchical sp ecification is g i ven b y , f ( ψ ) k,l ,o F ( ψ ) l − 1 ,o ∼ N 0 , Σ l F ( ψ ) l − 1 ,o , k = 1 , . . . , K ( ψ ) l ; l = 1 , . . . , L ψ ; F ( ψ ) 0 ,o = [ s 1 : · · · : s n ] ⊤ F ( ψ ) l − 1 ,o = f ( ψ ) 1 ,l − 1 ,o : · · · : f ( ψ ) K ( ψ ) l − 1 ,l − 1 ,o , K ( ψ ) L ψ = 1 , F ( ψ ) L ψ ,o = ψ ( u ) o ( s 1 ) , . . . , ψ ( u ) o ( s n ) ⊤ ∈ R n . (7) Th us, the top la y er pro duces a single output p er lo cation, yie l d i n g the vector ψ ( u ) o ( s 1 ) , . . . , ψ ( u ) o ( s n ) ⊤ , for o = 1 , ..., O . T o construct the co v ariance matrices Σ l F ( h ) l − 1 ,j and Σ l F ( ψ ) l − 1 ,o , w e m i r r or the forward pass of a deep neural n etw ork. Sp ecifically , we apply p oi nt wise n o n l i n ea r function (e.g., ReLU, sigmoid ) σ ( h ) l − 1 ( · ) and σ ( ψ ) l − 1 ( · ) to the ( l − 1)th la y er’s no de matrix corresp onding to the pro cess (6) and (7), resp ecti vely , to obtain “activ ation” matrices Φ ( h ) l − 1 ,j = σ ( h ) l − 1 ( F ( h ) l − 1 ,j ) and Φ ( ψ ) l − 1 ,o = σ ( ψ ) l − 1 ( F ( ψ ) l − 1 ,o ). In analogy with DNNs, we in tro duce la y er sp ecific weigh t matrices W ( h ) l,j ∈ R K ( h ) l × K ( h ) l − 1 , W ( ψ ) l,o ∈ R K ( ψ ) l × K ( ψ ) l − 1 , and bias vectors b ( h ) l,j ∈ R K ( h ) l and b ( ψ ) l,o ∈ R K ( ψ ) l . W e then define eac h lay er’s GP co v aria n ce as a data– d ri ven kernel on the rows (lo cations) of the previo u s lay er’s acti v ations, with learnable scales enco ded by w eight matrices and bias 10 v ectors, given b y , Σ l F ( h ) l − 1 ,j = 1 K ( h ) l σ ( h ) l Φ ( h ) l − 1 ,j W ( h ) ⊤ l,j + 1 n ⊗ b ( h ) ⊤ l,j h σ ( h ) l Φ ( h ) l − 1 ,j W ( h ) ⊤ l,j + 1 n ⊗ b ( h ) ⊤ l,j i ⊤ , Σ l F ( ψ ) l − 1 ,o = 1 K ( ψ ) l σ ( ψ ) l Φ ( ψ ) l − 1 ,o W ( ψ ) ⊤ l,o + 1 n ⊗ b ( ψ ) ⊤ l,o h σ ( ψ ) l Φ ( ψ ) l − 1 ,o W ( ψ ) ⊤ l,o + 1 n ⊗ b ( ψ ) ⊤ l,o i ⊤ , (8) where 1 n is an n -vector of ones used to broad cas t the bias across lo cations. This construction yiel d s a sp ecific DGP prior for b oth h j ( · ) and ψ ( u ) o ( · ) whose lay ers mimic a Bay esian deep net work: σ l ( · ) acts as the activ ation, W l go verns in ter–lay er m i x i n g, b l pro vides affine shifts, and the induced cov ariance at lay er l is a kernel of the transforme d features from lay er l − 1. The result is a highly flexible, nonstationary , and p ossi b l y non- Gaussian representation obtained via sto chastic input warping, sim ultaneo u sl y for th e spatial factors and the factor loadings. Ho wev er, direct sampling-bas ed Ba yesian inference is often computational l y infeasible, as high-d i m e n si on a l latent v ariables typically mix slo wly and MCMC meth o ds do not scale w ell with increasing mo del depth and dataset size. F ortunately , the sp e ci fi c structure of the DGP describ ed ab o ve allows us to exploit the relationship b et ween deep neural netw ork op- timization and approximate Ba yesian inference for th e DGP prior via a tailored v ariati o n al appro ximation . This approac h enables scalable inference while still providing robust un- certain ty quan tification through DGP . The following section d et ai l s this sp ecific v ariational appro ximation applied to the factor mo del incorp orating the DGP prior describ ed ab ov e. 3.3 V ariational Inference on Deep Gauss i a n Pro cess Prior Let θ ( h ) = { ( v ec ( W ( h ) l,j ) ⊤ , ( b ( h ) l,j ) ⊤ ) ⊤ : l = 1 , .., L h ; j = 1 , ..., J } and θ ( ψ ) = { ( v ec ( W ( ψ ) l,o ) ⊤ , ( b ( ψ ) l,o ) ⊤ ) ⊤ : l = 1 , .., L ψ ; o = 1 , ..., O } collect the k ernel param et er s (w eight matrices and bias vectors) for the DGPs defining th e cov ariance functions for h j ( · ) and ψ ( u ) o ( · ), resp ectively . Let θ denote th e collection of all the co v ariance parameters together θ = (( θ ( h ) ) ⊤ , ( θ ( ψ ) ) ⊤ ) ⊤ . Let D = { s i , y ( s i ) , X ( s i ) } n i =1 denote the obs er ved data. Placing stand a r d normal priors on all en tries of θ , i.e., p ( θ ) = N ( 0 , I ), yields an intractable p osterior π ( θ | D ). Closed-form 11 conditionals are unav ailable and generi c Metropoli s- Hast i n g s samplers mix p o orly for this high-dimensional parameter θ under the deep GP setting. W e therefore adop t a v ariat i o n al appro ximation that factorizes o ver la yers and parameters. Sp ecifically , we define the v ari- ational di st r i b u t i on as q ( θ ) = Q L h l =1 Q J j =1 { q ( W ( h ) l,j , b ( h ) l,j ) } Q L ψ l =1 Q O o =1 { q ( W ( ψ ) l,o , b ( ψ ) l,o ) } where eac h term represents the v ariational distribut i on of the corresp onding w eight matrices and bias vectors. Let w ( h ) l,j,k = ( w ( h ) l,j,k k ′ : k ′ = 1 , ..., K ( h ) l − 1 ) and w ( ψ ) l,o,k = ( w ( ψ ) l,o,k k ′ : k ′ = 1 , ..., K ( ψ ) l − 1 ) b e the k th ro w of the matrices W ( h ) l,j and W ( ψ ) l,o , resp ectively . The v ariational distribution is constructed as q ( W ( h ) l,j , b ( h ) l,j ) = K ( h ) l Y k =1 q ( w ( h ) l,j,k , b ( h ) l,j,k ) , q ( W ( ψ ) l,o ) = K ( ψ ) l Y k =1 q ( w ( ψ ) l,o,k , b ( ψ ) l,o,k ) , q ( w ( h ) l,j,k , b ( h ) l,j,k ) = p ( h ) l,j N (( µ ( w,h ) ⊤ l,j,k , µ ( b,h ) l,j,k ) ⊤ , δ 2 I ) + (1 − p ( h ) l,j ) N ( 0 , δ 2 I ) , q ( w ( ψ ) l,o,k , b ( ψ ) l,o,k ) = p ( ψ ) l,o N (( µ ( w,ψ ) ⊤ l,o,k , µ ( b,ψ ) l,o,k ) ⊤ , δ 2 I ) + (1 − p ( ψ ) l,o ) N ( 0 , δ 2 I ) , (9) where w ( h ) l,j,k k ′ and b ( h ) l,j,k are the ( k , k ′ )th element of the w eight matrix W ( h ) l,j and k th element of the bias vector b ( h ) l,j , resp ectively . Likewise, w ( ψ ) l,o,k k ′ and b ( ψ ) l,o,k are the corresp onding en tries of W ( ψ ) l,o and b ( ψ ) l,o . Eac h scalar weigh t or b i as ha s a v ariational distribution of the Gau ssi a n m i xt u re, with eac h mixing comp onen t having a small v ariance δ 2 . Let µ ( w,h ) l,j,k = ( µ ( w,h ) l,j,k k ′ : k ′ = 1 , ..., K ( h ) l − 1 ) and µ ( w,ψ ) l,o,k = ( µ ( w,ψ ) l,o,k k ′ : k ′ = 1 , ..., K ( ψ ) l − 1 ) b e the v ariational mean parameters corresp on d i n g to w ( h ) l,j,k and w ( ψ ) l,o,k , resp ectively . Denote η ( h ) W,j and η ( h ) b,j as the c ol l ec ti o n of all v ariational mean parameters { ( µ ( w,h ) ⊤ l,j, 1 , ..., µ ( w,h ) ⊤ l,j,K ( h ) l ) ⊤ : l = 1 , .., L h } and { µ ( b,h ) l,j,k : l = 1 , .., L h ; k = 1 , .., K ( h ) l } , resp ectively , corresp onding to h j ( · ). Collect all these paramet er s for kernel cons tr u ct i o n of { h j ( · ) : j = 1 , ..., J } under η ( h ) = { (( η ( h ) W,j ) ⊤ , ( η ( h ) b,j ) ⊤ ) : j = 1 , .., J } . Similarly , coll ect all w eights and bias pa r am et er s for kernel construction of { ψ ( u ) o ( · ) : o = 1 , ..., O } under η ( ψ ) . Denote η as the collect i on of a l l v ariational mean parameters η = (( η ( h ) ) ⊤ , ( η ( ψ ) ) ⊤ ) ⊤ . As the marginal v ariational distribu t i on of each weigh t and bias has its own unique mean parameter, there is a one-to-one corresp ondence b etw een the vector of all kernel parameters θ for DGP and the collection η . The full mean-field v ariational family factorizes ov er all en tries of the weigh ts and biases, 12 with lay er-sp ecific inclusion pro b ab i l i t i es p ( h ) l,j , p ( ψ ) l,o ∈ [0 , 1]. As these probabi l i t i es approac hes to 0 (with small δ 2 ), each marginal collapses to N (0 , δ 2 ), effectively “droppi n g” that param- eter. W e d en ot e the v ariational distribution of θ as q ( θ | η ) to sho w its dep endence o n the v ar i a tio n a l parameters η . The optimal v ariational p ar a m et er s η are set by minimizing the Kullback–Leibler (KL) div ergence b etw een the v ariati o n al distribution q ( θ | η ) and the full p osteri o r distribution π ( θ | D ) for the parameter θ . This is equiv alen t to maximizing , L GP-VI ( β , η , σ 2 ) = E q [log( π ( θ , D ))] − E q [log q ( θ | η )] , (10) the evidence low er b ound (ELBO), where π ( θ , D ) is the join t di st r i b u ti o n of data and parameters. With th e mean field v ariational distribution q ( θ | η ), the ELBO is giv en by , L GP-VI ( β , η , σ 2 ) = n X i =1 Z · · · Z q ( θ | η ) log p ( y ( s i ) | θ , β , { F ( h ) l,j } J,L h j,l =1 , { F ( ψ ) l,o } O,L ψ o,l =1 , σ 2 ) J Y j =1 L h Y l =1 p F ( h ) l,j | F ( h ) l − 1 ,j , W ( h ) l,j , b ( h ) l,j O Y o =1 L ψ Y l =1 p F ( ψ ) l,o | F ( ψ ) l − 1 ,o , W ( ψ ) l,o , b ( ψ ) l,o d W ( h ) l,j d b ( h ) l,j d W ( ψ ) l,o d b ( ψ ) l,o − KL q ( θ | η ) p ( θ ) where p ( y ( s i ) | θ , β , { F ( h ) l,j } J,L h j,l =1 , { F ( ψ ) l,o } O,L ψ o,l =1 , σ 2 ) = N ( y ( s i ) | X ( s i ) β + Ψ ( s i ) h ( s i ) , σ 2 I J ) d e- notes the data likeliho o d at the outer l ay er. Note that F ( h ) L h ,j = ( h j ( s 1 ) , .., h j ( s n )) ⊤ and F ( ψ ) L ψ ,o = ( ψ ( u ) o ( s 1 ) , ..., ψ ( u ) o ( s n )) ⊤ are functions of the weigh t and bias parameters θ ( h ) and θ ( ψ ) , resp ectively , as descri b ed in Section 3.2. Hence, p ( y ( s i ) | θ , β , { F ( h ) l,j } J,L h j,l =1 , { F ( ψ ) l,o } O,L ψ o,l =1 , σ 2 ) simplifies to p ( y ( s i ) | β , Ψ ( s i ) , h ( s i ) , σ 2 ). Since the direct maxi m i za ti o n of (11) is challenging due to in tractable in tegration, w e replace it with a Mon te Car l o (MC) approximation as L GP-MC ( β , η , σ 2 ) = 1 M M X m =1 n X i =1 log p ( y ( s i ) | X ( s i ) , β , Ψ ( s i ) , h ( s i ) , σ 2 , θ ( m ) ) − KL q ( θ ( ψ ) | η ( ψ ) ) p ( θ ( ψ ) ) − KL q ( θ ( h ) | η ( h ) ) p ( θ ) , (11) 13 where θ ( m ) are MC sample s for all w eights and bias kernel parameters collectively from the v ar i a tio n a l di s tr i b u t i o n in (9 ). L GP-MC will offer accurate approximation of the v ariational ob jective L GP-VI as the num ber of MC sam p l es M → ∞ [Paisley et al . , 2012, Rez en d e et al., 2014]. L GP-MC assumes further simpl i fi cat i on as stated in the following lemma. Lemma 3.1. Assume that K ( h ) l , K ( ψ ) l ar e b oth lar ge and δ ≈ 0 . Then L GP-MC ( β , σ 2 , η ) ≈ 1 M M X m =1 n X i =1 log p ( y ( s i ) | X ( s i ) , β , Ψ ( s i ) , h ( s i ) , σ 2 , θ ( m ) ) − J X j =1 L h X l =1 p ( h ) l,j 2 || µ ( w,h ) l,j || 2 − J X j =1 L h X l =1 p ( h ) l,j 2 || µ ( b,h ) l,j || 2 − O X o =1 L ψ X l =1 p ( ψ ) l,o 2 || µ ( w,ψ ) l,o || 2 − O X o =1 L ψ X l =1 p ( ψ ) l,o 2 || µ ( b,ψ ) l,o || 2 . (12) The pro of of Lemma 3.1 can be found in Section 1 of the supplementary file. Ob jec- tiv e (12) is optimized exactly lik e training a deep net w ork with sto chastic for ward passes through the latent la yers, and structured w eight deca y whose strengths are det er m i n ed by the inclusion probabilities. Next section establishes thi s critical bridge b etw een sto chastic DNN training and v ariational inference under our DGP prior, and enables scalable optimization via minibatching and bac kpropag at i o n while retaining principled uncertaint y quantification through the v ar i a t i on a l distribution q ( θ | η ). It is imp ortant to note that the regression co ef- ficien ts β and th e no i se v ariance σ 2 are estimated via p oint optimization of the Monte Carlo ob jective. In this work, our emphasis is on capturing uncert ai nt y arising from the estimation of spatial effects, and we do not explicitly account for p osterior uncertaint y in β or σ 2 . As sho wn in sim ulation studies, when spatial effects are dominant relative to fixed effects, this approac h r esu l t s in only a ne gl i g i b l e loss of predictive uncertain ty with large sample size. 3.4 Scalable Computation: Lev erage Connection Bet w een Deep Gaussian Pro cess and Deep Neural Net w ork This section introduces a dual p ersp ective of the mo del by presenting an alternativ e approac h to representing spati a l fac to r s an d fa ct o r lo ad i n g s usi n g d eep neural netw orks (DNNs). While inference is carried out using the primary DGP-based mo deling framework 14 for laten t factors and factor loadings, the dual view facilita tes efficien t computation throu gh DNN-based optimization and is analogous to employing a v ariational approximation of the DGP prior, as o u tl i n e d b elow. Sp ecifically , we envision mo deling each h j ( s ) with deep n eu r al net works (DNNs) with L la yers, and the l th lay er con taining K ( h ) l no des, given b y , h j ( s ) = σ ( h ) L h W ( h ) L h ,j σ ( h ) L h − 1 · · · σ ( h ) 1 W ( h ) 1 ,j s + b ( h ) 1 ,j ⊙ z ( h ) 1 ,j · · · ⊙ z ( h ) L h − 1 ,j + b ( h ) L h ,j , (13) where the w eight and bias paramet er s are as defined in Secti o n 3.2 , and ⊙ denotes the elemen t-wise pro duct betw een t w o v ectors/matrices of the same dimension s. The v ectors z ( h ) 1 ,j ∈ R K ( h ) 1 , ..., z ( h ) L h − 1 ,j ∈ R K ( h ) L h − 1 are the drop out vectors with binary entries in { 0 , 1 } . F ollowing Section 3.2, θ ( h ) represen ts collection of all w eight and bias par am et er s for the construction for h ( s ) = ( h 1 ( s ) , ..., h J ( s )) ⊤ . Lik ewise, under the dual view, we envision mo deling ψ ( u ) o ( s ) with a DNN with L ψ n umber of lay ers and l th lay er con taining K ( ψ ) l no des, as fol l ows ψ ( u ) o ( s ) = σ ( ψ ) L ψ W ( ψ ) L ψ ,o σ ( ψ ) L ψ − 1 · · · σ ( ψ ) 1 W ( ψ ) 1 ,o s + b ( ψ ) 1 ,o ⊙ z ( ψ ) 1 ,o · · · ⊙ z ( ψ ) L ψ − 1 ,o + b ( ψ ) L ψ ,o , (14) where the w eight and bias paramet er s are as defined in Secti o n 3.2 , and ⊙ denotes the elemen t-wise pro duct betw een t w o v ectors/matrices of the same dimension s. The v ectors z ( ψ ) 1 ,o ∈ R K ( ψ ) 1 , ..., z ( ψ ) L ψ − 1 ,o ∈ R K ( ψ ) L ψ − 1 are t h e drop out vectors with binary entries in { 0 , 1 } . F ollowing section 3 . 3, θ ( ψ ) represen ts the collection of th e weigh ts and bias parameters cor- resp onding to the const ru c ti o n of ψ ( u ) ( s ). Figure 1 offers a pictoral representation of this dual view of th e mo del using DNNs. Therefore, it is p ossible to mo del the Equation (1) with DNN structure on spatial factor loadings and factors, giv en d at a D , by minimizing L DNN ( β , σ 2 , θ ( ψ ) , θ ( h ) ) = 1 2 nσ 2 n X i =1 y ( s i ) − ˆ y ( s i ) 2 2 + J X j =1 L h X l =1 λ ( w,h ) l,j ∥ W ( h ) l,j ∥ 2 2 + λ ( b,h ) l,j ∥ b ( h ) l,j ∥ 2 2 + O X o =1 L ψ X l =1 λ ( w,ψ ) l,o ∥ W ( ψ ) l,o ∥ 2 2 + λ ( b,ψ ) l,o ∥ b ( ψ ) l,o ∥ 2 2 . (15) 15 Figure 1: Illustration for the dual view of the prop osed framework using deep neural net works (DNNs). Here ˆ y ( s i ) = X ( s i ) β + Ψ ( s i ) h ( s i ), { ( λ ( w,h ) l,j , λ ( b,h ) l,j ) : l = 1 , ..., L h ; j = 1 , ..., J } and { ( λ ( w,ψ ) l,o , λ ( b,ψ ) l,o ) : l = 1 , ..., L ψ ; o = 1 , ..., O } are p enalt y parameters which control shrink age on the w eights and biases of the latent pro cesses. These regularization terms m i t i gat e o v erfitting by dis- couraging o verly large netw ork parameters and effectively act as weigh t deca y . In p rac ti c al applications, regularization is pr i m a r i l y achiev ed through sto chastic no de drop out, with the drop out rate commonly selected via empirical v alidation, typically in the range of 0. 1 to 0.3, as r eco m m en ded in the literatur e [Arora et al., 2021]. F or L 2 regularization, the p enalty parameters are generall y set in adv ance, usually b et ween 10 − 4 and 10 − 5 . Lemma 3.2. Minimizing the loss in Equation (15) is e quivalent to maximizing the ELBO in Equation (12) , under the de ep-GP fr amework. In this e quivalenc e, the p ar ameters θ ( ψ ) and θ ( h ) of Equation (15) c orr esp ond, r esp e ctively, to the p ar ameters η ( ψ ) and η ( h ) in Equation (12) . Detailed argument proving Lemm a 3.2 can b e found i n Section 2 of the supplementary file. F rom an im p l em entation standp o i nt, Lemma 3.2 allows one to estimate θ ( ψ ) and θ ( h ) using standard mini-batch sto c hastic b a ckpropagation, yiel d i n g estimates b θ ( ψ ) and b θ ( h ) . More sp ecifically , the sto chastic back-propagation alg o ri t h m mi n i m i zes the lo ss in (15) via mi n i - batc h sto c hastic gradien t descen t. A t iteration t , a mini-ba t ch B t ⊂ { 1 , . . . , n } is sampled uniformly at rand o m , and drop out is applied to obtain a sto c hastic realization of the netw ork parameters, yielding a sampled predictor ˆ y ( t ) ( s ). Here, the parameters are up dated according 16 to the sto chastic gradien t rules θ ( ψ ) t +1 = θ ( ψ ) t − α t 1 σ 2 X i ∈B t ( ˆ y ( s i ) ( t ) − y ( s i )) ⊤ ∇ θ ( ψ ) ˆ y ( s i ) ( t ) + ∇ θ ( ψ ) R ( θ ( ψ ) t , θ ( h ) t ) θ ( h ) t +1 = θ ( h ) t − α t 1 σ 2 X i ∈B t ( ˆ y ( s i ) ( t ) − y ( s i )) ⊤ ∇ θ ( h ) ˆ y ( s i ) ( t ) + ∇ θ ( h ) R ( θ ( ψ ) t , θ ( h ) t ) , (16) where R ( θ ( ψ ) , θ ( h ) ) = P J j =1 P L h l =1 λ ( w,h ) l,j ∥ W ( h ) l,j ∥ 2 2 + λ ( b,h ) l,j ∥ b ( h ) l,j ∥ 2 2 + P O o =1 P L ψ l =1 λ ( w,ψ ) l,o ∥ W ( ψ ) l,o ∥ 2 2 + λ ( b,ψ ) l,o ∥ b ( ψ ) l,o ∥ 2 2 and α t > 0 denotes the learning rate at iteration t . At ( t + 1)th iteration, the up dated v alues β t +1 and σ 2 t +1 are computed as follo ws: β t +1 = ar g min β n X i =1 || y ( s i ) − X ( s i ) β − Ψ t +1 ( s i ) h t +1 ( s i ) || 2 , σ 2 t +1 = 1 n n X i =1 || y ( s i ) − X ( s i ) β t +1 − Ψ t +1 ( s i ) h t +1 ( s i ) || 2 , where Ψ t +1 ( s i ) and h t +1 ( s i ) corresp o n d to t h e loading matrix and factors, resp ectively , ev al ua t ed at the ( t + 1)th iterate s θ ( ψ ) t +1 and θ ( h ) t +1 . By exploiting the equiv alence b etw een the ELBO and the DNN lo ss in Lemma 3.2, these estimates coincide with the optimal v ariational par a m et er s b η ( ψ ) and b η ( h ) , resp ectively . This allo ws developing a strategy to draw approximate p osterior samples of θ ( ψ ) and θ ( h ) , as outlined b elow. 3.4.1 Node D ro p out-Based Posterior Sa m pli ng W e obtain approximate p osterior draws of the weigh ts and biases in θ ( ψ ) and θ ( h ) via Mon te Carlo (MC) drop out applied to the tra i ned parameters b θ ( ψ ) and b θ ( h ) , resp ectively . Drop out is incorp orated at b oth training and test i n g stages, but serves different purp oses: during t r ai n i n g , it is par t of v ariational optimization , whereas at test time, it enables sto c has- tic forward passes for approximate p osterior inference, as detailed b elo w. F or eac h j = 1 , .., J , and eac h lay er l , let z ( m,h ) l,j ∈ { 0 , 1 } K ( h ) l b e the m th p ost-conv ergence drop out vector ob t ai n ed during mo del training Construct, a K ( h ) l × K ( h ) l − 1 matrix of bina ry elemen ts Z ( m,w,h ) l,j = [ z ( m,h ) l,j : · · · : z ( m,h ) l,j ] = z ( m,h ) l,j 1 ⊤ K ( h ) l − 1 , and form mask ed parameters W ( m,h ) l,j = c W ( h ) l,j ⊙ Z ( m,w,h ) l,j and b ( m,h ) l,j = b b ( h ) l,j ⊙ z ( m,h ) l,j . Run a forw ard pass through the neural 17 net work describing the function h j ( · ) to pro duce one sto chastic realizat i o n . Analogousl y , for eac h o = 1 , · · · , O and each lay er l in the neural netw orks describing ψ ( u ) o ( · ), retain the m th p ost-conv ergence no de dop out vector z ( m,ψ ) l,o ∈ { 0 , 1 } K ( ψ ) l during mo del training. Construct , a K ( ψ ) l × K ( ψ ) l − 1 matrix of bin a r y elements Z ( m,w,ψ ) l,o = [ z ( m,ψ ) l,o : · · · : z ( m,ψ ) l,o ] = z ( m,ψ ) l,o 1 ⊤ K ( ψ ) l − 1 , and form mask ed parameters W ( m,ψ ) l,o = c W ( ψ ) l,o ⊙ Z ( m,w,ψ ) l,o and b ( m,ψ ) l,o = b b ( ψ ) l,o ⊙ z ( m,ψ ) l,o . Given small δ 2 and Lemma 3.2, applying drop out using the drop out masks on estimated weigh t and bias pa- rameters effectively simulates draws from the v ariational p osterio r q ( θ | η ) for θ in (9). Join t forw ard passes of b oth netw orks result in a sto chastic dra ws θ ( m ) = ( θ ( m,h ) , θ ( m,ψ ) ). These dra ws are used to compute p osterior summaries of mo del com p onen ts and p osterior predic- tiv e distributions at the cost of standard drop out inference. The algorithm is illustrated in Figure 2 and ps eu d o co de is pro vided in Algorithm 1. Figure 2: I l l u st ra t i on of the MC drop out pro cedure as an approximation to v ariational Ba yesian inference. Eac h forw ard pass samples a rando m drop out mask, g en er a ti n g sto c has- tic realizations of netw ork w eights. This pro cess corresp onds to drawing samples from the v ar i a tio n a l p osterior distrib u t i on , enabling approximate p osterior inference and uncertain ty quan tification without explicitly parameterizing. P osterior and P osterior Predictive Distributions. A t a test lo cation s ∗ , w e p er- form M stochastic forward passes with dro p out activ ated to obtain samp l es of the mul- tiv ariate output { h ( m ) ( s ∗ ) } M m =1 and { Ψ ( m ) ( s ∗ ) } M m =1 , and construct samples of w ( m ) ( s ∗ ) = Ψ ( m ) ( s ∗ ) h ( m ) ( s ∗ ). While p osterior inference on w ( s ∗ ) can b e obtained from these MC sam- ples, our empirical ev al u a ti o n indica t es mo derate under-co verage, p ossibly due to the v aria- tional approximation. As an alternativ e, the p osterior d i s tr i b u t i o n of the multiv ariate resid - 18 ual spatial effects is constructed through an a d d i tio n a l normal approximation t h at adds a reg- ularizing effect for uncer ta i nt y quantification, given b y , p w ( s ∗ ) | D ≈ N b µ w ( s ∗ ) , b Σ w ( s ∗ ) , where the mean an d co v ariance are estimated b y , b µ w ( s ∗ ) = 1 M M X m =1 w ( m ) ( s ∗ ) , b Σ w ( s ∗ ) = 1 M M X m =1 w ( m ) ( s ∗ ) − b µ w ( s ∗ ) w ( m ) ( s ∗ ) − b µ w ( s ∗ ) ⊤ . (17) Similarly , t h e p osterior predi c ti ve distribution for the resp onse y ( s ∗ ) is also appro ximated using a multiv ariate norm a l distribution, given b y , p y ( s ∗ ) | D ≈ N b µ y ( s ∗ ) , b Σ y ( s ∗ ) , b µ y ( s ∗ ) = X ( s ∗ ) b β + b µ w ( s ∗ ) , b Σ y ( s ∗ ) = b Σ w ( s ∗ ) + b σ 2 I J . (18) The predictive mean and asso ci a t ed uncertaint y are extracted directly from (18). F ur- thermore, the J × J estimated cross-co v ariance m a t ri x b Σ y ( s ∗ ) = (( b Σ y ,j j ′ ( s ))) J j,j ′ =1 pro- vides insight into the lo cation-sp ecific cross-dep endencies at the tes t lo cation s ∗ . Sp ecifi- cally , the estimated cross-correlation betw een the j th and j ′ th outco m e at s ∗ is given by b ρ j,j ′ ( s ∗ ) = b Σ y ,j j ′ ( s ) / q b Σ y ,j j ( s ) b Σ y ,j ′ j ′ ( s ). Note that the nonli n e ar dep endence structure i s captured through the sto chastic constructio n of w ( m ) ( s ∗ ), while the multiv ariate normal appro ximation is used only for p osterior summarization and uncertain t y calibration. The follo wing section empirically assesses the prop osed framew ork with resp ect to predictive inference and the estimation of spatially - v arying cross-correlati o n s. 19 Algorithm 1 to fit DNC Input: T raining data D = { ( s i , y ( s i )) } n i =1 Output: Posterior sam p l es of b y ( s ) /* Step 1: Initialize β , σ 2 , θ ( h ) , and θ ( ψ ) at β (0) , σ 2(0) , θ (0 ,h ) , and θ (0 ,ψ ) , respectively. */ /* Step 2: Find optimal point estimates of β , σ 2 , θ ( h ) , and θ ( ψ ) */ 1 for iter = 1 , · · · , n iter do 2 Up date θ ( h ) and θ ( ψ ) using sto chastic gradien t descent (SGD) on the loss function in Equation (15) 3 end 4 return Optimize d estimate: b θ ( h ) = { c W ( h ) l,j , b b ( h ) l,j } J,L h − 1 j,l =1 and b θ ( ψ ) = n c W ( ψ ) l,o , b b ( ψ ) l,o o O,L ψ − 1 o,l =1 /* Step 3: Draw posterior samples of θ via dropout sampling */ 5 for m = 1 , · · · , M do 6 (1) Sample unit-wi se drop out masks 7 z ( m,h ) l,j , z ( m,ψ ) l,o with entries drawn from { 0 , 1 } K ( h ) l and { 0 , 1 } K ( ψ ) l , resp ectvely , and construct Z ( m,w,h ) l,j = z ( m,h ) l,j 1 ⊤ K ( h ) l − 1 ∈ R K ( h ) l × K ( h ) l − 1 and Z ( m,w,ψ ) l,o = z ( m,ψ ) l,o 1 ⊤ K ( ψ ) l − 1 ∈ R K ( ψ ) l × K ( ψ ) l − 1 based on Section 3.4.1 8 (2) Apply masks to obtain sparse p a ra m et er s 9 W ( m,h ) l,j = c W ( h ) l,j ⊙ Z ( m,w,h ) l,j , b ( m,h ) l,j = b b ( h ) l,j ⊙ z ( m,h ) l,j , W ( m,ψ ) l,o = c W ( ψ ) l,o ⊙ Z ( m,w,ψ ) l,o , b ( m,ψ ) l,o = b b ( ψ ) l,o ⊙ z ( m,ψ ) l,o , θ ( m,h ) = { W ( m,h ) l,j , b ( m,h ) l,j } L h − 1 l =1 , θ ( m,ψ ) = { W ( m,ψ ) l,o , b ( m,ψ ) l,o } L ψ − 1 l =1 . 10 (3) F orw ard propa ga t e to get latent outputs h ( m ) ( s i ) and Ψ ( m ) ( s i ) for all i to generate multiv ariate spa t i al effect w ( m ) ( s i ) = Ψ ( m ) ( s i ) h ( m ) ( s i ) 11 (4) Accumulate sam p l e s { w ( m ) ( s i ) } and form the predictive mean/co v ariance using Eq ( 18 ) . 12 end 13 /* Step 4: Approximate posterior samples { θ (1) , . . . , θ ( M ) } */ 20 4 Sim ulation Study Giv en the DNN-based computation strategy of the spatia l factor mo del with LMC, we refer to our approach as Deep Neural Co-regionalizatio n (DNC). W e present tw o simulation studies to ev aluate the p erformance of the prop osed DNC framework in comparison to other m ultiv ariate Ba yesian spatial m o dels. In b oth sim ulations, data are gen er a te d from the fitted mo del (1) under the setti n g J = 2, resulting in biv ariate outcomes and incorp orati n g the spatially-v arying linear mo del co-regionalization structure (2). In th e first scenario, spatial factors and factor loadings a r e simulated from s ta t i on a r y Gaussian pro cesses. The second scenario, how ev er, explores a more challenging setting where sp ati a l factor loadings are generated wi t h complex nonlinea r structur es that are difficult to estimate using stationary Gaussian pro cesses. This latt er simulation is designed to hi g h l i g ht the flexibility of the DNC framew ork wh en dealing with comp l ex spatial nonlinearitie s in factor loadings. As comp etitors, w e cons i d er four representativ e spatial m o dels: (i) a Bay esian Gaussian pro cess mo del fitted indep endently to each outcome, referred to as spLM, and implemented with the spLM function in the spBayes pa ck age in R [Finley et al., 2007]; (ii) a Ba yesian linear mo del of coregi o n al i za t i on (BLMC) [Zhang and Banerjee, 2022b]; and (iii) a Ba yesian m ultiv ariate Gaussian pro ce ss (MV GP) under the linear mo del of coregionalization (LMC) framew ork, referred to as spMvLM, fitted usin g the spMvLM function from the spBayes pac k age i n R [Finley et al., 2007]. The spLM mo del is fitted separately for ea ch outcom e and do not capture the cr o ss-c ov ariance b et ween out co m es. In con trast, b oth spMvLM and BLMC fit mo del (1) while employing the linear mo del of coregionalization structure (2) to mo del cr oss -cov ariance b etw een outcomes. How ev er, b oth assume that the loading matrix is spatially in v arian t, resulting i n a spatially constan t cross-co v ariance structure. Notably , spMvLM fits stationary Gaussian pro cesses for the spatial factors , whereas BLMC uses V ecchia-appro ximated GPs, whic h p ermits nonstationarity in the spatial fa ct or s. The prop o sed DNC framework is implemented in Python using TensorFlow , with ReLU activ ation functions in all hidden lay ers, and trained using the Adam optimizer with a batch size of 64 and early stopping; the learning rate and maximum num b er of ep o c hs were set to 10 − 2 and 1000 for the sim ulation stu d y in Section 4.1.1, and to 10 − 3 and 2000 for the 21 sim ulation study in Section 4.1.2. All computational exp erim ents were conducted on a single mac hine with 10 CPU cores and 64 GB of memory . Mo del p erformance was assessed using ro ot mean s qu a r ed predicti on error (RMSPE), cov erage of 95% prediction interv als ( CVG; the p ercen tage of interv als con tainin g the true v alue), a verage length o f 95% prediction in terv als (LEN), and comput a t i on a l run time. In addition , we calculate and rep ort th e empirically estimated spatiall y -v arying correlation b et ween the tw o ou t co m es, b ρ 12 ( s ), across the domain, and dis p l ay the corresp onding true correlation surface, which is analytically deriv ed from the simulation design . All sim ulations are repl i c at ed 100 t i m es and eac h metric a veraged ov er the 100 sim ulations is presen ted. W e also presen t standard deviation for each metric ov er these 100 simulations. 4.1 Sim ulation Design 4.1.1 Syn thetic Data from Stationary F actor Loadings W e generated biv ariate spatial data based on a spatially-v arying LMC following Guhaniyogi et al. [2013b]. Sp ecifically , we considered n = 2 50 0 lo cati o n s in s 1 , ..., s n ⊂ [0 , 1] 2 and ran- domly split them in to n train = 1500 for trainin g , n v al = 500 for v alidation, and n test = 500 for testing. A t each lo cation, w e sim ulated tw o outc om es using the multiv ariate spatial regression mo del giv en in Equation (1), with the spatially-v arying LMC defined as w ( s ) = Ψ ( s ) h ( s ) , Ψ ( s ) = ψ 11 ( s ) ψ 12 ( s ) 0 ψ 22 ( s ) , h j ( s ) ind. ∼ GP 0 , C ( · , · , ϕ ) , where C ( s i , s i ′ , ϕ ) = exp( −|| s i − s i ′ || /ϕ ) d en o t es the exp o n ential correlati on functio n with a common range parameter ϕ for b oth spatial pro cesses. The sample size is kept mo derate so as to fit the comp etitor spMvLM which is cr i t i cal in understanding the ad v an tages due to the space-v aryi n g factor loadin g matrix in DNC. The regression co efficien ts w ere set as ( β 1 , β 2 ) = (1 , 1) and t h e error cov ariance matrix as Σ = 0 . 5 I . Co v ariates w ere generated indep endently at eac h location according to x ( s ) ∼ N ( 0 , I ) . The factor loading surfaces w ere sp ecified as indep enden t Gaussi a n pro cesses with the same range ϕ , suc h that ψ 11 ( s ) = 1 . 0 + η 11 ( s ) , ψ 22 ( s ) = 1 . 0 + η 22 ( s ) , ψ 21 ( s ) = η 21 ( s ) , 22 where eac h η ij ( s ) ∼ GP 0 , C ( · , · , ϕ ) . The p o si t i ve offsets in ψ 11 ( s ) and ψ 22 ( s ) ensure these v al u es are p ositive, while the mean-zero ψ 21 ( s ) induces bo t h p o si t i ve and negati ve cross- co v ariances across the domai n . F or thi s sim ulation, w e set ϕ = 0 . 5. T able 1 summarizes the predictive p erformance of all mo dels on the nonstationary biv ari- ate d a t ase t genera te d usi n g the spati al l y -v arying LMC. The prop osed DNC framework at- tains th e low est RMSPE for b oth outcomes, indicating sup erior predictiv e accuracy ov er the baseline approaches. Co verage probabilities for DNC are close to the nominal 95% level, with relativ ely short credible in terv als, which reflects w ell-calib r at ed uncertaint y quan tification. Figure 3 further supp orts th i s b y sho wing that the pr ed i c te d o u tc om es from DNC clos el y matc h the true sp a t i al outcomes, with th e 95% predi ct i ve in terv als are pla ced tightly around the truth. This improv emen t stem s from DNC’s flexible deep latent representation, which effectiv ely captures spatially v arying cross-co v ariances b eyond t h e stationarity assum p t i on . In con trast, spLM and spMv L M op erate under the restrictive assumption Ψ ( s ) = Ψ , which do es not accommoda te lo calized changes in cross-co v ariance structure, resulting in exces- siv ely wide interv als. Moreo ver, spLM fits each outcome indep endently , whic h con tributes to its noticeably inferior p erformance relative to other metho ds. BLMC emerges as the closest comp etitor o wing to its fact o r mo d el representation, and abilit y to mo del complex nonstationarity in laten t factors. Nev ertheless, its assumption of spatially inv ariant loading matrices leads to mo derately low er p erformance compared to DNC. Notably , DNC stands out for its computat i o n al efficiency , ac hieving a substan tial reduction in r u n time relativ e to its most relev an t comp etitor BL MC. F urthermo r e, the spatially-v arying cross-correlation estimated b y DNC effectiv ely captures the essen tial pattern of the true cross-correlation observ ed in the simulated data, as illustrated in Figure 4. 4.1.2 Syn thetic Data from Nonstationary Deep GP-Deriv ed F actor Loadings In con trast to t h e fi r st sim ulation exampl e, whic h generates entries of the factor loading matrix from a stationary Gaussian pro cess, the current ex a m p l e emplo ys a deep Gaussian pro cess to mo del the fa ct or l o ad i n g s. Additionally , the spatial factors in this simulation a r e constructed to be smo other than those in the previous example. As b efo re , spatial lo cations are c hosen as s 1 , ..., s n ⊂ [0 , 1] 2 , resulting in n = 2 , 500 spatial locations. These lo cations 23 (a) T rue y 1 ( s ) (b) Predicted y 1 ( s ) (c) 95% Lo wer y 1 ( s ) (d) 95% Upp er y 1 ( s ) (e) T rue y 2 ( s ) (f ) Predi ct e d y 2 ( s ) ( g) 95% Low er y 2 ( s ) (h) 95% Upp er y 2 ( s ) Figure 3: Spatial predictive surfaces for the tw o outcomes under the prop osed DNC mo del for data simulated und er Section 4.1.1. T op ro w displa ys true surface, predicted surface, upp er and low er ends of 95% predictive interv als for y 1 ( s ); b ottom row shows the same for y 2 ( s ). The plot sho ws the p oint estimates capturing the sp at i a l v ariability acro ss the domain with 95% predictive in terv als tigh tly around the truth. (a) T rue cross-co v ariance (b) Posterior predic t ive cross-correlation (DNC) Figure 4: Spat i al pattern s of the lo cation-sp ecific cross-correlation Corr( y 1 ( s ) , y 2 ( s )) from one r ep r esentativ e sim ulation dataset for the sim ulation design in Section 4.1.1. Th e left panel shows the t ru e cross-correlation surface, wh i l e the right panel presents the corresp ond- ing p osterior predictiv e e st i m at e obtained from the prop osed DNC mo del. 24 T able 1: Ro ot mean squared prediction error (RMSPE), cov erage, an d av erage length of 95% prediction interv als for eac h mo del on the tw o-outcome test set comprisi n g n test = 500 lo cations, av eraged ov er 100 repli ca ti o n s. The mo del with the low est RMS P E is highlighted in b old. St a n d ar d deviations a cr os s 100 replicatio n s are shown in parentheses. The results demonstrate th at DNC ac hieves the l ow est RMSPE for b oth outcomes, maintains nominal co verage, and provides narrow er predictive interv als compared to comp eting metho ds. Ad- ditionally , DNC ex h i b i t s the shortest computation time, with a marked reduction relative to its closest comp etitor, BLMC. y 1 ( s ) y 2 ( s ) Comp etitors RMSPE CV G LEN RMSPE CV G LEN Time (min) DNC 0.69 (0.04) 0.95 (0.02) 2.65 (0.04) 0.81 (0.07) 0.92 (0.03) 2.78 (0.14) 0.45 (0.07) spLM 0.88 (0.16) 0.94 (0.01) 3.25 (0.65) 1.90 (0.28) 0.95 (0.01) 6.90 (1.10) 8.50 (0.20) BLMC 0.75 (0.01) 0.95 (0.01) 2.95 (0.03) 0.89 (0.12) 0.95 (0 . 0 3) 3.52 (0.06) 48.0 (16.0) spMvLM 0.77 (0.05) 0.97 (0.02) 6.90 (1.70) 0.84 (0.05) 0.97 ( 0. 03) 6.85 (1.7 0) 25.0 (0.50) are ran d om l y divided in to training ( n train = 1 , 500), v a l i d at i o n ( n v al = 500), and test i n g ( n test = 500) sets for mo del fitting and assessment. T o b egin, smo oth latent spatial pro cesses h ( s ) = ( h 1 ( s ) , h 2 ( s )) ⊤ are simulated indep en- den tly from Gaussian pro cess pr i o rs : h j ( s ) ind. ∼ GP (0 , τ 2 h C h ( · , · ; ϕ h )) for j = 1 , 2, where C h is a Mat ´ ern correlation function with smo othn es s ν h = 3 / 2, v ariance τ 2 h = 1, and len g t h -sca l e ϕ h = 0 . 2. T o intro d u c e spatially v ar y i n g cross-cov ariances, the loading matrix Ψ ( s ) ∈ R 2 × 2 is sp ecified usi n g a nested, tw o-la yer d eep Gaussian pro cess structure. Each entry ψ j j ′ ( s ) is generated by first mappin g eac h sp at i a l lo cation s to a Q = 5-dimensional latent represen - tation, u ( s ) = ( u 1 ( s ) , . . . , u Q ( s )) ⊤ , u q ( s ) ∼ GP (0 , τ 2 U C U ( · , · ; ϕ U )) , q = 1 , . . . , Q, where C U is a Mat ´ ern cov ariance function with smo oth n ess ν U = 3 / 2, v arian c e τ 2 U = 1, and length-scale ϕ U = 0 . 4. Conditional on u ( s ), each element of the loading matrix is constructed as ψ j j ′ ( s ) = g j j ′ u ( s ) , g j j ′ ( · ) ∼ GP (0 , τ 2 Ψ C Ψ ( · , · ; ϕ Ψ )) , indep endently for j , j ′ = 1 , 2, where C Ψ is a Mat ´ ern co v ariance function on the latent space, with smo othness ν Ψ = 3 / 2, v ariance τ 2 Ψ = 1, and le n gt h -s cal e ϕ Ψ = 0 . 3. 25 This hierarc hical, n est ed pro cess ind u ce s a ric h and nonlinear dep endency structure in Ψ ( s ), which would b e challenging to mo del with a stationary GP pri or on t h e factor l o ad i n g s alone. The biv ariate outcome y ( s ) = ( y 1 ( s ) , y 2 ( s )) ⊤ is then simulated accord i n g to mo del (1) applying the spatially v arying linear mo del of co re gi o n al i z at i o n (2) with the factors and factor loadings sp ecified as ab ov e. F or the regression comp onent, a single co v ariate with co efficient β = 0 . 25 is in cl u d ed , and the error term is simulated with standard deviation σ ϵ = 0 . 1. T able 2 summarize s the predi ct i ve p erformance for nested nonstationary data generated using a Deep GP-in sp i r ed hierarchical coregionalization sc heme. Consisten t wi t h results from Section 4.1.2, the prop osed D NC exhibits strong predictive accuracy al on g si d e w ell-calibrated co verage and mo derate interv al widths, underscoring its efficacy in capturin g intricate hie ra r - c hical spatial dep endencies through its d eep latent architecture. Figure 5 illustrates that the predicted y 1 ( s ) and y 2 ( s ) closely mirror lo cal features of t h e true outcomes, althou g h there is some degree of o versmoothin g present. The 95% predictive in terv als for b oth outcomes are tigh tly concentrated aro u n d the ground truth, indicating reliable u n cer t ai nt y quantification. In contrast, while BLMC and spMvLM achiev e comparable uncertain ty estimates, their p oin t predictions are mo destly inferior, likely due to their inability to accoun t for space-v aryi n g loading matrices. SpLM p erforms notably worse with resp ect to p oin t predictions, which can b e attributed to its negle ct of bot h the in ter-outcome correlation a n d the nonstationarit y in factor loadings. One significan t adv an tage of DNC is its unique ability to recov er space-v arying cor- relations betw een ou t co m es, as depicted i n Figure 6. While the estimated spat i al cross- correlation by DNC captures the underlying trend, some ov ersmoo th i n g p er si st s, p ossi b l y stemming from large-sample appro ximations inherent to v ariational infer en ce when applied to smaller datasets. This methodolog i cal n uance ma y also explain the limited distinct i o n in p erformance b etw een DNC and BLMC under these simulation settings. A particularly striking asp ect is the computational efficiency of DNC: it op erates nearly 100 times faster than BLMC, representing a subs ta ntial improv emen t in scal a bi l i ty . Collectively , these find- ings highlight that DNC offers a robust balance b etw een mo del flexibility and computa t i on a l practicality for t h e state-of-the-art multiv ariate geo sp at i a l analysis. 26 (a) T rue y 1 ( s ) (b) Predicted y 1 ( s ) (c) 95% Lo wer y 1 ( s ) (d) 95% Upp er y 1 ( s ) (e) T rue y 2 ( s ) (f ) Predi ct e d y 2 ( s ) ( g) 95% Low er y 2 ( s ) (h) 95% Upp er y 2 ( s ) Figure 5: Spatial predictive surfaces for the tw o outcomes under the prop osed DNC mo del for data simulated und er Section 4.1.2. T op ro w displa ys true surface, predicted surface, upp er and low er ends of 95% predictive interv als for y 1 ( s ); b ottom row shows the same for y 2 ( s ). The plot sho ws the p oint estimates capturing the sp at i a l v ariability acro ss the domain with 95% predictive in terv als tigh tly around the truth. (a) T rue cross-co v ariance (b) Posterior predic t ive cross-correlation (DNC) Figure 6: Spat i al pattern s of the lo cation-sp ecific cross-correlation Corr( y 1 ( s ) , y 2 ( s )) from one r ep r esentativ e sim ulation dataset for the sim ulation design in Section 4.1.2. Th e left panel shows the t ru e cross-correlation surface, wh i l e the right panel presents the corresp ond- ing p osterior predictiv e e st i m at e obtained from the prop osed DNC mo del. 27 T able 2: Ro ot mean squared prediction error (RMSPE), cov erage, an d av erage length of 95% prediction interv als for eac h mo del on the tw o-outcome test set comprisi n g n test = 500 lo cations, av eraged ov er 100 repli ca ti o n s. The mo del with the low est RMS P E is highlighted in b old. St a n d ar d deviations a cr os s 100 replicatio n s are shown in parentheses. The results demonstrate th at DNC ac hieves the l ow est RMSPE for b oth outcomes, maintains nominal co verage, and provides narrow er predictive interv als compared to comp eting metho ds. Ad- ditionally , DNC ex h i b i t s the shortest computation time, with a marked reduction relative to its closest comp etitor, BLMC. y 1 ( s ) y 2 ( s ) Comp etitors RMSPE CV G LEN RMSPE CVG LEN Time (min) DNC 1.38 (0.07) 0.94 (0. 0 1 ) 5.59 (0.25) 1.38 (0.06) 0.94 (0.01) 5.65 (0.25) 0.31 (0.75) spLM 1.56 (0.11) 0.94 ( 0 . 01 ) 5.71 (0.28) 1.58 (0.09) 0.94 (0.01) 5.76 ( 0 . 26 ) 9.12 (6.25) BLMC 1.42 (0.06) 0.94 (0.05) 5.54 (0.02) 1.43 (0.05) 0.94 ( 0. 0 3) 5.60 (0.02) 28.66 (9.19) spMvLM 1.44 (0.11) 0.94 (0.01) 5.80 (0.25) 1.45 (0.09) 0 . 94 (0.25) 5.82 (0.25) 25.22 (11.17) 5 Remote Sensing V egetation Data Analysis The w estern United States is ex p eriencing rapid a n d unprecedented ecological transfor- mations, with vegetation increasi n gl y exp osed to severe drought conditions and h e i ghtened wildfire risk [Abatzoglou and Williams, 2016, Williams et al., 2019, W esterling et al., 2006]. These dynamics r es u l t in significant economic costs: wildfires in this region are estimate d to burden the U.S. economy b y 394 to 893 billion p er y ear, with dam ag es surpassing 90 billion b etw een 2017 and 2021 alone. Thus, accurately iden tifying the environmen tal drivers of vegetation healt h and pro d u ct i v i ty is essen tial for forecast i n g ecosystem resp onses and guiding adapt i ve management strategies [Running et al., 2004, P ettorelli et al . , 2005]. This c hallenge is further complica te d b y the region’s pronounced en vironmental heterogeneit y , spanning from coastal rainforests that receive ov er 3,000 mm of y early precipitati o n to in- terior deserts with less than 200 mm [Diffenbaugh et al., 2015]. As a result, ecological constraints differ grea tl y , with temp erature often limiting pro duct i vi ty at higher elev ations, and moisture av ailability serving as the primary constraint in arid lo wland areas [Mukh tar et al., 2025]. Remote sensing via satellites offers a p ow erful means of large-scale v egetation mon i t o r i n g, using indi c es such as the normalized difference vegetation index (ND VI) and red reflectance. ND VI serves as a reliable proxy for vegetation healt h and pro ductivity , while red reflectance 28 captures physiological and structural changes in p l ant canopies. T o b etter understand these ecosystem dynamics and stressors, it is crucial to join tly mo del these indicators, as this allo ws us to account for their correlations and shared environmen tal influences. T o this end, we apply our prop o sed DNC approach to the joint mo deling of log(NDVI+1) and red reflectance, all owing for space-v arying correlations b etw een these v ariables. Our analysis utilizes MODIS Enhanced V egetation Index data from the western United States, fo cusing on sinusoidal grid tile h08v05, whic h spans appro ximately 30 ° N–40 ° N latitude and 104 ° W–130 ° W longitude, co vering around 1,0 20 , 00 0 observ ed lo cat i o n s to dem o n st ra t e the metho d. W e divide the data into training (60%), v alidation (20%), and testing (2 0 % ) sets to ev al ua t e ou t -o f-s am p l e p r ed i ct i ve p erformance. Due to the large data volume, metho ds s u ch as spMvLM and spLM are computationally infeasibl e . Therefore, w e li m i t direct compariso n s only to the BLMC approach ev en if it is computatio n al l y exp ensiv e, ackno wledging the fact that it is th e closest comp etitor in the simulation studies. T able 3: The tabl e rep o r ts the predi ct i ve p erformance and computation tim e (in minutes) for b o th DNC and BLMC on log (NDVI+1) and Reflectance. F or 204,000 out-of-sample observ ations, w e pro vide the ro ot mean squared prediction error (RMSPE), the cov erage, and the len g th of the 95% predictive in terv als. Results indicate that DNC offers slightly improv ed p oint prediction and p r o duces muc h na rr ow er predictive interv als, whil e main taining similar co verage to BLMC. Additionally , DNC ac hieves a 500-fold reduction in computation time relativ e to BLMC. ND VI Red reflectance Comp etitors RMSPE CV G LEN RMS P E CV G LEN Tim e (in min.) DNC 0.02 0.95 0.08 0.01 0.95 0.05 5.67 BLMC 0.03 0.95 0.19 0.02 0.94 0.10 2498.18 T able 3 demonstrates that b oth DNC a n d BLMC provide closely si m i l ar predictive p oint estimates for each outcome, with DNC offering a sl i ght adv an tage in p oin t prediction. Both metho ds yield nomi n a l or ne ar -n o m i na l co v erage levels a cr oss outcomes; how ev er, DNC ac hieves a t wo-fold reduction in the length of the 95% predi ct i ve interv als. Most notably , DNC stands out for its computat i o n al efficiency , exhibiting ab out 500-fold decrease in com- putation time compar ed to BLMC. Figures 7 and 8 provide visu al i z at i o n of the mo d el recon- structions, illustrating that DNC del i vers accurate recov ery of b oth resp onse surfaces, with 29 the 95% predictiv e in terv als closely env eloping t h e true observ ations. The tigh t pred i ct i ve surface indicates b oth robustness and reliability of prediction. (a) Observed outcomes (b) Posterior predictive mean (c) 95% predictiv e in terv al low er b ound (d) 95% predictiv e in terv al upp er b ound Figure 7: Spatial patterns of the first outcome log(NDVI+1) on the test s et . P anels displa y (a) the observ ed log(ND VI+1), (b) the predicted lo g(NDVI+1), (c) 95% predictive in terv al lo wer b ound, and (d) 95% predictive in terv al upp er b ound obtained fro m the DNC mo del. The color scale re p re sents the magnitude of log(ND VI+1 1), with d a r ker shades indicating lo wer v alues. The figure shows accurate prediction with 95% predictive in terv als t i ghtly around the truth. Here x and y axes represent longitude and latitude b oth normalized i n [0,1] interv al. Figure 9 displays th e predicted spatial cross-correlation, b ρ 12 ( s ), for the test set. The color scale reflect s the magnitude and direction of the est i m at ed lo cal spatial dependence b etw een the t wo outcomes. The t wo outcomes shows w eak dep en d en ce in most of the do m ai n , explaining mo dest difference in p erformance b et ween DNC and BLMC. Ho wev er, in certain sub domains, noticeable shifts in cross-cov ariance emerge: for instance, areas shad ed in green 30 (a) Observed outcomes (b) Posterior predictive mean (c) 95 p er ce n t predictiv e lo wer b ound (d) 95 p er c ent predictive upp er b ound Figure 8: Spatial pat te rn s of the first ou t co m e Red reflec ta n ce on the test set. P anels displa y (a) the observed R ed reflectance, (b) the predicted Red reflec ta n ce, (c) 95% predicti ve in terv al low er b ound, and (d) 95% predict i ve in terv al upp er b ound obtained from the DNC mo del. The color scale represents the magnitude of Red reflectance, with dark er shades indicating lo wer v alues. The figure sho ws accurate prediction with 95% predictive in terv als tigh tly around the truth. Here x and y axes represent longitude and latitude b oth normalized in [0,1] interv al. 31 and purple indicat e zones of pronounced negative asso ciation. These spatial v ariations in dep endence underscore the capacity of th e pro p osed DNC mo del to detect and characterize nonstationary cross-dep end en ce, revealing spatially heterogeneous i nteractions b etw een the resp onse surfaces that would o t h er wi se b e obscured under stationary multiv ariate Gaussi a n pro cess mo d el s on spatial factors and spatially-inv ariant lo ad i n g matrices. Figure 9: Predicted sp at i a l cross-correlation patterns b etw een the tw o outcomes (log(ND VI+1) and Red reflectance) on the test set. Each p oin t represents a spatial lo- cation, colored by cross-correlation estimated from the p osterior predictiv e samples of th e DNC mo del. The color gradient rev eals spatial heterogeneit y in t h e dep endence structure b etw een the tw o outcomes. Here x and y axes represent longitud e and latitude, b oth scaled in [0,1] interv al. 6 Conclusion and F uture W ork With the rapid growth of multiv ariate spatial datasets across diverse scientific domains, there is a pressing need fo r flexible statistical mo dels that can fully accoun t for spatial v ar i a tio n sp eci fi c to eac h outcome, while also ca p t u r i n g spatially-v arying cross-co v ariance structures a m on g outcomes. Classi cal and contemp orary sp a t i al factor mo dels, esp ecially those model i n g latent factors and their loadings using Gaussian pro ces ses or related lo w- rank approximations, encounter sev ere computational b ottlenecks as the n umber of spati a l lo cations scales in to the thousands. As a result, many scalable approaches to multiv ari- 32 ate spatial mo del i n g mak e the simpli fy i n g assumption of spatially-inv ariant cross-cov ariance structures, limiting their ability to represen t real-w orld comp l ex i ty . T o ov ercome these c hallenges, this pap er in tro du c es th e deep n eu r al cor eg i o n al i za t i on (DNC) framework. This approach empl oys deep Gaussian pro cesses (DGP) to flexibly mo d el b oth the spatial factors and their loadings, sp ecifying a structured cross-cov ariance. A nov el v ar i a tio n a l app r oximation is develop ed for t h e DGP , which is then shown to b e equiv alen t to fitting deep neural net works (DNNs) to the spatial factors and factor loadings. F raming the mo del as a DGP enables the ca p t u r e of complex spatial nonstationarity in factors and factor loadings, while the DNN p ersp ective pro vides scalable, MCMC-free p osterio r inferenc e, making the metho d feasible for ext r em el y large spatial datasets. Empirical results using b oth s i mulated and real-world remote sensing data demonstrate tha t DNC ac hieves sup erior p oint prediction, estimation of sp at i a l l y-v arying cross-correlation b etw een out co m es, reliable uncertaint y quan tification, and many-fold compu t at i o n al gains ov er existing approaches. Sev eral promi si n g directions exist for future researc h. One natural extension is to adapt the framework for spatio-temp oral data, wh i ch would fa ci l i t at e the ana l y si s of dynam i c m ultiv ariate pro cesses in areas suc h as cl i m at e mo deling, eco l og y , and time-series remote sensing. Incorp or a t in g domain-sp ecific constrain ts or accomm o dating non-Gaussian data distributions could further enhance b oth the in terpretab il i ty and utility of the approach. Additionally , application of DNC to domains like epi d em i o l og y or biomedical imaging would b e v aluable for assessing its general i za b i l i ty and inspiring further metho dological adv ance- men ts. 7 Ac kno wledgmen t Dr. Ra jarshi Guhaniy ogi is supp orted b y National Science F oundation D MS - 22 10 6 72 , and National Instit u t e of Health R01NS131604, R01GM163238. References John T. Abatzoglou an d A. Park Williams. Impact of an throp ogenic climat e c hange on wildfire across w estern US forests. Pr o c e e dings of the National A c ademy of Scienc es , 113 (42):11770–11775, 2016. doi: 10.1073/pnas.1607171113. 33 T atiana V. Apanasovic h and Marc G. Genton. Cross-cov ariance functio n s for m ultiv ariate random fi el d s based on latent dimensions. Biometrika , 97(1):15–30, 2010. doi: 10.1093/ biomet/asp078. Raman Arora, Peter Bartlett, Poor ya Mianjy , and Natha n Srebro. Drop out: Explicit form s and cap a ci ty control. In International Confer enc e on M achine L e arning , pages 351–361. PMLR, 2021. Luk e Bornn, Gavin Shaddick, and James V Zidek. Mo deling nonstationary pro cesses through dimension expansion. Journal of the A meric an statistic al asso ciation , 107( 4 97 ) : 28 1 –2 89 , 2012. Catherine A Calder. A dynamic pro cess con volution approac h to model i n g ambien t par- ticulate matter concentrations. Envir onmetrics: the Official Journal of the International Envir onmetrics So ciety , 19(1):39–48, 2008. Jean-P aul Chil es and Pierre Delfiner. Ge ostatistics: mo deling sp atial unc ertainty . John Wiley & Sons, 2 01 2 . Brandon Co om b es, Saonli Basu, Sharmistha Guha, and Nic holas Schork. W eighted score tests implementing mo del-av eraging sc hemes in detection o f rare v arian ts in case-control studies. Plos one , 10( 10 ) : e01 3 93 55 , 2015. No el Cressie and Chris to ph er K Wikle. Statistics for sp atio-temp or al data . John Wi l ey & Sons, 2011. Andreas Damianou and Neil D Lawrence. Deep Gaussian pro cesses. In International Con- fer enc e on Artificial Intel ligenc e and Statistics , pages 207–215. PMLR, 2013. Jeffrey M. Diez and H. Ronald Pulliam. Hierarchical analysis of sp ecies distributions and abundance across environmen tal gradients. Ec olo gy , 88(12):3144–3152, 2007. do i : 10. 1890/07- 0048.1. Noah S. Di ffenbaugh, Daniel L. Sw ain, and Dani el l e T ouma. Anthrop ogenic w arming has increased drought risk in California. Pr o c e e dings of the National A c ademy of Scienc es , 112 (13):3931–3936, 2015. doi: 10.1073/pnas.1422385112. 34 Matthew M Dunlop, Mark A Girolami, Andrew M Stuart , and Aretha L T eck en trup. Ho w deep are deep gaussian pro cesses? Journal of Machine L e arning R ese ar ch , 19(54):1–46, 2018. Andrew O Finley , Sudipto Ban erjee , and Bradley P Ca r l i n . spb ay es: an r pack age for uni- v ar i a te and multiv ariat e hier ar chical p oin t-referenced spat i al mo dels. Journal of statistic al softwar e , 19:1–24, 2007. Dani Gamerman, Hedib ert F reitas Lop es, and Esther Salazar. Spatial dynamic factor anal- ysis. Bayesian Analysis , 3(4):759–792, December 2008. doi: 10.1214/08- BA329. Gregory Gaspari and Stephen E. Cohn. Construction of correlation functions in t wo and three dimensions. Quarterly Journal of the R oyal Mete or olo gic al So ciety , 125(554):723– 757, 1999. doi: 10.1002/qj.49712555417. Alan E Gelfand, Alexandra M Schmidt, Sudipto Banerjee, an d CF Sirmans. Non st at i o n ar y m ultiv ariate pro cess mo deling through spatially v arying coregionalization. T est , 13(2): 263–312, 2004. Marc G. Genton and William Kleib er. Cr oss -cov ariance functions for m ultiv ariate geostatis- tics. Statistic al Scienc e , 3 0 (2 ) : 14 7 –1 63 , 2015. doi: 10. 12 14/ 14 - STS48 7 . Tilmann Gneiting, William Kleib er, and Martin Schlather. Mat ´ ern cross-co v ariance functions for m ultiv ariate random fields. Journal of the Americ an Statistic al Asso ciation , 105(491): 1167–1177, 2010. Sharmistha Guha, Jose Ro driguez-Acosta, and Ivo D Dino v. A ba y esian m ultiplex graph classifier of functional brain connectivit y across diverse tasks of cognitiv e control. Neu- r oinformatics , 22(4):457–472, 2024. Ra jarshi Guhaniyogi. Multiv ari a t e bias adjusted tap ered predi ct i ve pro cess mo dels. Sp atial Statistics , 21:42–65, 2017. Ra jarshi Guhan i yogi an d Bruno Sanso. Large multi-scale spatial mo deling using tree shrink- age priors. Statistic a Sinic a , 30(4):2023–2050, 2020. 35 Ra jarshi Guhaniyogi, Andrew O Fin l ey , Sudipto Banerjee, and Ric hard K Kob e. Mo deling complex spatial dep endencies: Lo w-rank spatially v arying cross-co v ariances with applica- tion to soil n utrient data. Journal of agricultur al, biolo gic al, and envir onmental statistics , 18(3):274–298, 2013a. Ra jarshi Guhaniyogi, Andrew O Fin l ey , Sudipto Banerjee, and Ric hard K Kob e. Mo deling complex spatial dep endencies: Lo w-rank spatially v arying cross-co v ariances with applica- tion to soil n utrient data. Journal of agricultur al, biolo gic al, and envir onmental statistics , 18(3):274–298, 2013b. Ra jarshi Guhaniyogi, Cheng Li, T errance D S avitsky , and San vesh Sri v astav a. Distributed ba yesian v arying co efficient mo del i n g using a gaussian pro cess prior. Journal of machine le arning r ese ar ch , 23 ( 84 ) : 1– 5 9, 2022. Y eseul Jeon, Ra jarshi Guhaniy ogi, and Aaron Scheffler. Deep generativ e mo d el i n g with spatial and netw ork images: An explainable ai (xai) approach. arXiv pr epr int arXiv:2505.12743 , 2025a. Y eseul Jeon, Ra jarshi Guhaniyogi, Aaron Scheffler, Devin F rancom, and Donatella P asqualini. In terpretable deep neur a l netw ork for m o deling func ti o n a l surrogates. arXiv pr eprint arXiv:2503.20528 , 2025b. Charlie Kirkw o o d, Theo Economou , Nicolas Pugeault, and Henr y Odb ert. Ba yesian deep learning for spatial interpolation in the presence of auxiliary information. Mathematic al Ge oscienc es , 54(3):5 0 7– 53 1 , 2022. William Kleib er and Douglas Nychk a. Nonstationary mo deling for multiv ariate spatial pro- cesses. Journal of Multivariate A nalysis , 112:7 6 –9 1, 2012. Anindy a Ma jumd a r and Alan E. Gelfand. Multiv ariate spat i a l mo d e l i n g for geostatistical data using conv olution. Journal of the Americ an Statistic al Asso ciation , 102(477):110–124, 2007. doi: 10.1198/ 0 16 21 4 50 60 0 00 01 2 54 . Jorge Mateu and Ab dollah Jalilian. Spatial p oint pro cesses and neura l netw orks: A conv e- nien t coup l e. Sp atial Statistics , 50:100644, 2022. 36 Hamza Mukh tar, Y ujia Y ang, Men g jiao Xu, Jiujiang W u, Saw aid Abbas, Da W ei, and W ei Zhao. Elev ation-dep endent v egetation greening and i ts resp o n ses to climate c hanges in the south slop e of the himalay as. Ge ophysic al R ese ar ch L etters , 52(4):e2024GL113276, 2025. Otso Ov ask ainen, Jenni Hottola, and Juha Siitonen. Mo deling sp ecies co-o ccurrence by m ultiv ariate logistic regression generates new hypotheses on fungal in teraction s. Ec olo gy , 91(9):2514–2521, 2010. John P aisley , David Blei, and Michael Jordan. V ariational Ba yesian inference with sto chastic searc h, 2012 . Nathalie P ettorelli , Jon Olav Vik, Atle Mysterud, Jean-Michel Gaillard, Compton J. T uc ker, and Nils C. Stenseth. Using the satellite-deriv ed ndvi to assess ecological resp onses to en vironmental c hange. T r ends in Ec olo gy & Evolution , 20(9):503–510, 2005. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Sto chastic backpropagation and a p p r oximate inference in deep ge n er at i ve mo del s . In International Confer enc e on Machine L e arning , pages 1278–1286. PMLR, 2014. Stev en W. Ru n n i n g , Ramakrishna R. Nemani, et al. A contin uous satellite-d er i ved measure of global terrestrial primary pro du c ti o n . Bioscienc e , 54(6):547–560, 2004. Mary Lai O Salv a ˜ na, Amanda Lenzi, and Marc G Genton. Spatio-temp oral cross-co v ariance functions under the lagra n g i an framework with multiple adv ections. Journal of the Amer- ic an Statistic al Asso ciation , 118(544):2746–2761, 2023. Annie Sauer, Rob ert B Grama cy , and David Higdon. Active learning for deep gaussian pro cess surrogates. T e chnometrics , 65(1):4–18, 2023. Alexandra M. Schmidt and Alan E. Gelfand. A bay esian coregion al i z at i o n approach for m ultiv ariate p ollutant data. Journal of Ge ophysic al R ese ar ch: Atmospher es , 108(D24): 8783, 2003. doi: 10.1029/2002JD002844. Mic halis Titsias and Neil D La wrence. Bay esian gaussian pro ces s latent v ariable mo del. In Pr o c e e dings of the thirte enth international c onfer enc e on artificial intel ligenc e and statis- tics , pages 844–851. JMLR W orkshop and Conference Pr o ceedings, 2010. 37 Hans W ac kernagel. Multivariate Ge ostatistics: An Intr o duction with Applic ations . Springer Science & Business Media, New Y ork, NY, USA, 2003. ISBN 9783540441427. J. Hardin W addle, Rob ert M. Dorazio, Susan C. W all s, Ken G. Rice, Ja y Beauc hamp, Mic hael J. Sc human, and F rank J. Mazzotti. A n ew parameteriza ti o n for estimating co- o ccurrence of interacting sp ecies. Ec olo gic al Applic ations , 20(5) : 1 46 7 –1 47 5 , 2010. doi: 10.1890/09- 0394.1. An thony L. W esterling, Hugo G. Hidalgo, Daniel R. Ca yan, and Th o m a s W. Swetnam. W arming and earlier spring incr ea se western US forest wildfire activity . Scienc e , 313 (5789):940–943, 2006. doi: 10.1126/science.1128834. Christopher K Wikle and Andrew Zammit-Mangion. Statistical deep learning for spati a l and spatio t em p oral data. Annual R eview of Statistics and Its Applic ation , 10(1):247–270, 2023. A. Park Williams, John T. Abatzoglou, Alexander Gershuno v, Jorge Guzman-Morales, Daniel A. Bishop, Jenni f er K. Balch, and Dennis P . Lettenmaier. Obse rved impacts of an- throp ogenic climate change on wi l d fi r e in California. Earth’s F utur e , 7(8):89 2– 91 0 , 2019 . doi: 10.1029/2019EF001210. Andrew Zammit-Mangion, Tin Lok James Ng, Quan V u, and Maurizio Filipp one. Deep comp ositional spat i a l mo dels. Journal of the A meric an Statistic al Asso ciation , 117(540): 1787–1808, 2022. Lu Zhang and Sudipto Banerjee. Spatial factor m o deling: A ba yesian matrix-normal ap- proac h for misaligned data. Biometrics , 78(2):560–573, 2022a. Lu Zhang and Sudipto Banerjee. Spatial factor m o deling: A ba yesian matrix-normal ap- proac h for misaligned data. Biometrics , 78(2):560–573, 2022b. 38 Supplemen tary File: Uncertain t y-Aw are Neural Multiv ariate Geostatistics Abstract This file con tains pro of s of Lemma 3.1 and 3.2 of the main article. 1 Pro of of Lemma 3.1 Let µ ( w,h ) l,j = ( µ ( w,h ) ⊤ l,j, 1 , ..., µ ( w,h ) ⊤ l,j,K ( h ) l ) ⊤ , µ ( b,h ) l,j = ( µ ( b,h ) l,j, 1 , ..., µ ( b,h ) l,j,K ( h ) l ) ⊤ , µ ( w,ψ ) l,o = ( µ ( w,ψ ) ⊤ l,o, 1 , ..., µ ( w,ψ ) ⊤ l,o,K ( ψ ) l ) ⊤ , µ ( b,ψ ) l,o = ( µ ( b,ψ ) l,o, 1 , ..., µ ( b,ψ ) l,o,K ( ψ ) l ) ⊤ . Usi n g Prop o si t i o n 1 of Gal and Ghahramani [2016], assuming that the n umbers of latent no des defining the shared laten t fa ct or s, K ( h ) l , and the outcome sp ecific loading functions, K ( ψ ) l , are b oth large and v ar i a nce δ 2 in v ariational distribution is small, we obtain KL q ( θ ( ψ ) | η ( ψ ) ) p ( θ ( ψ ) ) ≈ O X o =1 L ψ X l =1 p ( ψ ) l,o 2 || µ ( w,ψ ) l,o || 2 + O X o =1 L ψ X l =1 p ( ψ ) l,o 2 || µ ( b,ψ ) l,o || 2 KL q ( θ ( h ) | η ( h ) ) p ( θ ( h ) ) ≈ J X j =1 L h X l =1 p ( h ) l,j 2 || µ ( w,h ) l,j || 2 + J X j =1 L h X l =1 p ( h ) l,j 2 || µ ( b,h ) l,j || 2 , upto a constant dep endi n g on the n umbers of laten t no des K ( h ) l ( l = 1 , .., L h ) an d K ( ψ ) l ( l = 1 , .., L ψ ), as well as on the v ariational v ariance parameter δ 2 . This implies KL q ( θ | η ) p ( θ ) = 1 KL q ( θ ( ψ ) | η ( ψ ) ) p ( θ ( ψ ) ) + KL q ( θ ( h ) | η ( h ) ) p ( θ ) can b e approximated as KL q ( θ | η ) p ( θ ) ≈ J X j =1 L h X l =1 p ( h ) l,j 2 || µ ( w,h ) l,j || 2 + J X j =1 L h X l =1 p ( h ) l,j 2 || µ ( b,h ) l,j || 2 + O X o =1 L ψ X l =1 p ( ψ ) l,o 2 || µ ( w,ψ ) l,o || 2 + O X o =1 L ψ X l =1 p ( ψ ) l,o 2 || µ ( b,ψ ) l,o || 2 , upto a constan t dep ending on the num b ers of laten t no des K ( h ) l ( l = 1 , .., L h ) and K ( ψ ) l ( l = 1 , .., L ψ ), as well a s on the v ariational v ariance parameter δ 2 . Plugging KL q ( θ | η ) p ( θ ) appro ximation to Equation (11) of the main ar t i cl e yields, L GP-MC ( β , σ 2 , η ) ≈ 1 M M X m =1 n X i =1 log p ( y ( s i ) | X ( s i ) , β , Ψ ( s i ) , h ( s i ) , σ 2 , θ ( m ) ) − J X j =1 L h X l =1 p ( h ) l,j 2 || µ ( w,h ) l,j || 2 J X j =1 L h X l =1 p ( h ) l,j 2 || µ ( b,h ) l,j || 2 − O X o =1 L ψ X l =1 p ( ψ ) l,o 2 || µ ( w,ψ ) l,o || 2 − O X o =1 L ψ X l =1 p ( ψ ) l,o 2 || µ ( b,ψ ) l,o || 2 . (1) 2 Pro of of Lemma 3.2 Without loss of gene ra l i ty , we will prov e the le m m a assuming β = 0 ju st to reduce the notational complications. F or th e l th la yer ( l = 1 , . . . , L h − 1), o ut co m e j = 1 , . . . , J and iteration m = 1 , . . . , M , let z ( m,h ) l,j denote the drop out vector of dimensio n K ( h ) l corresp onding to the mo deling of the function h j ( · ). W e define a matrix Z ( m,w,h ) l,j = z ( m,h ) l,j 1 K ( h ) l − 1 , where 1 K ( h ) l − 1 is a vector of ones. Analogous constructions hold for the m o deling of ψ ( u ) o ( · ) function, with Z ( m,w,ψ ) l,o = z ( m,ψ ) l,o 1 K ( ψ ) l − 1 . Collecting the drop out vectors for all lay ers and outcome for h j ( · ), let z ( m,h ) = v ec( Z ( m,w,h ) l,j ) ⊤ , ( z ( h ) l,j ) ⊤ ⊤ : l = 1 , . . . , L h − 1; j = 1 , . . . , J , and similarly for the ψ ( u ) o ( · ) function, z ( m,ψ ) = v ec( Z ( m,w,ψ ) l,o ) ⊤ , ( z ( ψ ) l,o ) ⊤ ⊤ : l = 1 , . . . , L ψ − 1; o = 1 , . . . , O . 2 Define the full drop out collection for iteration m as z ( m ) = ( z ( m,h ) ) ⊤ , ( z ( m,ψ ) ) ⊤ ⊤ . Let ϵ ( m ) b e a vector of Gaussian p e rt u r b a ti o n s, indep end ent of the drop out, with ϵ ( m ) ∼ N ( 0 , δ 2 I ), and of the same dim en si o n as the parameter v ector θ . A sample from the v ariational distribution (Equation (9) of the main article), for iteration m , is generated as θ ( m ) = η ⊙ z ( m ) + ϵ ( m ) , where η denotes the v ariational mean v ector and ⊙ denotes the elemen t-wise pro duct. Hence, the v ariational samples θ ( m ) can b e seen as i.i.d. dra ws ( z ( m ) ⊤ , ϵ ( m ) ⊤ ) ⊤ from their join t distribution. The v ariational ELBO can b e written as L GP − VI ( β , η , σ 2 ) = E q ( θ | η ) [log p ( θ , D )] − E q ( θ | η ) [log q ( θ | η )] . This can b e rewritten explicitly in terms of the auxiliary v ariables: L GP − VI ( β , η , σ 2 ) = E q ( θ | η ) [log p ( D | θ )] − KL ( q ( θ | η ) ∥ p ( θ )) = E z , ϵ [log p ( D | η , z , ϵ )] − KL ( q ( θ | η ) ∥ p ( θ )) = Z log p ( D | η , z , ϵ ) dP z , ϵ ( z , ϵ ) − KL ( q ( θ | η ) ∥ p ( θ )) . (2) Mon te Carlo Appro ximation T o n umerical l y approximate the exp ectation R log p ( D | η , z , ϵ ) dP z , ϵ , we dra w M i.i.d. samples ( z ( m ) , ϵ ( m ) ) and estimate: Z log p ( D | η , z , ϵ ) dP z , ϵ ( z , ϵ ) ≈ 1 M M X m =1 log p ( D | θ ( m ) ) , where θ ( m ) = η ⊙ z ( m ) + ϵ ( m ) . When the Gaussian p erturbation v ariance is small ( δ 2 ≈ 0), w e ha ve θ ( m ) ≈ η ⊙ z ( m ) for 3 practical purp oses. As sho wn in Secti o n s 3 and 4 of the supplemen tary material of Gal and Ghahramani [2016], for a G au ss i an observ ation lik eliho o d, the log-likeliho o d can b e written as: log p ( D | θ ( m ) ) = − 1 2 σ 2 n X i =1 y ( s i ) − ˆ y ( m ) ( s i ) 2 2 + C 1 , where C 1 is a constan t indep endent of η , and ˆ y ( m ) ( s i ) = Ψ ( s i ) ( m ) h ( s i ) ( m ) , wit h all en tries of Ψ ( s i ) and h ( s i ) are defined as in Equations (13) and (14) of the main article, with θ replaced by θ ( m ) . ELBO Expression Com bining these results and app l yi n g Lemma 3.1, we obtain the following Monte Carlo v ar i a tio n a l ob jective: L GP-MC ( β , η , σ 2 ) ≈ − 1 2 M σ 2 M X m =1 n X i =1 y ( s i ) − ˆ y ( m ) ( s i ) 2 2 − J X j =1 L h X l =1 p ( h ) l,j 2 ∥ µ ( w,h ) l,j ∥ 2 − J X j =1 L h X l =1 p ( h ) l,j 2 ∥ µ ( b,h ) l,j ∥ 2 − O X o =1 L ψ X l =1 p ( ψ ) l,o 2 ∥ µ ( w,ψ ) l,o ∥ 2 − O X o =1 L ψ X l =1 p ( ψ ) l,o 2 ∥ µ ( b,ψ ) l,o ∥ 2 , (3) up to a constant indep endent of η . Setting λ ( w,h ) l,j = λ ( b,h ) l,j = p ( h ) l,j 2 and λ ( w,ψ ) l,o = λ ( b,ψ ) l,o = p ( ψ ) l,o 2 , this prov es the equiv al e n ce b etw een ob jectiv e function ( 15) as a fu n ct i on o f θ and ob jecti ve function (12) as a function of η . Consequently , Algorithm 1 p erforming st o c hastic gradient d escent on Equation (15) leads to the esti m at ed net work parameters b θ , whic h can b e identified as the estimated v ar i a tio n a l mean parameters b η . References Y arin Gal and Zoubin Gha h r ama n i . Drop out as a Ba yesian approximation: Representing mo del uncer ta i nt y in deep learning. In International Confer enc e on Machine L e arning , pages 1050–1059. PMLR, 2016. 4
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment