Generative Bayesian Computation as a Scalable Alternative to Gaussian Process Surrogates
Gaussian process (GP) surrogates are the default tool for emulating expensive computer experiments, but cubic cost, stationarity assumptions, and Gaussian predictive distributions limit their reach. We propose Generative Bayesian Computation (GBC) vi…
Authors: Nick Polson, Vadim Sokolov
Generativ e Ba y esian Computation as a Scalable Alternativ e to Gaussian Pro cess Surrogates Nic k P olson ∗ V adim Sok olo v † F ebruary 2026 Abstract Gaussian pro cess (GP) surrogates are the default tool for emulating expensive com- puter exp erimen ts, but cubic cost, stationarity assumptions, and Gaussian predictiv e distributions limit their reac h. W e prop ose Generativ e Ba yesian Computation (GBC) via Implicit Quan tile Netw orks (IQNs) as a surrogate framework that targets all three limitations. GBC learns the full conditional quan tile function from input–output pairs; at test time, a single forw ard pass per quantile lev el pro duces dra ws from the predictiv e distribution. Across fourteen benchmarks w e compare GBC to four GP-based metho ds. GBC impro ves CRPS b y 11–26% on piecewise jump-pro cess b enc hmarks, by 14% on a ten- dimensional F riedman function, and scales linearly to 90,000 training p oin ts where dense-co v ariance GPs are infeasible. A b oundary-augmen ted v arian t matches or out- p erforms Mo dular Jump GPs on t wo-dimensional jump datasets (up to 46% CRPS impro vemen t). In activ e learning, a randomized-prior IQN ensemble ac hieves nearly three times low er RMSE than deep GP active learning on Ro c k et LGBB. Overall, GBC records a fav orable p oin t estimate in 12 of 14 comparisons. GPs retain an edge on smo oth surfaces where their smo othness prior provides effectiv e regularization. Keyw ords: Implicit quan tile net work, Uncertaint y quantification, Computer experiments, Non-stationarit y , Activ e learning. ∗ Bo oth Sc ho ol of Business, Univ ersity of Chicago. ngp@chicagobooth.edu . † Departmen t of Systems Engineering and Operations Research, George Mason Univ ersity . vsokolov@gmu.edu . 1 1 In tro duction A surrogate mo del — also called an emulator or meta-mo del — approximates an exp ensive computer sim ulation y = f ( x ), x ∈ R d , from a mo dest training campaign of input–output pairs [Sacks et al., 1989, San tner et al., 2018, Gramacy, 2020]. Such em ulators are cen tral to calibration [Kennedy and O’Hagan, 2001, Sc h ultz et al., 2022], sensitivit y analysis [Oakley and O’Hagan, 2004], Ba y esian optimization [Jones et al., 1998, Sc hultz and Sok olov, 2018a] — for which deep reinforcemen t learning offers a model-free alternativ e [Sc h ultz and Sokolo v, 2018b] — sequential design [Morris and Mitchell, 1995, Gramacy and Lee, 2009], and activ e learning [MacKay, 1992, Cohn, 1994]. Large-scale agent-based sim ulators [Sokolo v et al., 2012] exemplify the setting: a single ev aluation can take min utes to hours, making GP- based Ba y esian optimization or calibration the dominant approac h. Gaussian pro cesses (GPs) hav e b ecome the default surrogate c hoice b ecause they furnish analytic predictive means and v ariances and supp ort coherent Bay esian uncertain ty quantification [Rasmussen and Williams, 2006, Gramacy, 2020]. The GP framework contin ues to b e extended in imp ortan t directions. Sauer et al. [2023] com bine deep GPs with elliptical slice sampling and active learning via the ALC criterion to handle non-stationary resp onse surfaces in small-to-medium simulation campaigns, building on the deep GP formulation of Damianou and Lawrence [2013]; Sc hultz and Sokolo v [2022] extend deep GPs to heterosk edastic, high-dimensional outputs. Holth uijzen et al. [2025] apply bias-corrected GP surrogates with V ecchia appro ximations [Katzfuss and Guinness, 2021] to large ecological datasets ( n ≈ 8 . 4 × 10 6 ), demonstrating that elab orate multi-stage pip elines are needed to scale GPs b ey ond n ≈ 10 , 000. F or resp onse surfaces with sudden discon tinuities, Flow ers et al. [2026] prop ose Modular Jump GPs (MJGP) that EM-cluster the marginal resp onses to identify regimes, train a classifier to estimate regime probabilities, and fit a single GP on augmented inputs. P ark et al. [2026] develop activ e learning criteria for piecewise Jump GP surrogates, demonstrating that bias near regime b oundaries must en ter the acquisition criterion. Lo cal GP methods — including the LAGP approach of Gramacy and Apley [2015] and its large-scale extension b y Cole et al. [2022] — address the cubic cost barrier b y fitting separate GP neighborho o ds, while Coop er et al. [2026] extend fully Ba yesian GP classification to large n via V ecchia appro ximation combined with ESS. Eac h adv ance addresses a genuine limitation, but each solution is problem-sp ecific: deep GPs require MCMC; V ecc hia appro ximations imp ose a conditional indep endence graph; jump mo difications need cluster assignmen t follow ed b y lo cal GP inference; and active learn- ing criteria must b e re-derived for eac h new surrogate class. A practitioner facing a new computer exp eriment m ust diagnose which limitation applies and select the corresp onding sp ecialist extension — or else accept the limitations of a standard stationary GP . Quan tile neural net work surrogates offer a complementary approac h [P olson and Sok olo v, 2020]; deep learning methods for dimensionalit y reduction [Polson et al., 2021] further enable scaling to high-dimensional inputs. Quantile regression [Ko enk er, 2005, Gasthaus et al., 2019] pro vides a direct route to conditional distributions, and recen t work in simulation- based inference (SBI) has sho wn that neural netw orks trained on simulated data can p erform amortized p osterior inference [Cranmer et al., 2020, Papamak arios et al., 2021, Luec kmann et al., 2021]. The Implicit Quan tile Netw ork [Dabney et al., 2018] em b eds the quan tile level as a con tinuous input, enabling a single net work to represent the entire conditional quantile 2 function — a prop ert y we exploit for surrogate mo deling. GBC w as introduced by Polson and Sokolo v [2025] for amortized Ba y esian inference. Here w e dev elop GBC with IQNs as a surrogate framew ork that targets all three GP lim- itations simultaneously . T raining scales as O ( N ) via stochastic gradien t descen t; the IQN deliv ers the full conditional quan tile function (not just mean and v ariance); and the same arc hitecture handles non-stationarit y , heteroskedasticit y , and jump discontin uities without custom kernel or likelihoo d specifications — adaptation requires only loss-weigh t selection and, for jump boundaries, an optional input augmen tation (Section 4.2). Once trained, predictiv e samples cost O (1) p er new input v ersus O ( n ) p er GP predictive mean or O ( n 2 ) p er predictiv e v ariance, in addition to the O ( n 3 ) GP training cost. W e compare GBC to four GP-based metho ds — hetGP [Binois et al., 2018], stationary GP , MJGP [Flo wers et al., 2026], and DGP+ALC [Sauer et al., 2023] — across fourteen b enchmarks spanning d = 1 to d = 10 and n = 133 to n = 90 , 000. The b enchmarks are designed so that each GP metho d is tested under conditions fav orable to it: hetGP on heterosk edastic data, MJGP on jump boundaries, DGP+ALC on activ e learning. The pattern that emerges is that GBC’s adv an tage gro ws with problem dimension, training-set size, and b oundary sharpness, while GPs retain a clear edge on smooth, mo derate- n surfaces where their kernel prior provides effectiv e regularization. T able 7 consolidates the fourteen head-to-head comparisons. GBC do es ha ve clear limitations. It requires a generativ e forw ard mo del from which join t parameter–data samples ( θ , y ) can be dra wn; when only a small fixed dataset is a v ailable, augmen tation is needed. On smo oth, stationary problems at small sample sizes, a well- tuned GP with analytic p osteriors can matc h or exceed GBC accuracy . On heteroskedastic data, purp ose-built mo dels such as hetGP achiev e b etter predictiv e calibration (co verage) when the Gaussian heterosk edastic assumption holds. The adv an tage of GBC grows with dimensionalit y , scale ( n ), and the sharpness of any regime boundary — precisely the settings where standard GP surrogates struggle most. Our con tributions are as follo ws. First, w e formulate GBC with IQNs as a unified sur- rogate framework (Definition 2) that addresses the three main limitations of GP surrogates — cubic scaling, stationarity , and Gaussian predictiv e distributions — within a single ar- c hitecture. The theoretical foundation rests on the noise outsourcing theorem (Theorem 1), whic h guaran tees that any conditional distribution can b e represented as a deterministic function of the conditioning v ariable and a uniform random v ariable; w e pro ve that the IQN arc hitecture is a universal appro ximator of this map (Prop osition 3) and establish CRPS con vergence rates (Propositions 4 – 5). Second, we in tro duce a three-term training loss that com bines an L 1 lo cation anchor, a quan tile-ordering surrogate, and the quan tile loss, with w eights that can b e adapted to the problem structure: a single configuration c hange (no arc hitectural mo dification) yields a 28% CRPS impro vemen t on jump b oundaries. Third, we dev elop GBC-Aug, a b oundary-augmented v arian t that couples EM-based regime detection with the IQN bac kend, matc hing or surpassing Mo dular Jump GPs on their own b enc hmarks while training 26 × faster. F ourth, w e propose a randomized-prior IQN ensemble for activ e learning that provides reliable acquisition signals where standard NN uncertain ty metho ds (b o otstrap, SNGP , anc hored ensembles, last-lay er Laplace) fail. Fifth, we provide a system- atic empirical comparison across fourteen b enc hmarks and four GP-based comp etitors, with 10–50 Monte Carlo replicates p er b enchmark, establishing when GBC outp erforms GPs and when it do es not. 3 Section 2 reviews GP surrogates and their structural limitations. Section 3 in tro duces GBC: the noise outsourcing theorem, IQN arc hitecture, and training loss. Section 4 dev elops b oundary augmen tation for jump pro cesses and activ e learning within the GBC framework. Section 5 describ es the implementation and presents synthetic b enchmarks. Section 6 ev al- uates GBC on real-data and applied examples. Section 7 discusses the results and op en problems. 2 Gaussian Pro cess Surrogates This section reviews GP surrogates and their limitations, setting up the three c hallenges — cubic scaling, stationarit y , and Gaussian predictive distributions — that GBC addresses. 2.1 GP regression Consider a black-box simulator y = f ( x ) + ϵ , x ∈ X ⊆ R d , ϵ ∼ N (0 , σ 2 ). A GP surrogate places a GP prior on f : for any finite collection X n = ( x 1 , . . . , x n ) ⊤ , Y n | X n ∼ N n ( µ, Σ( X n )) , Σ ij ( X n ) = k ( x i , x j ) + g I { i = j } , where k ( · , · ) is a cov ariance (kernel) function, t ypically a function of the pairwise distance ∥ x i − x j ∥ , and g > 0 is a nugget parameter [Rasm ussen and Williams, 2006]. A common c hoice is the squared-exp onential kernel k ( x i , x j ) = σ 2 f exp( −∥ x i − x j ∥ 2 /ℓ ); the GP baselines in Sections 5 – 6 use a Mat ´ ern-5 / 2 kernel with separable (p er-dimension) length scales, which is less smooth than the SE k ernel and standard in computer exp eriments [Gramacy, 2020]. P osterior prediction at test lo cations X ∗ has closed form (kriging equations), with predictive mean µ Y ( X ∗ ) and cov ariance Σ Y ( X ∗ ); see Gramacy [2020] for a detailed exp osition. Inference for ( σ 2 f , ℓ, g ) requires O ( n 3 ) flops p er lik eliho o d ev aluation. 2.2 Limitations and extensions Three structural limitations constrain the applicability of GP surrogates. The most imme- diate is computational: dense co v ariance matrices scale as O ( n 2 ) in storage and O ( n 3 ) in Cholesky decomp osition, limiting plain GPs to n ≪ 10 , 000 [Co op er et al., 2026]. Holthui- jzen et al. [2025] require V ecchia appro ximations [Katzfuss and Guinness, 2021] to handle n ≈ 8 . 4 × 10 6 , while lo cal GP metho ds [Gramacy and Apley, 2015, Cole et al., 2022] mitigate the cost b y fitting separate neighborho o ds. Sparse v ariational GPs [Titsias, 2009, Hensman et al., 2013] reduce cost to O ( nm 2 ) via m ≪ n inducing points, at the cost of an appro ximate p osterior. All of these approaches in tro duce structural assumptions (conditional indep en- dence graphs, neigh b orho o d sizes, or inducing-p oint placemen t) that add tuning burden. The second limitation is stationarit y . The k ernel Σ( x i , x j ) depends only on ∥ x i − x j ∥ , enco ding the assumption that correlation structure is spatially uniform. Non-stationary dynamics — regime c hanges, heteroskedasticit y , discontin uities — require b esp oke solutions: DGP w arping [Sauer et al., 2023, Damianou and Lawrence, 2013], tree-based partitions [Gramacy and Lee, 2008], or explicit jump mo deling [Flo wers et al., 2026, Park et al., 2026]. Eac h mo dification in tro duces additional inference complexit y . 4 The third limitation is distributional. Standard GP prediction delivers a Gaussian N ( µ Y , Σ Y ). Do wnstream tasks requiring quantile estimates, tail probabilities, or samples from a non-Gaussian p osterior receiv e only a Gaussian appro ximation. Extending GPs to classification requires a laten t lay er and MCMC or v ariational integration [Co op er et al., 2026], sp ecifically b ecause the Bernoulli likelihoo d is not conjugate to the GP prior. Figure 1 pro vides a simple illustration. On a 1D function with a jump discontin uit y , the stationary GP must c ho ose a single length scale that trades off smo othness within each regime against accuracy near the b oundary . The IQN-based GBC, b y contrast, adapts its conditional quan tile bands lo cally — widening them at the jump and tightening them a wa y from it. The full piecewise b enchmarks in Sections 5.3 – 6.2 demonstrate this adv an tage in noisy , higher-dimensional settings. 0 20 40 60 80 100 x 4 2 0 2 4 6 8 10 12 y Stationary GP (Mater n 5/2) 90% PI T r u e f ( x ) P osterior mean T raining data 0 20 40 60 80 100 x 2 0 2 4 6 8 10 12 y GB C (IQN) 90% PI T r u e f ( x ) P osterior mean T raining data Figure 1: GP vs. GBC on a 1D jump function ( n = 100, noiseless). The stationary GP (left) adapts its length scale near the discontin uit y but blurs the jump; GBC (righ t) widens its quan tile bands at the b oundary . Details on the IQN arc hitecture are in Section 3. These extensions solve individual limitations but fragment the to olkit. A practitioner using V ecchia for scalabilit y cannot easily com bine it with jump detection for discon tin uities; a deep GP designed for non-stationarit y do es not automatically pro duce calibrated quantile predictions. GBC targets all three limitations within a single arc hitecture, as the next section describ es. 3 Generativ e Ba y esian Computation GBC replaces the GP with a neural netw ork that maps inputs and a uniform random v ariable to predictiv e samples. W e first present the theoretical foundation (noise outsourcing), then the IQN arc hitecture and training loss, and conclude with approximation guaran tees. 5 3.1 Noise outsourcing The key insigh t b ehind GBC is that any p osterior distribution can b e represen ted as a deterministic function of the data and a single uniform random v ariable — no likelihoo d ev aluation required. This follo ws from a classical result in probability theory . Let θ ∈ Θ ⊆ R k denote unknown parameters (or any quantities of interest), y ∈ Y ⊆ R p observ ed data, and τ ∈ [0 , 1] k a uniform base v ariable. The goal is to dra w samples from the p osterior π ( θ | y ), whether or not a likelihoo d p ( y | θ ) is av ailable. Theorem 1 (Noise Outsourcing; Kallen b erg, 1997, Thm. 5.10) . If ( X , Z ) ar e r andom vari- ables in a Bor el sp ac e X × Z , ther e exist a r andom variable U ⊥ ⊥ X and a me asur able function G ⋆ : [0 , 1] × X → Z such that ( X , Z ) a.s. = X , G ⋆ ( U, X ) . In w ords: the dep endence of Z on X can b e captured entirely b y a deterministic function that takes X and an indep endent uniform draw as inputs. The theorem applies to an y pair of jointly distributed random v ariables. In the Bay esian inference setting, take X = y (data) and Z = θ (parameters): the generator G ⋆ ( τ , y ) is a measurable transp ort map from the uniform base to the p osterior π ( θ | y ). When θ is scalar ( k =1), this reduces to the inv erse CDF F − 1 θ | y ( τ ); for k > 1, G ⋆ is a general measurable function, not a comp onent-wise quantile in version. In the surrogate emulation setting — the primary fo cus of this pap er — tak e X = x (inputs) and Z = y (scalar resp onse): the generator G ⋆ ( τ , x ) = Q τ ( Y | x ) is the conditional quantile function of Y giv en x . In b oth cases, the IQN learns an appro ximation G ϕ ≈ G ⋆ from training pairs. Throughout Sections 5 – 6, x denotes the sim ulator input and y the resp onse; the ( x, y ) surrogate notation is used consisten tly . F urther details on the pro of and con vergence theory app ear in Supplemen t A. When the conditioning v ariable is high-dimensional ( y ∈ R p with large p ), a sufficient summary statistic S ( y ) can replace y as input; Supplement B discusses this dimension reduction. 3.2 Implicit Quan tile Net w ork W e follo w Dabney et al. [2018] in embedding the quantile level τ via a cosine transform b efore merging with the cov ariate represen tation: ϕ ( τ ) = cos( j π τ ) n h − 1 j =0 ∈ R n h , where n h = 32 is the em b edding dimension. The IQN computes G ϕ ( τ , x ) = f out ( f 1 ( f x ( x ) ⊙ f τ ( ϕ ( τ )))) , where f x , f τ , f 1 are fully-connected lay ers with ReLU activ ations, ⊙ denotes element wise pro duct, and f out outputs tw o v alues: a lo cation estimate ˆ µ (used only during training as an L 1 anc hor targeting the conditional median) and a quantile estimate ˆ q τ (the predictiv e output at test time). The architecture works as follo ws. The input x passes through a feature net work f x that pro duces a represen tation v ector; independently , the quan tile lev el τ is embedded via ϕ ( τ ) 6 and pro jected through f τ . The elemen twise pro duct f x ( x ) ⊙ f τ ( ϕ ( τ )) allo ws the quantile lev el to mo dulate which features of x are emphasized, enabling the net work to represen t differen t quan tile surfaces as smo oth functions of τ . Because τ en ters as a cont in uous input rather than a discrete index, the IQN can ev aluate any quan tile in [0 , 1] at test time without retraining. Figure 2 provides a diagram of the IQN architecture. The left branch processes the input x through a feature netw ork f x ; the right branc h embeds τ via the cosine transform and pro jects through f τ . Their elemen twise pro duct merges co v ariate information with quantile- lev el mo dulation, and a final la yer pro duces b oth a lo cation estimate ˆ µ and the quan tile prediction ˆ q τ . x ∈ R d τ ∼ U [0 , 1] f x : F C+ReLU ϕ ( τ ): cosine f τ : F C+ReLU ⊙ (element wise) f 1 : F C+ReLU ˆ µ ˆ q τ Figure 2: IQN arc hitecture. The input x and quan tile lev el τ are pro cessed through sepa- rate branc hes ( f x and f τ ◦ ϕ ), merged via elemen twise pro duct, and deco ded in to a lo cation estimate ˆ µ and a quantile prediction ˆ q τ . The cosine em b edding ϕ ( τ ) enables smo oth in ter- p olation ov er quan tile levels. 3.3 T raining loss The IQN is trained with a three-term loss that addresses three p otential failure mo des: mo de collapse, quantile mis-ordering, and miscalibration. Let D N = { ( x ( i ) , y ( i ) ) } N i =1 b e training data from the forward mo del (in the b enc hmarks of Sections 5 – 6, N = n , the same training set used b y the GP). F or each mini-batch and dra w τ ∼ Uniform[0 , 1], the loss is ℓ ( τ ) = w 1 E | y − ˆ µ | | {z } L1 lo cation anchor + w 2 | τ − 0 . 5 | E m τ | {z } ordering surrogate + w 3 E max τ e, ( τ − 1) e | {z } quantile , (1) 7 where e = y − ˆ q τ and the ordering surrogate is m τ = ( max(0 , ˆ q τ − y ) if τ < 0 . 5 , max(0 , y − ˆ q τ ) if τ ≥ 0 . 5 . Here y is the resp onse (prediction target) and x the input (conditioning v ariable); for the Ba yesian inference instantiation (Section 3, noise outsourcing), replace y → θ and x → y . The first term (lo cation anchor) prev ents mo de-collapse b y encouraging the netw ork’s lo cation output to trac k the conditional median. The second term is a quantile-or dering surr o gate : for τ < 0 . 5, it p enalizes when ˆ q τ exceeds y ; for τ ≥ 0 . 5, it p enalizes when ˆ q τ falls below y . This is not a direct non-crossing constrain t b etw een quantile levels τ 1 < τ 2 (as in Cannon, 2018), but rather a soft non-crossing p enalt y: if lo w er quan tiles sta y b elow y and upper quantiles sta y ab ov e y , they are less likely to cross. This soft p enalt y works w ell in practice. The | τ − 0 . 5 | weigh ting strengthens the p enalty at extreme quan tile levels. Quan tile crossings are not formally prev ented but were not observ ed to degrade CRPS in an y b enchmark; a strict non-crossing guarantee w ould require isotonic p ost-pro cessing or the constrained arc hitecture of Cannon [2018]. The third term is the standard quantile loss [Ko enk er, 2005], whic h is minimized at the true conditional quantile. The default weigh ts ( w 1 , w 2 , w 3 ) = (0 . 3 , 0 . 3 , 0 . 4) work w ell for smo oth and heteroskedas- tic resp onses. F or jump-pro cess benchmarks with sharp discontin uities, reducing the L 1 anc hor to (0 . 1 , 0 . 2 , 0 . 7) (“quantile-dominan t”) improv es CRPS b y freeing the net w ork from smo othing across boundaries; these t w o configurations are fixed a priori (not tuned on test data) and reported in all tables. In practice, a simple diagnostic guides the choice: if the marginal resp onse distribution is bimo dal or the domain con tains kno wn regime b oundaries, quan tile-dominant weigh ts are preferred; otherwise the default weigh ts are used. This diag- nostic inspects the training responses and is therefore data-dep endent, though it uses only the marginal distribution (not test data). T able 5 below sho ws that default weigh ts on the jump benchmarks (Phantom, Star) yield CRPS 3–4 × w orse than quan tile-dominant, con- firming that the w eigh t choice materially affects performance on these problems. W e train with Adam [Kingma and Ba, 2015] and cosine annealing [Loshchilo v and Hutter, 2017]. 3.4 GBC algorithm Algorithm 1 Generative Ba y esian Computation (GBC) for Surrogate Mo deling 1: T rain: 2: Obtain training pairs ( x ( i ) , y ( i ) ) from the forward mo del, i = 1 , . . . , N . 3: T rain IQN to obtain G ˆ ϕ (the fitted netw ork) by minimizing (1) on { ( x ( i ) , y ( i ) , τ ( i ) ) } via Adam + cosine annealing. 4: T est: Given new input x ∗ : 5: Dra w τ ( b ) ∼ Uniform[0 , 1] for b = 1 , . . . , B . 6: Return ˆ y ( b ) = ˆ q τ ( b ) from G ˆ ϕ ( τ ( b ) , x ∗ ) as predictive samples (the ˆ µ head is not used at test time). T raining GBC costs O ( N · C fwd ) where C fwd is the cost of one net work forward pass. F or a GP with n training p oints, h yp erparameter estimation requires O ( n 3 ) flops p er lik eliho o d 8 ev aluation (Cholesky factorization of the n × n co v ariance matrix). Predictive mean at a new p oin t costs O ( n 2 ) from scratch, or O ( n ) if the Cholesky factor is cac hed; the full predictiv e v ariance requires an additional O ( n 2 ) solve. On the F riedman d =10 b enchmark at n = 2 , 000, GP training requires 47 . 6 s (CPU) v ersus 17 . 7 s for GBC (GPU), with GBC also ac hieving lo wer RMSE (T able 3); absolute timings reflect hardware differences (Section 5.4 separates asymptotic scaling). Bey ond n = 2 , 000 (the largest size tested for the GP baseline on this hardw are), the dense-co v ariance GP b ecomes infeasible while GBC scales linearly . Once trained, generating B predictive samples at any new input x ∗ costs O ( B · C fwd ) — no matrix factorization, no MCMC chain. 3.5 Unified GBC framew ork The following definition unifies the v ariants used throughout this pap er. Definition 2 (GBC F ramework) . Given tr aining p airs { ( x i , y i ) } N i =1 fr om a forwar d mo del, GBC pr o c e e ds in thr e e steps: (i) c ompute fe atur es z = T ( x ) via a (p ossibly identity) tr ans- formation T : R d → R p ; (ii) tr ain an IQN G ϕ ( τ , z ) to appr oximate Q τ ( Y | Z = z ) via the thr e e-term loss (1) ; and (iii) at a new input x ∗ with z ∗ = T ( x ∗ ) , sample τ 1 , . . . , τ B iid ∼ U (0 , 1) and r eturn { G ϕ ( τ b , z ∗ ) } B b =1 as pr e dictive samples. The tw o v arian ts in Section 5 are sp ecial cases: GBC (default) sets T ( x ) = x ; GBC-Aug sets T ( x ) = [ x, ˆ c ψ ( x )] where ˆ c ψ : X → [0 , 1] is a learned b oundary classifier (Section 4.2). The next three prop ositions establish appro ximation, accuracy , and consistency guaran- tees for IQN quan tile estimation under the quantile (c hec k) loss. In practice, GBC minimizes the three-term loss (1) rather than the quantile loss alone; Remark 6 b elow discusses the relationship b etw een the theoretical and practical ob jectives. Prop osition 3 (Universal approximation of conditional quan tiles) . Assume X ⊂ R d is c omp act and the c onditional quantile function ( τ , x ) 7→ Q τ ( Y | x ) is c ontinuous on [0 , 1] × X . Then for any ε > 0 , ther e exists a R eLU network G ϕ : [0 , 1] × X → R with finitely many hidden units such that sup τ ∈ [0 , 1] , x ∈X G ϕ ( τ , x ) − Q τ ( Y | x ) < ε. Pr o of ide a. The conditional quantile function is con tinuous on the compact set [0 , 1] × X ; the standard UA T [Hornik et al., 1989] gives a ReLU appro ximator on this joint input. The IQN’s elemen twise merge can realize the required join t-input net work via bias-term padding. F ull pro of in Supplement A. Prop osition 4 (CRPS approximation b ound) . L et Q τ ( Y | x ) denote the true c onditional quantile function and G ϕ ( τ , x ) an appr oximation satisfying sup τ ∈ [0 , 1] | G ϕ ( τ , x ) − Q τ ( Y | x ) | ≤ ε for a given x . Then for any observation y , CRPS( G ϕ , y ) − CRPS( Q, y ) ≤ 2 ε. Pr o of ide a. The CRPS equals 2 R 1 0 ρ τ ( y − F − 1 ( τ )) dτ [Gneiting and Raftery, 2007], and the c heck loss ρ τ is Lipsc hitz-1 in its quantile argumen t. Integrating the p oin twise b ound gives 2 ε . F ull pro of in Supplemen t A. 9 Prop osition 5 (Consistency) . Assume X ⊂ R d is c omp act, the r esp onse Y is b ounde d, the c onditional quantile function Q τ ( Y | x ) is β -H¨ older c ontinuous in ( x, τ ) for some β > 0 , and the network class G n c onsists of R eLU networks whose depth L n and width W n satisfy L n W n log( L n W n ) = o ( n/ log n ) . Then the empiric al quantile loss minimizer ˆ G n ∈ G n satisfies E Z 1 0 ρ τ ( Y − ˆ G n ( τ , X )) dτ − → inf G E Z 1 0 ρ τ ( Y − G ( τ , X )) dτ as n → ∞ . In p articular, the exp e cte d CRPS of ˆ G n c onver ges to the Bayes-optimal CRPS. Pr o of ide a. Propositions 3 – 4 control appro ximation error. Estimation error is b ounded by the Rademac her complexity of G n comp osed with the Lipschitz c heck loss: O ( p L n W n log( L n W n ) /n ) [P adilla et al., 2022]. The net w ork-size condition ensures b oth terms v anish; minimax rates follo w from Sc hmidt-Hieb er [2020]. F ull pro of in Supplement A. Remark 6 (Theory vs. practice) . Pr op ositions 3 – 5 analyze the quantile loss (the w 3 term in (1) ). The L 1 anchor and or dering surr o gate serve as finite-sample r e gularizers — pr e- venting mo de c ol lapse and quantile mis-or dering — but intr o duc e a bias c ontr ol le d by w 1 and w 2 . As w 3 → 1 , the tr aining obje ctive appr o aches the pur e quantile risk and the the or etic al guar ante es apply dir e ctly; the quantile-dominant c onfigur ation (0 . 1 , 0 . 2 , 0 . 7) use d for jump b enchmarks is closer to this r e gime. The empiric al r esults in Se ctions 5 – 6 demonstr ate pr ac- tic al p erformanc e of the c omp osite-loss estimator; char acterizing the bias–varianc e tr ade off as a function of ( w 1 , w 2 , w 3 ) r emains op en. These r esults guar ante e existenc e and c onsistency but not finite-sample c alibr ation: IQN c over age c an deviate fr om nominal in either dir e ction (Se ction 7). 4 Boundary Augmen tation and Activ e Learning GBC’s IQN core handles non-stationarit y by default, but tw o extensions impro v e p erfor- mance in sp ecific regimes: an input-augmen tation strategy for sharp jump b oundaries, and an ensemble-based acquisition criterion for active learning. 4.1 Non-stationarit y and heterosk edasticit y Because the IQN G ϕ is comp osed of feedforw ard netw orks that are univ ersal appro ximators [Hornik et al., 1989], it imp oses no stationarit y assumption: the map ( τ , x ) 7→ G ϕ ( τ , x ) can learn arbitrary resp onse surfaces, including those with heteroskedastic v ariance and discon tinuous structure. In con trast, a stationary GP kernel assigns p ositive co v ariance to an y tw o p oints whose inputs are close, encoding a smo oth, homoskedastic prior ov er the resp onse surface. F or a heteroskedastic pro cess y = f ( x ) + σ ( x ) ϵ , ϵ ∼ N (0 , 1), a stationary GP effectively estimates an av erage noise level ¯ σ , producing ov erly-wide interv als in lo w-v ariance regions and under-co verage in high-v ariance regions. The IQN adapts its conditional quan tile estimates lo cally to σ ( x ), pro ducing in terv al widths that v ary with the lo cal noise lev el. Whether the resulting co verage matches the nominal level dep ends on sample size and signal-to-noise ratio; Section 5.2 illustrates b oth the adaptation and the residual cov erage gap on the motorcycle b enc hmark. 10 4.2 Boundary augmen tation for jump pro cesses When the resp onse surface contains sharp jump discontin uities, the IQN must simultane- ously learn (i) the b oundary geometry and (ii) the conditional quantile function on eac h side. Learning the b oundary from data alone is feasible (Section 5.3) but can b e improv ed by pro- viding a b oundary indicator as an additional input feature. An alternative implicit approac h — a mixture-of-exp erts (MoE) IQN with a learned gating netw ork that routes inputs to K exp ert heads — w as tested and failed: with soft routing (w eighted av erage of exp erts), the gate never commits to a single exp ert p er region (mean sp ecialization 0 . 64–0 . 71 out of 1 . 0), and the mixture output is no b etter than a single IQN (CRPS 0 . 036–0 . 039 vs. 0 . 031); with hard routing (argmax at inference), predictions are catastrophic (CRPS > 0 . 59) b ecause in- dividual experts, trained only to con tribute partial signals under soft mixing, cannot pro duce v alid standalone predictions. The softmax gate is inheren tly smo oth and cannot represent a step function in input space, making implicit partitioning fundamen tally unsuitable for sharp discontin uities. Inspired b y the mo dular pip eline of MJGP [Flow ers et al., 2026], w e prop ose GBC-Aug , a three-step augmen tation. First, a tw o-comp onen t Gaussian mixture mo del (EM) is fit to the marginal training resp onses { y i } , pro ducing hard binary lab els c i ∈ { 0 , 1 } that separate the t wo regimes. Second, an MLP classifier ˆ f : X → [0 , 1] is trained via binary cross-entrop y loss to predict P ( c = 1 | x ) from the inputs alone. Third, the classifier output is app ended as an extra input feature and a standard IQN is trained on the augmented data { ([ x i , ˆ f ( x i )] , y i ) } . The classifier pro vides a soft b oundary indicator that the IQN can exploit to lo calize its quan tile estimates without learning the b oundary geometry from scratc h. The same IQN arc hitecture and loss function are used as in plain GBC; the only change is the additional input dimension. GBC-Aug applies when the resp onse has binary jump structure (t wo distinct regimes). It is not needed for smo oth b oundaries (e.g., AMHV) or when no sp ecialist comp etitor exists (e.g., Michalewicz at n = 90 , 000). The binary assumption is limiting: if the resp onse has more than t wo regimes or if the marginal mo des o v erlap substantially , EM clustering ma y misassign lab els, propagating errors to the classifier and ultimately to the IQN. The Mic halewicz benchmark already illustrates a partial failure of this kind (its marginal is not bimo dal), which is why plain GBC is used there instead. Classifier accuracy is the key b ottlenec k: on Phantom, replacing the final three-lay er MLP (99.98% accuracy , CRPS = 0 . 009) with a single-hidden-lay er classifier (82% accuracy) degrades GBC-Aug CRPS to 0 . 024 (Section 6.2). 4.3 Activ e learning with GBC Activ e learning (AL) for surrogates sequentially selects the next input x n +1 to minimize a global prediction error criterion [MacKa y, 1992, Cohn, 1994, Sauer et al., 2023]. A t an y test lo cation x , GBC generates B predicted outputs ˆ y ( b ) = G ϕ ( τ ( b ) , x ), τ ( b ) ∼ U (0 , 1), b = 1 , . . . , B . The Mon te Carlo estimate of predictiv e v ariance is ˆ σ 2 GBC ( x ) = 1 B − 1 B X b =1 ˆ y ( b ) − ¯ ˆ y 2 , ¯ ˆ y = B − 1 B X b =1 ˆ y ( b ) . 11 More generally , GBC provides the full predictiv e distribution P ( ˆ y | x ), enabling a family of acquisition criteria. One natural c hoice is the width of a predictive in terv al: x ∗ n +1 = arg max x ∈X F − 1 ˆ y | x (0 . 95) − F − 1 ˆ y | x (0 . 05) , (2) whic h generalizes the GP’s v ariance-based ALC criterion to non-Gaussian distributions. In practice, ho wev er, we found that ensemble disagr e ement is more robust than (2) in sequen- tial settings, because the single-model quantile spread can collapse as the IQN ov erfits the gro wing training set. Ensemble disagreement is defined as a ( x ) = v u u t 1 K − 1 K X k =1 ˜ y k ( x ) − ¯ ˜ y ( x ) 2 , ˜ y k ( x ) = median b G ϕ k ( τ ( b ) , x ) , where ˜ y k ( x ) is the median prediction from ensemble mem b er k and ¯ ˜ y = K − 1 P k ˜ y k . All activ e learning exp erimen ts in Sections 6.3 – 6.4 use ensemble disagreemen t as the acquisition criterion. The ensemble uses a K =3 randomized prior architecture [Osband et al., 2018]: each IQN mem b er k computes G ϕ k ( τ , x ) = g ϕ k ( τ , x ) + α g 0 ,k ( τ , x ), where g 0 ,k is a frozen random-weigh t net work (scale α = 0 . 5) whose parameters are drawn once at initialization and nev er up dated. This design addresses a fundamen tal difference b etw een NN ensem bles and GP posteriors. GP p osterior samples are draws from a sto c hastic process whose diversit y is guaranteed b y the p osterior cov ariance: ev en as n → ∞ , the samples remain distinct wherev er p osterior uncertain ty is nonzero. By con trast, NN ensemble mem b ers trained on the same data with differen t random seeds conv erge to nearly identical predictions as n grows, b ecause SGD driv es all members to w ard the same empirical risk minimizer. Bo otstrap resampling partially restores diversit y by v arying the training set, but in our experiments b o otstrap ensem bles w ere consistently the worst acquisition strategy — 17% worse than random selection on the 7D Proxy and 30% worse on SatMini (Supplement D) — b ecause the resampled training sets at small n pro duce degenerate fits. Randomized priors inject structur al div ersit y that p ersists regardless of n : in data-sparse regions, g ϕ k cannot learn to cancel the random g 0 ,k , so differen t mem b ers mak e differen t predictions and disagreement is high; in data-rich regions, g ϕ k dominates and the prior p erturbation b ecomes negligible. Among the alternatives tested — bo otstrap ensem bles, SNGP [Liu et al., 2020], anc hored ensem bles [P earce et al., 2020], and last-lay er Laplace — randomized priors produced the b est acquisition signal on ev ery b enc hmark (Supplement D). 5 Implemen tation and Syn thetic Benc hmarks W e first describ e the implementation shared b y all b enchmarks, then ev aluate GBC on three syn thetic problems of increasing complexit y: a heterosk edastic 1D dataset (motorcycle), piecewise-stationary pro cesses in d = 2–4 (BGP), and a high-dimensional scaling study ( d = 10, F riedman). 12 5.1 Implemen tation details All GBC mo dels are implemen ted in PyT orch [P aszk e et al., 2019]. Co de to repro duce ev ery exp erimen t in this pap er is av ailable at https://github.com/vadimsokolov/gbc- surrogate . The IQN uses the same architecture throughout: the feature net w ork f x and quan tile pro jec- tion f τ eac h contain one hidden lay er of 256 units (ReLU), follow ed b y a shared hidden lay er f 1 of 256 units after the elemen t wise merge, with quantile em b edding dimension n h = 32 and the three-term loss (1) with default w eights (0 . 3 , 0 . 3 , 0 . 4) unless otherwise noted. T raining uses Adam [Kingma and Ba, 2015] with learning rate 10 − 3 and cosine annealing [Loshc hilo v and Hutter, 2017]. Ep o chs range from 3 , 000 to 8 , 000 dep ending on dataset size and are sp ecified p er b enchmark b elow; all were fixed b efore seeing test-set results. The architecture (depth, width, n h ) was not tuned p er b enc hmark; the same configuration is used for all fourteen datasets, ranging from d =1, n =133 (motorcycle) to d =10, n =90 , 000 (Michalewicz). Tw o v arian ts of the GBC framework appear in this pap er, differing only in the input represen tation. GBC (default) feeds the raw inputs x to the IQN with the three-term loss (1) and is used for all syn thetic b enchmarks and activ e learning exp erimen ts. GBC-Aug app ends a learned b oundary-classifier feature (Section 4.2) and is used when the resp onse exhibits binary jump structure (Phantom, Star). Both share the same IQN architecture and training loss; Definition 2 provides the unified form ulation. The comparator metho ds are: • hetGP : mleHetGP from the R pac k age hetGP [Binois et al., 2018], Mat´ ern-5 / 2 kernel with jointly estimated latent mean and spatially-v arying noise v ariance. • MJGP : R pac k age jumpgp [Flow ers et al., 2026], EM clustering + classifier + lo cal GP on augmen ted inputs. • Stationary GP : Mat ´ ern-5 / 2 via maximum marginal likelihoo d. • DGP+ALC : R pack age deepgp [Sauer et al., 2023], 2-la yer deep GP with elliptical slice sampling, activ e learning via ALC. Throughout we rep ort three metrics: ro ot mean-squared error (RMSE) for p oint pre- diction (or ro ot mean-squared p ercentage error, RMSPE, when the baseline rep orts relative errors); con tinuous ranked probability score (CRPS) for distributional accuracy; and empir- ical 90% co verage of predictive in terv als (nominal = 0 . 90). CRPS simultaneously rewards calibration and sharpness and applies to any predictive distribution [Gneiting and Raftery, 2007], making it our primary metric. W e estimate CRPS from M p osterior quan tile samples via \ CRPS( F , y ) = 1 M P M m =1 | q m − y | − 1 2 M 2 P M m =1 P M m ′ =1 | q m − q m ′ | , where q m = F − 1 ( τ m ) are quan tiles at a regular grid τ m = m/ ( M + 1). GBC mo dels train on a single NVIDIA A100 GPU; GP baselines run on 16-core CPU (In tel Xeon). W all-clo ck comparisons therefore reflect b oth algorithmic complexity ( O ( n ) SGD vs. O ( n 3 ) Cholesky) and hardware differences (GPU vs. CPU). T o separate these effects, the F riedman scaling exp eriment in Section 5.4 rep orts asymptotic gro wth rates alongside absolute times. 13 5.2 Motorcycle data (hetGP comparison) The motorcycle helmet dataset ( MASS::mcycle , n = 133) is a canonical b enchmark for heterosk edastic non-stationary em ulation: head acceleration as a function of time after a sim ulated crash, with v ariance that increases sharply during the main deceleration phase (times 15–35 ms). W e compare GBC to hetGP [Binois et al., 2018], the purp ose-built het- erosk edastic GP for this setting, rather than a stationary GP . W e use 50 random 80:20 train/test splits (identical for b oth metho ds). GBC trains an ensem ble of K =5 IQNs (each 5 , 000 ep o c hs, differen t random seeds) on the same splits; pre- dictions p o ol B = 200 quantile samples per mo del (1 , 000 total). No architecture mo dification is made relativ e to any other b enchmark. T able 1: Motorcycle b enc hmark ( MASS::mcycle , n = 133): mean ± SE ov er 50 replicates. hetGP is purp ose-built for heterosk edastic data; GBC-IQN, with default loss weigh ts and no custom kernel or noise mo del, ac hieves marginally low er CRPS. Metho d RMSE CRPS 90% Cov erage hetGP (Matern-5/2) 23 . 81 ± 0 . 49 12 . 56 ± 0 . 24 0 . 875 GBC (IQN, single) 23 . 92 ± 0 . 57 12 . 56 ± 0 . 29 0 . 803 GBC (IQN, K =5 ens.) 23 . 88 ± 0 . 57 12 . 49 ± 0 . 29 0 . 830 Figure 3 illustrates three approac hes to predictive uncertaint y . The stationary GP as- signs near-constan t in terv al width across the domain. The hetGP adapts via a laten t noise pro cess, pro ducing wider interv als during the high-v ariance deceleration phase and narrow er in terv als elsewhere. GBC adapts its interv al width to local v ariance without a parametric noise mo del — the IQN’s conditional quantile function inherently captures heterosk edastic- it y . Ensem bling smooths the p osterior mean and widens the interv als, partially closing the co verage gap with hetGP (from 0 . 803 to 0 . 830, though still b elo w hetGP ’s 0 . 875). The central finding is that GBC, with default loss w eights and no custom noise mo del, is comparable to a purp ose-built heteroskedastic GP on CRPS (12 . 49 vs. 12 . 56; the difference is within standard-error o ver lap and should not be interpreted as a significant adv an tage). The hetGP achiev es b etter co verage b ecause its explicit noise mo del σ 2 δ ( x ) pro vides a w ell- calibrated Gaussian predictive distribution when the heteroskedastic Gaussian assumption holds — a structural adv antage at this small sample size ( n train = 106) that GBC’s nonpara- metric quantile estimates cannot fully match. 14 0 10 20 30 40 50 60 T ime (ms) 150 100 50 0 50 A cceleration (g) Stationary GP 90% PI P osterior mean Data 0 10 20 30 40 50 60 T ime (ms) 150 100 50 0 50 100 A cceleration (g) hetGP 90% PI P osterior mean Data 0 10 20 30 40 50 60 T ime (ms) 150 100 50 0 50 A cceleration (g) GB C (single IQN) 90% PI P osterior mean Data 0 10 20 30 40 50 60 T ime (ms) 150 100 50 0 50 A cceleration (g) G B C ( K = 5 e n s e m b l e ) 90% PI P osterior mean Data Figure 3: Motorcycle b enc hmark: stationary GP (top left), hetGP (top righ t), single IQN (b ottom left), and K =5 IQN ensem ble (b ottom right). Shaded region: 90% predictiv e in terv al. The stationary GP assigns near-constan t uncertaint y; hetGP adapts via a latent noise pro cess; GBC adapts without a parametric noise mo del. 5.3 Piecewise and jump pro cesses (BGP b enc hmarks) Piecewise-con tinuous resp onse surfaces arise whenever a physical system undergo es a regime c hange — a phase transition, a structural failure threshold, or the capacit y saturation seen in the AMHV data (Section 6.1). Park et al. [2026] define Bi-mixture GP (BGP) datasets that isolate this difficulty in a controlled setting: f ( x ) = f 1 ( x ) 1 X 1 ( x ) + f 2 ( x ) 1 X \ X 1 ( x ) , x ∈ [ − 0 . 5 , 0 . 5] d , where X 1 = { a ⊤ x ≥ 0 } with a drawn uniformly from {− 1 , +1 } d ; f 1 ∼ G P (0 , 9 k SE ) and f 2 ∼ G P (13 , 9 k SE ) with squared-exponential k ernel of length scale 0 . 1 d ; and observ ations are noisy: y i = f ( x i ) + ε i , ε i ∼ N (0 , 4). The tw o regimes are smo oth GP surfaces separated b y a hyperplane whose orientation v aries randomly across replicates. The jump-to-noise ratio is 13 / 2 = 6 . 5: large enough that the discon tinuit y is clearly visible but the noise preven ts trivial boundary detection. A stationary GP m ust c ho ose a single length scale that trades off smo othness within each regime against fidelity at the b oundary — exactly the tension that motiv ates non-stationary metho ds. W e test d ∈ { 2 , 3 , 4 } . 15 F or eac h dimension d , w e generate 20 independent BGP realizations (new partition vector a and new GP draws eac h time). Eac h realization provides N = 2 , 000 p oints in [ − 0 . 5 , 0 . 5] d with an 80:20 train/test split ( n train = 1 , 600). Both methods use the same training data. The rep orted v ariance across replicates reflects b oth estimation error and v ariation in problem difficult y (different a and GP draws). T able 2: BGP b enchmarks (P ark et al. 2026): mean ± SE o v er 20 replicates. GBC-IQN ac hieves lo w er CRPS at all dimensions; the adv an tage gro ws with d as the GP’s global length-scale estimate b ecomes less effective. GP (Mat´ ern-5/2) GBC (IQN) d RMSE CRPS RMSE CRPS 90% Cov. 2 2 . 514 ± 0 . 053 1 . 379 ± 0 . 024 2 . 168 ± 0 . 028 1 . 224 ± 0 . 015 0 . 894 3 2 . 762 ± 0 . 045 1 . 515 ± 0 . 022 2 . 197 ± 0 . 018 1 . 236 ± 0 . 011 0 . 885 4 3 . 122 ± 0 . 060 1 . 716 ± 0 . 030 2 . 245 ± 0 . 017 1 . 267 ± 0 . 012 0 . 845 GBC achiev es low er CRPS on every replicate and dimension, with the a verage impro v e- men t growing with d : from 14% RMSE / 11% CRPS at d = 2 to 28% / 26% at d = 4 (standard errors in T able 2). The stationary GP faces a fundamen tal trade-off: its single length-scale parameter must sim ultaneously smo oth within regions and accommo date the sharp jump; the IQN adapts its conditional quan tile function to eac h side of the partition without a k ernel assumption. Figure 4 illustrates the d = 2 case. The stationary GP assigns a single length scale to the en tire domain, pro ducing blurring of the predicted mean near the b oundary . The IQN captures the lev el shift cleanly , with narrow er uncertaint y interv als aw ay from the b oundary and broader in terv als near it. 16 0.4 0.2 0.0 0.2 0.4 x 1 0.4 0.2 0.0 0.2 0.4 x 2 Stationary GP T rue boundary T rain 0.4 0.2 0.0 0.2 0.4 x 1 0.4 0.2 0.0 0.2 0.4 x 2 GB C (IQN) 6 3 0 3 6 9 12 15 6 3 0 3 6 9 12 15 B G P d = 2 : p a r t i t i o n a x 0 , a = [ 1 . 0 , 1 . 0 ] Figure 4: BGP b enc hmark ( d = 2): predicted mean surface from stationary GP (left) vs. GBC/IQN (right). Dashed line: true partition b oundary a ⊤ x = 0. GP smo oths across the jump; GBC captures the regime c hange. 5.4 F riedman high-dimensional scaling The F riedman function [F riedman, 1991, Surjanovic and Bingham, 2013] f ( x ) = 10 sin( π x 1 x 2 ) + 20( x 3 − 0 . 5) 2 + 10 x 4 + 5 x 5 , x ∈ [0 , 1] 10 , is a standard b enchmark for high-dimensional regression: fiv e of the ten inputs are activ e (with a nonlinear interaction b etw een x 1 and x 2 ), and the remaining five are inert, testing a metho d’s abilit y to concentrate on the relev ant subspace. The smooth but nonlinear structure makes this a natural GP problem, and it is widely used to b enc hmark scalability b ecause GP training cost gro ws as O ( n 3 ) while the function itself is cheap to ev aluate at an y n . W e ev aluate b oth prediction accuracy (at n = 2 , 000 in 10 replicates) and scalability (single runs at n ∈ { 500 , 1 , 000 , 2 , 000 , 5 , 000 , 10 , 000 , 20 , 000 } ). T raining data: y ( i ) = f ( x ( i ) ) + ε ( i ) , ε ( i ) ∼ N (0 , 1). GP baseline uses Mat´ ern-5 / 2 with noise, optimized via maximum marginal lik eliho o d. GBC trains for 3 , 000 ep o chs; b oth metho ds are ev aluated on 500 held-out test p oints. T able 3: F riedman d = 10, n = 2 , 000: mean ± SE ov er 10 replicates. GBC-IQN achiev es lo wer RMSE and CRPS. Both metho ds pro duce cov erage w ell ab ov e the 90% nominal level, indicating conserv ativ e (ov erly wide) interv als. Metho d RMSE CRPS 90% Cov erage GP (Matern 5/2) 0 . 590 ± 0 . 016 0 . 378 ± 0 . 005 0 . 995 GBC (IQN) 0 . 520 ± 0 . 009 0 . 323 ± 0 . 004 0 . 991 GBC ac hieves RMSE = 0 . 520 ± 0 . 009 and CRPS = 0 . 323 ± 0 . 004, v ersus GP’s RMSE = 17 0 . 590 ± 0 . 016 and CRPS = 0 . 378 ± 0 . 005 — a 12% impro v ement on RMSE and 14% on CRPS, with GBC achieving low er CRPS on every individual replicate (10/10). T able 4 and Figure 5 rep ort the scaling b ehavior. At n = 2 , 000, GBC is b oth faster (17.7 s GPU vs. 47.6 s CPU) and more accurate. Bey ond n = 2 , 000, the dense-cov ariance GP cannot b e ev aluated on a standard w orkstation, while GBC con tinues to impro ve: RMSE = 0 . 40 at n = 5 , 000, 0 . 36 at n = 10 , 000, and 0 . 29 at n = 20 , 000 (training time: 152.5 s). On this b enc hmark and hardware configuration, GBC b ecomes b oth faster and more accurate at appro ximately n = 1 , 500; the crosso ver p oint will v ary with problem structure and hardw are. T able 4: F riedman d = 10 scaling study: training time (seconds) and test RMSE at each sample size. GP is infeasible b eyond n = 2 , 000. Eac h ro w is a single run; the n =2 , 000 RMSE v alues differ from T able 3’s 10-replicate means b ecause of sampling v ariabilit y in the training set. GP (Matern 5/2) GBC (IQN) n Time (s) RMSE Time (s) RMSE 500 2.9 1.26 9.6 0.91 1,000 9.7 0.76 10.4 0.71 2,000 47.6 0.64 17.7 0.53 5,000 infeasible 41.2 0.40 10,000 infeasible 70.6 0.36 20,000 infeasible 152.5 0.29 1 0 3 1 0 4 T r a i n i n g s i z e n 1 0 1 1 0 2 W all time (s) T r a i n i n g t i m e v s . \ n GP (sklear n) GB C (IQN) 1 0 3 1 0 4 T r a i n i n g s i z e n 0.4 0.6 0.8 1.0 1.2 RMSE T e s t R M S E v s . \ n ( F r i e d m a n , d = 1 0 ) GP GB C (IQN) Figure 5: F riedman d = 10: training time (left) and test RMSE (righ t) vs. sample size. GP time grows as O ( n 3 ); GBC grows approximately linearly . GBC achiev es lo wer RMSE at all n and remains feasible as n grows b eyond GP’s reach. 18 6 Real-Data and Applied Examples This section ev aluates GBC on fiv e applied benchmarks from the computer exp erimen ts liter- ature: semiconductor transp ort (AMHV), spatial jump pro cesses (Phan tom, Star, Michalewicz), and active learning for ro ck et aero dynamics and satellite drag. 6.1 AMHV semiconductor transp ort The AMHV (Automated Material Handling V ehicle) transport dataset [Kang et al., 2024] records a verage transp ort time (seconds) for autonomous vehicles in a semiconductor wafer fabrication facility ( N = 9 , 503, d = 4). The four inputs — vehicle sp eed, acceleration, min- im um inter-v ehicle distance, and empt y-vehicle searc h range — con trol logistics parameters whose combined effect on transp ort time is smooth at low utilization but jumps sharply when the system approac hes capacity saturation. This real-w orld dataset is the hardest of the four Flo wers b enchmarks: the tw o resp onse regimes overlap in the marginal distribution (unlik e the w ell-separated mo des of Phantom and Star), and d =4 makes the data sparse relative to input v olume. Inputs and resp onse are normalized to [0 , 1] follo wing Flo wers et al. [2026]. W e use the same 90:10 train/test proto col (20 replicates). On this smo oth-b oundary dataset, GBC without augmen tation achiev es RMSE = 0 . 024 and CRPS = 0 . 011 (SE < 0 . 001 for both) with 91% cov erage. MJGP ac hiev es CRPS = 0 . 012, marginally w orse, and its RMSE (0 . 027) is actually worse than GBC’s (0 . 024) — the jump- detection step introduces ov erhead when the boundary is smo oth rather than sharp. This illustrates a recurring pattern: a general-purp ose architecture a v oids the p erformance p enalty that sp ecialized metho ds may incur when their structural assumptions are not w ell-matched to the data. 6.2 Flo w ers jump datasets (Phantom and Star) Flo wers et al. [2026] b enc hmark their MJGP on four datasets representing diverse piecewise structures. The Phan tom dataset ( d =2, N =10 , 201) sim ulates a biv ariate resp onse with a sinusoidal jump boundary (sin( x 1 ) = x 2 ) that creates nested curvilinear regions — a shap e that defeats axis-aligned partitions and simple classifiers. The Star dataset ( d =2, N =10 , 201) replaces the smo oth sine b oundary with an angular, star-shaped partition in whic h every point in the in terior regime is geographically close to the exterior regime, p osing a challenge for neigh b orho o d-based metho ds. Both datasets generate the resp onse from t wo Gaussian comp onen ts conditioned on regime membership, yielding a bimo dal marginal distribution with a sharp level shift across the b oundary . The difficulty is not in predicting the response within each regime (a simple GP suffices there) but in learning the complex b oundary geometry: misclassifying even a thin strip of the input space near the b oundary pro duces large errors in b oth the mean prediction and the predictive in terv al placemen t. W e compare GBC directly to MJGP using the same data, the same 90:10 random train/test proto col (20 replicates), and the ful l training set for b oth metho ds. 19 T able 5: Flo w ers et al. (2026) b enchmark datasets: mean ± SE ov er 20 replicates, full training sets. MJGP uses the authors’ co de [Flo wers et al., 2026]. GBC-Aug augmen ts the IQN input with a b oundary-classifier probability (Section 4.2). Bold = b est v alue across metho ds. MJGP GBC GBC-Aug Dataset ( n train ) RMSE CRPS RMSE CRPS RMSE CRPS Phantom ( n =9k, d =2) 0 . 053 ± 0 . 001 0 . 010 ± 0 . 000 0 . 095 ± 0 . 001 0 . 031 ± 0 . 001 † 0 . 044 ± 0 . 002 0 . 009 ± 0 . 000 † Star ( n =9k, d =2) 0 . 085 ± 0 . 001 0 . 024 ± 0 . 001 0 . 124 ± 0 . 001 0 . 045 ± 0 . 001 † 0 . 059 ± 0 . 002 0 . 013 ± 0 . 000 † AMHV ( n =8 . 5k, d =4) 0 . 027 ± 0 . 000 0 . 012 ± 0 . 000 0 . 024 ± 0 . 000 0 . 011 ± 0 . 000 — Michalewicz ( n =90k, d =4) infeasible ( ≥ 48 h) 0 . 064 ± 0 . 001 0 . 031 ± 0 . 000 † — † Pin ball-dominan t loss ( w 1 =0 . 10, w 2 =0 . 20, w 3 =0 . 70); AMHV uses default loss. GBC 90% cov erage: Phan tom 0.91, Star 0.89, AMHV 0.91, Michalewicz 0.91. GBC-Aug applied only to datasets with binary jump structure (Phantom, Star). MJGP metrics are our ev aluations using the authors’ co de; Flow ers et al. [2026] rep ort RMSPE only . “ ± 0 . 000” indicates SE < 0 . 0005. On Phan tom ( d =2), plain GBC with quantile-dominan t loss achiev es CRPS = 0 . 031 — approximately 3 × worse than MJGP’s 0 . 010. The L 1 lo cation-anc hor term encourages the shared hidden la y ers to predict the conditional median, which near a jump b ound- ary is an a v erage of the t wo regimes rather than a sharp transition. Reducing the anchor w eight to w 1 =0 . 10 and increasing the quan tile weigh t to w 3 =0 . 70 frees the net work to learn sharp er b oundary represen tations. A loss-weigh t ablation on Phan tom (20 replicates, 6,000 ep o chs) confirms this: (0 . 05 , 0 . 25 , 0 . 70) giv es CRPS = 0 . 033; (0 . 00 , 0 . 30 , 0 . 70) giv es 0 . 035; and (0 . 10 , 0 . 20 , 0 . 70) giv es 0 . 031 (T able 5). Removing the anchor entirely ( w 1 = 0) degrades p erformance, suggesting that a small L 1 term stabilizes optimization without biasing the quan tile estimates to ward the b oundary av erage. The largest gain — 28% on Phan tom — comes from this loss rew eighting alone, with no arc hitectural change; the gain decreases with b oundary sharpness (15% on Star, 9% on Michalewicz, 5% on the smo oth AMHV). Applying GBC-Aug (Section 4.2) closes the remaining gap entirely . With an MLP clas- sifier ac hieving 99.98% accuracy on Phan tom, GBC-Aug achiev es RMSE = 0 . 044 ± 0 . 002 (vs. MJGP’s 0 . 053) and CRPS = 0 . 009 (SE < 0 . 001) with 91% co verage, matc hing MJGP on CRPS and impro ving RMSE b y 17%. On Star, GBC-Aug achiev es CRPS = 0 . 013 (SE < 0 . 001; 89% cov erage), outp erforming MJGP’s 0 . 024 by 46%. T otal training time (classifier + IQN) is ≈ 23 s p er replicate on GPU (A100), compared to > 600 s for MJGP on 16 CPU threads. This 26 × w all-clo c k ratio reflects b oth the O ( n ) vs. O ( n 3 ) algorithmic difference and the GPU-vs-CPU hardware asymmetry . A notable finding is that ensem bling hurts on jump data. On smo oth data (motorcycle), p o oling K =5 IQNs impro ves CRPS via v ariance reduction (T able 1). On jump data, en- sem bling increases CRPS — the opp osite effect. The practical recommendation: for jump data, use a single mo del with quantile-dominan t loss; for smo oth data, ensemble for v ariance reduction. W e discuss a candidate mechanism (b oundary-lo cation blur across mem b ers) in Section 7. The Mic halewicz benchmark [Surjanovic and Bingham, 2013] ( d =4, N =100 , 000 via Latin h yp ercub e; 90:10 split, n train =90 , 000) mo difies the standard Michalewicz function by adding a 0 . 5 offset to the flat region, introducing a jump manifold in four-dimensional input space. Unlik e Phantom and Star, the marginal resp onse is not bimo dal: it is a p oin t mass near 0 . 5 20 plus a scattered distribution in [ − 4 , 0], violating the Gaussian-mixture clustering assumption. A t n train = 90 , 000, MJGP is computationally infeasible within practical time limits ( ≥ 48 h). GBC trains in under 30 min on GPU, ac hieving RMSE = 0 . 064 ± 0 . 001 and CRPS = 0 . 031 (SE < 0 . 001) — a regime accessible only to O ( n ) metho ds. 6.3 Ro c k et LGBB activ e learning The Rock et LGBB (Langley Glide-Back Booster) b enc hmark [P amadi et al., 2004, Gramacy, 2020, Sauer et al., 2023] mo dels the aero dynamics of a reusable NASA ro ck et b o oster glid- ing bac k to Earth after launc h. The three inputs are Mac h num b er, angle of attack, and side-slip angle; the resp onse is the side-force co efficient (one of six aero dynamic coefficients pro duced b y the sim ulator). The sound barrier at Mac h = 1 imparts a sharp regime change on the resp onse surface, creating a nonlinear ridge in the Mac h–angle-of-attack plane that stationary GP kernels systematically ov ersmo oth. This mak es the LGBB a standard b ench- mark for non-stationary surrogates in the computer exp eriments literature [Gramacy, 2020]. F ollowing Sauer et al. [2023] exactly , we use n 0 =50 initial Latin hypercub e p oints, acquire 250 additional p oints via the ensemble disagreemen t criterion, and ev aluate on a 1,000-p oin t test set after each acquisition (30 indep endent replicates). T able 6: Active learning b enchmarks: GBC-AL vs. DGP+ALC. Metric ± SE ov er indep en- den t replicates. Ratio > 1 fav ors GBC (b old). Dataset n Metric GBC-AL DGP+ALC Ratio 2D Exp onential 100 RMSE 0 . 087 ± 0 . 003 0 . 072 ∗ 0 . 83 × Ro c k et LGBB 300 RMSE 0 . 0072 ± 0 . 0002 0 . 0213 ± 0 . 001 ∗ 2 . 95 × Satellite GRACE 500 RMSPE 1 . 19 ± 0 . 01 0 . 967 ∗ 0 . 81 × SatMini ( d =7) 60 RMSPE 5 . 30 ± 0 . 15 7 . 91 ± 0 . 28 1 . 49 × 7D Proxy 70 RMSE 0 . 344 ± 0 . 012 0 . 558 ± 0 . 007 1 . 62 × ∗ DGP v alues from the published results of Sauer et al. [2023] (mean only; SE not av ailable from published output); SatMini and 7D Proxy DGP v alues from our DGP+ALC runs on the same data splits. Replicates: 30 (2D, Ro ck et GBC), 50 (Ro c k et DGP , from published co de), 10 (Satellite, SatMini, 7D Proxy). Supplemen t D provides details on 2D Exp onen tial, SatMini, and 7D Pro xy . Co verage is not rep orted for AL b enc hmarks b ecause the ev aluation metric is p oint-prediction RMSE/RMSPE, following Sauer et al. [2023]. Figure 6 sho ws the learning curv es. A t the final budget, GBC-AL achiev es 2 . 95 × low er RMSE than DGP+ALC (0 . 0072 vs. 0 . 0213), with non-o v erlapping replicate distributions (ev ery GBC run yields lo w er RMSE than every DGP run); the 90% bands separate by n ≈ 150. The Ro ck et resp onse surface con tains sharp nonlinearities in the side-force coefficient that are b etter captured b y the IQN than by the DGP’s smo oth squared-exp onen tial k ernel. 21 100 150 200 250 300 n ( t r a i n i n g s e t s i z e ) 1 0 2 6 × 1 0 3 2 × 1 0 2 3 × 1 0 2 4 × 1 0 2 RMSE DGP+ALC (50 reps) GB C-AL (30 reps) Figure 6: Ro c ket LGBB activ e learning: RMSE vs. training set size. Lines show the mean o ver 30 replicates (GBC-AL) and 50 replicates (DGP+ALC); shaded regions are 90% bands (5th to 95th percentile). DGP+ALC replicate coun t (50) reflects the published exp erimental design of Sauer et al. [2023]; GBC-AL uses 30 replicates. The unequal counts do not alter the qualitativ e conclusion: the bands separate completely by n ≈ 150, and ev ery GBC replicate ac hieves lo w er RMSE than every DGP replicate at n = 300. On the 2D Exp onential b enc hmark ( d =2, smo oth, n =100), DGP achiev es 21% low er RMSE. This is consistent with the GP’s inductiv e bias b eing w ell-matched to smo oth, low- dimensional functions: the squared-exponential kernel’s smo othness assumption pro vides effectiv e regularization from very few p oints, while the neural netw ork must estimate its implicit smo othness from limited data. Tw o supplemen tary pro xy exp erimen ts (Supple- men t D) corrob orate these findings at smaller budgets: SatMini, a reduced-budget v ersion of the Satellite GRACE problem ( n =60), and 7D Proxy , a syn thetic smooth function in [0 , 1] 7 designed to mimic the satellite drag surface at low cost. GBC-AL outp erforms DGP+ALC on b oth (+33% and +38%, T able 6). 6.4 Satellite GRA CE activ e learning The Satellite GRACE benchmark [Sun et al., 2019, Meh ta et al., 2014, Sauer et al., 2023] uses a T est P article Monte Carlo (TPMC) sim ulator developed at Los Alamos National Lab- oratory to compute drag co efficien ts for satellites in lo w-Earth orbit — a quantit y needed for p ositioning and collision av oidance. The seven inputs describ e the thermospheric conditions (atmospheric comp osition, velocity , temp erature) encountered by the GRACE satellite. Un- lik e the Ro ck et LGBB, the resp onse surface is smo oth and the simulator is sto chastic ( ≈ 86 s p er ev aluation), making this a high-dimensional, exp ensive-to-ev aluate problem where the 22 GP’s smo othness prior is w ell-matched. Mehta et al. [2014] sho w that a GP trained on a static 1,000-p oint Latin h yp ercub e ac hiev es < 1% RMSPE; the active learning question is whether sequential design can reach this accuracy with few er runs. F ollowing Sauer et al. [2023], w e fix n 0 =110 initial p oints (including 10 replicate pairs to estimate noise) and acquire 390 additional p oints (10 indep endent replicates). Figure 7 reveals a n uanced comparison. GBC-AL leads throughout the early acquisition phase: at n =200, GBC ac hieves RMSPE = 1 . 89 ± 0 . 02 (mean ± SE o v er 10 replicates) v ersus DGP’s 2 . 26 (from published learning curv e), a 16% adv antage. Ho wev er, after ap- pro ximately n =300 p oin ts, DGP ov ertakes GBC, ultimately reaching RMSPE = 0 . 967 at n =500 compared to GBC’s 1 . 19. This crossov er is consistent with the pattern observed across all b enchmarks: GP surrogates b enefit from an informative smoothness prior that the neural netw ork lacks, and this adv an tage grows with n on smo oth surfaces (Section 7). 150 200 250 300 350 400 450 500 n ( t r a i n i n g s e t s i z e ) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 RMSPE DGP overtakes DGP+ALC (10 reps) GB C-AL (10 reps) Figure 7: Satellite GRA CE activ e learning: RMSPE vs. training set size. GBC-AL leads for n < 300; DGP+ALC o v ertakes after the crosso v er (dashed v ertical line). F or exp ensive sim ulators where the acquisition budget is limited, GBC provides better accuracy during the early phase. F or exp ensiv e simulators where the acquisition budget is limited — the most practically relev an t regime — GBC-AL deliv ers b etter accuracy during the early phase when eac h new p oin t matters most. The SatMini and 7D Proxy b enchmarks in T able 6 confirm this small- n adv an tage on indep endent datasets: GBC impro v es b y 33% ( n =60) and 38% ( n =70), resp ectiv ely . These results suggest that GBC-AL tends to b e more effective when the acquisition budget is moderate relative to the input dimension, while DGP+ALC b enefits from its 23 informativ e smo othness prior when enough data accumulat es to identify the length-scale parameters of a smo oth kernel. T able 7 consolidates GBC’s p erformance across the b enchmarks studied. In the emulation comparisons (T able 7, top), GBC matches or improv es CRPS in every setting where a direct GP comparison is a v ailable, with the adv an tage gro wing with problem dimension and sample size. T able 7: Summary of GBC p erformance across all benchmarks. “GP baseline” denotes the metho d-of-c hoice GP v ariant for eac h problem t yp e. ∆ metric is the percentage improv emen t of GBC ov er the GP baseline (p ositive = GBC b etter; CRPS for emulation, RMSE/RMSPE for AL); k × denotes a ratio when the impro v ement exceeds 100%. Bold indicates GBC is b etter. Benc hmark GP Baseline GP Metric GBC Metric ∆ Note Emulation (CRPS, lower is b etter) BGP d =2 ( n =1 . 6k) Stationary GP 1.379 1.224 + 11 % 20/20 reps BGP d =3 ( n =1 . 6k) Stationary GP 1.515 1.236 + 18 % 20/20 reps BGP d =4 ( n =1 . 6k) Stationary GP 1.716 1.267 + 26 % 20/20 reps Motorcycle ( n =133) hetGP 12.56 12.49 + 0 . 5 % § 50 reps F riedman d =10 ( n =2k) Stationary GP 0.378 0.323 + 14 % 10/10 reps F riedman ( n =20k) GP infeasible — 0.29 — scales to n =90k Mic halewicz ( n =90k) MJGP infeasible — 0.031 — GBC only Phan tom d =2 ( n =9k) MJGP 0.010 0.009 ‡ ≈ + 10 % GBC-Aug; marginal Star d =2 ( n =9k) MJGP 0.024 0.013 ‡ + 46 % GBC-Aug AMHV d =4 ( n =8 . 5k) MJGP 0.012 0.011 + 8 % smo oth A ctive le arning (RMSE/RMSPE, lower is b etter) AL 2D Exp ( n =100) DGP+ALC 0.072 0.087 − 21% smo oth, small n AL Ro ck et ( n =300) DGP+ALC 0.0213 0.0072 2 . 95 × 30/30 reps AL Satellite ( n< 300) DGP+ALC 2.26 † 1.89 † + 16 % GBC leads AL Satellite ( n =500) DGP+ALC 0.967 1.19 − 23% DGP leads AL SatMini ( n =60) DGP+ALC 7.91 5.30 + 33 % small n AL 7D Proxy ( n =70) DGP+ALC 0.558 0.344 + 38 % small n Note: ‡ GBC-Aug (b oundary-augmented); see Section 4.2. † A t n =200; GBC leads for n < 300, DGP o v ertak es after. GBC records a fav orable p oint estimate in 12 of 14 direct comparisons; 2 additional rows are feasibility-only (GP infeasible). Comparisons are descriptive: proto cols, replicate counts (10–50), and baseline sources (re-run vs. published) v ary across b enchmarks (see T able 6 fo otnote). DGP is b etter on smo oth surfaces at small n (2D Exp) and large n (Satellite). § Within standard-error ov erlap; not a significan t difference. 7 Discussion The CRPS adv an tage grows with problem dimension (11% to 26% on the BGP b enc hmarks) and with deviation from stationarity . Against sp ecialist metho ds, GBC-Aug matches or sur- passes MJGP b y adopting its b oundary-detection pipeline with an IQN bac kend; GBC-AL outp erforms DGP+ALC on Ro ck et (2 . 95 × ) but loses on smo oth surfaces at large n (Satel- lite GRACE). The t wo cases where GBC consistently underp erforms — 2D Exp onen tial and 24 Satellite at n =500 — both in v olve smo oth functions where the GP’s squared-exponential ker- nel provides effectiv e regularization that the neural net w ork do es not replicate from limited data. Ensem bling helps on smo oth data but hurts on jump data (Section 6.2). A candidate mec hanism is that ensem ble members lo calize the discontin uit y boundary at slightly differ- en t locations, and p o oling their quantile samples blurs the conditionals near the b oundary . W e ha ve not isolated this effect exp erimen tally (e.g., via b oundary-lo cation ablation), so it remains a h yp othesis. GBC also has a natural connection to lik eliho o d-free inference. Because the IQN is trained on simulated pairs ( θ , y ) drawn from the join t distribution, it functions as an amortized p os- terior sampler in the sense of sim ulation-based inference [Cranmer et al., 2020, P apamak arios et al., 2021]. Unlik e ABC [Beaumon t et al., 2002], which requires p er-observ ation forw ard- mo del ev aluations, GBC generates p osterior samples via a single forward pass after training. Unlik e normalizing flo ws, the IQN pro vides direct access to arbitrary quantiles without in- v ertibility constraints. Supplement C demonstrates this connection on a bimo dal p osterior inference task. GBC training is O ( N · L · w 2 ) where L is the num b er of lay ers and w the lay er width; this is linear in N via mini-batch SGD. At inference, cost is O ( B · L · w 2 ) p er new observ ation, indep enden t of training-set size. Because GBC and GP baselines run on differen t hardware (GPU vs. CPU), w all-clo ck comparisons reflect b oth algorithmic and hardware differences; Figure 5 isolates the asymptotic gro wth rates. IQN calibration v aries across problem regimes. Co verage is near-nominal (0 . 85–0 . 91) on piecewise and heteroskedastic b enchmarks (BGP , AMHV), under-nominal on motorcycle (0 . 83, small n ), and ov er-conserv ativ e on F riedman ( > 0 . 99). No existing theory predicts whic h regime applies; the answ er dep ends on the in teraction b etw een signal-to-noise ratio, b oundary sharpness, and sample size. The CRPS impro vemen ts w e rep ort mix calibration and sharpness; a PIT decomp osition of these comp onents remains an op en question. More broadly , neural netw ork v ariance estimates — whether from heteroskedastic NLL, ensem ble spread, or last-lay er Laplace appro ximations — are systematically o verconfiden t on out-of-sample data: the predicted v ariance is calibrated on training inputs but does not extrap olate to regions where the net work has seen little data. When guaran teed co ver- age is required, conformal prediction [V o vk et al., 2005, Romano et al., 2019] provides a distribution-free p ost-pro cessing step that recalibrates any blac k-b o x predictor using held- out residuals. The IQN’s quan tile-based in terv als a v oid the v ariance-estimation problem but, as shown ab o v e, still exhibit regime-dep endent miscalibration. Sev eral cav eats apply . The b enc hmarks do not co ver all domains ( d > 10, functional outputs, sto c hastic sim ulators). Some gains are within standard-error o v erlap (motorcycle +0 . 5%, Phan tom 0 . 009 vs. 0 . 010). The “12 of 14” coun t is descriptive, not a formal test; replicate counts v ary (10–50) and some baselines lack standard errors. W all-clo ck ratios reflect GPU-vs-CPU hardw are differences in addition to alg orithmic scaling; Figure 5 isolates the latter. Two baseline classes are absent: sparse v ariational GPs [Titsias, 2009, Hensman et al., 2013], whic h address cubic cost but retain stationarit y/Gaussian assumptions, and non-GP surrogates (e.g., gradien t-b o osted trees), whic h lac k principled distributional output for CRPS comparison. Sev eral op en problems remain: formal AL optimality theory for IQNs analogous to 25 IMSE/ALC for GPs [Sacks et al., 1989, Gramacy, 2020]; adaptive arc hitecture selection; extension to v ector-v alued outputs; and impro ved epistemic uncertaint y for small- n activ e learning, where GP p osteriors provide a structural adv an tage. F or practitioners, the k ey guideline is that GBC’s strengths and limitations are pre- dictable from problem structure. GBC is the stronger c hoice when the resp onse has dis- con tinuities, when d or n is large enough that cubic GP cost b ecomes prohibitive, or when the predictive distribution is non-Gaussian. A well-tuned GP remains preferable on smo oth, mo derate- n surfaces ( n < 5 , 000, d ≤ 4) where the kernel prior pro vides effectiv e regulariza- tion. The t wo methods can also be combined: GBC-Aug already b orro ws the GP literature’s EM clustering idea, and conformal calibration [Romano et al., 2019] can p ost-pro cess GBC in terv als when formal co v erage guarantees are needed. GBC via IQNs provides a practical surrogate framework that deliv ers full predictive distributions at O ( n ) training cost. On the b enc hmarks studied, it matches or improv es up on GP surrogates for non-stationary , high-dimensional, and large- n problems, while GPs retain an adv an tage on smo oth, mo derate- n surfaces where their kernel prior provides effective regularization. GBC is not a universal replacement for GPs but a complementary to ol whose strengths and w eaknesses are predictable from problem structure: discontin uities, high dimensionalit y , and large training sets fav or GBC; smo oth functions at small n fa v or GPs. References Mark A. Beaumon t, W eny ang Zhang, and Da vid J. Balding. Appro ximate Ba yesian compu- tation in p opulation genetics. Genetics , 162(4):2025–2035, 2002. Mic k a¨ el Binois, Robert B. Gramacy , and Mic hael Ludk ovski. Practical heteroscedastic Gaus- sian pro cess mo deling for large simulation exp eriments. Journal of Computational and Gr aphic al Statistics , 27(4):808–821, 2018. doi: 10.1080/10618600.2018.1458625. Alex J. Cannon. Non-crossing nonlinear regression quantiles by monotone comp osite quan tile regression neural net work, with application to rainfall extremes. Sto chastic Envir onmental R ese ar ch and Risk Assessment , 32:3207–3225, 2018. doi: 10.1007/s00477- 018- 1573- 6. Da vid A. Cohn. Neural net w ork exploration using optimal exp eriment design. A dvanc es in Neur al Information Pr o c essing Systems , 6, 1994. D. Austin Cole, Ry an Christianson, and Rob ert B. Gramacy . Locally induced Gaussian pro cesses for large-scale simulation exp erimen ts. Statistics and Computing , 32(3):33, 2022. Andrew Co op er, Annie S. Bo oth, and Rob ert B. Gramacy . Mo dernizing full p osterior infer- ence for surrogate mo deling of categorical-output sim ulation exp eriments. Quality Engi- ne ering , 38(1):91–110, 2026. doi: 10.1080/08982112.2025.2552420. Kyle Cranmer, Johann Brehmer, and Gilles Loupp e. The frontier of simulation-based infer- ence. Pr o c e e dings of the National A c ademy of Scienc es , 117(30):17199–17207, 2020. doi: 10.1073/pnas.1912789117. 26 Will Dabney , Georg Ostrovski, Da vid Silv er, and R´ emi Munos. Implicit quan tile net w orks for distributional reinforcemen t learning. Pr o c e e dings of the International Confer enc e on Machine L e arning , 2018. Andreas Damianou and Neil D. Lawrence. Deep Gaussian pro cesses. In Pr o c e e dings of the 16th International Confer enc e on A rtificial Intel ligenc e and Statistics (AIST A TS) , pages 207–215, 2013. Anna R. Flo wers, Christopher T. F ranc k, Mic k a ¨ el Binois, Chiwoo Park, and Rob ert B. Gramacy . Mo dular jump Gaussian processes. Data Scienc e in Scienc e , 5(1):2612651, 2026. doi: 10.1080/26941899.2025.2612651. Jerome H. F riedman. Multiv ariate adaptive regression splines. A nnals of Statistics , 19(1): 1–67, 1991. Jan Gasthaus, Konstan tinos Benidis, Y uyang W ang, Syama Sundar Rangapuram, David Salinas, V alentin Flunk ert, and Tim Janusc ho wski. Probabilistic forecasting with spline quan tile function RNNs. In Pr o c e e dings of the 22nd International Confer enc e on A rtificial Intel ligenc e and Statistics (AIST A TS) , pages 1901–1910, 2019. Tilmann Gneiting and Adrian E. Raftery . Strictly prop er scoring rules, prediction, and estimation. Journal of the Americ an Statistic al Asso ciation , 102(477):359–378, 2007. Rob ert B. Gramacy . Surr o gates: Gaussian Pr o c ess Mo deling, Design, and Optimization for the Applie d Scienc es . Chapman and Hall/CR C, 2020. Rob ert B. Gramacy and Daniel W. Apley . Lo cal Gaussian process appro ximation for large computer exp eriments. Journal of Computational and Gr aphic al Statistics , 24(2):561–578, 2015. Rob ert B. Gramacy and Herbert K. H. Lee. Bay esian treed Gaussian process mo dels with an application to computer modeling. Journal of the Americ an Statistic al Asso ciation , 103 (483):1119–1130, 2008. Rob ert B. Gramacy and Herb ert K. H. Lee. Adaptive design and analysis of sup ercomputer exp erimen ts. T e chnometrics , 51(2):130–145, 2009. James Hensman, Nicol` o F usi, and Neil D. Lawrence. Gaussian processes for big data. In Pr o- c e e dings of the Twenty-Ninth Confer enc e on Unc ertainty in A rtificial Intel ligenc e , pages 282–290, 2013. Maik e F. Holthuijzen, Rob ert B. Gramacy , Cay elan C. Carey , David M. Higdon, and R. Quinn Thomas. Syn thesizing data pro ducts, mathematical mo dels, and observ a- tions for lake temp erature forecasting. Annals of Applie d Statistics , 19(2), 2025. doi: 10.1214/24- AO AS1999. Kurt Hornik, Maxw ell Stinchcom b e, and Halb ert White. Multilay er feedforw ard netw orks are universal appro ximators. Neur al Networks , 2(5):359–366, 1989. 27 Donald R. Jones, Matthias Schonlau, and William J. W elc h. Efficient global optimization of exp ensiv e black-box functions. Journal of Glob al Optimization , 13(4):455–492, 1998. Ola v Kallenberg. F oundations of Mo dern Pr ob ability . Springer, 1997. Bonggw on Kang, Chiwoo P ark, Hyunw o o Kim, and So ondo Hong. Ba y esian optimization for the v ehicle dwelling p olicy in a semiconductor w afer fab. IEEE T r ansactions on Automa- tion Scienc e and Engine ering , 21:5942–5952, 2024. doi: 10.1109/T ASE.2024.3354694. Matthias Katzfuss and Joseph Guinness. A general framew ork for Vecchia approximations of Gaussian pro cesses. Statistic al Scienc e , 36(1):124–141, 2021. Marc C. Kennedy and Anthon y O’Hagan. Ba y esian calibration of computer mo dels. Journal of the R oyal Statistic al So ciety: Series B , 63(3):425–464, 2001. Diederik P . Kingma and Jimm y Ba. Adam: A metho d for stochastic optimization. In International Confer enc e on L e arning R epr esentations (ICLR) , 2015. Roger Ko enker. Quantile R e gr ession . Cam bridge Universit y Press, 2005. Jeremiah Zhe Liu, Zi Lin, Shrey as Padh y , Dustin T ran, T ania Bedrax-W eiss, and Bala ji Lakshminara yanan. Simple and principled uncertaint y estimation with deterministic deep learning via distance aw areness. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 33, pages 7498–7512, 2020. Ily a Loshc hilov and F rank Hutter. SGDR: Stochastic gradien t descent with w arm restarts. In International Confer enc e on L e arning R epr esentations (ICLR) , 2017. Jan-Matthis Luec kmann, Jan Bo elts, David S. Green b erg, Pedro J. Gon¸ calv es, and Jak ob H. Mac ke. Benchmarking simulation-based inference. Pr o c e e dings of the 24th International Confer enc e on Artificial Intel ligenc e and Statistics (AIST A TS) , pages 343–351, 2021. Da vid J. C. MacKay . Information-based ob jectiv e functions for active data selection. Neur al Computation , 4(4):590–604, 1992. Piyush M. Mehta, Andrew W alker, Earl Lawrence, Ric hard Linares, David Higdon, and Josef Koller. Mo deling satellite drag co efficients with resp onse surfaces. A dvanc es in Sp ac e R ese ar ch , 54(8):1590–1607, 2014. Max D. Morris and T ob y J. Mitc hell. Exploratory designs for computational exp eriments. Journal of Statistic al Planning and Infer enc e , 43(3):381–402, 1995. Jerem y E. Oakley and Anthon y O’Hagan. Probabilistic sensitivity analysis of complex mo d- els: a Ba y esian approac h. Journal of the R oyal Statistic al So ciety: Series B , 66(3):751–769, 2004. Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcemen t learning. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 31, 2018. 28 Oscar Hernan Madrid P adilla, W esley T ansey , and Y anzhen Chen. Quan tile regression with ReLU netw orks: Estimators and minimax rates. Journal of Machine L e arning R ese ar ch , 23(1), 2022. Bandu N. Pamadi, Gregory J. Brauckmann, Michael J. Ruth, and Henry D. F uhrmann. Aero dynamic c haracteristics, database dev elopment, and fligh t sim ulation of the X-33 v ehicle. T echnical Rep ort TM-2004-212900, NASA, 2004. George Papamak arios, Eric Nalisnic k, Danilo Jimenez Rezende, Shakir Mohamed, and Bala ji Lakshminara yanan. Normalizing flows for probabilistic mo deling and inference. Journal of Machine L e arning R ese ar ch , 22(57):1–64, 2021. Chiw o o Park, Rob ert W aelder, Bonggwon Kang, Benji Maruy ama, So ondo Hong, and Rob ert B. Gramacy . Activ e learning of piecewise Gaussian pro cess surrogates. T e ch- nometrics , 68(1):186–201, 2026. doi: 10.1080/00401706.2025.2561746. Adam P aszk e, Sam Gross, F rancisco Massa, Adam Lerer, James Bradbury , Gregory Chanan, T revor Killeen, Zeming Lin, Natalia Gimelshein, Luca An tiga, Alban Desmaison, An- dreas K¨ opf, Edward Y ang, Zac hary DeVito, Martin Raison, Alykhan T ejani, Sasank Chil- amkurth y , Benoit Steiner, Lu F ang, Junjie Bai, and Soumith Chintala. PyT orc h: An imp erativ e style, high-p erformance deep learning library . In A dvanc es in Neur al Informa- tion Pr o c essing Systems , volume 32, 2019. Tim Pearce, F elix Leibfried, Alexandra Brintrup, Mohamed Zaki, and Andy Neely . Uncer- tain ty in neural net works: appro ximately Ba yesian ensem bling. In International Confer- enc e on A rtificial Intel ligenc e and Statistics , pages 234–244. PMLR, 2020. Nic holas Polson and V adim Sokolo v. Deep learning: Computational asp ects. Wiley Inter- disciplinary R eviews: Computational Statistics , 12(5):e1500, 2020. Nic holas Polson, V adim Sokolo v, and Jianeng Xu. Deep learning partial least squares. arXiv pr eprint arXiv:2106.14085 , 2021. Nic k P olson and V adim Sok olo v. Generative AI for Ba yesian computation. Entr opy , 27(7): 683, 2025. Carl E. Rasm ussen and Christopher K. I. Williams. Gaussian Pr o c esses for Machine L e arn- ing . MIT Press, 2006. Y aniv Romano, Ev an Patterson, and Emmanuel Cand ` es. Conformalized quan tile regression. In A dvanc es in Neur al Information Pr o c essing Systems , volume 32, 2019. Jerome Sacks, William J. W elch, T oby J. Mitc hell, and Henry P . Wynn. Design and analysis of computer exp eriments. Statistic al Scienc e , 4(4):409–423, 1989. Thomas J. San tner, Brian J. Williams, and William I. Notz. The Design and Analysis of Computer Exp eriments . Springer, 2nd edition, 2018. 29 Annie Sauer, Rob ert B. Gramacy , and David Higdon. Active learning for deep gaussian pro cess surrogates. T e chnometrics , 65(1):4–18, 2023. doi: 10.1080/00401706.2021.2008505. Johannes Schmidt-Hieber. Nonparametric regression using deep neural netw orks with ReLU activ ation function. Annals of Statistics , 48(4):1875–1897, 2020. Laura Sch ultz and V adim Sokolo v. Ba y esian optimization for transportation sim ulators. Pr o c e dia Computer Scienc e , 130:973–978, 2018a. Laura Sch ultz and V adim Sok olov. Deep reinforcement learning for dynamic urban trans- p ortation problems. arXiv pr eprint arXiv:1806.05310 , 2018b. Laura Sc h ultz and V adim Sok olov. Deep learning Gaussian pro cesses for computer mo dels with heterosk edastic and high-dimensional outputs. arXiv pr eprint arXiv:2209.02163 , 2022. Laura Sch ultz, Joshua Auld, and V adim Sokolo v. Bay esian calibration for activity based mo dels. arXiv pr eprint arXiv:2203.04414 , 2022. V adim Sok olov, Joshua Auld, and Mic hael Hop e. A flexible framew ork for dev eloping inte- grated models of transp ortation systems using an agen t-based approac h. Pr o c e dia Com- puter Scienc e , 10:854–859, 2012. F urong Sun, Rob ert B. Gramacy , Benjamin Haaland, Earl Lawrence, and Andrew W alker. Em ulating satellite drag from large sim ulation exp eriments. SIAM/ASA Journal on Un- c ertainty Quantific ation , 7(2):720–759, 2019. Simon Surjano vic and Derek Bingham. Virtual library of simulation exp eriments: T est functions and datasets. https://www.sfu.ca/ ~ ssurjano , 2013. Mic halis K. Titsias. V ariational learning of inducing v ariables in sparse Gaussian pro cesses. In Pr o c e e dings of the Twelfth International Confer enc e on Artificial Intel ligenc e and Statis- tics , pages 567–574, 2009. Vladimir V o vk, Alex Gammerman, and Glenn Shafer. Algorithmic L e arning in a R andom World . Springer, 2005. 30
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment