Conditional neural control variates for variance reduction in Bayesian inverse problems

Conditional neural contr ol variates f or v ariance r eduction in Bay esian in verse pr oblems Ali Siahkoohi 1 Hyunwoo Oh 2 1 Department of Computer Science, Univ ersity of Central Florida 2 Department of Physics and Maryland Center for Fundamental Physics, Uni versity of Maryland Abstract Bayesian inference for in verse problems in- volv es computing e xpectations under posterior distributions—e.g., posterior means, v ariances, or predictiv e quantities—typically via Monte Carlo (MC) estimation. When the quantity of in- terest varies signiﬁcantly under the posterior, ac- curate estimates demand many samples—a cost often prohibitive for partial differential equation- constrained problems. T o address this challenge, we introduce conditional neural control variates, a modular method that learns amortized control variates from joint model–data samples to reduce the variance of MC estimators. T o scale to high- dimensional problems, we le verage Stein’ s identity to design an architecture based on an ensemble of hierarchical coupling layers with tractable Jaco- bian trace computation. T raining requires: (i) sam- ples from the joint distribution of unknown param- eters and observed data; and (ii) the posterior score function, which can be computed from physics- based likelihood e valuations, neural operator sur- rogates, or learned generati ve models such as con- ditional normalizing ﬂo ws. Once trained, the con- trol v ariates generalize across observ ations without retraining. W e validate our approach on stylized and partial differential equation-constrained Darcy ﬂow in verse problems, demonstrating substantial variance reduction, e ven when the analytical score is replaced by a learned surrogate. 1 INTR ODUCTION Quantifying uncertainty in the solution of in verse prob- lems is of fundamental importance in many areas of sci- ence and engineering [T arantola, 2005, Stuart, 2010]. Due to noise and ill-posedness, a single point estimate fails to capture the non-uniqueness of the solution. The Bayesian framew ork addresses this by characterizing the full pos- terior distrib ution over solutions consistent with the ob- served data and prior knowledge. Posterior samples then provide expectations of quantities of interest—e.g., poste- rior means, variances, and credible intervals. Computing these expectations requires e valuating integrals of the form E p post ( x | y ) [ h ( x )] , which are typically intractable in closed form. While Monte Carlo (MC) estimation replaces the in- tegral with a sample average [Robert and Casella, 2004], its variance— V ar ( h ( x )) / N for N i.i.d. samples—can be prohibitiv ely large when h ( x ) varies signiﬁcantly under the posterior . In partial differential equation (PDE)-constrained in verse problems, where each posterior sample requires a computationally expensi ve forw ard solv e, this high v ariance translates into high computational cost. Control v ariates of fer a classical remedy [Fieller and Hart- ley, 1954, Robert and Casella, 2004]: gi ven an auxiliary function g ( x ) with zero expectation under the target dis- tribution and positi ve correlation with h ( x ) , the controlled estimator E p post ( x | y ) [ h ( x ) − g ( x )] has lower v ariance while remaining unbiased. Stein’ s identity [Stein, 1972] provides a systematic way to construct such zero-mean functions from the score function—i.e., the gradient of the log-density—of the target distribution. Building on this principle, control functionals [Oates et al., 2017] and the zero-variance prin- ciple [Assaraf and Caff arel, 1999] hav e demonstrated the ef fectiv eness of score-based control v ariates for MC integra- tion, and neural networks have been used to parameterize them in gradient estimation [Grathwohl et al., 2018], render- ing [Müller et al., 2020], general MC integration [W an et al., 2020], and lattice ﬁeld theory [Bedaque and Oh, 2024, Oh, 2025]. While these methods achieve impressive variance reduction, they all require retraining for each ne w observ a- tion y , precluding amortization—a crucial limitation when training of these control variates for e very new observ ation is computationally prohibitiv e. T o address this limitation, we introduce conditional neu- ral control variates (CNCV), a modular variance reduc- tion method for MC estimation of posterior expectations in Bayesian in verse problems. Our method requires: (i) joint samples ( x , y ) drawn from the prior predictiv e distribution— by sampling x from the prior and computing y = F ( x )+ ϵ — samples that are already av ailable in simulation-based in- ference pipelines [Cranmer et al., 2020]; and (ii) the pos- terior score function ∇ x log p ( x | y ) , which can come from physics-based ev aluations, neural operator surrogates [Li et al., 2021], or learned generative models [Rezende and Mohamed, 2015, Song and Ermon, 2019, Ho et al., 2020, Papamakarios et al., 2021, Baldassari et al., 2023]. Once trained, the control variates apply to any ne w observation y without retraining, integrating seamlessly with simulation- based inference techniques. A further challenge arises when scaling Stein-based control variates to high dimensions: computing the div ergence of a general neural network—required by the construction in Section 3—in volv es O ( d ) backward passes or stochastic trace estimation [Hutchinson, 1990]. T o address this, we introduce an ensemble of hierarchical afﬁne coupling layers inspired by HINT [Kruse et al., 2021], each operating on a random permutation of the input dimensions. The triangular Jacobian structure enables exact di ver gence computation in a single forward pass. Each ensemble member captures dif- ferent cross-dimension interactions through its permutation, and av eraging yields control variates with high correlation across all components of the quantity of interest. Our main contributions are: (i) we e xtend neural control v ari- ates to the conditional setting, yielding amortized control variates that inte grate with any posterior sampler and score source; (ii) we introduce an ensemble of hierarchical cou- pling layers with random input permutations that enable ex- act div ergence computation; and (iii) we validate CNCV on four in verse problems of increasing complexity—Gaussian, Rosenbrock, nonlinear , and PDE-constrained Darcy ﬂo w— demonstrating substantial variance reduction that general- izes across observations, e ven with learned scores. 2 B A YESIAN INVERSE PROBLEMS W e are concerned with estimating an unknown parameter x ∈ X ⊂ R d from noisy observations y ∈ Y ⊂ R m related through the forward model y = F ( x ) + ϵ , where F : X → Y is the forward operator and ϵ ∼ N ( 0 , σ 2 I ) is additi ve Gaussian noise. In the Bayesian framework [T arantola, 2005, Kaipio and Somersalo, 2006], the posterior distribution is giv en by Bayes’ rule: p post ( x | y ) ∝ p like ( y | x ) p prior ( x ) , (1) where p like ( y | x ) is the likelihood, p prior ( x ) encodes prior beliefs on the unknown, and the normalizing constant is the marginal likelihood p ( y ) . For brevity , we write p ( x | y ) ≡ p post ( x | y ) throughout. Since the posterior is generally in- tractable, we must resort to sampling-based methods to extract information from it. 2.1 SAMPLING THE POSTERIOR Extracting information from the posterior requires access to samples—and many modern posterior sampling meth- ods rely on the scor e function ∇ x log p ( x | y ) , either explic- itly or implicitly . By Bayes’ rule, the score decomposes as ∇ x log p ( x | y ) = ∇ x log p like ( y | x ) + ∇ x log p prior ( x ) . Gradient-based MCMC methods—such as MALA and HMC [Roberts and Rosenthal, 1998, Neal, 2011]—use the score to construct Marko v chains that con ver ge to the pos- terior , but produce correlated samples and require burn-in. Conditional generativ e models offer an alternative: condi- tional normalizing ﬂows (CNFs; Rezende and Mohamed, 2015, Papamakarios et al., 2021) learn an in vertible map from a base distribution to the posterior , providing both i.i.d. samples and an explicit density from which the score is ob- tained by automatic differentiation. Dif fusion models [Song and Ermon, 2019, Ho et al., 2020, Baldassari et al., 2023] di- rectly parameterize the score via denoising score matching. When the forward operator F is dif ferentiable—or when its adjoint is av ailable—the likelihood score can also be ev aluated analytically [Louboutin et al., 2023]; when F is expensi ve, a differentiable surrogate [Li et al., 2021] can approximate it. In all cases, the score of the sampling dis- tribution is av ailable as a byproduct. Our control variate construction (Section 3) exploits this observ ation, requiring only that the score corresponds to the sampling distribution. 2.2 MONTE CARLO ESTIMA TION AND CONTR OL V ARIA TES Once posterior samples are av ailable, the primary computa- tional task is estimating expectations of quantities of interest h : X → R : E p post ( x | y ) [ h ( x )] = Z X h ( x ) p post ( x | y ) d x . (2) These expectations provide statistical summaries—e.g., the posterior mean ( h ( x ) = x ), posterior variance ( h ( x ) = ( x − µ post ) 2 ), or predicti ve quantities. Gi ven N sam- ples { x i } N i =1 ∼ p ( x | y ) , the MC estimator b E [ h ( x )] = 1 N P N i =1 h ( x i ) is unbiased with variance V ar ( h ( x )) / N . When V ar ( h ( x )) is large, accurate estimates demand man y samples. A control variate g : X × Y → R satisfying E p ( x | y ) [ g ( x , y )] = 0 yields the controlled estimator e E [ h ( x )] = 1 N N X i =1  h ( x i ) − g ( x i , y )  , (3) which is also unbiased, with per -sample variance V ar ( h − g ) = V ar ( h ) + V ar ( g ) − 2 Cov ( h, g ) . When g is posi- ti vely correlated with h and has moderate variance, V ar ( h − g ) < V ar ( h ) —i.e., the control variate reduces the per- sample v ariance. When the samples { x i } N i =1 are i.i.d. dra ws from the posterior—e.g., from a conditional generativ e model or independent importance sampling—the terms h ( x i ) − g ( x i , y ) are themselves i.i.d., and the estimator variance i s V ar ( h − g ) / N . The control v ariate therefore re- duces the constant factor from V ar ( h ) to V ar ( h − g ) while preserving the 1 / N con vergence rate. The challenge is con- structing such zero-mean functions g systematically; our approach (Section 3) uses the score from the sampling dis- tribution to b uild g via Stein’ s identity . 3 CONDITIONAL NEURAL CONTR OL V ARIA TES The goal of this section is to present the CNCV construc- tion. The key insight is to combine Stein’ s identity—which guarantees zero expectation by construction—with a hierar- chical coupling layer architecture that makes the required div ergence computation tractable. 3.1 CONTR OL V ARIA TE CONSTRUCTION VIA STEIN’S IDENTITY Stein’ s identity [Stein, 1972] states that for any smooth ϕ : R d × R m → R d satisfying appropriate boundary condi- tions (see Appendix A), E p ( x | y ) [ ∇ x · ϕ ( x , y ) + ϕ ( x , y ) · ∇ x log p ( x | y )] = 0 . (4) The abov e expression immediately provides a family of control variates parameterized by ϕ : g ( x , y ) = ∇ x · ϕ ( x , y ) + ϕ ( x , y ) · ∇ x log p ( x | y ) , (5) where ∇ x · ϕ = P d i =1 ∂ ϕ i /∂ x i denotes the di ver gence. By construction, E p ( x | y ) [ g ( x , y )] = 0 for all y , ensuring that g could be used as a control variate. W e parameterize ϕ by a neural network ϕ θ : R d + m → R d ; substituting into equation (5) yields g θ . Unfortunately , com- puting ∇ x · ϕ θ = P d i =1 ∂ ϕ θ,i /∂ x i for a general neural network requires d backward passes or stochastic trace esti- mation [Hutchinson, 1990]—a cost that becomes prohibiti ve in high dimensions, motiv ating the architecture below . 3.2 HIERARCHICAL COUPLING LA YER ARCHITECTURE T o address the computational challenge of the di ver gence, we parameterize ϕ θ using the hierarchical in vertible neural transport (HINT) architecture of Kruse et al. [2021], which organizes afﬁne coupling layers [Dinh et al., 2017] in a recursiv e binary tree. At each node, the input is split into two halves x = [ x 1 , x 2 ] with x 1 ∈ R d 1 , x 2 ∈ R d 2 , and transformed via an afﬁne coupling: ϕ θ ( x , y ) =  x 1 s θ ( x 1 , y ) ⊙ x 2 + t θ ( x 1 , y )  (6) Here, s θ and t θ are neural networks conditioned on y , and ⊙ denotes elementwise multiplication. In the recursiv e tree, the upper half is ﬁrst processed by its subtree, and the resulting output—rather than the raw x 1 —conditions the coupling of the lo wer half; the coupled lo wer half is then processed by its own subtree. This continues until leaf nodes are reached at a prescribed depth. The recursi ve subdivision ensures that ev ery dimension participates in at least one coupling, pro- ducing a lo wer-triangular Jacobian with all but one diagonal entry learnable. At a single coupling node, the triangular structure yields diagonal elements ∂ ϕ θ,i /∂ x i = 1 for i in the upper partition and ∂ ϕ θ,i /∂ x i = s θ,i ( x 1 , y ) for i in the lower partition. The div ergence contribution from one node is thus X i ∈ x 2 s θ,i ( x 1 , y ) + d 1 , (7) where d 1 = dim( x 1 ) . The full-tree Jacobian diagonal is computed recursiv ely: d = [ d upper ; s ⊙ d lower ] , with d = 1 at leaf nodes (see Appendix A). The diver gence ∇ x · ϕ θ = 1 ⊤ d is exact and obtained in a single forward pass. Each hierarchical coupling tree ϕ θ maps x ∈ R d to R d . Cru- cially , the full Jacobian diagonal diag( ∇ x ϕ θ ) is av ailable from the recursi ve forw ard pass, enabling a vector -valued control variate for h ( x ) ∈ R k using a single network: g θ ( x , y ) = diag  ∇ x ϕ θ  + ϕ θ ( x , y ) ⊙ ∇ x log p ( x | y ) , (8) In the abov e expression, ⊙ denotes elementwise multi- plication and k denotes the dimension of the quantity of interest. Since the j -th component g θ,j = ∂ ϕ θ,j /∂ x j + ϕ θ,j · ∇ x j log p ( x | y ) is exactly the j -th term of the prod- uct rule expansion befor e summing ov er components (see Appendix A), each component independently has zero ex- pectation: E p ( x | y ) [ g θ ] = 0 . A single tree thus produces control v ariates for all k components simultaneously , with- out requiring a separate network per component. While a single tree produces valid control variates, it is limited by its ﬁxed binary subdi vision pattern. T o improve cov erage, we use L independent ensemble members, each with its o wn parameters and a random permutation P ℓ ap- plied to the input dimensions (outputs are mapped back before computing g j ). Since the trace is in variant under con- jugation, tr( P ℓ J P − 1 ℓ ) = tr( J ) , each permuted tree still satisﬁes Stein’ s identity (Appendix A), and different permu- tations capture complementary cross-dimension interactions. The av eraged control variate g ens ( x , y ) = 1 L L X ℓ =1 g θ ℓ ( x , y ) (9) preserves E p ( x | y ) [ g ens ( x , y )] = 0 ; we verify this empiri- cally in Appendix A (Figure 12). W e ablate the effect of ensemble size in Section 4.1.4. 3.3 TRAINING Our purpose is to ﬁnd parameters θ that minimize the total variance of the controlled estimator h − g . Since E p ( x | y ) [ g θ ( x , y )] = 0 by Stein’ s identity , we have E  ∥ h − g ∥ 2  = P d j =1 V ar ( h j − g j ) + ∥ E [ h ] ∥ 2 , (10) where the second term is constant in θ . Expanding each component giv es V ar ( h j − g j ) = V ar ( h j ) + V ar ( g j ) − 2 Cov ( h j , g j ) , so minimizing the MSE simultaneously penalizes large V ar ( g j ) and rewards high Cov ( h j , g j ) — automatically balancing the two contributions to v ariance reduction. This yields the training objectiv e: { θ ∗ ℓ } = arg min { θ ℓ } E h ∥ h − g ens ∥ 2 i , (11) where the expectation is ov er ( x , y ) ∼ p ( x , y ) . In prac- tice, the expectation is approximated using N training sam- ples { ( x i , y i ) } N i =1 generated by sampling x i ∼ p prior ( x ) and computing y i = F ( x i ) + ϵ i . The score function ∇ x log p ( x i | y i ) is ev aluated at each training pair using any of the sources described in Section 2.1. 3.4 INFERENCE For a ne w observation y obs , we draw M posterior samples using any a vailable sampler (cf. line 9 of Algorithm 1) and compute the variance-reduced estimate (line 15): b E [ h ( x )] = 1 M M X j =1  h ( x j ) − g ens ( x j , y obs )  . (12) The entire inference procedure in volv es only neural network forward passes and score e v aluations— L coupling layer and score ev aluations per sample. When the dominant cost is the forward model—e.g., PDE solves—this overhead is negligible. Crucially , no retraining is required for new observations, maintaining full amortization. 4 NUMERICAL EXPERIMENTS W e validate the proposed CNCV framework on four in- verse problems of increasing complexity . W e ﬁrst con- sider three stylized problems—Gaussian, Rosenbrock, and nonlinear—with tractable posterior scores, then scale to a PDE-constrained Darcy ﬂow inv erse problem where the score must be learned from data. Throughout, we report the variance r eduction factor (VRF) = V ar ( h − g ) / V ar ( h ) , Algorithm 1 Conditional neural control variates (CNCV) Require: Prior p prior , forward model F , score source, pos- terior sampler , ensemble size L , quantity of interest h 1: // Ofﬂine training 2: for i = 1 , . . . , N train do 3: Sample x i ∼ p prior ( x ) 4: Compute y i = F ( x i ) + ϵ i 5: Evaluate ∇ x log p ( x i | y i ) from chosen score source 6: end for 7: Initialize L hierarchical coupling layers with random input permutations 8: Optimize { θ ℓ } L ℓ =1 via equation (11) 9: // Online inference (for any ne w y obs ) 10: Draw M posterior samples { x j } M j =1 ∼ p ( x | y obs ) 11: for j = 1 , . . . , M do 12: Evaluate score ∇ x log p ( x j | y obs ) 13: Evaluate g ens ( x j , y obs ) via equation (9) 14: end for 15: return b E [ h ] = 1 M P M j =1  h ( x j ) − g ens ( x j , y obs )  T able 1: V ariance reduction for Gaussian posteriors. VRF = V ar ( h − g ) /V ar ( h ) ; lower is better . V alues are mean ± std across d components, each av eraged ov er 100 observations. d Mean Est. VRF V ar . Est. VRF 2 0 . 030 ± 0 . 005 0 . 022 ± 0 . 008 4 0 . 040 ± 0 . 011 0 . 020 ± 0 . 012 8 0 . 076 ± 0 . 023 0 . 073 ± 0 . 043 16 0 . 150 ± 0 . 086 0 . 267 ± 0 . 258 where lower v alues indicate stronger v ariance reduction (VRF = 0 is perfect, VRF = 1 is no improv ement). In all experiments, k = d —i.e., the quantity of interest has the same dimension as the unknown parameter . 4.1 STYLIZED INVERSE PR OBLEMS These e xperiments quantify variance reduction and its effect on estimation accuracy , study how performance scales with dimension and sample size, ablate the ensemble size, and assess sensitivity to learned scores. Full hyperparameters are in Appendix E. 4.1.1 Gaussian posterior Consider the in verse problem y = x + ϵ with x ∈ R d ∼ N ( 0 , Σ prior ) and ϵ ∼ N ( 0 , σ 2 I ) with σ = 0 . 3 . The prior cov ariance matrix is a randomly generated positiv e deﬁ- nite matrix. The analytical posterior is Gaussian with kno wn mean and co variance. W e e valuate (i) posterior mean estima- tion with h ( x ) = x , and (ii) posterior variance estimation with h ( x ) = ( x − µ post ) 2 . 2 4 8 16 D i m e n s i o n d 0.0 0.2 0.4 VRF (a) Mean V ariance 2 4 8 16 D i m e n s i o n d 0.85 0.90 0.95 1.00 ( h , g ) (b) Mean V ariance Figure 1: Dimension scaling of CNCV . (a) VRF for mean and variance estimation across d ∈ { 2 , 4 , 8 , 16 } (lower is better; VRF = 1 means no improvement). (b) Correlation between h and g across dimensions. 1 0 1 1 0 2 1 0 3 S a m p l e s i z e N 0.02 0.04 0.06 VRF (a) V ariance reduction factor i = 1 i = 2 i = 3 i = 4 Mean 1 0 1 1 0 2 1 0 3 S a m p l e s i z e N 1 0 7 1 0 6 1 0 5 1 0 4 1 0 3 MSE (b) Mean squared error C V i = 1 C V i = 2 C V i = 3 C V i = 4 R aw (mean) CV (mean) Figure 2: Sample efﬁciency for d = 4 mean estimation. (a) VRF is sample-size in variant at ∼ 0 . 04 . (b) MSE follo ws 1 / N rate; the CV estimator pro vides a ∼ 25 × effecti ve sam- ple increase. Index i denotes the component. T able 1 summarizes results across dimensions. W e observe that for mean estimation, CNCV achiev es VRF between 0 . 03 and 0 . 15 across d ∈ { 2 , 4 , 8 , 16 } , while for variance estimation, VRF remains belo w 0 . 08 at d ≤ 8 and reaches 0 . 27 at d = 16 (each entry is averaged over 100 test ob- servations). Figure 1 sho ws that VRF gro ws mildly with d , while correlation between h and g remains abov e 0.96 for mean estimation. The learned control v ariates achiev e per - component correlation ρ > 0 . 98 with the target. Figure 2 shows the sample efﬁciency for d = 4 . W e observe from panel (a) that the VRF is sample-size in v ariant at ∼ 0 . 04 across sample sizes from 10 to 5k. Panel (b) shows MSE vs. sample size: while both estimators follow the 1 / N rate, the CV estimator has a substantially lower constant f actor . 4.1.2 Rosenbrock posterior T o assess CNCV on non-Gaussian posteriors, we con- sider the Rosenbrock distrib ution—a benchmark with a strongly curved, banana-shaped density , with prior p ( x ) ∝ exp( − a ( x 1 − µ ) 2 − b ( x 2 − x 2 1 ) 2 ) where µ = 0 , a = 0 . 5 , b = 1 . Combined with Gaussian likelihood p ( y | x ) = N ( x , σ 2 I ) with σ = 0 . 3 , the posterior is non-Gaussian and exhibits strong nonlinear correlations between components. Figure 3 demonstrates amortization across three test obser- vations from div erse regions: y (1) from the left tail, y (2) from the right tail at high x 2 , and y (3) near the ridge. Each observation (marked by stars) induces a differently posi- 1 0 1 x 1 0.0 0.5 1.0 1.5 2.0 x 2 VRF=0.08 ( a ) y ( 1 ) 0 2 x 1 0 1 2 3 VRF=0.23 ( b ) y ( 2 ) 1 0 1 x 1 0 1 2 VRF=0.22 ( c ) y ( 3 ) y ( 1 ) y ( 2 ) y ( 3 ) 0.0 0.1 0.2 0.3 0.4 0.5 VRF (d) P er-component x 1 x 2 Figure 3: Amortized CNCV on Rosenbrock posterior . (a– c) T est observations from left tail, right tail, and ridge (stars) yield diverse posteriors; VRF values are sho wn in each panel. (d) Per-component VRF for each observation; the same trained model achie ves VRF ∈ [0 . 08 , 0 . 23] across all cases (lower is better). x 1 x 2 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 VRF (a) Per-component VRF VRF = 1 0 1 2 3 4 5 6 7 8 9 Observ ation index 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 VRF (b) Per-observ ation VRF VRF = 1 Figure 4: Posterior v ariance estimation on the Rosenbrock problem. (a) Per-component VRF: x 1 achiev es strong re- duction while x 2 (banana direction) is harder . (b) Per- observation VRF across 10 test observ ations (mean 0 . 44 ). tioned banana-shaped posterior . Despite this div ersity , the same trained model achiev es VRF between 0 . 08 and 0 . 23 across all cases (panel d). This conﬁrms that the control variate generalizes to an y new observ ation. Aside from posterior mean estimation, CNCV also reduces variance for posterior variance estimation. Figure 4 uses h ( x ) = ( x − µ est ) 2 , where µ est and the score are obtained from a pretrained CNF (Section 2.1). The ﬁrst component ( x 1 ) achie ves VRF = 0 . 08 , while the banana direction ( x 2 )—where the posterior e xhibits the strongest nonlinear correlations—is harder ( VRF = 0 . 79 ), yielding a mean VRF of 0 . 44 across 10 test observations. 4.1.3 Nonlinear forward model W e next consider the nonlinear forward model F ( x ) = Ax + sin( x ) in d = 4 , where A has condition number 2. Figure 5 sho ws the non-Gaussian posteriors for three test observations. At test time, posterior samples com e from a denoising diffusion probabilistic model [Ho et al., 2020] trained on the same joint distrib ution, providing fast amor - tized sampling. CNCV achie ves VRF = 0 . 57 ± 0 . 29 across 10 test observations. Figure 6 shows the per -component and per-observ ation VRFs. 2 1 0 1 2 3 x 2 x ( 1 ) t r u e x ( 2 ) t r u e x ( 3 ) t r u e 4 2 0 x 3 1 0 1 x 1 2 1 0 1 2 x 4 2 0 2 x 2 4 2 0 x 3 Figure 5: Corner plot of the nonlinear posterior ( d = 4 ) for three test observations (blue, red, green). Contours show 50% and 90% HDRs; stars mark the true parameters. x 1 x 2 x 3 x 4 0.0 0.2 0.4 0.6 0.8 1.0 VRF (a) P er-component VRF VRF = 1 0 1 2 3 4 5 6 7 8 9 Observation index 0.0 0.2 0.4 0.6 0.8 1.0 VRF (b) P er-observation VRF VRF = 1 Figure 6: CNCV variance reduction on the nonlinear for- ward model ( d = 4 ). (a) Per-component VRF a veraged ov er 10 test observations. (b) Per -observation VRF (mean 0 . 57 ). 4.1.4 T raining efﬁciency and ensemble size Figure 7(a) sho ws VRF versus total training samples seen for both the Gaussian ( d = 4 ) and Rosenbrock ( d = 2 ) prob- lems. W e observe that both con verge rapidly: the Gaussian VRF drops from ∼ 19 to below 0.03, and the Rosenbrock from ∼ 277 to below 0.14, within the ﬁrst fe w passes ov er the dataset. Figure 7(b) sho ws VRF as a function of the en- semble size L . A single member ( L = 1 ) yields almost no variance reduction; ho we ver , two members already reduce VRF by two orders of magnitude on the Gaussian problem, and performance saturates beyond L ≈ 8 . This motiv ates the default L = 16 used throughout our e xperiments. 4.1.5 Sensitivity to learned scor es All preceding experiments used posterior scores computed directly from the known likelihood and prior . In prac- tice, ho we ver , the score must often be learned from data— 0 1000 2000 3000 T r a i n i n g s a m p l e s s e e n ( × 1 0 3 ) 1 0 2 1 0 1 1 0 0 1 0 1 1 0 2 1 0 3 VRF (a) G a u s s i a n ( d = 4 ) R o s e n b r o c k ( d = 2 ) 1 2 4 8 16 32 64 E n s e m b l e s i z e L 1 0 2 1 0 1 1 0 0 1 0 1 1 0 2 VRF (b) G a u s s i a n d = 4 R o s e n b r o c k d = 2 Figure 7: (a) VRF v ersus total training samples seen. (b) VRF versus ensemble size L . Shading sho ws ± 1 stan- dard deviation across test observations. Both panels show Gaussian d = 4 and Rosenbrock d = 2 mean estimation. T able 2: Analytical vs. CNF-learned score comparison. VRF = V ar ( h − g ) /V ar ( h ) ; lo wer is better . Gaussian and nonlinear: mean ± std across components; Rosenbrock: range over three test observations. Each ensemble is trained and e valu- ated with the same score source. Problem Score VRF Red. (%) ρ Gauss. d = 4 Analyt. 0 . 040 ± 0 . 011 96 > 0 . 99 Gauss. d = 4 CNF 0 . 058 ± 0 . 002 94 0 . 98 Rosenb . d = 2 Analyt. 0 . 08 – 0 . 23 77–92 — Rosenb . d = 2 CNF 0 . 010 ± 0 . 002 99 > 0 . 99 Nonlin. d = 4 Analyt. 0 . 57 ± 0 . 29 43 — Nonlin. d = 4 CNF 0 . 57 ± 0 . 26 43 — 1 0 1 1 0 2 1 0 3 S a m p l e s i z e N 0.0 0.2 0.4 0.6 0.8 1.0 VRF (a) V ariance reduction factor Gaussian Rosenbrock Nonlinear 1 0 1 1 0 2 1 0 3 S a m p l e s i z e N 1 0 5 1 0 4 1 0 3 MSE (b) Mean squared error CV R aw Gaussian Rosenbrock Nonlinear 1 / N Figure 8: Sample efﬁcienc y with CNF-learned scores. (a) VRF is sample-size in variant. (b) The CV estimator (solid) provides a constant-f actor MSE improv ement ov er raw MC (dashed). MSE plateaus at lar ge N due to residual bias from the CNF score approximation. particularly for inv erse problems governed by expensiv e PDEs, where amortized inference is often the only computa- tionally viable option. T o assess robustness to score approx- imation error, we train a CNF to approximate p ( x | y ) via negati ve log-likelihood on joint samples drawn from p ( x , y ) . The score is then obtained by automatic differentiation of the change-of-variables log-density (CNF architecture details in Appendix E). As discussed in Section 2.1, using the CNF score together with CNF samples yields an estimator un- biased for E p φ [ h ( x )] —i.e., expectations under the learned 1 0 1 1 0 2 1 0 3 S a m p l e s i z e N 0.18 0.20 0.22 VRF (a) V ariance reduction factor Mean (d=100) 1 0 1 1 0 2 1 0 3 S a m p l e s i z e N 1 0 4 1 0 3 1 0 2 1 0 1 MSE (b) Mean squared error R aw MC CNCV 1 / N Figure 9: Sample ef ﬁciency for Darcy ﬂo w . (a) VRF is sample-size inv ariant at ∼ 0 . 18 . (b) MSE of the posterior mean estimator follo ws 1 / N for both raw MC and CNCV , with the CV estimator achie ving a ∼ 5 . 5 × constant-factor improv ement. At large N , both curves plateau because the reference mean is estimated from a ﬁnite CNF sample pool. posterior rather than the true posterior . MSE in Figure 8 is ac- cordingly computed against the CNF sample mean. T able 2 compares analytical and CNF-learned scores on all three benchmarks. On the Gaussian problem ( d = 4 ), the CNF score increases VRF from 0 . 040 to 0 . 058 —a modest degra- dation consistent with score approximation error . On the Rosenbrock problem ( d = 2 ), the CNF score achie ves VRF of 0 . 010 ± 0 . 002 with ρ > 0 . 99 , lo wer than the analytical- score VRF ( 0 . 08 – 0 . 23 )—likely because the CNF pro vides i.i.d. samples, whereas the analytical-score experiment re- lies on correlated MCMC draws. On the nonlinear problem ( d = 4 ), the CNF score achie ves VRF = 0 . 57 ± 0 . 26 , virtu- ally identical to the analytical score ( 0 . 57 ± 0 . 29 ), despite 17.5% relative score error . These results suggest that the coupling layer architecture is rob ust to moderate score ap- proximation errors. Figure 8 conﬁrms that VRF remains sample-size in variant with CNF-learned scores, and the CV estimator maintains a constant-factor MSE improvement ov er raw MC across all sample sizes. 4.2 PDE-CONSTRAINED INVERSE PR OBLEM W e now apply CNCV to a higher -dimensional, PDE- constrained problem: estimating the log-permeability ﬁeld in a groundwater ﬂow model governed by Darcy’ s equa- tion [Darcy, 1856], −∇ · ( e u ( s ) ∇ p ( s )) = 0 on D = [0 , 1] 2 , with mixed boundary conditions [Beskos et al., 2017]. The unknown log-permeability u ( s ) is observed through m = 33 pressure sensors with Gaussian noise ( σ y = 0 . 01 ). The PDE forward model costs ∼ 0 . 5 s per solve, the pos- terior is high-dimensional, and the score must be learned from data. Following Beskos et al. [2017], the prior on u is a Gaussian random ﬁeld (GRF) represented via a cosine Karhunen–Loève (KL) e xpansion with truncation K = 10 , giving u ( s ) = P 100 j =1 z j p λ j ψ j ( s ) where λ j are the GRF eigen values and ψ j are cosine eigenfunctions. The training data consists of d = K 2 = 100 KL coefﬁcients z ∈ R 100 obtained by expanding GRF samples, paired with the cor- responding observations. The forw ard model reconstructs the ﬁeld from z , solves the PDE on a 40 × 40 grid using Devito [Louboutin et al., 2019, Luporini et al., 2020], and extracts pressures at the sensor locations. Since the poste- rior score is not a v ailable in closed form for this nonlinear PDE, we train a CNF on 120k joint ( z , y ) samples to serve as both posterior sampler and score source (Appendix C veriﬁes the quality of the learned score). The CNCV ensem- ble uses L = 16 depth-3 trees; full hyperparameters are in Appendix E. Figure 9 conﬁrms that the VRF is sample-size in v ariant at ∼ 0 . 18 , with the CV estimator achieving ∼ 5 . 5 × lower MSE—an ef fecti ve 5 . 5 × sample size increase on this 100- dimensional problem. Figure 10 shows the posterior sum- mary for a held-out observation: the CNCV mean closely matches the truth, with posterior uncertainty concentrated in the domain interior—aw ay from sensors and Dirichlet boundaries. Since the reconstruction is linear in the KL coef- ﬁcients, the posterior standard deviation is go verned by the sensor geometry rather than the speciﬁc observation v alues, and is therefore consistent across test cases (Appendix D). As observed from Figure 11, CNCV reduces estimator stan- dard deviation most strongly along large-scale structures; per-observ ation VRF is tightly concentrated around 0 . 18 across 20 test observations. 5 RELA TED WORK Stein-based control variates hav e a rich history in MC variance reduction. The zero-v ariance principle of Assaraf and Caff arel [1999] inspired kernel-based control function- als [Oates et al., 2017, 2019, South et al., 2022] and poly- nomial/regularized MCMC estimators [Mira et al., 2013, South et al., 2023], while Si et al. [2022] uniﬁed polynomial, kernel, and neural function classes under a scalable Stein objectiv e. Neural networks hav e also been used to parame- terize control v ariates outside the Stein frame work—for gra- dient estimation [Grathwohl et al., 2018], rendering [Müller et al., 2020], general MC integration [W an et al., 2020], and vector -valued [Sun et al., 2023a] and meta-learning [Sun et al., 2023b] settings—and within it, for lattice ﬁeld the- ory [Bedaque and Oh, 2024, Oh, 2025]. Howe ver , all of these methods are non-amortized : each must be ﬁtted for a speciﬁc target distribution, requiring MCMC samples and an O ( n 3 ) kernel solve [Oates et al., 2017], or least-squares regression on MCMC output [Mira et al., 2013]. In contrast, amortized methods train a single model that gen- eralizes across observations [Blei et al., 2017]. Conditional normalizing ﬂo ws [Rezende and Mohamed, 2015, Papa- makarios et al., 2021] and simulation-based inference [Cran- mer et al., 2020, Rade v et al., 2020] learn density estimators p φ ( x | y ) from joint samples, with e xtensions to in verse prob- lems [Siahkoohi et al., 2022, 2023, Orozco et al., 2025]; im- portance sampling can further reduce v ariance through bet- ter proposals [Dax et al., 2023], complementary to control 0 . 0 0 . 5 1 . 0 s 1 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 s 2 (a) Pressure p † 0 . 0 0 . 5 1 . 0 s 1 (b) T rue u † 0 . 0 0 . 5 1 . 0 s 1 (c) CNCV mean 0 . 0 0 . 5 1 . 0 s 1 (d) Post. std 0.00 0.13 0.27 0.40 0.53 0.67 0.80 0.93 -0.80 -0.55 -0.31 -0.06 0.19 0.44 0.68 0.93 -0.80 -0.55 -0.31 -0.06 0.19 0.44 0.68 0.93 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 Figure 10: Darcy ﬂow posterior summary . (a) Forward PDE pressure p † with sensor locations (open circles); (b) true log-permeability u † ; (c) CNCV posterior mean from N = 5 , 000 CNF samples; (d) pointwise posterior standard deviation. All ﬁelds reconstructed at 128 × 128 from KL coefﬁcients. 0 . 0 0 . 5 1 . 0 s 1 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 s 2 (a) Std of ˆ µ (MC) 0 . 0 0 . 5 1 . 0 s 1 (b) Std of ˆ µ (CNCV) 0 . 0 0 . 5 1 . 0 s 1 (c) Std ratio (MC / CNCV) 0 5 10 15 T est obs. index 0 . 0 0 . 1 0 . 2 0 . 3 (d) Per-obs VRF Mean = 0 . 18 0.00 0.01 0.01 0.02 0.02 0.03 0.03 0.04 0.00 0.01 0.01 0.02 0.02 0.03 0.03 0.04 1.00 1.34 1.69 2.03 2.37 2.72 3.06 3.40 Figure 11: V ariance of the mean estimator on the Darc y ﬂo w problem. (a, b) Pixel wise standard de viation of the posterior mean estimator ˆ µ = 1 N P i x i for vanilla MC and CNCV ; (c) std ratio (MC / CNCV), where values > 1 indicate that CNCV reduces estimator variance; (d) per -observation VRF across 20 held-out test observ ations. v ariates. Howe ver , to our knowledge, no prior work has pro- posed amortized contr ol variates —observation-conditioned functions g ( x , y ) with zero posterior e xpectation that gen- eralize to unseen y without retraining. 6 DISCUSSION The proposed method inte grates seamlessly with simulation- based inference techniques [Cranmer et al., 2020], which already produce joint samples ( x i , y i ) and learned poste- rior approximations from which the score is readily av ail- able. CNCV training consumes e xactly these e xisting arti- facts, adding no extra forward model ev aluations, and the inference-time cost of e v aluating the ensemble of coupling layers is comparable to a single shallow network e valuation. At the same time, CNCV also accommodates physics-based scores computed directly from the likelihood and prior , mak- ing it applicable to problems where high-ﬁdelity forward ev aluations are av ailable and accurate score computation is paramount. This ﬂexibility makes CNCV attractiv e across a wide range of settings—from problems constrained by par- tial dif ferential equations, where each forw ard solve domi- nates the computational budget and an y variance reduction translates directly into proportional savings, to lo wer-cost problems where the analytical score is readily accessible. It is worth noting that CNCV reduces the variance of pos- terior expectations computed from a giv en set of samples, but does not alter the samples themselves or improv e the underlying posterior approximation. While the quality of the control variate depends on score accuracy , our normaliz- ing ﬂow-based e xperiments demonstrate that learned scores yield comparable variance reduction to analytical scores. Since Stein’ s identity requires the score to correspond to the sampling distribution, it is crucial that the score and pos- terior samples originate from the same source—e.g., both from a CNF—to preserve the zero-mean property of the control variate. For simple posteriors, polynomial control v ariates [Mira et al., 2013] can match CNCV for a single ob- serv ation; ho wev er , the v alue of CNCV lies in non-Gaussian settings where amortization across observ ations is needed. As with any amortized method, we expect a gap relativ e to observ ation-speciﬁc optimization [Orozco et al., 2025]; CNCV is complementary to such reﬁnement strategies and can be combined with them to further reduce estimator vari- ance. Aside from the in verse problems considered above, the prin- ciple of learning amortized, Stein-based control v ariates is applicable to other domains where Monte Carlo estimation is a computational bottleneck—e.g., v ariance reduction of policy gradient estimators in reinforcement learning [Green- smith et al., 2004], scaling to higher-dimensional ﬁelds by operating in latent spaces learned by functional autoen- coders [Bunker et al., 2025], and amortized uncertainty quantiﬁcation for digital twins [Herrmann, 2023]. Related material The Darcy ﬂow forward solver relies on the De- vito [Louboutin et al., 2019, Luporini et al., 2020] ﬁnite- difference compiler via a lightweight groundwater solver av ailable at github .com/luqigroup/groundwater . Prior sam- ples from the corrected Gaussian random ﬁeld are generated using the code av ailable at github.com/luqigroup/grf . Code to partially reproduce the results presented in this work is av ailable at github .com/luqigroup/cncv . Acknowledgements Ali Siahkoohi ackno wledges support from the Institute for Artiﬁcial Intelligence at the Univ ersity of Central Florida. References Roland Assaraf and Michel Caf farel. Zero-variance princi- ple for Monte Carlo algorithms. Physical Review Letters , 83(23):4682, 1999. Lorenzo Baldassari, Ali Siahkoohi, Josselin Garnier , Knut Sølna, and Maarten V . de Hoop. Conditional score-based dif fusion models for Bayesian inference in inﬁnite dimen- sions. In Advances in Neural Information Pr ocessing Systems , volume 36, pages 24262–24290, 2023. Paulo F . Bedaque and Hyunwoo Oh. Lev eraging neural control v ariates for enhanced precision in lattice ﬁeld theory . Physical Review D , 109:094519, 2024. Alexandros Beskos, Mark Girolami, Shiwei Lan, Patrick E. Farrell, and Andrew M. Stuart. Geometric MCMC for inﬁnite-dimensional in verse problems. Journal of Com- putational Physics , 335:327–351, 2017. David M. Blei, Alp K ucukelbir , and Jon D. McAuliffe. V ari- ational inference: a revie w for statisticians. J ournal of the American Statistical Association , 112(518):859–877, 2017. Justin Bunker , Mark Girolami, Heﬁn Lambley , Andrew M. Stuart, and T . J. Sulli v an. Autoencoders in function space. Journal of Machine Learning Researc h , 26(165):1–54, 2025. Kyle Cranmer , Johann Brehmer , and Gilles Louppe. The frontier of simulation-based inference. Pr oceedings of the National Academy of Sciences , 117(48):30055–30062, 2020. Henry Darcy . Les fontaines publiques de la ville de Dijon . V ictor Dalmont, Paris, 1856. Maximilian Dax, Stephen R. Green, Jonathan Gair , Michael Pürrer , Jonas W ildberger , Jakob H. Macke, Alessandra Buonanno, and Bernhard Schölkopf. Neural importance sampling for rapid and reliable gra vitational-wa ve infer- ence. Physical Revie w Letters , 130:171403, 2023. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Confer ence on Learning Representations , 2017. E. C. Fieller and H. O. Hartley . Sampling with control variables. Biometrika , 41:494–501, 1954. W ill Grathwohl, Dami Choi, Y uhuai W u, Geoffrey Roeder , and David Duv enaud. Backpropagation through the void: Optimizing control variates for black-box gradient esti- mation. In International Confer ence on Learning Repre- sentations , 2018. Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter . V ariance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Resear ch , 5:1471–1530, 2004. Felix J. Herrmann. Digital twins in the era of generative AI. The Leading Edge , 42(11):730–732, 2023. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Pr ocessing Systems , 2020. Michael F . Hutchinson. A stochastic estimator of the trace of the inﬂuence matrix for Laplacian smoothing spli nes. Communications in Statistics – Simulation and Computa- tion , 19(2):433–450, 1990. Jari Kaipio and Erkki Somersalo. Statistical and Computa- tional In verse Pr oblems . Springer, 2006. Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Repr esentations , 2015. Jakob Kruse, Gianluca Detommaso, Robert Scheichl, and Ullrich Köthe. HINT: Hierarchical Inv ertible Neural Transport for Density Estimation and Bayesian Inference. In AAAI Confer ence on Artiﬁcial Intelligence , v olume 35, pages 8191–8199, 2021. Zongyi Li, Nikola K ov achki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhatt, Andrew Stuart, and An- ima Anandkumar . Fourier neural operator for parametric partial dif ferential equations. In International Conference on Learning Repr esentations , 2021. M. Louboutin, M. Lange, F . Luporini, N. Kukreja, P . A. W itte, F . J. Herrmann, P . V elesko, and G. J. Gorman. De- vito (v3.1.0): an embedded domain-speciﬁc language for ﬁnite differences and geophysical e xploration. Geoscien- tiﬁc Model Development , 12(3):1165–1187, 2019. Mathias Louboutin, Ziyi Y in, Rafael Orozco, Thomas J. Grady II, Ali Siahk oohi, Gabrio Rizzuti, Philipp A. W itte, Olav Møyner , Gerard J. Gorman, and Felix J. Herrmann. Learned multiphysics in version with differentiable pro- gramming and machine learning. The Leading Edge , 42 (7):474–486, 2023. Fabio Luporini, Mathias Louboutin, Michael Lange, Navjot Kukreja, Philipp W itte, Jan Hückelheim, Charles Y ount, Paul H. J. K elly , Felix J. Herrmann, and Gerard J. Gorman. Architecture and performance of Devito, a system for automated stencil computation. ACM T ransactions on Mathematical Softwar e , 46(1):6:1–6:28, 2020. Antonietta Mira, Reza Solgi, and Daniele Imparato. Zero variance Mark ov chain Monte Carlo for Bayesian estima- tors. Statistics and Computing , 23(5):653–662, 2013. Thomas Müller , Fabrice Rousselle, Ale xander K eller , and Jan Novák. Neural control variates. A CM T ransactions on Graphics , 39(6):243:1–243:19, 2020. Radford M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo . Chapman and Hall/CRC, 2011. Chris J. Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B , 79(3):695–718, 2017. Chris J. Oates, Jon Cockayne, François-Xavier Briol, and Mark Girolami. Conv ergence rates for a class of esti- mators based on Stein’ s method. Bernoulli , 25(2):1141– 1159, 2019. Hyunwoo Oh. Training neural control v ariates using corre- lated conﬁgurations. Physical Revie w D , 112(7):074501, 2025. Rafael Orozco, Ali Siahk oohi, Mathias Louboutin, and Fe- lix J. Herrmann. ASPIRE: iterativ e amortized posterior inference for Bayesian inv erse problems. Inverse Pr ob- lems , 41(4):045001, 2025. George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshmi- narayanan. Normalizing ﬂows for probabilistic modeling and inference. Journal of Machine Learning Researc h , 22(57):1–64, 2021. Stefan T . Radev , Ulf K. Mertens, Andreas V oss, L ynton Ardizzone, and Ullrich Köthe. BayesFlow: Learning com- plex stochastic models with in vertible neural networks. IEEE T ransactions on Neural Networks and Learning Systems , 33(4):1452–1466, 2020. Danilo Jimenez Rezende and Shakir Mohamed. V ariational inference with normalizing ﬂo ws. In International Con- fer ence on Machine Learning , 2015. Christian P . Robert and George Casella. Monte Carlo Sta- tistical Methods . Springer , 2nd edition, 2004. Gareth O. Roberts and Jef frey S. Rosenthal. Optimal scal- ing of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B , 60(1): 255–268, 1998. Shijing Si, Chris J. Oates, Andre w B. Duncan, Lawrence Carin, and François-Xa vier Briol. Scalable control vari- ates for Monte Carlo methods via stochastic optimization. In Monte Carlo and Quasi-Monte Carlo Methods , pages 205–221, 2022. Ali Siahk oohi, Gabrio Rizzuti, and Felix J. Herrmann. Deep Bayesian inference for seismic imaging with tasks. Geo- physics , 87(5):S281–S302, 2022. Ali Siahkoohi, Gabrio Rizzuti, Raf ael Orozco, and Felix J. Herrmann. Reliable amortized variational inference with physics-based latent distribution correction. Geophysics , 88(3):R297–R322, 2023. Y ang Song and Stefano Ermon. Generativ e modeling by estimating gradients of the data distribution. In Advances in Neural Information Pr ocessing Systems , 2019. Leah F . South, T oni Karvonen, Christopher Nemeth, Mark Girolami, and Chris J. Oates. Semi-exact control func- tionals from Sard’ s method. Biometrika , 109(2):351–367, 2022. Leah F . South, Chris J. Oates, Antonietta Mira, and Christo- pher Drovandi. Regularized zero-variance control vari- ates. Bayesian Analysis , 18(3):865–888, 2023. Charles Stein. A bound for the error in the normal ap- proximation to the distribution of a sum of dependent random variables. In Pr oceedings of the Sixth Berkele y Symposium on Mathematical Statistics and Pr obability , volume 2, pages 583–602, 1972. Andre w M. Stuart. In verse problems: a Bayesian perspectiv e. Acta Numerica , 19:451–559, 2010. Zhuo Sun, Alessandro Barp, and François-Xavier Briol. V ector-v alued control variates. In International Confer- ence on Machine Learning , 2023a. Zhuo Sun, Chris J. Oates, and François-Xavier Briol. Meta- learning control variates: v ariance reduction with limited data. In Confer ence on Uncertainty in Artiﬁcial Intelli- gence , 2023b. Albert T arantola. In verse Pr oblem Theory and Methods for Model P arameter Estimation . SIAM, 2005. Ruosi W an, Mingjun Zhong, Haoyi Xiong, and Zhanxing Zhu. Neural control variates for Monte Carlo v ariance reduction. In Eur opean Confer ence on Mac hine Learning and Principles and Practice of Knowledg e Discovery in Databases , 2020. A DERIV A TION OF STEIN’S IDENTITY FOR POSTERIOR DISTRIBUTIONS W e provide a self-contained deriv ation of Stein’ s identity (equation (4) ) in the conte xt of posterior distrib utions aris- ing from Bayesian in verse problems. The deri vation relies on the div ergence theorem in R d and the product rule for div ergences. A.1 SETUP AND REGULARITY CONDITIONS Let p ( x | y ) denote the posterior density over x ∈ R d for a ﬁxed observation y ∈ R m , and let ϕ ( x , y ) : R d × R m → R d be a smooth vector -valued function. W e require: 1. p ( x | y ) is continuously differentiable and strictly posi- tiv e on R d ; 2. ϕ ( · , y ) is continuously differentiable for each y ; 3. The boundary condition lim ∥ x ∥→∞ ϕ ( x , y ) p ( x | y ) = 0 holds, ensuring that the surface integral at inﬁnity vanishes. The boundary condition is satisﬁed whene ver p ( x | y ) decays suf ﬁciently fast—e.g., when the prior is Gaussian or has sub- Gaussian tails—and ϕ grows at most polynomially . A.2 DERIV A TION VIA INTEGRA TION BY P ARTS Consider the integral I = Z R d ∇ x ·  ϕ ( x , y ) p ( x | y )  d x , (13) where ∇ x · v = P d i =1 ∂ v i /∂ x i denotes the div ergence. By the di ver gence theorem applied to the ball B R = { x : ∥ x ∥ ≤ R } , Z B R ∇ x ·  ϕ p  d x = I ∂ B R  ϕ p  · ˆ n dS, where ˆ n is the outward normal to the sphere ∂ B R . By the boundary condition (3), the surface integral vanishes as R → ∞ , giving I = Z R d ∇ x ·  ϕ ( x , y ) p ( x | y )  d x = 0 . (14) A.3 EXP ANDING VIA THE PRODUCT R ULE W e now expand the diver gence in the above expression using the product rule. For each component i : ∂ ∂ x i  ϕ i ( x , y ) p ( x | y )  = ∂ ϕ i ∂ x i p ( x | y ) + ϕ i ( x , y ) ∂ p ( x | y ) ∂ x i . Summing ov er i = 1 , . . . , d and using the identity ∇ p/p = ∇ log p : ∇ x ·  ϕ p  =  d X i =1 ∂ ϕ i ∂ x i  p +  d X i =1 ϕ i ∂ p ∂ x i  =  ∇ x · ϕ  p +  ϕ · ∇ x log p  p. Substituting into equation (14) and dividing by p inside the expectation: E p ( x | y ) [ ∇ x · ϕ ( x , y ) + ϕ ( x , y ) · ∇ x log p ( x | y )] = 0 , (15) which is exactly equation (4) . This establishes that g ( x , y ) = ∇ x · ϕ + ϕ · ∇ x log p ( x | y ) has zero expec- tation under the posterior for any ϕ satisfying the regularity conditions. Component-wise identity . Since the di ver gence theorem applies to each component of ϕ p independently , the identity also holds per component before summing: E p ( x | y )  ∂ ϕ j ∂ x j + ϕ j ∂ log p ( x | y ) ∂ x j  = 0 , j = 1 , . . . , d. This justiﬁes the vector -valued control variate in equa- tion (8) : each component g θ,j independently has zero e xpec- tation, so E [ g θ ] = 0 without requiring the components to sum. A.4 APPLICA TION TO COUPLING LA YER ARCHITECTURES For our coupling layer parameterization (equation (6) ), the div ergence takes the e xplicit form (7). The triangular struc- ture means ∇ x · ϕ θ in volves only the diagonal of the Ja- cobian, which equals 1 for untransformed components and s θ,i ( x 1 , y ) for transformed components. The resulting con- trol variate g θ ( x , y ) = X i ∈ x 2 s θ,i ( x 1 , y ) + d 1 | {z } div ergence (e xact) + ϕ θ ( x , y ) · ∇ x log p ( x | y ) | {z } score term In the above expression, d 1 = dim( x 1 ) . This control vari- ate satisﬁes E p ( x | y ) [ g θ ( x , y )] = 0 exactly , with no approxi- mation in the div ergence computation. This unbiasedness guarantee holds regardless of the neural network parame- ters θ (provided the boundary conditions are met), and is preserved under ensemble a veraging (equation (9)). For the hierarchical coupling layer , the recursiv e binary tree structure computes the Jacobian diagonal as diag( ∇ x ϕ ) = [ d upper ; s ⊙ d lower ] , where d upper and d lower are the diagonals from the upper and lower subtrees, and s is the scale vector from the coupling at the current node. At leaf nodes, d = 1 . The div ergence is then tr( ∇ x ϕ ) = 1 ⊤ d , which remains analytically computable and exact. A.5 BOUND ARY CONDITIONS FOR POSTERIOR DISTRIBUTIONS IN INVERSE PR OBLEMS W e verify that condition (3) holds in typical inv erse prob- lem settings. For a Gaussian prior p prior ( x ) = N ( µ 0 , Σ 0 ) and Gaussian likelihood p ( y | x ) = N ( x , σ 2 I ) , the poste- rior satisﬁes p ( x | y ) ≤ C exp( − 1 2 ( x − µ 0 ) ⊤ Σ − 1 0 ( x − µ 0 )) for some C > 0 , since the likelihood is bounded abov e by 1. This Gaussian tail decay ensures ϕ ( x , y ) p ( x | y ) → 0 as ∥ x ∥ → ∞ whenev er ϕ grows at most polynomially— which holds for neural networks with bounded (tanh, sigmoid) or polynomially growing (ReLU) acti vations. For non-Gaussian priors such as the Rosenbrock density exp( − a ( x 1 − µ ) 2 − b ( x 2 − x 2 1 ) 2 ) , the super-e xponential de- cay in all directions similarly ensures the boundary condi- tions hold. A.6 EMPIRICAL VERIFICA TION Figure 12 provides an empirical check of Stein’ s identity . For each of 250 test observations y , we draw N = 5 , 000 posterior samples and compute the dimension-averaged sample mean of the control variate, 1 N d P i,j g j ( x i , y ) , which should be zero by construction. Across all four experiments—Gaussian ( d = 4 , analytical score), Rosen- brock ( d = 2 ), nonlinear ( d = 4 ), and Darcy ( d = 100 , all three with learned scores)—this quantity is tightly concen- trated around zero (mean ≈ 0 , std ≈ 10 − 3 ), conﬁrming that the zero-mean guarantee holds in practice. B PER-COMPONENT V ARIANCE REDUCTION ON D ARCY FLO W Figure 13 shows the per-component VRF for each of the 100 KL coef ﬁcients, sorted by decreasing prior eigen value. T wo regimes are visible: Low-fr equency modes (components 1–30). These modes correspond to large prior eigen values λ j and carry most of the prior variance. Because the data strongly con- strains these components, the posterior variance V ar ( z j | y ) is already much smaller than the prior variance λ 2 j . The control variate must match the posterior mean function E [ z j | y ] —a nonlinear function of the observation—which is harder for the leading modes where the data is most informativ e. This yields VRF close to 1. High-frequency modes (components 60–100). These modes have small prior eigen v alues and are barely con- strained by the data: p ( z j | y ) ≈ p ( z j ) = N (0 , 1) . Here the posterior score is dominated by the prior score ∇ z j log p ( z j ) = − z j , which the coupling layers can ap- proximate easily , yielding VRF as low as 0 . 04 . The component-a veraged VRF is approximately 0 . 16 , while 0.005 0.000 0.005 1 N d i , j g j ( x i , y ) 0 50 100 150 200 Density Gaussian mean = -0.0002 0.010 0.005 0.000 0.005 0.010 1 N d i , j g j ( x i , y ) 0 25 50 75 100 125 150 Density Rosenbrock mean = -0.0001 0.0050 0.0025 0.0000 0.0025 1 N d i , j g j ( x i , y ) 0 100 200 300 Density Nonlinear mean = 0.0000 0.0050 0.0025 0.0000 0.0025 0.0050 1 N d i , j g j ( x i , y ) 0 50 100 150 200 250 300 Density Darcy flow mean = -0.0000 Figure 12: Empirical veriﬁcation of Stein’ s identity . Each panel shows the distribution of 1 N d P i,j g j ( x i , y ) across 250 test observations ( N = 5 , 000 samples each). All four distributions are centered at zero, conﬁrming the zero-mean guarantee. 0 20 40 60 80 100 KL comp onent (sorted by decreasing eigenv alue) 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 VRF 0 . 025 0 . 050 0 . 075 0 . 100 0 . 125 Eigenv alue λ Figure 13: Per -KL-component VRF on the Darc y ﬂo w prob- lem ( d = 100 ), sorted by decreasing prior eigen v alue λ j (orange curve, right axis). High-frequency modes (small λ j , right side) achie ve VRF as low as 0 . 04 ; lo w-frequency modes (large λ j , left side) hav e VRF ≈ 1 . The dashed line marks VRF = 1 (no improv ement). the per-observ ation VRF reported in Section 4.2 ( ∼ 0 . 18 ) aggregates differently across samples. In both cases, the ov erall reduction is substantial because the majority of com- ponents (those with smaller eigen values) achie ve very lo w VRF . C CNF QU ALITY DIA GNOSTICS FOR D ARCY FLO W Since the Darcy ﬂo w experiment relies entirely on a learned CNF for both posterior sampling and score computation, we − 2 0 2 − 2 0 2 z 2 (a) Obs 1 Prior CNF p ost. z † − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 2 − 1 0 (b) Obs 1 (zo om) z † CNF mean − 2 0 2 − 2 0 2 z 2 (c) Obs 2 − 1 . 0 − 0 . 5 0 . 0 − 1 0 1 (d) Obs 2 (zo om) − 2 0 2 z 1 − 2 0 2 z 2 (e) Obs 3 0 . 0 0 . 5 1 . 0 z 1 − 1 0 1 (f ) Obs 3 (zo om) Figure 14: First two KL components of the CNF posterior for three held-out observations. Left column: prior samples (blue) and CNF posterior samples (orange) with the true parameter (red star); the CNF correctly concentrates around the truth. Right column: zoomed view showing the CNF posterior mean (green diamond) near the truth. 0 20 40 60 80 100 KL comp onen t index j 0 . 0 0 . 5 1 . 0 Correlation r j Mean r = 0 . 94 Figure 15: Per-component Pearson correlation between the CNF score ˆ s j = ∂ z j log p φ ( z | y ) and the physics-based posterior score s j (adjoint PDE + prior), ev aluated at 100 held-out test observations. Mean ¯ r = 0 . 94 , median 0 . 97 . provide diagnostics verifying the quality of the trained CNF (Section 4.2). Figure 14 projects the 100-dimensional posterior onto the ﬁrst two KL components—i.e., those with the largest prior eigen values—for three held-out observ ations. In each case, the CNF posterior (orange) is correctly concentrated around 0 1000 Batch iteration 140 160 180 NLL (a) CNF T rain V al 0 2500 5000 Batch iteration 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 MSE (b) CNCV T rain V al Figure 16: Training and validation loss curves. (a) CNF negati ve log-likelihood: ra w per -batch training loss (light), smoothed training loss (dark), and validation NLL per epoch. (b) CNCV mean-squared error: same layout. Both models con verge without o verﬁtting. the true parameter (red star), with the CNF posterior mean (green diamond) close to the truth. As expected, the prior samples (blue) are substantially more dispersed, conﬁrming that the CNF has learned meaningful posterior concentra- tion. Figure 15 compares the CNF score ∇ z log p φ ( z | y ) against the physics-based posterior score computed via adjoint PDE solves and the GRF prior (Section 4.2), for each of the 100 KL components across 100 held-out observations. W e observe that the per -component Pearson correlation is high (mean ¯ r = 0 . 94 , median 0 . 97 ), conﬁrming that the CNF learns an accurate score across nearly all KL modes. W e note that CNFs trained without weight decay ( λ = 0 ) produced inaccurate scores due to coupling-scale explosion; weight decay of 10 − 4 on the AdamW optimizer was essential for stable training. Figure 16 shows the training dynamics for both the CNF and the CNCV ensemble. Both models con verge smoothly , with validation losses tracking the training losses closely , indicating no signiﬁcant ov erﬁtting. D ADDITIONAL D ARCY POSTERIOR VISU ALIZA TIONS Figure 17 shows posterior summaries for three additional held-out observations, demonstrating that CNCV general- izes beyond the single observation shown in Figure 10. Each row follows the same format: pressure ﬁeld p † with sensor locations, true log-permeability u † , CNCV posterior mean, and posterior standard deviation. Across all observations, the CNCV posterior mean recov ers the large-scale struc- ture of the truth, with po sterior uncertainty concentrated in the domain interior, away from sensors and Dirichlet boundaries. 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (a) Pressure p † 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (b) T rue u † 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (c) CNCV mean 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (d) Posterior std 0.00 0.13 0.27 0.40 0.53 0.67 0.80 0.93 -0.92 -0.69 -0.45 -0.22 0.01 0.25 0.48 0.71 -0.92 -0.69 -0.45 -0.22 0.01 0.25 0.48 0.71 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (a) Pressure p † 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (b) T rue u † 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (c) CNCV mean 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (d) Posterior std 0.00 0.13 0.27 0.40 0.53 0.67 0.80 0.93 -0.89 -0.59 -0.29 0.01 0.32 0.62 0.92 1.23 -0.89 -0.59 -0.29 0.01 0.32 0.62 0.92 1.23 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (a) Pressure p † 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (b) T rue u † 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (c) CNCV mean 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (d) Posterior std 0.00 0.13 0.27 0.40 0.53 0.67 0.80 0.93 -1.17 -0.79 -0.42 -0.04 0.34 0.72 1.09 1.47 -1.17 -0.79 -0.42 -0.04 0.34 0.72 1.09 1.47 0.00 0.07 0.14 0.21 0.28 0.35 0.42 0.49 Figure 17: Darc y posterior summaries for three additional held-out test observations. Each ro w sho ws a dif ferent observation. Columns: (a) pressure p † with sensors, (b) true u † , (c) CNCV posterior mean, (d) posterior std. E EXPERIMENT AL DET AILS AND HYPERP ARAMETERS W e provide complete hyperparameter speciﬁcations for all experiments. All models are trained with the Adam op- timizer [Kingma and Ba, 2015] unless otherwise noted (AdamW for the Darcy CNF). Learning rates follow a cosine schedule from lr init to lr ﬁnal . E.1 CNCV ENSEMBLE (SECTIONS 4.1 AND 4.2) T able 3 summarizes the CNCV hyperparameters. The same architecture ( L = 16 , depth 2, 64 hidden, 3-layer MLP) is used across all stylized experiments; only the Darcy problem requires a deeper MLP (5 layers) and lower learning rate due to the higher dimensionality ( d = 100 ). E.2 CONDITIONAL NORMALIZING FLO WS (SECTIONS 4.1 AND 4.2) T able 4 shows the CNF hyperparameters. The Gaussian and Rosenbrock CNFs use 12 ﬂow layers with 128 hidden units, while the nonlinear CNF uses 8 ﬂo w layers with 64 hidden T able 3: CNCV ensemble hyperparameters. Gauss. Rosenb. Nonlin. Darcy d 2–16 2 4 100 L (ensemble) 16 16 16 16 T ree depth 2 2 2 3 MLP layers 3 3 3 5 Hidden units 64 64 64 128 Batch size 2048 2048 2048 4096 T rain samples 65,536 65,536 65,536 120,000 Epochs 50 50 100 400 lr init 10 − 3 10 − 3 10 − 3 10 − 4 lr ﬁnal 10 − 4 10 − 4 10 − 4 10 − 5 Seed 12 1 12 12 units; all are trained on 262k joint samples. The Darcy CNF is substantially larger (12 ﬂow layers, 128 hidden, 17.2M parameters) to handle the 100-dimensional KL space with 33-dimensional observ ations. W eight decay of 10 − 4 is ap- plied only for the Darcy CNF , where it prevents coupling scale explosion (Appendix C). 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (a) T rue u † -0.89 -0.60 -0.30 -0.01 0.28 0.57 0.86 1.15 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (b) Sample 1 -0.89 -0.60 -0.30 -0.01 0.28 0.57 0.86 1.15 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (c) Sample 2 -0.89 -0.60 -0.30 -0.01 0.28 0.57 0.86 1.15 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (d) Sample 3 -0.89 -0.60 -0.30 -0.01 0.28 0.57 0.86 1.15 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (a) T rue u † -1.24 -0.91 -0.58 -0.25 0.08 0.40 0.73 1.06 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (b) Sample 1 -1.24 -0.91 -0.58 -0.25 0.08 0.40 0.73 1.06 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (c) Sample 2 -1.24 -0.91 -0.58 -0.25 0.08 0.40 0.73 1.06 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (d) Sample 3 -1.24 -0.91 -0.58 -0.25 0.08 0.40 0.73 1.06 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (a) T rue u † -1.17 -0.78 -0.39 -0.00 0.38 0.77 1.16 1.55 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (b) Sample 1 -1.17 -0.78 -0.39 -0.00 0.38 0.77 1.16 1.55 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (c) Sample 2 -1.17 -0.78 -0.39 -0.00 0.38 0.77 1.16 1.55 0 . 0 0 . 5 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (d) Sample 3 -1.17 -0.78 -0.39 -0.00 0.38 0.77 1.16 1.55 Figure 18: Individual posterior samples for the same three observ ations as Figure 17. Each row sho ws the true ﬁeld (a) and three independent draws from the CNF approximate posterior (b–d). Samples preserv e the frequency content of the truth; the smoothness of the posterior mean in Figure 17(c) arises from av eraging over uncertain high-frequency modes. T able 4: Conditional normalizing ﬂow hyperparameters. Gauss. Rosenb. Nonlin. Dar cy Flow layers 12 12 8 12 T ree depth 2 2 2 6 MLP layers 3 3 3 3 Hidden units 128 128 64 128 Batch size 8192 4096 8192 4096 T rain samples 262,144 262,144 262,144 120,000 Epochs 1000 1000 1000 200 lr init 10 − 3 10 − 3 10 − 3 10 − 3 lr ﬁnal 10 − 5 10 − 5 10 − 5 10 − 5 W eight decay 0 0 0 10 − 4 Parameters 0.6M 0.2M 0.1M 17.2M E.3 DIFFUSION MODEL FOR THE NONLINEAR EXPERIMENT The denoising diffusion probabilistic model used for pos- terior sampling in the nonlinear experiment (Section 4.1) uses 1000 dif fusion timesteps with a linear noise schedule β t ∈ [10 − 4 , 0 . 02] . The denoising network is a 5-layer MLP with 512 hidden units, conditioned on the observation y via concatenation and on the timestep via sinusoidal position embeddings. T raining uses 65,536 joint samples, batch size 4096, and a cosine learning rate schedule from 10 − 3 to 10 − 4 ov er 1000 epochs. E.4 COMPUT A TIONAL COST T raining the CNCV ensemble takes approximately 5 min- utes for the stylized e xperiments ( d ≤ 16 ) and 50 minutes for the Darcy problem ( d = 100 ) on a single NVIDIA R TX 2000 Ada Generation GPU (16 GB). The Darcy CNF re- quires approximately 2 hours of training. At inference time, ev aluating g ens for a batch of M samples requires L = 16 forward passes through the coupling layer networks plus M score ev aluations. For the Darcy problem with M = 200 samples, inference takes ∼ 0 . 5 s, compared to ∼ 100 s for 200 PDE solves.

Conditional neural control variates for variance reduction in Bayesian inverse problems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment