One-Step Sampler for Boltzmann Distributions via Drifting

We present a drifting-based framework for amortized sampling of Boltzmann distributions defined by energy functions. The method trains a one-step neural generator by projecting samples along a Gaussian-smoothed score field from the current model dist…

Authors: Wenhan Cao, Keyu Yan, Lin Zhao

One-Step Sampler for Boltzmann Distributions via Drifting
One-Step Sampler f or Boltzmann Distrib utions via Drifting W enhan Cao 1,* Keyu Y an 1,* Lin Zhao 1,† 1 Department of Electrical and Computer Engineering, National Univ ersity of Singapore {wenhan,yky,zhaolin}@nus.edu.sg * Equal contribution † Corresponding author Abstract W e present a drifting-based frame work for amortized sampling of Boltzmann distri- butions defined by ener gy functions. The method trains a one-step neural generator by projecting samples along a Gaussian-smoothed score field from the current model distribution to ward the tar get Boltzmann distribution. F or targets specified only up to an unknown normalization constant, we deri ve a practical target-side drift from a smoothed energy and use two estimators: a local importance-sampling mean-shift estimator and a second-order curvature-corrected approximation. Com- bined with a mini-batch Gaussian mean-shift estimate of the sampler -side smoothed score, this yields a simple stop-gradient objectiv e for stable one-step training. On a four-mode Gaussian-mixture Boltzmann tar get, our sampler achie ves mean error 0 . 0754 , cov ariance error 0 . 0425 , and RBF MMD 0 . 0020 . Additional double-well and banana targets show that the same formulation also handles nonconv ex and curved lo w-energy geometries. Overall, the results support drifting as an effecti ve way to amortize iterative sampling from Boltzmann distributions into a single forward pass at test time. 1 Introduction Sampling from complex, high-dimensional distributions underlies many important problems in computational science, with applications spanning molecular modeling, Bayesian inference, and generativ e modeling. In this paper , we are interested in sampling from target distrib utions specified through an energy function E ( x ) , namely Boltzmann distributions of the form p ( x ) = 1 Z exp( − E ( x )) , where Z = Z exp( − E ( x )) dx < ∞ is an unknown normalization constant. More generally , one may write p ( x ) = 1 Z exp  − 1 τ E ( x )  , with temperature τ > 0 ; throughout this paper , we absorb τ into E and set τ = 1 for simplicity . Boltzmann distributions pro vide a flexible way to represent structured tar get densities through their energy landscapes. Their main difficulty , howe ver , lies in sampling. Since the normalization constant is typically intractable, exact sampling is una vailable, and one usually resorts to iterati ve procedures such as Langevin dynamics or Hamiltonian Monte Carlo [Roberts and T weedie, 1996, Roberts and Rosenthal, 1998, Duane et al., 1987, Neal, 2011]. These samplers often require many target Preprint. e valuations and can mix poorly in comple x or multi-modal landscapes, making them e xpensiv e to use at test time. A natural alternati ve is to learn an amortized sampler [Gershman and Goodman, 2014, Kingma and W elling, 2014, Rezende and Mohamed, 2015, Dinh et al., 2017]: a neural generator x = f θ ( z ) , z ∼ p 0 , whose pushforward distribution q θ approximates the target p . Once trained, such a sampler produces approximate draws from the Boltzmann distrib ution using a single forward pass. The challenge is ho w to train f θ when the target is specified only up to its ener gy function and an unknown normalization constant. In this paper , we propose to use drifting model [Deng et al., 2026] as a general framework for training neural samplers for Boltzmann distributions, closely related to particle transport ideas such as Stein variational gradient descent [Liu and W ang, 2016]. Drifting casts sampler learning as a projected transport problem: gi ven the current sampler distribution q θ , one defines a transport field that moves samples from q θ tow ard the target p , and then regresses the generator tow ard the transported samples. This perspective is especially appealing here because the target-side transport can be expressed directly in terms of the energy function, without requiring the normalization constant. Our starting point is to instantiate drifting with the Gaussian-smoothed score operator [Hyvärinen and Dayan, 2005, Hyvärinen, 2007, V incent, 2011, Song et al., 2021]. Under this choice, the drift field is gi ven by the dif ference between the smoothed target score and the smoothed sampler score. For a Boltzmann target, we show that the smoothed score admits a con venient representation in terms of a smoothed energy , and can be interpreted either as a local posterior mean-shift or as a locally av eraged energy gradient. This leads to a practical training objectiv e for neural samplers of Boltzmann distributions. W e further dev elop two approximations for the target-side drift. The first is a local importance- sampling estimator based on Gaussian perturbations around the current sample. The second is a second-order approximation that yields a curv ature-corrected energy gradient, in spirit related to modern score-based and diffusion modeling dev elopments [Sohl-Dickstein et al., 2015, Ho et al., 2020, Song et al., 2020]. Combined with a simple mini-batch estimator of the sampler-side smoothed score, these approximations result in an efficient algorithm for amortized sampling from Boltzmann targets. Contributions. Our contributions are threefold. First, we formulate sampling from Boltzmann distributions as a drifting problem and deri ve a training objectiv e based on smoothed-score transport. Second, we characterize the target-side drift induced by an ener gy function and show ho w it can be approximated ef ficiently in practice. Third, we obtain a simple stop-gradient algorithm that trains a neural sampler to approximate Boltzmann sampling dynamics in an amortized manner . 2 Preliminaries 2.1 Drifting Let p denote the target distribution and let q θ be the distribution induced by a generator x = f θ ( z ) , z ∼ p 0 . Drifting introduces a vector field V p,q θ ( x ) that specifies how a sample x ∼ q θ should be moved so as to better match the target distrib ution. The corresponding training objectiv e is L Drift ( θ ) = E z ∼ p 0 ∥ f θ ( z ) − sg ( f θ ( z ) + V p,q θ ( f θ ( z ))) ∥ 2 2 , where sg( · ) denotes the stop-gradient operator . At the objective le vel, this is equi v alent to L Drift ( θ ) = E x ∼ q θ ∥ V p,q θ ( x ) ∥ 2 2 . The vector field should vanish when the current sampler already matches the target. A sufficient condition is anti-symmetry , V p,q ( x ) = − V q ,p ( x ) , which implies V p,p ( x ) = 0 . A natural construction is V p,q ( x ) = η  T [ p ]( x ) − T [ q ]( x )  , where T is an operator on distributions and η > 0 is a step size. 2 2.2 Smoothed-score drifting In this work, we choose T to be the Gaussian-smoothed score operator , T σ [ p ]( x ) = ∇ x log( p ∗ ϕ σ )( x ) , where ϕ σ is the Gaussian kernel with bandwidth σ . The resulting drift is V ( σ ) p,q ( x ) = η  ∇ x log( p ∗ ϕ σ )( x ) − ∇ x log( q ∗ ϕ σ )( x )  . The drifting objectiv e then becomes L Drift ( θ ) = η 2 E x ∼ q θ ∥∇ x log( p ∗ ϕ σ )( x ) − ∇ x log( q θ ∗ ϕ σ )( x ) ∥ 2 2 . This can be vie wed as a re verse Fisher -type discrepancy between the smoothed target and sampler scores. 2.3 Problem setup: amortized sampling f or Boltzmann distributions W e focus on the case where the tar get is a Boltzmann distrib ution. Classical training strate gies for such energy-defined models include contrasti ve di vergence and noise-contrasti ve estimation [Hinton, 2002, Gutmann and Hyvärinen, 2010], and modern variants include cooperati ve and short-run training schemes [Xie et al., 2020, Han et al., 2017, Nijkamp et al., 2020]. Here, amortized sampling means learning a parametric sampler once during training so that test-time sampling is obtained by a single forward pass, thereby amortizing the cost of iterati ve per -sample MCMC updates. p ( x ) = 1 Z exp( − E ( x )) , with energy function E ( x ) and unkno wn partition function Z . Our goal is to train a generator f θ such that q θ ≈ p , thereby replacing iterativ e test-time MCMC with a single forward pass through the generator . The central technical problem is therefore to compute or approximate the target-side smoothed score ∇ x log( p ∗ ϕ σ )( x ) directly from the energy function. 3 Method 3.1 T arget-side drift f or Boltzmann distrib utions For a Boltzmann tar get p ( x ) = 1 Z exp( − E ( x )) , define the Gaussian-smoothed density p σ ( x ) := ( p ∗ ϕ σ )( x ) = 1 Z Z exp( − E ( u )) ϕ σ ( x − u ) du. For a Gaussian kernel, ϕ σ ( x − u ) = (2 π σ 2 ) − d/ 2 exp  − ∥ u − x ∥ 2 2 σ 2  , we can rewrite log p σ ( x ) , up to an additiv e constant independent of x , as log p σ ( x ) = − ¯ E σ ( x ) + C , where ¯ E σ ( x ) := − log Z exp  − E ( u ) − ∥ u − x ∥ 2 2 σ 2  du. It follows that ∇ x log( p ∗ ϕ σ )( x ) = −∇ x ¯ E σ ( x ) . 3 Hence the Boltzmann drift takes the form V ( σ ) E ,q ( x ) = η  −∇ x ¯ E σ ( x ) − ∇ x log( q ∗ ϕ σ )( x )  , and the corresponding objectiv e is L E - Drift ( θ ) = η 2 E x ∼ q θ   −∇ x ¯ E σ ( x ) − ∇ x log( q θ ∗ ϕ σ )( x )   2 2 . This shows that sampling from Boltzmann distributions can be cast as matching the sampler-side smoothed score to an energy-induced tar get-side drift. 3.2 Interpr eting the smoothed ener gy gradient T o obtain a more explicit form, define Z σ ( x ) := Z exp  − E ( u ) − ∥ u − x ∥ 2 2 σ 2  du, so that ¯ E σ ( x ) = − log Z σ ( x ) . Dif ferentiating yields −∇ x ¯ E σ ( x ) = 1 σ 2  E u ∼ π σ ( ·| x ) [ u ] − x  , where π σ ( u | x ) := exp  − E ( u ) − ∥ u − x ∥ 2 2 σ 2  Z σ ( x ) is a local posterior centered at x . Equiv alently , by integration by parts, −∇ x ¯ E σ ( x ) = − E u ∼ π σ ( ·| x )  ∇ u E ( u )  . Therefore, the target-side drift can be interpreted either as a local mean-shift under the energy model or as a locally averaged energy gradient. This characterization is useful because it turns the intractable global sampling problem into a local estimation problem around each current sample. 3.3 Appr oximating the target-side drift Monte Carlo appr oximation. A simple estimator can be obtained from local Gaussian perturba- tions. For a gi ven sample x , dra w u ℓ = x + σ ε ℓ , ε ℓ ∼ N (0 , I ) , ℓ = 1 , . . . , L, and define w ℓ = exp( − E ( u ℓ )) , ¯ w ℓ = w ℓ P L m =1 w m . Then −∇ x ¯ E σ ( x ) ≈ 1 σ 2 L X ℓ =1 ¯ w ℓ u ℓ − x ! . This estimator performs a local importance-weighted mean-shift tow ard low-ener gy regions. Second-order approximation. When second-order information is av ailable, we can deriv e a local closed-form approximation. Let g ( x ) = ∇ E ( x ) , H ( x ) = ∇ 2 E ( x ) . Using the quadratic expansion E ( u ) ≈ E ( x ) + g ( x ) ⊤ ( u − x ) + 1 2 ( u − x ) ⊤ H ( x )( u − x ) , the local posterior becomes approximately Gaussian, yielding E [ u − x | x ] ≈ − σ 2  I + σ 2 H ( x )  − 1 g ( x ) . Therefore, −∇ x ¯ E σ ( x ) ≈ −  I + σ 2 H ( x )  − 1 ∇ E ( x ) . This approximation rev eals that the smoothed target score is a curv ature-corrected energy gradient. 4 3.4 Estimating the sampler -side smoothed scor e The second term in the drift, ∇ x log( q θ ∗ ϕ σ )( x ) , depends on the current sampler distribution. W e estimate it directly from a mini-batch using Gaussian mean-shift. Gi ven generated samples { x j } N j =1 , define K σ ( x j , x i ) = exp  − ∥ x j − x i ∥ 2 2 σ 2  . The sampler-side smoothed score at x i is estimated by ˆ s q ,σ ( x i ) = 1 σ 2 P N j =1 K σ ( x j , x i ) x j P N j =1 K σ ( x j , x i ) − x i ! . 3.5 T raining objectiv e In practice, we optimize the stop-gradient form of drifting rather than directly differentiating the squared field norm. For a latent sample z ∼ p 0 , let x = f θ ( z ) . W e first estimate the drift ˆ V ( x ) = η  ˆ g E ,σ ( x ) − ˆ s q ,σ ( x )  , where ˆ g E ,σ ( x ) is computed using either the Monte Carlo or second-order approximation. W e then construct the frozen target ˜ x = sg  x + ˆ V ( x )  , and minimize L sg E - Drift ( θ ) = E z ∼ p 0 ∥ f θ ( z ) − ˜ x ∥ 2 2 . This objecti ve has a simple interpretation: samples produced by the current generator are first transported tow ard the Boltzmann target under the estimated drift, and the generator is then updated to match the transported samples. In this way , the generator amortizes the local transport dynamics induced by the energy model. 3.6 Mini-batch implementation Giv en a mini-batch z i ∼ p 0 , x i = f θ ( z i ) , i = 1 , . . . , N , our algorithm proceeds as follows: 1. Estimate the sampler-side smoothed score ˆ s q ,σ ( x i ) using Gaussian mean-shift over the current batch. 2. Estimate the target-side drift ˆ g E ,σ ( x i ) using either local importance sampling or the second- order approximation. 3. Form the drift ˆ V i = η  ˆ g E ,σ ( x i ) − ˆ s q ,σ ( x i )  . 4. Construct frozen targets ˜ x i = sg( x i + ˆ V i ) . 5. Update θ by minimizing ˆ L sg E - Drift ( θ ) = 1 N N X i =1 ∥ x i − ˜ x i ∥ 2 2 . 5 Method Mean ℓ 2 ↓ Cov . Fro. ↓ RBF MMD ↓ Generated energy Reference energy Gaussian kernel drifting 0.0754 0.0425 0.0020 1.0045 1.0263 T able 1: Quantitative summary for Gaussian kernel drifting on the Gaussian-mixture Boltzmann target. Lower is better for mean error , cov ariance error , and MMD. Figure 1: Qualitativ e behavior of Gaussian kernel drifting on the Gaussian-mixture Boltzmann target. The learned one-step sampler captures all four modes at the correct locations and with approximately the correct spread. 4 Experiments 4.1 Experimental setup W e ev aluate Gaussian kernel drifting on the two-dimensional Gaussian-mixture Boltzmann distrib u- tion used throughout the paper . The target distrib ution has four symmetric modes, so the experiment tests whether a one-step amortized sampler can recover both the local shape of each mode and the global mode balance. In this draft, we focus on understanding the behavior of our sampler itself rather than comparing against external baselines. For Gaussian kernel drifting, we use a residual MLP generator with latent dimension 32 , hidden width 256 , and three hidden layers. The model is trained for 10 , 000 drifting updates with batch size 1024 and learning rate 10 − 3 . The Gaussian kernel bandwidth and drift step are both set to σ = η = 0 . 22 , and the tar get-side drift is estimated by the Monte Carlo local-mean-shift estimator with 256 local perturbations per sample. All reported numbers use 5000 generated samples and 5000 reference samples from the target, with random seed 42 . W e report the ℓ 2 error of the sample mean, the Frobenius norm of the covariance error , RBF MMD, and the mean energy of generated samples. Lower is better for the first three metrics, while the mean energy should be interpreted relati ve to the reference ener gy . 4.2 Quantitative summary T able 1 summarizes the final sampling quality of the trained one-step sampler . The learned generator attains mean error 0 . 0754 , co variance error 0 . 0425 , and RBF MMD 0 . 0020 , indicating that the generated distribution closely matches the target in both first- and second-order structure. The generated mean energy 1 . 0045 is also close to the reference value 1 . 0263 , which shows that the transported samples lie in the correct low-ener gy region of the Boltzmann tar get. T aken together , these numbers suggest that the drifting objectiv e is sufficient to train a single neural map whose pushforward already matches the target distrib ution well, without any iterati ve sampling procedure at test time. 6 Figure 2: Additional target examples for Gaussian kernel drifting. On the left, the learned sampler captures the two-well structure of the double-well energy . On the right, it follows the curved low- energy manifold of the banana-shaped tar get. 4.3 Qualitative analysis Figure 1 visualizes the learned sampler together with the tar get contour and reference samples. The generated samples clearly occupy all four modes and align well with the target geometry . The mass allocation across the four quadrants is (1141 , 1244 , 1332 , 1283) out of 5000 samples, which is close to the ideal balanced count of 1250 per mode for this symmetric target. The figure also re veals the main residual error mode of the sampler: a small amount of probability mass appears between neighboring modes, especially along the lower half of the square. This is consistent with the fact that the generator is trained as a smooth one-step transport map. Even so, the learned samples remain concentrated near the correct basins, and the qualitativ e picture matches the low MMD and co variance errors reported in T able 1. T o illustrate that the same procedure extends beyond the Gaussian-mixture example, Figure 2 sho ws results on the double-well and banana targets with the same Gaussian kernel drifting frame work. In the double-well case, the sampler reco vers the two lo w-energy basins and the connecting bridge region, achie ving mean error 0 . 0308 , cov ariance error 0 . 0326 , and MMD 0 . 0014 . In the banana case, the generated samples follo w the nonlinear curv ed geometry of the tar get, with mean error 0 . 0226 and MMD 0 . 0019 . These additional examples suggest that the drifting objecti ve is not restricted to mixtures of nearly Gaussian modes, but also adapts to targets with noncon vex and highly curved support. 5 Conclusion W e proposed a drifting-based approach to amortized sampling for Boltzmann distributions, where a one-step generator is trained to match transported samples under a Gaussian-smoothed score field. The framew ork giv es a concrete target-side drift for unnormalized ener gies and supports both Monte Carlo local mean-shift and second-order approximations, together with a mini-batch estimator for the sampler-side score. This combination leads to a practical and lightweight training algorithm. Empirically , the method learns high-quality one-step samplers on multimodal toy Boltzmann distri- butions, with strong quantitati ve agreement on moment errors and MMD and with good qualitati ve mode coverage. Results on Gaussian-mixture, double-well, and banana targets suggest that the approach is robust be yond near-Gaussian settings and can follo w curved lo w-energy manifolds. Future work will test the method on higher-dimensional image Boltzmann distributions, compare directly to strong iterative samplers under matched compute budgets, and improv e stability via adaptiv e bandwidths and lower -variance tar get-side estimators. 7 References Mingyang Deng, He Li, T ianhong Li, Y ilun Du, and Kaiming He. Generati ve modeling via drifting. arXiv pr eprint arXiv:2602.04770 , 2026. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Confer ence on Learning Repr esentations , 2017. Simon Duane, Anthony D. K ennedy , Brian J. Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics Letters B , 195(2):216–222, 1987. Samuel J. Gershman and Noah D. Goodman. Amortized inference in probabilistic reasoning. In Pr oceedings of the Annual Meeting of the Co gnitive Science Society , volume 36, pages 517–522, 2014. Michael Gutmann and Aapo Hyvärinen. Noise-contrastiv e estimation: A new estimation principle for unnormalized statistical models. In Pr oceedings of the thirteenth international confer ence on artificial intelligence and statistics , pages 297–304. JMLR W orkshop and Conference Proceedings, 2010. T ian Han, Y ang Lu, Song-Chun Zhu, and Y ing Nian W u. Alternating back-propagation for generator network. In Proceedings of the Thirty-Fir st AAAI Confer ence on Artificial Intelligence , pages 1976–1984, 2017. doi: 10.1609/aaai.v31i1.10902. Geoffre y E Hinton. T raining products of experts by minimizing contrastiv e di vergence. Neural computation , 14(8):1771–1800, 2002. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif fusion probabilistic models. In Advances in Neural Information Pr ocessing Systems , 2020. Aapo Hyvärinen. Some extensions of score matching. Computational Statistics & Data Analysis , 51 (5):2499–2512, 2007. Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Mac hine Learning Resear ch , 6(4), 2005. Diederik P . Kingma and Max W elling. Auto-encoding v ariational bayes. In International Confer ence on Learning Repr esentations , 2014. Qiang Liu and Dilin W ang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Pr ocessing Systems , 2016. Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of marko v chain monte carlo , pages 47–95, 2011. Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Y ing Nian W u. On the anatomy of MCMC-based maximum likelihood learning of energy-based models. In Pr oceedings of the AAAI Confer ence on Artificial Intelligence , volume 34, pages 5272–5280, 2020. Danilo Jimenez Rezende and Shakir Mohamed. V ariational inference with normalizing flows. In International Confer ence on Machine Learning , pages 1530–1538, 2015. Gareth O. Roberts and Jef frey S. Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B , 60(1):255–268, 1998. Gareth O. Roberts and Richard L. T weedie. Exponential conv ergence of lange vin distributions and their discrete approximations. Bernoulli , 2(4):341–363, 1996. Jascha Sohl-Dickstein, Eric W eiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confer ence on Mac hine Learning , pages 2256–2265, 2015. Y ang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of The 35th Uncertainty in Artificial Intellig ence Confer ence , volume 115 of Pr oceedings of Mac hine Learning Researc h , pages 574–584, 2020. 8 Y ang Song, Jascha Sohl-Dickstein, Diederik P . Kingma, Abhishek Kumar , Stefano Ermon, and Ben Poole. Score-based generati ve modeling through stochastic differential equations. In International Confer ence on Learning Repr esentations , 2021. Pascal V incent. A connection between score matching and denoising autoencoders. Neural Computa- tion , 23(7):1661–1674, 2011. Jianwen Xie, Y ang Lu, Ruiqi Gao, Song-Chun Zhu, and Y ing Nian W u. Cooperativ e training of descriptor and generator networks. IEEE T ransactions on P attern Analysis and Mac hine Intelligence , 42(1):27–45, 2020. doi: 10.1109/TP AMI.2018.2879081. 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment