Stochastic-gradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic Gradient Langevin Dynamics (SGLD), SGLRW introduces stochastic noise only through the off-diagonal elements of the update covariance; this yields greater robustness to minibatch size while retaining asymptotic correctness. Furthermore, as comparison we analyze a natural analogue of SGLD utilizing gradient clipping. Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.
Bayesian methods provide a principled framework for learning probabilistic models from data and natively capturing uncertainty by replacing the parameter point estimates in frequentist methods with a posterior distribution over parameters. By marginalizing over parameters, Bayesian methods act as a form of regularization and enable uncertainty quantification and robust model selection (Neal, 2012). In doing so, Bayesian models can potentially mitigate overfitting and miscalibration, which are prevalent in modern large-scale, overparameterized neural networks (Guo et al., Comparing SGLD (left) and SGLRW (right) discretisations of Langevin dynamics, we can observe that the lattice based discretisation suppresses large parameter jumps that occur due to minibatch noise, resulting in more stable sampling. et al., 2023). Realizing these benefits in the modern hyperscaling era, however, requires posterior inference algorithms that scale to both dataset size and model complexity.
Within Bayesian methods, Markov chain Monte Carlo (MCMC) (Neal, 1993;Robert et al., 1999) remains the gold standard for posterior sampling, but it is also among the methods most affected by scalability and computational cost (Gelman et al., 1997). Alternative approaches, including variational inference (Blei et al., 2017), Laplace approximations (Tierney & Kadane, 1986), and single-pass methods (Gal & Ghahramani, 2016), are often less computationally demanding, but still introduce substantial overhead in training and inference (Blei et al., 2017;Lakshminarayanan et al., 2017;Wilson & Izmailov, 2020). As a result, these methods have, in some settings, fallen out of favour relative to modern approaches for assessing model trustworthiness, such as explainable and interpretable models (Li et al., 2023).
One core issue of MCMC methods for Bayesian posterior inference is the theoretical requirement to evaluate the gradient of the posterior over the entire dataset at each iteration (Welling & Teh, 2011;Ma et al., 2015). With growing model complexity and dataset size, this is often prohibitively expensive. Stochastic-gradient variants of MCMC methods, such as Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011), alleviate this concern to some extent by allowing the gradient to be evaluated on a small minibatch of data at each iteration. However, these methods are still known to be sensitive to the minibatch size (Brosse et al., 2018). As a result, they do not scale to regimes where only a small minibatch is available at each evaluation step, or where only a small number of samples from the dataset can be stored in memory, as is becoming increasingly common.
In this work, we propose Stochastic Gradient Lattice Random Walk (SGLRW), a stochastic-gradient extension of the recently introduced Lattice Random Walk (LRW) (Duffield et al., 2025) discretisation of overdamped Langevin dynamics. LRW replaces the Gaussian increments of Langevin dynamics with bounded binary or ternary updates on a lattice. As we show, unlike SGLD, the stochastic gradient noise in SGLRW enters only through the off-diagonal elements of the covariance matrix of the update and therefore remains robust to the minibatch size. This allows SGLRW to sample from the posterior distribution with the same asymptotic correctness as SGLD, but with improved stability for small minibatches, as shown in Figure 1.
In short our contributions are as follows:
• We propose Stochastic Gradient Lattice Random Walk (SGLRW), a lattice based stochastic-gradient discretisation of overdamped Langevin dynamics (Section 4.2).
• Extending the analysis of Chen et al. (2015), we provide a mean-squared-error analysis that justifies its improved stability for small minibatches (Section 4.3).
• We validate our theoretical findings on a mix of analytically understood problems and real-world tasks, including sentiment classification using an LLM (Section 5).
• We discuss a clipped version of SGLD as a strong baseline that is analogous to gradient clipping in stochastic gradient descent (Section 5.1).
As stated in the introduction, we consider the problem of minibatch-induced instability in stochastic gradient MCMC methods for Bayesian posterior sampling. Here, we recap the necessary background on Bayesian machine learning, posterior inference, and stochastic gradient methods.
We consider the supervised learning setting where we have observed data D = {(x i , y i )} N i=1 and aim to infer a posterior distribution p(θ | D) over the parameter vector θ ∈ R d . In contrast to frequentist approaches, which seek a single point estimate θ * that maximises the likelihood p(D | θ), Bayesian machine learning maintains a full distribution over parameters, p(θ | D). This posterior distribution captures the uncertainty in our parameter estimates given the observed data.
The posterior distribution is given by Bayes’ theorem as
where p(D | θ) is again the likelihood and p(θ) is the prior. Notably, we will
This content is AI-processed based on open access ArXiv data.