Utilising Gradient-Based Proposals Within Sequential Monte Carlo Samplers for Training of Partial Bayesian Neural Networks
Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks while only having a subset of the parameters be stochastic. Using sequential Monte Carlo (SMC) samplers as the inference method for pBNNs gives a non-parametric probabilistic estimation of the stochastic parameters, and has shown improved performance over parametric methods. In this paper we introduce a new SMC-based training method for pBNNs by utilising a guided proposal and incorporating gradient-based Markov kernels, which gives us better scalability on high dimensional problems. We show that our new method outperforms the state-of-the-art in terms of predictive performance and optimal loss. We also show that pBNNs scale well with larger batch sizes, resulting in significantly reduced training times and often better performance.
💡 Research Summary
This paper introduces a novel training algorithm for partial Bayesian neural networks (pBNNs) that leverages gradient‑based proposals within a sequential Monte‑Carlo (SMC) framework, called Guided Open‑Horizon SMC (GOHSMC). In a pBNN, a subset of the network parameters (θ) is treated as stochastic while the remaining parameters (ψ) are deterministic. The goal is to infer the posterior distribution p(θ | y, ψ) non‑parametrically using SMC and to optimise ψ via stochastic gradient descent. Existing SMC‑based methods such as Open‑Horizon SMC (OHSMC) employ a bootstrap random‑walk kernel, which is inefficient in high‑dimensional θ spaces and does not exploit information from the target posterior. GOHSMC addresses these shortcomings by (1) introducing a guided proposal that leaves the current posterior invariant, and (2) employing a gradient‑based Markov kernel—specifically unadjusted Langevin dynamics (LD)—to move particles in the direction of ∇ log π(θ | ψ). The LD step is combined with a forward‑proposal (FP) L‑kernel, allowing the exact computation of importance weights that correct for the non‑invariance of the unadjusted dynamics. The resulting weight update (Equation 20) incorporates both the posterior ratio π(θ_t | ψ_{t‑1})/π(θ_{t‑1} | ψ_{t‑2}) and the Gaussian density of the momentum variables, ensuring that particles retain the correct distribution after each iteration.
Algorithmically, GOHSMC proceeds as follows: at each iteration a mini‑batch y_S^M is sampled; effective sample size (ESS) is monitored and multinomial resampling is triggered when ESS falls below J/2; each particle is propagated using the LD proposal conditioned on the current ψ; weights are updated using the derived importance weight; normalized weights are used to compute a stochastic gradient g(ψ) = (N/M) Σ_j \tilde w_j ∇_ψ log p(y_S^M | θ_j, ψ); finally ψ is updated with a learning rate ε. Because the posterior from the previous iteration is used as the initial distribution for the next, the method enjoys a warm‑start effect that accelerates convergence compared with SGSMC, which restarts from the prior at every step.
The authors evaluate GOHSMC on six UCI regression benchmarks (Red/White Wine Quality, California Housing, Concrete Strength, Yacht Hydrodynamics, Naval Propulsion) using a three‑layer feed‑forward network with GeLU activations. The first layer is stochastic (widths ranging from 350 to 900), while subsequent layers are deterministic and trained with Adam (lr = 0.01). They compare against OHSMC with a random‑walk kernel, variational inference (VI), stochastic‑gradient HMC (SGHMC), weight‑averaged ensembles (SW‑AG), and Stein variational gradient descent (SVGD). All methods use 100 particles and are run for 100 epochs, with results averaged over five random splits (60/30/10 train/val/test).
Across metrics—root‑mean‑square error (RMSE), coefficient of determination (R²), bias, negative log‑likelihood (NLL), and continuous ranked probability score (CRPS)—GOHSMC consistently outperforms the baselines. It achieves the lowest RMSE on five of six datasets and the highest R² on four, while also delivering competitive or superior NLL and CRPS values. The performance gap is most pronounced on high‑dimensional tasks (e.g., Naval Propulsion with a 900‑unit stochastic layer), where the LD proposal maintains particle diversity and accurately captures the posterior. Although some error bars overlap, the overall trend demonstrates that the guided, gradient‑driven approach yields more stable and accurate posterior estimates than the original OHSMC and other state‑of‑the‑art methods.
In conclusion, GOHSMC demonstrates that integrating gradient‑based Markov kernels and guided proposals into SMC samplers provides an effective, scalable solution for training pBNNs. The method reduces training time by allowing larger mini‑batches without sacrificing accuracy, and it improves predictive performance and uncertainty quantification. The paper suggests future extensions such as incorporating full Hamiltonian Monte‑Carlo kernels with adaptive trajectory lengths (e.g., No‑U‑Turn Sampler or ChEES criteria) and applying the framework to more complex architectures like convolutional or transformer models. These directions promise further gains in efficiency and robustness for Bayesian deep learning in high‑dimensional settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment