Investigating Batch Inference in a Sequential Monte Carlo Framework for Neural Networks

Investigating Batch Inference in a Sequential Monte Carlo Framework for Neural Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Bayesian inference allows us to define a posterior distribution over the weights of a generic neural network (NN). Exact posteriors are usually intractable, in which case approximations can be employed. One such approximation - variational inference - is computationally efficient when using mini-batch stochastic gradient descent as subsets of the data are used for likelihood and gradient evaluations, though the approach relies on the selection of a variational distribution which sufficiently matches the form of the posterior. Particle-based methods such as Markov chain Monte Carlo and Sequential Monte Carlo (SMC) do not assume a parametric family for the posterior by typically require higher computational cost. These sampling methods typically use the full-batch of data for likelihood and gradient evaluations, which contributes to this computational expense. We explore several methods of gradually introducing more mini-batches of data (data annealing) into likelihood and gradient evaluations of an SMC sampler. We find that we can achieve up to $6\times$ faster training with minimal loss in accuracy on benchmark image classification problems using NNs.


💡 Research Summary

This paper investigates how to accelerate Bayesian neural network (BNN) training using Sequential Monte Carlo (SMC) samplers by introducing mini‑batch based data annealing (DA) strategies. Traditional SMC methods draw particles from the prior, propagate them with a Markov kernel (e.g., Hamiltonian Monte Carlo or Langevin dynamics), re‑weight them using the full‑dataset log‑likelihood, and resample when particle degeneracy occurs. Because each SMC iteration requires evaluating the log‑likelihood and its gradient on the entire training set, the computational cost grows linearly with the data size and quickly becomes prohibitive for modern image‑classification tasks.

The authors propose to replace the full‑batch evaluation with a schedule that gradually increases the amount of data used for likelihood and gradient estimates. Starting from a small mini‑batch size C (500 samples), they append a fixed‑size mini‑batch κ (also 500) at each annealing step, thereby constructing a sequence of batch sizes M_k that grows from C to the full dataset size N. Five deterministic schedules are examined: Constant (fixed mini‑batch), Full‑batch (always N), Constant‑to‑refine (small batch for most iterations, then a brief full‑batch phase), Linear (add one mini‑batch each iteration), and Automated (a linear schedule that reaches full‑batch at 90 % of the total iterations). In addition, an entropy‑driven adaptive schedule called Smooth Data Annealing (SDA) is introduced. SDA monitors the Shannon entropy of the log‑likelihood contributions across particles and adjusts a temperature‑like parameter β_k so that the entropy change ΔS remains constant. When β_k reaches 1, a new mini‑batch is added and β_k is reset, ensuring a smooth, data‑driven expansion of the batch.

The SMC framework uses gradient‑based Markov kernels. For HMC, three leapfrog steps with step size h = 0.002 are employed; for Langevin dynamics, a single leapfrog step is used. The authors avoid an explicit accept‑reject step by designing the proposal distribution q_k and the L‑kernel L_k such that the Jacobian determinants cancel, leading to a simplified weight update (Equation 12). This keeps the algorithm efficient while preserving the correct target distribution.

Experiments are conducted on three benchmark image datasets: MNIST (70 k grayscale digits), Fashion‑MNIST (70 k grayscale clothing items), and a variant of Full‑MNIST. Two network architectures are used: a LeNet‑5 model with 61 706 parameters and a larger CNN with 96 658 parameters. Each configuration is trained for 200 SMC iterations over five random seeds, and the authors report mean ± standard deviation of test loss, test accuracy, and wall‑clock runtime.

Key findings include:

  • Full‑batch SMC yields the highest accuracy (≈ 98 %) and lowest loss but incurs the longest runtime.
  • The Constant schedule provides the greatest speed‑up (≈ 20×) but suffers a modest accuracy drop (≈ 1 %).
  • Constant‑to‑refine (CTR) offers a strong compromise: it matches Full‑batch accuracy (≈ 97.8 %) while reducing runtime by a factor of 6.6, because it only uses the full batch during a short refinement phase.
  • Linear and Automated schedules outperform Constant but are slightly less efficient than CTR.
  • The entropy‑driven SDA consistently delivers stable performance across datasets, confirming that adaptive batch growth based on particle uncertainty is beneficial.
  • Across all schedules, HMC outperforms Langevin dynamics, even with only three leapfrog steps, indicating that richer Hamiltonian trajectories improve particle exploration without a prohibitive cost increase.

The authors conclude that data annealing dramatically lowers the computational burden of SMC samplers for Bayesian neural networks while preserving most of the statistical efficiency. They suggest future work on adaptive trajectory lengths, automatic tuning of the mass matrix, and scaling the approach to distributed GPU clusters.


Comments & Academic Discussion

Loading comments...

Leave a Comment