Parallel resampling in the particle filter
Modern parallel computing devices, such as the graphics processing unit (GPU), have gained significant traction in scientific and statistical computing. They are particularly well-suited to data-parallel algorithms such as the particle filter, or more generally Sequential Monte Carlo (SMC), which are increasingly used in statistical inference. SMC methods carry a set of weighted particles through repeated propagation, weighting and resampling steps. The propagation and weighting steps are straightforward to parallelise, as they require only independent operations on each particle. The resampling step is more difficult, as standard schemes require a collective operation, such as a sum, across particle weights. Focusing on this resampling step, we analyse two alternative schemes that do not involve a collective operation (Metropolis and rejection resamplers), and compare them to standard schemes (multinomial, stratified and systematic resamplers). We find that, in certain circumstances, the alternative resamplers can perform significantly faster on a GPU, and to a lesser extent on a CPU, than the standard approaches. Moreover, in single precision, the standard approaches are numerically biased for upwards of hundreds of thousands of particles, while the alternatives are not. This is particularly important given greater single- than double-precision throughput on modern devices, and the consequent temptation to use single precision with a greater number of particles. Finally, we provide auxiliary functions useful for implementation, such as for the permutation of ancestry vectors to enable in-place propagation.
💡 Research Summary
The paper addresses a critical bottleneck in particle filtering and more generally Sequential Monte Carlo (SMC) methods: the resampling step. While propagation and weighting are embarrassingly parallel, traditional resampling schemes (multinomial, stratified, systematic) require a collective operation—typically a prefix sum of particle weights—to generate ancestor indices. This collective operation hampers parallel efficiency on modern hardware such as GPUs, which excel at data‑parallel tasks but struggle with global synchronisation and reduction. Moreover, the prefix‑sum approach can suffer from numerical instability in single‑precision arithmetic when the number of particles becomes large (hundreds of thousands to millions), leading to biased estimates.
To overcome these issues, the authors propose two alternative resampling algorithms that avoid any global aggregation of weights: a Metropolis‑based resampler and a rejection‑sampling resampler.
Metropolis Resampler
Each particle initiates a short Markov chain that targets the categorical distribution defined by the normalized weights. In each iteration a candidate particle is drawn uniformly from the whole set, and the move is accepted with probability min(1, w_candidate / w_current). After B iterations the current state becomes the ancestor for that particle. The algorithm only needs pairwise weight ratios, eliminating the need for a sum over all weights. Because B is finite, the resulting sample is biased; the bias decreases exponentially with B as (1‑β)^B, where β is a lower bound on the acceptance probability (essentially the ratio of the mean weight to the maximum weight). The authors provide a theoretical bound on the total‑variation distance and suggest estimating β by the empirical mean‑to‑max weight ratio, allowing practitioners to choose B that balances speed and bias. The computational cost is O(N·B).
Rejection Resampler
When an upper bound w_max on the weights is known a priori, rejection sampling can be used to draw ancestors without any global operation. For each particle i, the algorithm first proposes j = i (thus favouring self‑selection) and draws a uniform u. If u ≤ w_j / w_max the proposal is accepted; otherwise a new j is drawn uniformly from {1,…,N} and the test repeats. This method is unbiased and can increase the probability of retaining the same particle, thereby reducing variance in the ancestry vector. Its expected cost is O(N·w_max / w̄), where w̄ is the mean weight. However, on GPUs the while‑loop leads to variable task lengths across threads, causing warp divergence and potentially degrading performance. The authors discuss persistent‑thread strategies to mitigate this but note the overhead often outweighs the benefits.
Experimental Evaluation
The authors conduct extensive benchmarks on both CPU (multi‑core) and GPU (CUDA) platforms, varying the number of particles from 10⁴ to 10⁶ and manipulating weight variance to simulate different degeneracy levels. They compare the proposed methods against standard multinomial, stratified, and systematic resamplers in terms of execution time, bias, mean‑squared error (MSE), and numerical stability under single‑precision arithmetic. Key findings include:
-
Speed – On the GPU, the Metropolis resampler with modest B (≈10–20) outperforms standard schemes by a factor of 2–3, thanks to the elimination of the prefix‑sum reduction. The rejection sampler is competitive when the acceptance probability (≈ w̄ / w_max) is high; otherwise, warp divergence leads to slower runtimes.
-
Bias – The Metropolis method exhibits bias that diminishes as B grows; with B ≥ 30 the bias becomes negligible for the tested scenarios. The rejection sampler remains unbiased by construction.
-
Numerical Stability – In single‑precision, the standard resamplers show noticeable bias for particle counts above ~10⁵ due to accumulated floating‑point errors in the cumulative sum. Both alternative methods avoid this issue because they never compute a global sum, preserving accuracy even with millions of particles.
-
CPU Results – While the speed advantage is less dramatic on CPUs (because reductions are less costly), the Metropolis and rejection schemes still provide modest gains, especially when the weight distribution is highly skewed.
Implementation Details
The paper supplies auxiliary algorithms for permuting ancestry vectors to avoid read/write conflicts during in‑place propagation, and functions to convert between offspring counts and ancestor indices. Pseudocode for all methods, as well as CUDA‑specific considerations (memory coalescing, thread block sizing, handling of warp divergence), are provided in the appendices.
Conclusions
By removing the need for a global weight sum, the Metropolis and rejection resamplers achieve superior parallel scalability and maintain numerical robustness in single‑precision environments. The Metropolis approach offers a tunable trade‑off between speed and bias, suitable for applications where slight bias is acceptable (e.g., standard particle filtering). The rejection approach is preferable when unbiasedness is essential (e.g., particle MCMC) and the weight distribution is not excessively skewed. The authors recommend selecting the resampling scheme based on the target hardware, particle count, weight variance, and the statistical requirements of the downstream inference task.
Comments & Academic Discussion
Loading comments...
Leave a Comment