High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context
This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads. Having multiple chains in Parallel Tempering allows parallelization in a manner that is similar to the serial algorithm. Volunteer computing introduces important constraints to high performance computing, and we show that both versions of the application are able to adapt themselves to the varying and unpredictable computing resources of volunteers’ computers, while leaving the machines responsive enough to use. We present experiments to show the scalable performance of these two approaches, and indicate that the efficiency of the methods increases with bigger problem sizes.
💡 Research Summary
The paper addresses the challenge of running large‑scale Parallel Tempering Monte Carlo (PTMC) simulations on volunteer‑computing platforms such as BOINC, where the hardware is heterogeneous, unpredictable, and must remain responsive to the user. Two parallelization strategies are presented: (1) a conventional multi‑core CPU approach using OpenMP, and (2) a CUDA‑based GPGPU implementation.
PTMC naturally lends itself to parallel execution because each replica (Markov chain) runs at a distinct temperature and can be swept independently; after a set of sweeps, adjacent replicas are probabilistically swapped. The authors exploit this two‑level parallelism—chain‑level and sweep‑level—to map the algorithm onto both CPUs and GPUs.
For the CPU version, each chain is assigned to a thread. Simple round‑robin assignment proved inefficient because hotter chains require more work than colder ones, leading to load imbalance. The authors therefore introduced a work‑pool: threads repeatedly fetch the next unprocessed chain from a shared counter inside a critical section. This dynamic scheduling equalizes the workload, keeps all cores busy, and reduces cache misses. OpenMP’s default thread creation is overridden to match BOINC’s “below‑normal” priority, ensuring that the volunteer’s desktop remains usable even when CPU utilization reaches ~94 %.
Random number generation, essential for Metropolis updates and replica swaps, is handled by giving each execution thread its own Mersenne‑Twister state vector and index. This eliminates the need for locks and preserves the statistical quality of the random stream. The same per‑thread state strategy is applied on the GPU, where each CUDA thread (or group of threads) maintains an independent state.
The GPU implementation mirrors the CPU’s two‑level parallelism. At the first level, each PTMC chain is mapped to a streaming multiprocessor (SM); at the second level, the SM’s many CUDA cores concurrently update independent sub‑regions of the chain. The authors split each chain into two groups of 64‑bit blocks that can be updated without inter‑thread interference, allowing thousands of threads to run simultaneously. Data transfer overhead—a major bottleneck in GPU computing—is mitigated by copying the entire set of chains to device memory once, performing many sweeps and swaps locally, and copying back only the final results. Kernel launches are organized so that each block processes a single chain, and threads within the block handle the sub‑regions, achieving good occupancy and minimizing synchronization.
Performance experiments cover a range of problem sizes (from a few dozen to several hundred chains) and compare the CPU and GPU versions. For small workloads the CPU version is faster because the GPU’s host‑device transfer dominates. As the number of chains grows beyond ~200, the GPU version outperforms the CPU by a factor of 2–3, and both implementations exhibit near‑linear scaling with problem size. The results confirm that the proposed methods are well‑suited to the volunteer‑computing context: they adapt to varying numbers of cores, respect user‑interface responsiveness, and exploit the massive parallelism of modern GPUs without overwhelming the host system.
In summary, the paper demonstrates that a relatively simple redesign of PTMC—assigning independent random‑number streams, using dynamic work‑pool scheduling on CPUs, and exploiting hierarchical parallelism on GPUs—can deliver high performance on heterogeneous, opportunistic computing resources. The techniques are broadly applicable to other Monte Carlo‑based scientific codes that must run on volunteer platforms while preserving a good user experience.
Comments & Academic Discussion
Loading comments...
Leave a Comment