This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads. Having multiple chains in Parallel Tempering allows parallelization in a manner that is similar to the serial algorithm. Volunteer computing introduces important constraints to high performance computing, and we show that both versions of the application are able to adapt themselves to the varying and unpredictable computing resources of volunteers' computers, while leaving the machines responsive enough to use. We present experiments to show the scalable performance of these two approaches, and indicate that the efficiency of the methods increases with bigger problem sizes.
Deep Dive into High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context.
This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads. Having multiple chains in Parallel Tempering allows parallelization in a manner that is similar to the serial algorithm. Volunteer computing introduces important constraints to high performance computing, and we show that both versions of the application are able to adapt them
High-Performance Physics Simulations Using Multi-Core CPUs
and GPGPUs in a Volunteer Computing Context
Kamran Karimi
Neil G. Dickson Firas Hamze
D-Wave Systems Inc.
100-4401 Still Creek Drive
Burnaby, British Columbia
Canada, V5C 6G9
{kkarimi, ndickson, fhamze}@dwavesys.com
Abstract
This paper presents two conceptually simple methods for parallelizing a Parallel
Tempering Monte Carlo simulation in a distributed volunteer computing context, where
computers belonging to the general public are used. The first method uses conventional
multi-threading. The second method uses CUDA, a graphics card computing system.
Parallel Tempering is described, and challenges such as parallel random number
generation and mapping of Monte Carlo chains to different threads are explained. While
conventional multi-threading on CPUs is well-established, GPGPU programming
techniques and technologies are still developing and present several challenges, such as
the effective use of a relatively large number of threads. Having multiple chains in
Parallel Tempering allows parallelization in a manner that is similar to the serial
algorithm. Volunteer computing introduces important constraints to high performance
computing, and we show that both versions of the application are able to adapt
themselves to the varying and unpredictable computing resources of volunteers’
computers, while leaving the machines responsive enough to use. We present
experiments to show the scalable performance of these two approaches, and indicate that
the efficiency of the methods increases with bigger problem sizes.
- Introduction
Many fields of science and technology require vast computational resources. Simulation
of physical systems is one notable example. Insufficient computing power can result in
unfeasibly long running times or poor accuracy of results. Parallelizing such applications
enables use of more processing power. However, parallelization of different applications
may need different approaches in order to effectively use the processing units. In this
paper, we parallelize a common Monte Carlo simulation technique known as Parallel
Tempering Monte Carlo (PTMC) [8].
We focus on two technologies that enable parallelism: 1) multi-core Central Processing
Units (CPU), and 2) streaming processors on Graphics Processing Units (GPU) [15]. In
particular, we use NVIDIA’s CUDA for GPU processing [15], though similar design
principles may be applied to other GPU brands. Our primary design goals have been
suitability for a volunteer computing environment, simplicity of the parallelization
method, and low synchronization overhead. All of these goals have been achieved by
leveraging the characteristics of our original (serial) algorithm.
The nature of Parallel Tempering Monte Carlo allows us to partition a large simulation
into smaller, independent simulations, groups of which can be solved by threads on a
CPU. The number of threads is chosen to be the same as the number of cores present in
the system to avoid cache contention among the threads. GPGPUs allow for the execution
of threads on a larger number of processing elements. Although these processing
elements are typically much slower than those of a CPU, having a large number of
threads may make it possible to surpass the performance of current multi-core CPUs.
In GPU programming, the creation of small, short-lived threads has been emphasized as a
means towards achieving good speedup. An example would be multiplying two matrices,
which can be done by dividing the task into threads, each of which computes one element
of the target matrix. This low-level parallelism can be used in many algorithms and has
been a significant source of GPU parallelization success. We show that, similar to CPU
multi-threading, one can design a suitable GPU parallel algorithm that uses a coarser
level of parallelization, running longer sequences of code on each GPU processor.
One matter that differs between multi-threaded CPU and GPU programming is memory
structure. In a multi-threaded CPU application, all threads have access to a common
memory space, so the primary challenge is synchronizing access to shared data structures.
GPU programming, on the other hand, involves threads that run in a separate memory
space from the main application, (which runs on the CPU). The implication is that the
code and the data on which the GPU threads operate must be transferred to the GPU
memory before any processing can be started, and the results of the computation must be
copied back to the main memory of a host computer. This data transfer can result in poor
performance and must be kept to a minimum. It is thus important to make sure that the
time spent processing data on the GPU is long compared to data transfer times.
Another characteristic of parallel programming with GPUs is the ability to start a large
number of threads with little overhead [15].
…(Full text truncated)…
This content is AI-processed based on ArXiv data.