High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context

Reading time: 6 minute
...

📝 Original Info

  • Title: High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context
  • ArXiv ID: 1004.0023
  • Date: 2011-03-31
  • Authors: Researchers from original ArXiv paper

📝 Abstract

This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads. Having multiple chains in Parallel Tempering allows parallelization in a manner that is similar to the serial algorithm. Volunteer computing introduces important constraints to high performance computing, and we show that both versions of the application are able to adapt themselves to the varying and unpredictable computing resources of volunteers' computers, while leaving the machines responsive enough to use. We present experiments to show the scalable performance of these two approaches, and indicate that the efficiency of the methods increases with bigger problem sizes.

💡 Deep Analysis

Deep Dive into High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context.

This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads. Having multiple chains in Parallel Tempering allows parallelization in a manner that is similar to the serial algorithm. Volunteer computing introduces important constraints to high performance computing, and we show that both versions of the application are able to adapt them

📄 Full Content

High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context

Kamran Karimi Neil G. Dickson Firas Hamze

D-Wave Systems Inc. 100-4401 Still Creek Drive Burnaby, British Columbia Canada, V5C 6G9 {kkarimi, ndickson, fhamze}@dwavesys.com

Abstract
This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads. Having multiple chains in Parallel Tempering allows parallelization in a manner that is similar to the serial algorithm. Volunteer computing introduces important constraints to high performance computing, and we show that both versions of the application are able to adapt themselves to the varying and unpredictable computing resources of volunteers’ computers, while leaving the machines responsive enough to use. We present experiments to show the scalable performance of these two approaches, and indicate that the efficiency of the methods increases with bigger problem sizes.

  1. Introduction Many fields of science and technology require vast computational resources. Simulation of physical systems is one notable example. Insufficient computing power can result in unfeasibly long running times or poor accuracy of results. Parallelizing such applications enables use of more processing power. However, parallelization of different applications may need different approaches in order to effectively use the processing units. In this paper, we parallelize a common Monte Carlo simulation technique known as Parallel Tempering Monte Carlo (PTMC) [8].

We focus on two technologies that enable parallelism: 1) multi-core Central Processing Units (CPU), and 2) streaming processors on Graphics Processing Units (GPU) [15]. In particular, we use NVIDIA’s CUDA for GPU processing [15], though similar design principles may be applied to other GPU brands. Our primary design goals have been suitability for a volunteer computing environment, simplicity of the parallelization method, and low synchronization overhead. All of these goals have been achieved by leveraging the characteristics of our original (serial) algorithm.

The nature of Parallel Tempering Monte Carlo allows us to partition a large simulation into smaller, independent simulations, groups of which can be solved by threads on a CPU. The number of threads is chosen to be the same as the number of cores present in the system to avoid cache contention among the threads. GPGPUs allow for the execution of threads on a larger number of processing elements. Although these processing elements are typically much slower than those of a CPU, having a large number of threads may make it possible to surpass the performance of current multi-core CPUs.

In GPU programming, the creation of small, short-lived threads has been emphasized as a means towards achieving good speedup. An example would be multiplying two matrices, which can be done by dividing the task into threads, each of which computes one element of the target matrix. This low-level parallelism can be used in many algorithms and has been a significant source of GPU parallelization success. We show that, similar to CPU multi-threading, one can design a suitable GPU parallel algorithm that uses a coarser level of parallelization, running longer sequences of code on each GPU processor.

One matter that differs between multi-threaded CPU and GPU programming is memory structure. In a multi-threaded CPU application, all threads have access to a common memory space, so the primary challenge is synchronizing access to shared data structures. GPU programming, on the other hand, involves threads that run in a separate memory space from the main application, (which runs on the CPU). The implication is that the code and the data on which the GPU threads operate must be transferred to the GPU memory before any processing can be started, and the results of the computation must be copied back to the main memory of a host computer. This data transfer can result in poor performance and must be kept to a minimum. It is thus important to make sure that the time spent processing data on the GPU is long compared to data transfer times.

Another characteristic of parallel programming with GPUs is the ability to start a large number of threads with little overhead [15].

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut