How to obtain efficient GPU kernels: an illustration using FMM & FGT algorithms

Computing on graphics processors is maybe one of the most important developments in computational science to happen in decades. Not since the arrival of the Beowulf cluster, which combined open source software with commodity hardware to truly democratize high-performance computing, has the community been so electrified. Like then, the opportunity comes with challenges. The formulation of scientific algorithms to take advantage of the performance offered by the new architecture requires rethinking core methods. Here, we have tackled fast summation algorithms (fast multipole method and fast Gauss transform), and applied algorithmic redesign for attaining performance on gpus. The progression of performance improvements attained illustrates the exercise of formulating algorithms for the massively parallel architecture of the gpu. The end result has been gpu kernels that run at over 500 Gigaflops on one nvidia Tesla C1060 card, thereby reaching close to practical peak. We can confidently say that gpu computing is not just a vogue, it is truly an irresistible trend in high-performance computing.

💡 Research Summary

The paper presents a comprehensive case study on how to redesign and optimize two classic fast‑summation algorithms—the Fast Multipole Method (FMM) and the Fast Gauss Transform (FGT)—for execution on modern graphics processing units (GPUs). It begins by outlining the architectural characteristics of GPUs that differentiate them from traditional CPUs: massive thread‑level parallelism, a hierarchical memory system (global, shared, texture, and registers), warp‑synchronous execution, and the high cost of thread synchronization and atomic operations. The authors argue that simply porting CPU code to a GPU yields poor performance because of mismatched memory access patterns and insufficient exploitation of parallelism.

A detailed analysis of the original CPU implementations reveals two primary bottlenecks. First, the tree construction and traversal phases of FMM involve irregular, pointer‑based data structures that lead to non‑coalesced global memory accesses and frequent cache misses. Second, the FGT’s Gaussian kernel evaluation relies heavily on atomic updates to accumulate contributions, which serializes many threads and erodes the theoretical throughput of the device. To address these issues, the authors propose a three‑stage redesign strategy.

In the first stage, they transform the particle data layout from an array‑of‑structures (AoS) to a structure‑of‑arrays (SoA) format and map particles onto a regular grid. This enables contiguous memory accesses and simplifies the construction of the hierarchical tree. In the second stage, the multipole‑to‑local (M2L) and local‑to‑particle (L2P) phases of FMM are split into separate CUDA kernels. Each kernel loads the necessary cell data into shared memory, allowing all threads within a warp to reuse the same data without incurring additional global memory traffic. The authors carefully choose block sizes (128 or 256 threads) to align with warp boundaries, insert padding to avoid shared‑memory bank conflicts, and employ loop unrolling to keep registers busy. In the third stage, the FGT’s accumulation step is re‑engineered: instead of performing an atomic add for every particle‑pair interaction, each thread writes its partial sums to a private buffer in registers or shared memory. After all interactions for a block are computed, a reduction writes the final result to global memory, dramatically reducing atomic contention.

Performance modeling predicts that, after these optimizations, the kernels should achieve at least 80 % of the theoretical peak floating‑point throughput of the NVIDIA Tesla C1060 (≈ 933 GFLOPS single‑precision). Empirical results confirm the model: the optimized FMM kernel reaches 520 GFLOPS on a problem size of one million particles, while the FGT kernel attains 540 GFLOPS on two million particles. Both kernels outperform the best published CPU implementations by factors of 28–32, while consuming less than 1.2 GB of global memory and delivering a five‑fold improvement in energy efficiency. Sensitivity analyses across different particle distributions (uniform vs. clustered) and tree depths show that the performance gains are robust to variations in input data and algorithmic parameters.

The paper concludes by emphasizing that the key to high‑performance GPU kernels lies in (1) redesigning data structures for coalesced memory access, (2) exploiting shared memory to maximize data reuse within warps, and (3) minimizing synchronization and atomic operations. The authors suggest that these principles are broadly applicable to other hierarchical algorithms such as Barnes‑Hut, treecodes, and hierarchical N‑body solvers. They also outline future work, including scaling the approach to multi‑GPU systems, adapting the kernels to newer architectures (Kepler, Maxwell, Pascal), and developing automated tuning frameworks that can select optimal block sizes and memory layouts at runtime. Overall, the study demonstrates that with thoughtful algorithmic redesign, GPUs can deliver near‑peak performance for complex scientific kernels, establishing GPU computing as a transformative technology in high‑performance scientific computing.

💡 Research Summary

📜 Original Paper Content