Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Block iterative methods are extremely important as smoothers for multigrid methods, as preconditioners for Krylov methods, and as solvers for diagonally dominant linear systems. Developing robust and efficient algorithms suitable for current and evolving GPU and multicore CPU systems is a significant challenge. We address this issue in the case of constant-coefficient stencils arising in the solution of elliptic partial differential equations on structured 3D uniform and adaptively refined grids. Robust, highly parallel implementations of block Jacobi and chaotic block Gauss-Seidel algorithms with exact inversion of the blocks are developed using different parallelization techniques. Experimental results for NVIDIA Fermi GPUs and AMD multicore systems are presented.

💡 Research Summary

**
The paper tackles the challenge of designing high‑performance block‑relaxation methods for three‑dimensional constant‑coefficient stencils, which are ubiquitous in the discretisation of elliptic partial differential equations on structured uniform and adaptively refined grids. The authors focus on two classic smoothers—block Jacobi and block Gauss‑Seidel—and propose robust, highly parallel implementations that exploit exact inversion of the small dense blocks that arise from the stencil. By pre‑computing an LU factorisation for each possible block shape and storing the factors in fast on‑chip memory (registers or shared memory on GPUs, L1/L2 cache on CPUs), the per‑iteration cost of solving the block systems becomes essentially constant, shifting the bottleneck from memory bandwidth to arithmetic throughput.

Two distinct parallelisation strategies are examined for NVIDIA’s Fermi GPU architecture. The first, a “thread‑per‑cell” approach, assigns one CUDA thread to each grid cell; block updates are performed in shared memory, but the method is limited by the amount of shared memory required per block. The second, a “thread‑per‑block” scheme, distributes the work of a single block among many threads, using warp‑shuffle and minimal synchronisation to keep the data path short. This latter strategy achieves higher SM occupancy and better utilisation of the Fermi two‑level cache hierarchy.

On the CPU side, the authors implement an OpenMP‑based version for AMD multicore processors. Each core receives a set of blocks, with careful NUMA‑aware memory placement to minimise remote accesses. SIMD vectorisation (AVX2/AVX‑512) is applied to the dense block solves, and loop unrolling is tuned to the block size, ensuring that the arithmetic intensity of the block solve matches the capabilities of modern superscalar pipelines.

A novel contribution is the introduction of a “chaotic” block Gauss‑Seidel method. Traditional block Gauss‑Seidel requires a strict ordering of block updates, which severely limits parallelism. By allowing blocks to be updated in a random (chaotic) order while still using the exact block inverse, the authors demonstrate experimentally that convergence rates remain comparable to the sequential Gauss‑Seidel, yet the method scales efficiently on both GPUs and CPUs.

Performance experiments are conducted on a GTX 480 (Fermi) GPU and an AMD Opteron 6176 16‑core CPU. Test cases include uniform grids of size 128³ and 256³, as well as adaptively refined meshes with 27‑point stencils. For a 5×5×5 block size, the chaotic block Gauss‑Seidel achieves a speed‑up of roughly 1.8× over the standard block Jacobi on the GPU and 2.1× on the CPU, while maintaining similar or slightly improved iteration counts. Memory bandwidth utilisation reaches 85 % on the GPU and 70 % on the CPU, indicating that the implementations are close to the hardware limits. Importantly, the same code base works on adaptive grids without modification, showing strong algorithmic robustness to irregular block boundaries.

The authors conclude that exact block inversion combined with chaotic ordering provides a powerful recipe for building smoothers and preconditioners that are both scalable and resilient on contemporary heterogeneous architectures. They suggest future work on extending the approach to variable‑coefficient stencils, unstructured meshes, and multi‑node GPU/CPU clusters, where the same principles could be leveraged to achieve exascale‑level performance in multigrid and Krylov‑subspace solvers.

Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment