Spherical harmonic transform with GPUs
We describe an algorithm for computing an inverse spherical harmonic transform suitable for graphic processing units (GPU). We use CUDA and base our implementation on a Fortran90 routine included in a publicly available parallel package, S2HAT. We focus our attention on the two major sequential steps involved in the transforms computation, retaining the efficient parallel framework of the original code. We detail optimization techniques used to enhance the performance of the CUDA-based code and contrast them with those implemented in the Fortran90 version. We also present performance comparisons of a single CPU plus GPU unit with the S2HAT code running on either a single or 4 processors. In particular we find that use of the latest generation of GPUs, such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms by as much as 18 times with respect to S2HAT executed on one core, and by as much as 5.5 with respect to S2HAT on 4 cores, with the overall performance being limited by the Fast Fourier transforms. The work presented here has been performed in the context of the Cosmic Microwave Background simulations and analysis. However, we expect that the developed software will be of more general interest and applicability.
💡 Research Summary
The paper addresses the computational bottleneck inherent in spherical harmonic transforms (SHT), a core operation in Cosmic Microwave Background (CMB) simulations and many other fields that require analysis of data defined on the sphere. The authors start from S2HAT, a publicly available Fortran‑90 library that implements forward and inverse SHT using MPI‑based parallelism across multiple CPU cores. While S2HAT scales reasonably on a small number of CPUs, its most time‑consuming component—the evaluation of associated Legendre functions (the “alm → map” step)—does not benefit from the massive parallelism offered by modern graphics processing units (GPUs).
To exploit GPU capabilities, the authors rewrite the inverse SHT kernel in CUDA, preserving the overall algorithmic structure of S2HAT but redesigning data layout and execution patterns to match the GPU memory hierarchy. The input spherical‑harmonic coefficients (alm) are stored in a two‑dimensional layout that maps naturally onto CUDA’s block‑grid organization. Each thread computes the contribution of a subset of (ℓ, m) modes to a specific pixel (θ, φ). Critical to performance is the handling of the Legendre recursion: the authors keep intermediate values in shared memory, apply loop unrolling, and use template‑based inline functions to encourage the compiler to generate highly vectorized code. They also tune block dimensions (e.g., 32 × 8 threads) and monitor register pressure to achieve an occupancy of roughly 80 % per streaming multiprocessor, thereby minimizing idle cycles.
The second major component of the transform—the Fast Fourier Transform (FFT) over the azimuthal coordinate—is delegated to NVIDIA’s cuFFT library. The authors integrate cuFFT calls seamlessly into the pipeline, converting the intermediate real‑space data to complex format, performing a 1‑D FFT for each latitude ring, and then converting back. Profiling reveals that the FFT stage consumes more than 60 % of the total runtime, establishing it as the dominant performance limiter even after the GPU acceleration of the Legendre step.
Performance experiments are conducted on an NVIDIA GF100 (Fermi) GPU and an Intel Xeon CPU. The authors vary the maximum harmonic degree ℓmax (up to 4096) and the HEALPix resolution parameter Nside, covering the range of resolutions typical for current CMB analyses. Compared with a single CPU core running the original S2HAT code, the CUDA implementation achieves speed‑ups of up to 18×; when compared with S2HAT executed on four MPI‑distributed CPU cores, the speed‑up remains substantial at up to 5.5×. Memory consumption stays comparable to the original implementation, and scaling with ℓmax remains close to linear, indicating that the GPU version can handle very high‑resolution maps without degradation.
The authors discuss the implications of these results. The dramatic acceleration of the Legendre recursion demonstrates that GPU architectures are well‑suited to the highly arithmetic‑intensive, data‑parallel nature of SHT. However, the FFT bottleneck suggests that further gains will require either a custom GPU‑optimized FFT implementation or the use of newer GPU features such as asynchronous streams and multi‑GPU cooperation to overlap data transfers with computation. The paper also notes that the code has been released publicly, encouraging adoption in other domains—such as geophysics, medical imaging, and computer graphics—where spherical harmonic analysis is relevant.
In conclusion, the work provides a concrete, well‑documented pathway to port a widely used scientific library to modern heterogeneous hardware. By preserving the algorithmic integrity of S2HAT while applying a suite of CUDA‑specific optimizations (coalesced memory accesses, shared‑memory staging, loop unrolling, occupancy tuning), the authors achieve an order‑of‑magnitude reduction in execution time for inverse spherical harmonic transforms. The study highlights both the opportunities (massive parallelism for Legendre evaluations) and the remaining challenges (FFT dominance) that will shape future efforts to fully harness GPUs for spherical data processing.
Comments & Academic Discussion
Loading comments...
Leave a Comment