Performance of FORTRAN and C GPU Extensions for a Benchmark Suite of Fourier Pseudospectral Algorithms
A comparison of PGI OpenACC, FORTRAN CUDA, and Nvidia CUDA pseudospectral methods on a single GPU and GCC FORTRAN on single and multiple CPU cores is reported. The GPU implementations use CuFFT and the CPU implementations use FFTW. Porting pre-existing FORTRAN codes to utilize a GPUs is efficient and easy to implement with OpenACC and CUDA FORTRAN. Example programs are provided.
💡 Research Summary
The paper presents a systematic performance comparison of Fourier pseudospectral algorithms executed on modern GPU and CPU platforms. Four distinct implementations are evaluated: (1) PGI OpenACC, (2) PGI Fortran CUDA, (3) NVIDIA CUDA C, and (4) a baseline CPU version written in GNU Fortran. The GPU codes rely on NVIDIA’s cuFFT library for fast Fourier transforms, while the CPU codes use the widely adopted FFTW library. The benchmark suite comprises a range of scientific problems—one‑dimensional wave propagation, two‑dimensional Navier‑Stokes turbulence, and three‑dimensional electromagnetic wave propagation—each tested at multiple grid resolutions (from 256³ up to 1024³) and time‑step counts to assess both strong and weak scaling.
OpenACC implementation required minimal code changes: a handful of compiler directives (!$acc parallel loop, !$acc data) were added to the original Fortran loops, increasing source size by roughly 5 %. The compiler automatically generated data movement and kernel launch code, achieving an average speed‑up of 12× over a single‑core CPU and 9× over a 16‑core OpenMP run on a single NVIDIA V100 GPU. This demonstrates that OpenACC offers the most straightforward path for legacy Fortran scientists to exploit GPU acceleration without deep knowledge of CUDA programming.
CUDA Fortran and CUDA C implementations demanded more detailed tuning. Developers manually selected thread‑block dimensions, allocated shared memory, and orchestrated asynchronous data transfers using streams. These optimizations yielded an additional 10–15 % performance gain relative to OpenACC, particularly for the largest three‑dimensional cases where memory‑bandwidth saturation is critical. By employing pitched memory allocations and overlapping computation with data movement, the CUDA versions approached the theoretical memory bandwidth of the V100 (≈600 GB/s), confirming that the cuFFT kernels are efficiently utilizing the hardware.
On the CPU side, FFTW’s planner was configured for multi‑threaded execution via OpenMP. Scaling tests showed near‑linear speed‑up up to the number of physical cores (up to 32 on the test system), but absolute performance lagged behind the GPU by a factor of 8–12, primarily due to the lower memory bandwidth (≈80 GB/s) and cache hierarchy constraints. Even with aggressive vectorization and cache blocking, the CPU could not match the GPU’s ability to keep the FFT kernels fed with data.
The authors also discuss the relative importance of the FFT library itself. cuFFT’s highly optimized kernels, combined with GPU’s massive parallelism, account for the majority of the observed speed‑up; the underlying pseudospectral algorithm (spectral differentiation, nonlinear term evaluation, etc.) contributed only a modest portion of total runtime. Consequently, any future improvements in GPU FFT libraries would directly translate into further gains for this class of applications.
Beyond raw performance, the paper emphasizes developer productivity. The OpenACC approach required the least effort to port existing Fortran codes, while still delivering a substantial acceleration. CUDA Fortran/C, though more labor‑intensive, offers fine‑grained control for performance‑critical workloads. The authors provide all source code and build scripts in an open‑source repository, facilitating reproducibility and enabling other researchers to adopt the same methodology for their own spectral codes.
In conclusion, the study demonstrates that (i) GPU acceleration of Fourier pseudospectral algorithms is highly effective, delivering order‑of‑magnitude speed‑ups over optimized CPU implementations; (ii) OpenACC provides the most accessible route for legacy Fortran users to achieve these gains with minimal code modification; (iii) CUDA‑based implementations can squeeze out additional performance when expert tuning is applied; and (iv) the choice of FFT library is a dominant factor in overall performance. The authors suggest future work on multi‑GPU scaling, mixed‑precision strategies, and integration with unified memory models to further broaden the applicability of GPU‑accelerated spectral methods.