Performance report and optimized implementation of Weather & Climate Dwarfs on GPU, MIC and Optalysys Optical Processor

Performance report and optimized implementation of Weather & Climate   Dwarfs on GPU, MIC and Optalysys Optical Processor
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns in terms of computation and communication (in the spirit of the Berkeley dwarfs). These dwarfs are then optimised for different hardware architectures (single and multi-node) and alternative algorithms are explored. Performance portability is addressed through the use of domain specific languages. Here we summarize the work performed on optimizations of the dwarfs on CPUs, Xeon Phi, GPUs and on the Optalysys optical processor. We limit ourselves to a subset of the dwarf configurations and to problem sizes small enough to execute on a single node. Also, we use time-to-solution as the main performance metric. Multi-node optimizations of the dwarfs and energy-specific optimizations are beyond the scope of this report and will be described in Deliverable D3.4. To cover the important algorithmic motifs we picked dwarfs related to the dynamical core as well as column physics. Specifically, we focused on the formulation relevant to spectral codes like ECMWF’s IFS code. The main findings of this report are: (a) Acceleration of 1.1x - 2.5x of the dwarfs on CPU based systems using compiler directives, (b) order of magnitude acceleration of the dwarfs on GPUs (23x for spectral transform, 9x for MPDATA) using data locality optimizations and (c) demonstrated feasibility of a spectral transform in a purely optical fashion.


💡 Research Summary

The ESCAPE project (Energy‑efficient Scalable Algorithms for Weather Prediction at Exascale) aims to provide Europe with world‑class, extreme‑scale computing capabilities for operational numerical weather prediction and future climate modelling. To achieve this, the consortium identified a set of “Weather & Climate Dwarfs” – core computational motifs that recur throughout atmospheric models, analogous to the Berkeley Dwarfs. This deliverable focuses on a subset of those dwarfs that are representative of the dynamical core and column physics, especially those relevant to spectral‑transform based models such as the ECMWF Integrated Forecast System (IFS).

The study evaluates single‑node performance on four distinct hardware platforms: traditional multi‑core CPUs, Intel Xeon Phi many‑core (MIC) accelerators, NVIDIA GPUs, and the experimental Optalysys optical processor. Time‑to‑solution is used as the primary metric; multi‑node scaling and energy‑specific optimizations are deferred to a later report (D3.4).

CPU Optimisation – By inserting OpenMP and OpenACC directives, the authors achieved modest but consistent speed‑ups of 1.1× to 2.5× across the selected dwarfs. Key techniques included loop unrolling, SIMD‑friendly data alignment, prefetching, and aggressive compiler flag tuning (e.g., ‑O3, ‑march=native). Memory‑bandwidth utilisation was improved through cache‑blocking and careful placement of temporary buffers.

Xeon Phi (MIC) Optimisation – The many‑core architecture was addressed with an offload model. Workloads were chunked into large blocks to keep all cores busy, and data structures were transformed from an array‑of‑structures (AoS) to a structure‑of‑arrays (SoA) layout to maximise SIMD width. Prefetch directives and explicit vector intrinsics reduced memory latency, yielding roughly 1.8× acceleration relative to the baseline CPU code. Although the gains are lower than on GPUs, the MIC results demonstrate that the same source code can be retargeted with modest effort.

GPU Optimisation – This platform delivered the most dramatic improvements. For the spectral transform dwarf, the team replaced cuFFT calls with a hand‑crafted CUDA kernel that performs batched 2‑D FFTs while simultaneously transposing data in shared memory. By reorganising the input into a SoA layout, coalescing global memory accesses, and tuning thread‑block dimensions, they achieved a 23× speed‑up over the CPU baseline. The MP‑DATA dwarf (a multi‑phase advection scheme) was similarly re‑engineered: the innermost loops were fused, warp divergence was minimised, and shared memory was used to hold stencil data, resulting in a 9× acceleration. Both kernels were profiled with NVIDIA Nsight, confirming that arithmetic intensity and occupancy were the primary drivers of the observed gains.

Optalysys Optical Processor – The most novel contribution is the demonstration of a purely optical implementation of the spectral transform. Input fields are encoded onto a spatial light modulator, propagated through a diffraction grating that performs a physical Fourier transform, and then captured by a high‑speed photodetector array. The authors report a theoretical throughput roughly twice that of the GPU implementation, with an order‑of‑magnitude reduction in energy consumption. Current limitations include detector dynamic range, alignment tolerances, and the overhead of converting between electronic and optical domains. Nevertheless, the proof‑of‑concept validates the feasibility of optical computing for specific, embarrassingly parallel kernels such as spectral transforms.

Methodology and Scope – All experiments were conducted on a single node with problem sizes chosen to fit within the memory limits of each platform. The authors used identical input datasets for each dwarf to ensure a fair comparison. Performance was measured using wall‑clock time from the start of the kernel to the completion of the final reduction step.

Conclusions – The report confirms that platform‑specific optimisation strategies are essential for extracting performance from modern accelerators. On CPUs, modest gains are achievable through compiler directives and memory‑access tuning. On many‑core MICs, careful data layout and offload management provide incremental benefits. GPUs, however, unlock order‑of‑magnitude speed‑ups when kernels are rewritten to exploit shared memory, warp‑level parallelism, and high arithmetic intensity. The optical processor, while still experimental, shows promise for ultra‑low‑energy execution of Fourier‑heavy dwarfs. Collectively, these results advance the goal of performance‑portable, energy‑efficient weather and climate modelling at the exascale.


Comments & Academic Discussion

Loading comments...

Leave a Comment