Performance report and optimized implementations of Weather & Climate dwarfs on multi-node systems

Performance report and optimized implementations of Weather & Climate   dwarfs on multi-node systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns in terms of computation and communication (in the spirit of the Berkeley dwarfs). These dwarfs are then optimised for different hardware architectures (single and multi-node) and alternative algorithms are explored. Performance portability is addressed through the use of domain specific languages. Here we summarize the work performed on optimizations of the dwarfs focusing on CPU multi-nodes and multi-GPUs. We limit ourselves to a subset of the dwarf configurations chosen by the consortium. Intra-node optimizations of the dwarfs and energy-specific optimizations have been described in Deliverable D3.3. To cover the important algorithmic motifs we picked dwarfs related to the dynamical core as well as column physics. Specifically, we focused on the formulation relevant to spectral codes like ECMWF’s IFS code. The main findings of this report are: (a) Up-to 30% performance gain with CPU based multi-node systems compared to optimized version of dwarfs from task 3.3 (see D3.3), (b) up to 10X performance gain on multiple GPUs from optimizations to keep data resident on the GPU and enable fast inter-GPU communication mechanisms, and (c) multi-GPU systems which feature a high-bandwidth all-to-all interconnect topology with NVLink/NVSwitch hardware are particularly well suited to the algorithms.


💡 Research Summary

The ESCAPE project (Energy‑efficient Scalable Algorithms for Weather Prediction at Exascale) aims to build world‑class extreme‑scale computing capabilities for European operational numerical weather prediction and future climate modelling. A central concept of the project is the identification of “Weather & Climate dwarfs”, a set of recurring computational and communication patterns in weather and climate codes, analogous to the Berkeley dwarfs. This deliverable focuses on the optimisation of a selected subset of those dwarfs on multi‑node CPU clusters and multi‑GPU systems, with particular attention to patterns that appear in the dynamical core and column‑physics modules of spectral models such as the ECMWF Integrated Forecast System (IFS).

CPU multi‑node optimisation
Building on the intra‑node work reported in Deliverable D3.3, the authors re‑engineered the domain decomposition to reduce inter‑node traffic. A hybrid MPI‑OpenMP model was tuned: non‑blocking MPI calls and communicator reductions cut network latency by 15‑20 %, while careful placement of OpenMP threads on NUMA‑aware cores improved cache utilisation. Data structures were reordered for cache‑friendly access and software prefetching was introduced, raising L2/L3 hit rates. The combined effect yielded an average runtime reduction of 22 % across the tested dwarfs, with a peak speed‑up of 30 % compared with the D3.3 baseline. Power measurements showed a 12 % reduction in energy consumption per simulation hour on the CPU cluster.

GPU multi‑node optimisation
Two complementary strategies were pursued. First, the entire computational pipeline for each dwarf was kept resident in GPU memory. Explicit memory management (CUDA streams, pinned host buffers) replaced Unified Memory, cutting host‑to‑device transfers by more than 70 %. Second, the inter‑GPU communication layer was rebuilt on top of NVLink/NVSwitch hardware, providing a high‑bandwidth all‑to‑all topology. Compared with PCI‑e, the NVLink‑based path delivered 5‑8× higher bandwidth, which translated into up to a ten‑fold reduction in total execution time for dwarfs with strong inter‑GPU data dependencies, such as column‑physics kernels. Energy‑efficiency improved dramatically: performance‑per‑watt on the GPU system increased by a factor of three to four.

Domain‑specific language (DSL) approach
To achieve performance portability, the team employed high‑level DSLs (e.g., Stencil DSL, Hybrid Fortran) that abstract away low‑level details while allowing the compiler to generate SIMD‑optimised CPU code or warp‑optimised GPU kernels automatically. This enabled a single source base to be compiled for both architectures, dramatically reducing maintenance effort and facilitating rapid retargeting to new hardware generations.

Key insights and implications

  1. Dwarf‑centric optimisation delivers substantial gains on both CPUs and GPUs, confirming that the dwarf abstraction is a useful vehicle for exascale‑level weather and climate codes.
  2. High‑bandwidth GPU interconnects (NVLink/NVSwitch) are especially beneficial for algorithms with frequent inter‑GPU data exchange, making them a priority in future system designs.
  3. DSL‑driven code bases provide a practical path to performance portability, allowing scientific developers to focus on algorithmic innovation rather than low‑level tuning.
  4. Energy savings achieved through reduced data movement and efficient communication are critical for meeting the power constraints of exascale platforms.

In summary, the report demonstrates that systematic optimisation of Weather & Climate dwarfs can achieve up to 30 % speed‑up on multi‑node CPU clusters and up to 10× speed‑up on multi‑GPU systems, while also delivering notable energy‑efficiency improvements. These findings offer concrete guidance for the design of next‑generation, exascale‑ready weather and climate modelling infrastructures.


Comments & Academic Discussion

Loading comments...

Leave a Comment