Projections of achievable performance for Weather & Climate Dwarfs, and for entire NWP applications, on hybrid architectures
This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns in terms of computation and communication (in the spirit of the Berkeley dwarfs). These dwarfs are then optimised for different hardware architectures (single and multi-node) and alternative algorithms are explored. Performance portability is addressed through the use of domain specific languages. This deliverable contains the description of the performance and energy models for the selected Weather & Climate dwarfs for different hardware architectures, multinode with GPU accelerators in particular. Presented performance models are extension to model provided in Deliverable 3.2. With some further enhancements, they are incorporated in the DCworms simulator. In particular, extended models allow to predict computational and energy performance on different architectures: single and multinodes, equipped with CPUs and GPUs accelerators. This allows to provide feasible performance projection at system scale.
💡 Research Summary
The D3.5 deliverable of the ESCAPE project presents a comprehensive methodology for predicting both performance and energy consumption of key “dwarf” kernels that underpin modern weather and climate models when executed on hybrid CPU‑GPU supercomputers. Building on the single‑CPU model introduced in Deliverable 3.2, the authors extend the framework to (i) multi‑node environments using MPI, (ii) three generations of Nvidia GPUs (Fermi, Kepler, Maxwell), and (iii) an energy model that distinguishes between processor (PKG) and DRAM power.
The paper first reviews the architecture of Nvidia GPUs, emphasizing the SIMT execution model, the organization of streaming multiprocessors (SMs) into warps, and the hierarchy of memory (registers, shared, L1/L2, global). Using detailed profiling (CUDA counters, instruction mix, memory coalescing statistics), the authors construct Roofline models for each GPU generation. These models map each dwarf kernel—ACRANEB2 (radiation), Spherical Harmonics (global spectral transforms), and BiFFT (limited‑area FFT)—onto the performance space defined by operational intensity (flops/byte) and hardware limits (peak bandwidth, peak FP64 throughput). The Roofline plots (Figures 3‑9) reveal that ACRANEB2 is typically compute‑bound on Kepler/Maxwell, while BiFFT is memory‑bound on Fermi, guiding kernel‑level optimizations such as block size, register pressure, and shared‑memory usage.
Next, the multi‑node extension incorporates MPI communication costs. For Spherical Harmonics and BiFFT, the authors quantify per‑node compute work, halo exchange volume, and latency/bandwidth of the interconnect. They embed these parameters into the performance model, allowing the prediction of strong‑ and weak‑scaling behavior as the node count grows. Overlap of communication with computation is shown to mitigate the scaling penalty, and the model accurately reproduces measured runtimes up to 48 MPI tasks with less than 10 % error.
The energy model is built from empirical power measurements on Intel Xeon E5‑2697v3 CPUs and Nvidia GeForce 970, Tesla K20m, and Tesla 2070‑q GPUs. Separate coefficients for package (PKG) and DRAM are derived (Tables 34‑38). By feeding the per‑kernel execution time and hardware utilization into these coefficients, the authors obtain energy‑to‑solution estimates for each dwarf. The results demonstrate that GPU acceleration can reduce energy consumption by 30‑45 % for the same time‑to‑solution, especially when double‑precision performance is required.
All models are integrated into the DCworms simulator, which can compose an entire NWP workflow as a directed graph of dwarfs running on heterogeneous resources. The paper showcases a representative workflow where BiFFT runs on CPUs while ACRANEB2 runs on GPUs, and presents system‑scale projections (Figures 29‑30) for various hardware mixes. The simulator predicts that a hybrid Xeon + GeForce 970 system achieves a 1.8× speed‑up and a 30 % energy saving compared with a CPU‑only configuration at the same node count.
In the conclusions, the authors stress that (1) the Roofline‑based characterization provides a portable way to identify bottlenecks across GPU generations, (2) the MPI‑aware performance model enables reliable scaling predictions, (3) the split PKG/DRAM energy model offers actionable insight for energy‑efficient scheduling, and (4) DCworms serves as a practical tool for architects and domain scientists to evaluate future exascale weather‑prediction platforms. Future work includes extending the framework to emerging accelerators such as Intel Xeon Phi, automating model calibration, and validating the approach on production‑grade operational runs.
Overall, the deliverable delivers a rigorously validated, end‑to‑end performance‑energy modeling stack that bridges algorithmic dwarfs, hardware characteristics, and system‑scale simulation, thereby supporting the design of next‑generation, energy‑aware weather and climate forecasting infrastructures.
Comments & Academic Discussion
Loading comments...
Leave a Comment