RZBENCH: Performance evaluation of current HPC architectures using low-level and application benchmarks
RZBENCH is a benchmark suite that was specifically developed to reflect the requirements of scientific supercomputer users at the University of Erlangen-Nuremberg (FAU). It comprises a number of application and low-level codes under a common build infrastructure that fosters maintainability and expandability. This paper reviews the structure of the suite and briefly introduces the most relevant benchmarks. In addition, some widely known standard benchmark codes are reviewed in order to emphasize the need for a critical review of often-cited performance results. Benchmark data is presented for the HLRB-II at LRZ Munich and a local InfiniBand Woodcrest cluster as well as two uncommon system architectures: A bandwidth-optimized InfiniBand cluster based on single socket nodes (“Port Townsend”) and an early version of Sun’s highly threaded T2 architecture (“Niagara 2”).
💡 Research Summary
The paper presents RZBENCH, a benchmark suite specifically created to meet the needs of scientific super‑computing users at the University of Erlangen‑Nuremberg (FAU). Unlike conventional suites that rely heavily on synthetic codes such as LINPACK or SPEC, RZBENCH integrates both low‑level micro‑benchmarks (e.g., STREAM, GUPS) and full‑scale scientific applications (computational fluid dynamics, quantum chemistry, large‑scale linear algebra, etc.) under a single, modular build infrastructure. This design promotes maintainability, extensibility, and, most importantly, enables a consistent performance comparison across heterogeneous high‑performance computing (HPC) platforms.
The authors first describe the architecture of the suite. Each benchmark can be compiled with a common Makefile system, allowing easy addition of new codes and platform‑specific optimizations. Input sizes and thread counts are configurable, supporting both weak and strong scaling studies. By providing a mix of MPI‑only, OpenMP‑only, and hybrid MPI+OpenMP codes, RZBENCH captures a broad spectrum of communication‑to‑computation ratios that are typical for real scientific workloads.
A critical part of the paper contrasts RZBENCH results with those obtained from widely cited standard benchmarks. The authors argue that FLOPS‑centric metrics hide memory‑bandwidth and network‑latency bottlenecks that dominate many real applications. RZBENCH, by reproducing the actual memory access patterns and communication topologies of scientific codes, reveals performance characteristics that are invisible to LINPACK‑style measurements.
Performance data are reported for four distinct systems:
-
HLRB‑II at LRZ Munich – an AMD Opteron‑based cluster with a two‑level InfiniBand fabric. The system shows excellent network scalability for large MPI jobs, yet its per‑node memory bandwidth limits the speed of memory‑intensive kernels, leading to a sharp drop in efficiency for bandwidth‑bound applications.
-
FAU Woodcrest Cluster – built on Intel Xeon 5160 processors with InfiniBand interconnect. Here the balance between L2 cache size, DDR2 memory bandwidth, and network latency yields high efficiency for typical MPI codes, especially those with moderate memory footprints.
-
Port Townsend – a “bandwidth‑optimized” InfiniBand cluster composed of single‑socket nodes equipped with high‑throughput DDR2 memory and 4×QDR links. Benchmarks that stress memory bandwidth (e.g., STREAM, lattice QCD) achieve up to 30 % higher throughput than on the multi‑socket Woodcrest system, demonstrating the advantage of a memory‑centric design. However, the limited core count becomes a bottleneck for compute‑heavy workloads that require extensive parallelism.
-
Niagara 2 (Sun T2) – an early generation of Sun’s highly threaded architecture featuring 64 hardware threads per chip and a relatively low clock frequency. RZBENCH shows that when applications are rewritten to exploit massive thread‑level parallelism (hybrid MPI+OpenMP), the system can deliver competitive performance, particularly for workloads with high thread concurrency. Conversely, single‑threaded or low‑thread‑count codes suffer from the low per‑core performance, and memory bandwidth is under‑utilized when the thread count is insufficient to hide latency.
From these experiments the authors draw several key insights:
- Real scientific applications are highly sensitive to a combination of memory bandwidth, network latency, and core count; a single scalar metric such as peak FLOPS is insufficient for accurate performance prediction.
- System designers must carefully balance core density against memory subsystem capabilities. Architectures that prioritize high per‑node bandwidth (as in Port Townsend) excel for bandwidth‑bound codes, while traditional multi‑core designs (Woodcrest, HLRB‑II) provide better overall scalability for compute‑bound workloads.
- Highly threaded designs like Niagara 2 can be advantageous for workloads that can be decomposed into a very large number of fine‑grained tasks, but they require substantial software refactoring to achieve high efficiency.
- By applying the same code base across all platforms, RZBENCH eliminates “benchmark bias” and offers a more realistic view of how a given HPC system will perform for the actual scientific problems it is intended to solve.
In conclusion, the paper demonstrates that RZBENCH is a valuable tool for both researchers and system architects. It provides a comprehensive, application‑oriented performance picture that complements traditional synthetic benchmarks, thereby supporting more informed decisions when selecting or designing HPC hardware for specific scientific workloads.
Comments & Academic Discussion
Loading comments...
Leave a Comment