RZBENCH: Performance evaluation of current HPC architectures using low-level and application benchmarks

Reading time: 5 minute
...

📝 Original Info

  • Title: RZBENCH: Performance evaluation of current HPC architectures using low-level and application benchmarks
  • ArXiv ID: 0712.3389
  • Date: 2007-12-21
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (※ 실제 논문에서는 FAU HPC 그룹 및 협력 연구기관 연구원들이 저자로 등재되어 있을 가능성이 높습니다.) — **

📝 Abstract

RZBENCH is a benchmark suite that was specifically developed to reflect the requirements of scientific supercomputer users at the University of Erlangen-Nuremberg (FAU). It comprises a number of application and low-level codes under a common build infrastructure that fosters maintainability and expandability. This paper reviews the structure of the suite and briefly introduces the most relevant benchmarks. In addition, some widely known standard benchmark codes are reviewed in order to emphasize the need for a critical review of often-cited performance results. Benchmark data is presented for the HLRB-II at LRZ Munich and a local InfiniBand Woodcrest cluster as well as two uncommon system architectures: A bandwidth-optimized InfiniBand cluster based on single socket nodes ("Port Townsend") and an early version of Sun's highly threaded T2 architecture ("Niagara 2").

💡 Deep Analysis

📄 Full Content

Benchmark rankings are of premier importance in High Performance Computing. Decisions about future procurements are mostly based on results obtained by benchmarking early access systems. Often, standardized suites like SPEC [5] or the NAS parallel benchmarks (NPB) [6] are used because the results are publicly available. The downside is that the mixture of requirements to run the standard benchmarks fast is not guaranteed to be in line with the needs of the local users. Even worse, compiler vendors go to great lengths to make their compilers produce tailor-made machine code for well-known code constellations. This does not reflect a real user situation.

For those reasons, the application benchmarks contained in the RZBENCH suite are for the most part widely used by scientists at FAU. They have been adapted to fit into the build framework and produce comprehensible performance numbers for a fixed set of inputs. A central customized makefile provides all the necessary information like names of compilers, paths to libraries etc. After building the suite, customizable run scrips provide a streamlined user interface by which all required parameters (e. g., numbers of threads/processes and others) can be specified. Where numerical accuracy is an issue, mechanisms for correctness checking have been employed. Output data is produced in “raw” and “cooked” formats, the latter as a mere higher-is-better performance number and the former as the full output of the application. The cooked peformance data can then easily be post-processed by scripts and fed into plotting tools or spreadsheets.

The suite contains codes from a wide variety of application areas and uses all of the languages and parallelization methods that are important in HPC: C, C++, Fortran 77, Fortran 90, MPI, and OpenMP.

All state-of the art HPC systems are nowadays based on dual-core and quadcore processor chips. In this analysis the focus is on standard dual-core chips such as the Intel Montecito and Intel Woodcrest/Conroe processor series. The Intel Clovertown quad-core series is of no interest here, since it implements two complelety separate dual-core chips put on the same carrier. We compare those standard technologies with a new architecture, the Sun UltraSPARC T2 (codenamed “Niagara 2”), which might be a first glance at potential future chip designs: A highly threaded server-on-a-chip using many “simple” cores which run at low clock speed but support a large number of threads.

Table 1. Specifications for the different compute nodes, sorted according to single core, single socket and single node properties. The L2 cache sizes marked in bold face refer to shared on-chip caches, otherwise all caches are local to each core.

The SGI Altix 4700 system at LRZ Munich comprises 9728 Intel Itanium2 processor cores integrated into the SGI NUMALink4 network. It is configured as 19 ccNUMA nodes each holding 512 cores and a total of 2 Tbyte of shared memory per partition. The 13 standard nodes are equipped with a single socket per memory channel, while in the six “high density” nodes two sockets, i.e. four cores, have to share a single memory channel. Table 1 presents the single core specifications of the Intel Itanium2 processor used for HLRB II. A striking feature of this processor is its large on-chip L3 cache of 9 Mbyte per core. A more detailed discussion of the Intel Itanium2 architecture is presented in Ref. [8].

The NUMALink4 network provides a high bandwidth (3.2 Gbyte/s per direction and link), low latency (MPI latency can be less than 2 µs) communication network (see Fig. 1 for a possible network topology in a small Altix system). However, the network topology implemented does not allow to keep the bi-sectional bandwidth constant within the system. Even the nominal bisection bandwidth per socket (0.8 Gbyte/s per direction) in a single standard node (256 sockets) falls short of a single point to point connection by a factor of four. Connecting the nodes with a 2D torus NUMALink topology, things get even worse. For a more detailed picture of the current network topology status we refer to Ref. [9].

All measurements presented were done within a single standard node.

The Woodcrest system at RRZE represents the prototypical design of modern commodity HPC clusters: 217 compute nodes (see Fig. 2) are connected to a single InfiniBand (IB) switch (Voltaire ISR9288 with a maximum of 288 ports, cf.

[10]). The dual-socket compute nodes (HP DL140G3) are equipped with 8 Gbytes of main memory, two Intel Xeon 5160 dual core chips (codenamed “Woodcrest”) running at 3.0 GHz and the bandwidth optimized “Greencreek”

). With Intel’s new Core2 architecture several improvements were introduced as compared to the Netburst design, aiming at higher instruction throughut, shorter pipelines and faster caches to name a few which are important for High Performance Computing. Each node features a DDR IB HCA in its PCIe-8x slot, thus the maximum IB communication bandwidt

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut