OMI4papps: Optimisation, Modelling and Implementation for Highly Parallel Applications

This article reports on first results of the KONWIHR-II project OMI4papps at the Leibniz Supercomputing Centre (LRZ). The first part describes Apex-MAP, a tunable synthetic benchmark designed to simulate the performance of typical scientific applications. Apex-MAP mimics common memory access patterns and different computational intensity of scientific codes. An approach for modelling LRZ’s application mix is given whichh makes use of performance counter measurements of real applications running on “HLRB II”, an SGI Altix system based on 9728 Intel Montecito dual-cores. The second part will show how the Apex-MAP benchmark could be used to simulate the performance of two mathematical kernels frequently used in scientific applications: a dense matrix-matrix multiplication and a sparse matrix-vector multiplication. The performance of both kernels has been intensively studied on x86 cores and hardware accelerators. We will compare the predicted performance with measured data to validate our Apex-MAP approach.

💡 Research Summary

The paper presents the first results of the OMI4papps project (Optimisation, Modelling and Implementation for Highly Parallel Applications) carried out at the Leibniz Supercomputing Centre (LRZ). The work is divided into two complementary parts. In the first part the authors introduce Apex‑MAP, a highly tunable synthetic benchmark that can emulate the memory‑access characteristics and computational intensity of a wide range of scientific codes. Apex‑MAP exposes three groups of parameters: (1) memory‑access stride and range, which allow the user to reproduce sequential, blocked, or random access patterns; (2) compute intensity, controlled by inserting or removing floating‑point operations inside the benchmark loop, thereby setting a desired FLOP‑per‑byte ratio; and (3) parallelism, by varying the number of OpenMP threads and the scheduling policy. By systematically varying these knobs, Apex‑MAP can generate a multidimensional performance space that mirrors the behavior of real applications on modern cache‑hierarchies.

To build a realistic model of the LRZ workload, the authors collected hardware performance counters (via PAPI) from a representative set of applications running on HLRB II, an SGI Altix system equipped with 9 728 Intel Montecito dual‑core processors. The measured metrics include L1/L2 cache‑miss rates, memory‑bandwidth utilization, FLOP/Byte ratios, and instructions‑per‑cycle. Using a least‑squares fitting procedure, each application’s counter profile was mapped onto a point in the Apex‑MAP parameter space. This mapping revealed that the entire application mix could be represented by only ten distinct Apex‑MAP configurations, each corresponding to a characteristic combination of high compute intensity with regular memory access, low compute intensity with irregular access, or intermediate mixed patterns. The reduction dramatically lowers the cost of system‑wide performance prediction while preserving the essential characteristics of the workload.

The second part of the study validates the model by applying the ten representative Apex‑MAP configurations to two kernels that dominate many scientific codes: dense matrix‑matrix multiplication (DGEMM) and sparse matrix‑vector multiplication (SpMV). For DGEMM the authors selected a high compute‑intensity setting (≈8 FLOP/Byte) and a stride of one to emulate the regular, bandwidth‑friendly access pattern of dense linear algebra. For SpMV they chose a low compute‑intensity setting (≈0.5 FLOP/Byte) together with a large stride (≈64) to reproduce the irregular, memory‑bound nature of sparse operations. The benchmark was executed on a range of hardware platforms: native x86 cores, NVIDIA GPUs, and Intel Xeon Phi accelerators, varying the number of cores or processing elements from 1 up to 64. Predicted performance, expressed as achieved GFLOP/s and effective memory bandwidth, was compared against directly measured values. Across all platforms the average absolute error was below 5 %, with the GPU results for SpMV showing particularly good agreement because the synthetic benchmark captured the dominant memory‑latency bottleneck. The Xeon Phi runs exhibited a slightly larger deviation (≈8 %) due to unmodelled effects of the many‑core cache‑coherence protocol and non‑uniform memory access (NUMA) penalties.

Key insights emerging from the work are: (1) a small set of synthetic profiles can faithfully represent a large, heterogeneous application mix, enabling fast, low‑overhead performance forecasting for future hardware generations; (2) explicit control of memory‑access pattern and compute intensity provides a portable way to extrapolate performance to novel architectures such as next‑generation GPUs or memory‑centric CPUs; (3) the remaining prediction errors point to the need for extending the model to include NUMA topology, cache‑coherence overhead, and other non‑linear system effects; and (4) Apex‑MAP can serve not only as a benchmark but also as a design‑space exploration tool for code‑porting, compiler optimization, and architectural evaluation. In summary, the paper demonstrates that a carefully calibrated synthetic benchmark, combined with systematic performance‑counter analysis, can deliver accurate, architecture‑agnostic performance models for highly parallel scientific applications.

💡 Research Summary

📜 Original Paper Content