HONEI: A collection of libraries for numerical computations targeting multiple processor architectures

Reading time: 6 minute
...

📝 Original Info

  • Title: HONEI: A collection of libraries for numerical computations targeting multiple processor architectures
  • ArXiv ID: 0904.4152
  • Date: 2009-04-27
  • Authors: Danny van Dyk, Markus Geveler, Sven Mallach, Dirk Ribbrock, Dominik Goeddeke, Carsten Gutwenger

📝 Abstract

We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a twofold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3-4 and 4-16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development.

💡 Deep Analysis

Deep Dive into HONEI: A collection of libraries for numerical computations targeting multiple processor architectures.

We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI’s libraries, we achieve a twofold speedup over straight forward C++ code using HONEI’s SSE backend, and additional 3-4 and 4-16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kerne

📄 Full Content

Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the underlying hardware towards heterogeneity and parallelism.

The general expectation that performance of serial codes improves automatically is no longer true. Power and heat considerations have put an end to aggressively optimising circuits and deeper processor pipelines (power wall ). Frequency scaling is thus limited to ’natural frequency scaling’ by process shrinking. Ever wider superscalar architectures with more aggressive out-of-order execution have reached their limits, the hardware is no longer able to extract enough instruction-level parallelism to hide latencies (ILP wall ).

The primary solution provided by processor manufacturers is the migration towards on-chip parallelism in the form of chip multiprocessors (CMP); the (approximately) doubled amount of transistors per chip generation is invested into a doubling of processing cores. Over the last few years, both Intel and AMD have established multicore processors in the mass market.

At the same time, memory performance continues to improve at a significantly slower rate than compute performance. In particular due to pin limits, this memory wall problem is even worsened by multicore architectures. Top-end quadcore processors can achieve a theoretical peak floating point performance of 12 GFLOP/s per core in double precision (24 GFLOP/s in single precision) whereas the bandwidth to off-chip memory has barely increased over the last hardware generations, now reaching approximately 12 GB/s shared between all cores on a die: Bandwidth asymptotically scales with the number of sockets and not with the number of cores. In the scope of this paper, the arithmetic intensity (defined as the ratio of floating point operations on each data item) determines the attainable performance, ratios of 1:1 are not uncommon. The traditional approach to add ever larger on-chip cache memory hierarchies with sophisticated prefetching and replacement policies only alleviates the memory wall problem if data reuse is possible.

In the context of these three fundamental problems, alternative processor designs such as graphics processors (GPUs) or the Cell BE processor, which has been primarily developed for the gaming console PlayStation 3, are of particular interest. For instance, NVIDIA’s flagship GPU at the time of writing, the GeForce GTX 285, features 30 multiprocessors (each comprising 8 single precision streaming processors (SP), one double precision SP, a shared lock-step instruction unit and support for transcedentals) delivering more than 1 TFLOP/s theoretical peak performance in single precision, and more importantly, a sustained, achievable memory bandwidth of 160 GB/s. This outstanding level of performance is achieved by keeping tens of thousands of threads ‘in flight’ simultaneously to effectively hide latencies; context switches between threads are handled by the hardware at no additional cost. On the other hand, the Cell processor is an example of on-chip heterogeneity, comprising a standard Power 5 CPU (PPE) and eight on-chip floating point co-processors called Synergistic Processing Elements (SPEs). The SPEs and the PPE are connected via a fast, 200 GB/s on-chip ring bus, the theoretical peak performance is 230 GFLOP/s in single precision, and off-chip memory is accessible at 25.6 GB/s. With achievable performance levels significantly exceeding that of commodity designs, and a dramatic improvement of toolchain support in the past two years, such processors migrate from specialised solutions to the general-purpose computing domain and are widely considered as forerunners of future manycore designs.

This parallelisation and heterogenisation of resources imposes a major challenge to the application programmer: Practitioners in computational science (numerics, natural sciences and engineering) often concentrate primarily on advancements in domain-specific methodology and hesitate to deal with the complications resulting from the paradigm shift in the underlying hardware. An important observation, that can be made in many existing software solutions, is that many algorithms are formulated-on the application level-to concentrate most of the actual work in few kernels, separated from the generic control flow. Examples include both low-level kernels such as matrix-vector products as well as high-level kernels such as black-box iterative solvers for sparse linear systems. In these cases, well-defined interfaces between the kernels enable individual tuning and, ideally, specialisation for different hardware resources. The idea to encapsulate the resulting highly tuned kernels into libraries is as old as scientific computing.

For applications involving sparse rather than dense matrices and correspondingly a l

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut