Programmatic Control of a Compiler for Generating High-performance Spatial Hardware

Reading time: 6 minute
...

📝 Abstract

This methodology paper addresses high-performance high-productivity programming on spatial architectures. Spatial architectures are efficient for executing dataflow algorithms, yet for high-performance programming, the productivity is low and verification is painful. We show that coding and verification are the biggest obstacle to the wide adoption of spatial architectures. We propose a new programming methodology, T2S (Temporal to Spatial), to remove this obstacle. A programmer specifies a temporal definition and a spatial mapping. The temporal definition defines the functionality to compute, while the spatial mapping defines how to decompose the functionality and map the decomposed pieces onto a spatial architecture. The specification precisely controls a compiler to actually implement the loop and data transformations specified in the mapping. The specification is loop-nest- and matrix-oriented, and thus lends itself to the compiler for automatic, static verification. Many generic, strategic loop and data optimizations can be systematically expressed. Consequently, high performance is expected with substantially higher productivity: compared with high-performance programming in today’s high-level synthesis (HLS) languages or hardware description languages (HDLs), the engineering effort on coding and verification is expected to be reduced from months to hours, a reduction of 2 or 3 orders of magnitude.

💡 Analysis

This methodology paper addresses high-performance high-productivity programming on spatial architectures. Spatial architectures are efficient for executing dataflow algorithms, yet for high-performance programming, the productivity is low and verification is painful. We show that coding and verification are the biggest obstacle to the wide adoption of spatial architectures. We propose a new programming methodology, T2S (Temporal to Spatial), to remove this obstacle. A programmer specifies a temporal definition and a spatial mapping. The temporal definition defines the functionality to compute, while the spatial mapping defines how to decompose the functionality and map the decomposed pieces onto a spatial architecture. The specification precisely controls a compiler to actually implement the loop and data transformations specified in the mapping. The specification is loop-nest- and matrix-oriented, and thus lends itself to the compiler for automatic, static verification. Many generic, strategic loop and data optimizations can be systematically expressed. Consequently, high performance is expected with substantially higher productivity: compared with high-performance programming in today’s high-level synthesis (HLS) languages or hardware description languages (HDLs), the engineering effort on coding and verification is expected to be reduced from months to hours, a reduction of 2 or 3 orders of magnitude.

📄 Content

Programmatic Control of a Compiler for Generating High- performance Spatial Hardware Hongbo Rong
Parallel Computing Lab (PCL), Intel Corporation hongbo.rong@intel.com Abstract This methodology paper addresses high-performance high- productivity programming on spatial architectures. Spatial architectures are efficient for executing dataflow algorithms, yet for high-performance programming, the productivity is low and verification is painful. We show that coding and verification are the biggest obstacle to the wide adoption of spatial architectures. We propose a new programming methodology, T2S (Temporal to Spatial), to remove this obstacle. A programmer specifies a temporal definition and a spatial mapping. The temporal definition defines the functionality to compute, while the spatial mapping defines how to decompose the functionality and map the decomposed pieces onto a spatial architecture. The specification precisely controls a compiler to actually implement the loop and data transformations specified in the mapping. The specification is loop-nest- and matrix-oriented, and thus lends itself to the compiler for automatic, static verification. Many generic, strategic loop and data optimizations can be systematically expressed. Consequently, high performance is expected with substantially higher productivity: compared with high-performance programming in today’s high-level synthesis (HLS) languages or hardware description languages (HDLs), the engineering effort on coding and verification is expected to be reduced from months to hours, a reduction of 2 or 3 orders of magnitude.
Keywords High-performance computing (HPC), spatial programming, productivity, language, compiler, FPGA, CGRA 1 Introduction In this paper, we address high-performance high- productivity programming on spatial architectures. A spatial architecture is composed of (many) distributed hardware resources, including memory blocks, arithmetic/logical elements and their interconnections. The arithmetic/logical elements execute whenever their input data are available. This assumption covers a wide spectrum of spatial architectures, from a fine-grain Field-Programmable Gate Array (FPGA) [2] to a Coarse-Grain Reconfigurable Architecture (CGRA) [3~7, 25].
In contrast with a temporal architecture (i.e. Von-Neumann architecture like CPUs or GPUs) that uses a global instruction pointer to fetch instructions from memory and then executes the instructions through a fixed pipeline, a spatial architecture has no instruction pointer or instruction fetch. Instead, instructions are directly synthesized into pipelines. This specializes the spatial architecture to match a specific computation, and thus presents better power-efficiency advantages over general purpose CPUs or GPUs.

1 If high-performance is not the target, spatial programming does not take much time, and is not hard. Programmers can write simple loop nests and add a few pragmas to easily get average performance. This user scenario is important itself, but is beyond our scope. This paper focuses on HPC programming only.

Spatial architectures are usually used as special-purpose accelerator devices for dataflow algorithms. Dataflow algorithms are driven by data availability, which enables massive parallelism for high performance. Performance and productivity, however, are conflicting goals. Table 1 describes several high-profile workloads that are representative of various domains and compute patterns. They include SGEMM (single-precision matrix multiply), PairHMM (Pair Hidden Markov Model), a neural network (VGG-16, mainly the convolution and ReLU layer), SpMV (Sparse-matrix dense-vector multiply) and merge sort. Figure 1 shows the engineering time spent on high-performance programming of these workloads on an FPGA or CGRA, written in several languages. The time is collected from 2 companies and 1 school based on their real-world products and research. As we can see, the productivity to achieve high performance is low. It often takes several to tens of months1.
On temporal architectures, there is such a conflict between performance and productivity as well. But significant progress has been made to address the conflict from all perspectives of languages, compilers, libraries, runtime, auto-tuning tools and hardware [10, 11, 26, 31, 32, 51]. Particularly, the Halide language [11] well addresses the conflict in the domain of image processing, and in general, dense matrix computation. A Halide program is a specification, including an algorithm and a schedule. The algorithm expresses a problem in a dataflow function. The schedule specifies how to optimize the function to run on hardware. Figure 2(a) illustrates the concept with a simple example, where B is a 1-dimensional (1-D) floating-point matrix, f is an arbitrary function, and i is a loop index v

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut