Programmatic Control of a Compiler for Generating High-performance Spatial Hardware
📝 Abstract
This methodology paper addresses high-performance high-productivity programming on spatial architectures. Spatial architectures are efficient for executing dataflow algorithms, yet for high-performance programming, the productivity is low and verification is painful. We show that coding and verification are the biggest obstacle to the wide adoption of spatial architectures. We propose a new programming methodology, T2S (Temporal to Spatial), to remove this obstacle. A programmer specifies a temporal definition and a spatial mapping. The temporal definition defines the functionality to compute, while the spatial mapping defines how to decompose the functionality and map the decomposed pieces onto a spatial architecture. The specification precisely controls a compiler to actually implement the loop and data transformations specified in the mapping. The specification is loop-nest- and matrix-oriented, and thus lends itself to the compiler for automatic, static verification. Many generic, strategic loop and data optimizations can be systematically expressed. Consequently, high performance is expected with substantially higher productivity: compared with high-performance programming in today’s high-level synthesis (HLS) languages or hardware description languages (HDLs), the engineering effort on coding and verification is expected to be reduced from months to hours, a reduction of 2 or 3 orders of magnitude.
💡 Analysis
This methodology paper addresses high-performance high-productivity programming on spatial architectures. Spatial architectures are efficient for executing dataflow algorithms, yet for high-performance programming, the productivity is low and verification is painful. We show that coding and verification are the biggest obstacle to the wide adoption of spatial architectures. We propose a new programming methodology, T2S (Temporal to Spatial), to remove this obstacle. A programmer specifies a temporal definition and a spatial mapping. The temporal definition defines the functionality to compute, while the spatial mapping defines how to decompose the functionality and map the decomposed pieces onto a spatial architecture. The specification precisely controls a compiler to actually implement the loop and data transformations specified in the mapping. The specification is loop-nest- and matrix-oriented, and thus lends itself to the compiler for automatic, static verification. Many generic, strategic loop and data optimizations can be systematically expressed. Consequently, high performance is expected with substantially higher productivity: compared with high-performance programming in today’s high-level synthesis (HLS) languages or hardware description languages (HDLs), the engineering effort on coding and verification is expected to be reduced from months to hours, a reduction of 2 or 3 orders of magnitude.
📄 Content
Programmatic Control of a Compiler for Generating High-
performance Spatial Hardware
Hongbo Rong
Parallel Computing Lab (PCL), Intel Corporation
hongbo.rong@intel.com
Abstract
This methodology paper addresses high-performance high-
productivity programming on spatial architectures. Spatial
architectures are efficient for executing dataflow algorithms,
yet for high-performance programming, the productivity is
low and verification is painful.
We show that coding and verification are the biggest obstacle
to the wide adoption of spatial architectures. We propose a
new programming methodology, T2S (Temporal to Spatial), to
remove this obstacle. A programmer specifies a temporal
definition and a spatial mapping. The temporal definition
defines the functionality to compute, while the spatial
mapping defines how to decompose the functionality and map
the decomposed pieces onto a spatial architecture. The
specification precisely controls a compiler to actually
implement the loop and data transformations specified in the
mapping. The specification is loop-nest- and matrix-oriented,
and thus lends itself to the compiler for automatic, static
verification. Many generic, strategic loop and data
optimizations can be systematically expressed. Consequently,
high performance is expected with substantially higher
productivity: compared with high-performance programming
in today’s high-level synthesis (HLS) languages or hardware
description languages (HDLs), the engineering effort on
coding and verification is expected to be reduced from months
to hours, a reduction of 2 or 3 orders of magnitude.
Keywords High-performance computing (HPC), spatial
programming, productivity, language, compiler, FPGA, CGRA
1
Introduction
In this paper, we address high-performance high-
productivity programming on spatial architectures. A spatial
architecture is composed of (many) distributed hardware
resources, including memory blocks, arithmetic/logical
elements and their interconnections. The arithmetic/logical
elements execute whenever their input data are available. This
assumption covers a wide spectrum of spatial architectures,
from a fine-grain Field-Programmable Gate Array (FPGA) [2]
to a Coarse-Grain Reconfigurable Architecture (CGRA) [3~7,
25].
In contrast with a temporal architecture (i.e. Von-Neumann
architecture like CPUs or GPUs) that uses a global instruction
pointer to fetch instructions from memory and then executes
the instructions through a fixed pipeline, a spatial architecture
has no instruction pointer or instruction fetch. Instead,
instructions are directly synthesized into pipelines. This
specializes the spatial architecture to match a specific
computation, and thus presents better power-efficiency
advantages over general purpose CPUs or GPUs.
1 If high-performance is not the target, spatial programming does not take much time, and is not hard. Programmers can write simple loop nests and add a few pragmas to easily get average performance. This user scenario is important itself, but is beyond our scope. This paper focuses on HPC programming only.
Spatial architectures are usually used as special-purpose
accelerator devices for dataflow algorithms. Dataflow
algorithms are driven by data availability, which enables
massive parallelism for high performance.
Performance and productivity, however, are conflicting
goals. Table 1 describes several high-profile workloads that
are representative of various domains and compute patterns.
They include SGEMM (single-precision matrix multiply),
PairHMM (Pair Hidden Markov Model), a neural network
(VGG-16, mainly the convolution and ReLU layer), SpMV
(Sparse-matrix dense-vector multiply) and merge sort. Figure
1 shows the engineering time spent on high-performance
programming of these workloads on an FPGA or CGRA,
written in several languages. The time is collected from 2
companies and 1 school based on their real-world products
and research. As we can see, the productivity to achieve high
performance is low. It often takes several to tens of months1.
On temporal architectures, there is such a conflict between
performance and productivity as well. But significant progress
has been made to address the conflict from all perspectives of
languages, compilers, libraries, runtime, auto-tuning tools and
hardware [10, 11, 26, 31, 32, 51]. Particularly, the Halide
language [11] well addresses the conflict in the domain of
image processing, and in general, dense matrix computation.
A Halide program is a specification, including an algorithm
and a schedule. The algorithm expresses a problem in a
dataflow function. The schedule specifies how to optimize the
function to run on hardware. Figure 2(a) illustrates the concept
with a simple example, where B is a 1-dimensional (1-D)
floating-point matrix, f is an arbitrary function, and i is a loop
index v
This content is AI-processed based on ArXiv data.