High Level Synthesis with a Dataflow Architectural Template

Reading time: 5 minute
...

📝 Abstract

In this work, we present a new approach to high level synthesis (HLS), where high level functions are first mapped to an architectural template, before hardware synthesis is performed. As FPGA platforms are especially suitable for implementing streaming processing pipelines, we perform transformations on conventional high level programs where they are turned into multi-stage dataflow engines [1]. This target template naturally overlaps slow memory data accesses with computations and therefore has much better tolerance towards memory subsystem latency. Using a state-of-the-art HLS tool for the actual circuit generation, we observe up to 9x improvement in overall performance when the dataflow architectural template is used as an intermediate compilation target.

💡 Analysis

In this work, we present a new approach to high level synthesis (HLS), where high level functions are first mapped to an architectural template, before hardware synthesis is performed. As FPGA platforms are especially suitable for implementing streaming processing pipelines, we perform transformations on conventional high level programs where they are turned into multi-stage dataflow engines [1]. This target template naturally overlaps slow memory data accesses with computations and therefore has much better tolerance towards memory subsystem latency. Using a state-of-the-art HLS tool for the actual circuit generation, we observe up to 9x improvement in overall performance when the dataflow architectural template is used as an intermediate compilation target.

📄 Content

High Level Synthesis with a Dataflow Architectural Template Shaoyi Cheng and John Wawrzynek Department of EECS, UC Berkeley, California, USA 94720 Email: sh cheng@berkeley.edu, johnw@eecs.berkeley.edu Abstract—In this work, we present a new approach to high level synthesis (HLS), where high level functions are first mapped to an architectural template, before hardware synthe- sis is performed. As FPGA platforms are especially suitable for implementing streaming processing pipelines, we perform transformations on conventional high level programs where they are turned into multi-stage dataflow engines [1]. This target template naturally overlaps slow memory data accesses with computations and therefore has much better tolerance towards memory subsystem latency. Using a state-of-the-art HLS tool for the actual circuit generation, we observe up to 9x improvement in overall performance when the dataflow architectural template is used as an intermediate compilation target. Index Terms—FPGA, Overlay Architecture, Hardware design template, High-level Synthesis, Pipeline Parallelism I. INTRODUCTION As the complexity of both the FPGA devices and their ap- plications increase, the task of efficiently mapping the desired functionality is getting ever more challenging. To alleviate the difficulty of designing for FPGAs, there has been a trend towards using higher levels of abstraction. Tools taking in high-level function specifications and generating hardware IP blocks have been developed both in academia [2], [3] and industry [4], [5]. Of course, the semantics of the high level languages like C/C++ are vastly different than the description of hardware behavior at clock cycle granularity. The tools often try to bridge this gap by fitting the control data flow graph (CDFG) of the original program into particular hard- ware paradigms such as Finite State Machine with Datapath (FSMD). Depending on the nature of the application, these approaches may or may not generate hardware taking full advantage of what the FPGA has to offer. User guidance in the forms of directives or pragmas are often needed to expose parallelism of various kinds and to optimize the design. An important dimension of the space is in the mechanism with which memory data are accessed. Designers sometimes need to restructure the original code to separate out memory accesses before invoking HLS. Also, it is often desirable to convert from conventional memory accesses to a streaming model and to insert DMA engines [6]. Further enhancements can be achieved by including accelerator specific caching and burst accesses. In this paper, we realize an intermediate architectural tem- plate (section II) that will complement existing work in HLS. It captures some of the common patterns applied in optimizing HLS generated designs. In particular, by taking advantage of the FPGAs as throughput-oriented devices, it structures the computation and data accesses into a series of coarse- grained pipeline stages, through which data flows. To target this architectural template, we have developed a tool to slice the original CDFG of the performance critical loop nests into subgraphs, connected by communication channels (sec- tion III). This decouples the scheduling of operations between different subgraphs and subsequently improves the overall throughput in the presence of data fetch stalls. Then, each of the subgraphs is fed to a conventional high-level synthesis flow, generating independent datapaths and controllers. FIFO channels are instantiated to connect the datapaths, forming the final system (section IV). The performance, when compared against directly synthesized accelerators, is far superior (sec- tion V), demonstrating the advantage of targeting the dataflow architectural template during HLS. II. THE DATAFLOW ARCHITECTURAL TEMPLATE Currently, HLS tools use a simple static model for schedul- ing operations. Different parts of the generated hardware run in lockstep with each other, with no need for dynamic dependency checking mechanisms such as scoreboarding or load-store queueing. This rigid scheduling of operators, while producing circuit of simpler structure and smaller area, is vulnerable to stalls introduced by cache misses or variable latency operations. The entire compute engine is halted as the state machine in the controller waits for the completion of an outstanding operation. This effect becomes very pro- nounced when irregular offchip data accesses are encoded in the function. Under these circumstances, the traditional approach where data movements are explicitly managed using DMA may not be effective as the access pattern is not known statically. Also, there may not be sufficient on-chip memory to buffer the entirety of the involved data structure. As a result, the overall performance can deteriorate significantly. To alleviate this problem, instead of directly synthesizing the accelerator from the original control dataflow graph, we first map the input function to an archi

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut