A Soft Processor Overlay with Tightly-coupled FPGA Accelerator

Reading time: 6 minute
...

📝 Abstract

FPGA overlays are commonly implemented as coarse-grained reconfigurable architectures with a goal to improve designers’ productivity through balancing flexibility and ease of configuration of the underlying fabric. To truly facilitate full application acceleration, it is often necessary to also include a highly efficient processor that integrates and collaborates with the accelerators while maintaining the benefits of being implemented within the same overlay framework. This paper presents an open-source soft processor that is designed to tightly-couple with FPGA accelerators as part of an overlay framework. RISC-V is chosen as the instruction set for its openness and portability, and the soft processor is designed as a 4-stage pipeline to balance resource consumption and performance when implemented on FPGAs. The processor is generically implemented so as to promote design portability and compatibility across different FPGA platforms. Experimental results show that integrated software-hardware applications using the proposed tightly-coupled architecture achieve comparable performance as hardware-only accelerators while the proposed architecture provides additional run-time flexibility. The processor has been synthesized to both low-end and high-performance FPGA families from different vendors, achieving the highest frequency of 268.67MHz and resource consumption comparable to existing RISC-V designs.

💡 Analysis

FPGA overlays are commonly implemented as coarse-grained reconfigurable architectures with a goal to improve designers’ productivity through balancing flexibility and ease of configuration of the underlying fabric. To truly facilitate full application acceleration, it is often necessary to also include a highly efficient processor that integrates and collaborates with the accelerators while maintaining the benefits of being implemented within the same overlay framework. This paper presents an open-source soft processor that is designed to tightly-couple with FPGA accelerators as part of an overlay framework. RISC-V is chosen as the instruction set for its openness and portability, and the soft processor is designed as a 4-stage pipeline to balance resource consumption and performance when implemented on FPGAs. The processor is generically implemented so as to promote design portability and compatibility across different FPGA platforms. Experimental results show that integrated software-hardware applications using the proposed tightly-coupled architecture achieve comparable performance as hardware-only accelerators while the proposed architecture provides additional run-time flexibility. The processor has been synthesized to both low-end and high-performance FPGA families from different vendors, achieving the highest frequency of 268.67MHz and resource consumption comparable to existing RISC-V designs.

📄 Content

A Soft Processor Overlay with Tightly-coupled FPGA Accelerator Ho-Cheung Ng, Cheng Liu, Hayden Kwok-Hay So Department of Electrical & Electronic Engineering, The University of Hong Kong {hcng, liucheng, hso}@eee.hku.hk Abstract—FPGA overlays are commonly implemented as coarse-grained reconfigurable architectures with a goal to im- prove designers’ productivity through balancing flexibility and ease of configuration of the underlying fabric. To truly facilitate full application acceleration, it is often necessary to also include a highly efficient processor that integrates and collaborates with the accelerators while maintaining the benefits of being implemented within the same overlay framework. This paper presents an open-source soft processor that is designed to tightly-couple with FPGA accelerators as part of an overlay framework. RISC-V is chosen as the instruction set for its openness and portability, and the soft processor is designed as a 4-stage pipeline to balance resource consumption and performance when implemented on FPGAs. The processor is generically implemented so as to promote design portability and compatibility across different FPGA platforms. Experimental results show that integrated software-hardware applications using the proposed tightly-coupled architecture achieve comparable performance as hardware-only accelerators while the proposed architecture provides additional run-time flexibility. The processor has been synthesized to both low-end and high-performance FPGA families from different vendors, achieving the highest frequency of 268.67 MHz and resource consumption comparable to existing RISC-V designs. I. INTRODUCTION By raising the abstraction level of the underlying config- urable fabric, many early works have already demonstrated the promise of using FPGA overlays to improve designer’s pro- ductivity in developing hardware accelerators [1], [2]. While such hardware accelerators can often deliver significant perfor- mance improvement over their software counterparts, they are often fixed in functionality and lack the flexibility to process irregular input or data that depends on run-time dynamics. To truly take advantage of the performance benefit of hardware accelerators, it is therefore desirable to have an efficient CPU in the overlay tightly-coupled with the accelerator to control its operations and to maintain compatibility with the rest of the software system. To illustrate these intricate hardware-software codesign challenges, Algorithm 1 shows a simple design that accelerates the Sobel edge detection algorithm in such heterogeneous system. In this implementation, an accelerator that computes 16 × 16 output pixels at a time is implemented in FPGA. During run time, depending on the user input image size, the software reuses this hardware accelerator for as many complete 16 × 16 output pixels as possible. The remaining odd pixels, as well as pixels on the boundary of the image where the standard filter kernel cannot readily operate on, are handled in software. Data: Pixels of size N × N 1 # define BUF 16 // HW computes 16x16 output pixels 2 for r := 0 to N −1 do 3 for c := 0 to N −1 do 4 if pixel[r, c] is edge then 5 SW SOBEL( pixel, r, c ); 6 else if ((r −1) % BUF) == 0 && 7 (c −1) % BUF) == 0 then 8 HW SOBEL( pixel, r, c ); 9 else 10 continue; 11 end 12 end 13 end Algorithm 1: Pseudocode for Sobel edge detector. As the hardware accelerator operates on a fixed 16 × 16 array of output pixel at a time, software passes control to the accelerator only for cases when all 17×17 pixels are available. Otherwise, the computation is carried out in software. Assume N −2 is a multiple of BUF. While the design of Algorithm 1 may be specific to the particular implementation of Sobel edge detection, it high- lights several challenges commonly faced by many real-world hardware-software designers. First of all, because of the lim- ited flexibility of most hardware accelerators, the controlling software must ensure the necessary input data are available before the accelerator is launched. Furthermore, unless the hardware accelerator is arbitrarily flexible, software running in the CPU must also be able to process any run time data that cannot readily be processed by the accelerator. In view of the above, this paper proposes the use of a small, open source soft processor to provide fine-grained control for the hardware accelerator in the context of an overlay framework. The core is designed to be tightly-coupled with the hardware accelerator in order to minimize the overhead in- volved with switching control between hardware and software. RISC-V RV32I [3] is chosen as the ISA for its openness and simplicity. Finally, the core is generically designed in order to promote design portability and compatibility. As such, we consider the main contribution of this work rests on the demonstration of the benefits of tightly-coupling a lightweight CPU with hardware accelerator to serve within a combined overlay architectur

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut