Soft GPGPUs for Embedded FPGAs: An Architectural Evaluation

Reading time: 5 minute
...

📝 Abstract

We present a customizable soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. Issues related to scaling the overlay architecture to multiple GPGPU multiprocessors are considered along with application-class architectural optimizations. The overlay architecture is optimized for FPGA implementation to support efficient use of embedded block memories and DSP blocks. This architecture supports direct CUDA compilation of integer computations to a binary which is executable on the FPGA-based GPGPU. The benefits of our architecture are evaluated for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 44x, on average, versus a MicroBlaze microprocessor are achieved. We show dynamic energy savings versus a soft-core processor of 80% on average. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%.

💡 Analysis

We present a customizable soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. Issues related to scaling the overlay architecture to multiple GPGPU multiprocessors are considered along with application-class architectural optimizations. The overlay architecture is optimized for FPGA implementation to support efficient use of embedded block memories and DSP blocks. This architecture supports direct CUDA compilation of integer computations to a binary which is executable on the FPGA-based GPGPU. The benefits of our architecture are evaluated for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 44x, on average, versus a MicroBlaze microprocessor are achieved. We show dynamic energy savings versus a soft-core processor of 80% on average. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%.

📄 Content

Soft GPGPUs for Embedded FPGAs: An Architectural Evaluation Kevin Andryc, Tedy Thomas, and Russell Tessier Department of Electrical and Computer Engineering University of Massachusetts Amherst, MA 01003 Abstract—We present a customizable soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. Issues related to scaling the overlay architecture to multiple GPGPU multiprocessors are considered along with application-class architectural op- timizations. The overlay architecture is optimized for FPGA implementation to support efficient use of embedded block memories and DSP blocks. This architecture supports direct CUDA compilation of integer computations to a binary which is executable on the FPGA-based GPGPU. The benefits of our architecture are evaluated for a collection of five stan- dard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 44×, on average, versus a MicroBlaze microprocessor are achieved. We show dynamic energy savings versus a soft-core processor of 80% on average. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%.

  1. Introduction FPGAs are used in a wide variety of embedded systems, such as automotive applications, appliances, and other con- sumer products. Most of the processing is performed by low- end embedded microprocessors and FPGAs. In some cases, just an FPGA is used and one or more microprocessors are fashioned from FPGA logic to execute specific code types. The benefits of this approach include the ability of software designers to specify functionality in a familiar high-level language (e.g. C) and the flexibility to modify this functionality for the FPGA device without the need to recompile FPGA logic, a time-consuming process that can range from minutes to days. This paper focuses on an exploration of soft GPGPU architectures in FPGAs. We describe the architectural cus- tomization and scalability of FlexGrip (FLEXible GRaphIcs Processor for general-purpose computing), a fully CUDA binary-compatible integer GPGPU, optimized for FPGA implementation [1]. Specifically, we focus on expanding our architecture to include multiple multiprocessors per GPGPU and optimizing away architectural features which are not needed by classes of applications. In developing the soft GPGPU, a series of FPGA-specific optimizations are used. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to FlexGrip performance. Specific contributions of our work include: (1) We characterize benchmarks into classes and analyze tradeoffs as we vary the amount of conditional execution hardware, number of processor operands and functions sup- ported by the processors. These characterizations allow for the optimization of area and energy and (2) we consider FPGA performance tradeoffs as the number of processors and multiprocessors in the soft GPGPU are varied.
  2. Background and Related Work Our soft GPGPU is part of a larger trend in FPGA usage to eliminate long FPGA compile times and diffi- cult hardware design cycles for many designers. Instead of application-specific custom hardware, an architectural over- lay [2] is implemented in FPGA hardware. Although these architectures exhibit lower performance and higher energy consumption than their full custom counterparts, they can be swapped into the FPGA on-demand, providing flexibility. For example, over the past ten years, the implementation of soft vector processors on FPGAs has matured significantly [3] [4]. These architectures typically support a customizable number of operations performed in parallel, an optimized memory interface, and a compiler. FPGA usage also allows for the customization of the soft vector processor instruction set and data bit widths [4]. A recent project [3] exploited the pipeline parallelism found in FPGAs to create custom modules that can be integrated into the soft vector processor datapath. Several FPGA-targeted projects considered the mapping of GPGPU applications represented in OpenCL to multi- threaded FPGA implementations. Labrecque and Steffan [5] described the multithreading of a single processor core. Hazard logic is removed from the processor and hazards are avoided by switching between up to seven different threads. Another work [6] considered an extension of this idea to include multiple cores of these simple multi-threaded processors operating in parallel. Kingyens and Steffan [7] described a GPU-like architecture that has some similarities to our architecture. Their GPU-like architecture includes multithreading across 32 “batches”, small cores which con- tain ALUs. In general, these architectures do not scale to multiple independently-controlled multiprocessors or offer Copyright held by the owner/author(s). Presented at 2nd International Workshop on Overlay Architectures for FPGAs (OLAF2016), Monterey, CA, USA, Feb. 21, 2016.

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut