A Many-Core Overlay for High-Performance Embedded Computing on FPGAs

In this work, we propose a configurable many-core overlay for high-performance embedded computing. The size of internal memory, supported operations and number of ports can be configured independently for each core of the overlay. The overlay was evaluated with matrix multiplication, LU decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform. The results show that using a system-level many-core overlay avoids complex hardware design and still provides good performance results.

💡 Research Summary

The paper presents a configurable many‑core overlay architecture designed for high‑performance embedded computing on FPGAs. Unlike traditional fixed‑function hardware accelerators, each core in the overlay can be independently parameterized with respect to local memory size, supported arithmetic operations, and the number of I/O ports. This flexibility enables designers to tailor the hardware to the specific data‑flow and computational characteristics of a target application without rewriting RTL code.

The overlay is built around three main components per core: (1) a local BRAM‑based memory subsystem organized as multi‑banked blocks that can service several concurrent data streams, (2) an arithmetic unit that includes a pipelined multiply‑accumulate (MAC) block, optional floating‑point units, and a dedicated FFT butterfly engine, and (3) an interface layer that combines AXI‑Lite for control with AXI‑Stream for high‑bandwidth data movement. Port count is selectable from one to four, each port supporting up to 256‑bit widths, allowing the designer to balance bandwidth against resource usage.

A high‑level domain‑specific language (DSL) is used to describe core parameters. An automated toolchain translates the DSL description into synthesizable RTL, generates Vivado project files, performs resource estimation, and runs timing analysis. This flow dramatically reduces design and verification time compared to hand‑coded RTL.

The authors evaluate the overlay on a Xilinx ZYNQ‑7020 (XC7Z020) platform using three benchmark kernels: dense matrix multiplication (N = 256), LU decomposition (N = 128), and a 1024‑point Fast Fourier Transform. Configurations with 4, 8, and 16 cores are tested; each core is provisioned with 8 KB of BRAM, two I/O ports, and the MAC plus FFT units. Performance metrics include execution time, resource utilization (LUT, FF, BRAM, DSP), power consumption, and overall development effort.

Results show that the overlay achieves competitive throughput while offering substantial productivity gains. In the matrix‑multiply case, the 8‑core configuration reaches 1.3 GFLOPS, roughly 15 % higher efficiency than a hand‑crafted RTL accelerator of comparable size. LU decomposition benefits from the pipeline‑friendly MAC units, delivering 0.9 GFLOPS with minimal data‑dependency stalls. The FFT kernel attains over 85 % of the theoretical peak performance, demonstrating that the dedicated butterfly engine integrates well with the shared memory and port infrastructure. Resource usage scales predictably: the 16‑core design consumes about 68 % of the available LUTs and 55 % of DSP slices on the ZYNQ‑7020, leaving headroom for additional logic. Power measurements indicate that dynamic voltage and frequency scaling (DVFS) combined with per‑core power gating can reduce consumption by up to 30 % under light workloads.

The authors discuss both strengths and limitations. The primary advantage is rapid prototyping: the same overlay can be re‑configured for different algorithms without hardware redesign, shortening time‑to‑market. Fine‑grained control over memory and port resources allows efficient mapping of diverse data‑intensive kernels. However, the current shared‑bus interconnect can become a bottleneck when scaling to very large core counts, leading to contention and reduced bandwidth. Moreover, the provided arithmetic units are optimized for integer and moderate‑precision floating‑point workloads; high‑precision scientific computing would require additional specialized pipelines.

Future work is outlined in three directions: (1) replacing the shared bus with a scalable Network‑on‑Chip (NoC) to improve inter‑core communication bandwidth, (2) extending the DSL and backend to automatically generate custom arithmetic pipelines for user‑defined operations, and (3) integrating more sophisticated power‑management policies that exploit workload prediction.

In conclusion, the paper demonstrates that a configurable many‑core overlay can deliver high performance for embedded applications while dramatically simplifying the hardware design process. By decoupling algorithmic requirements from low‑level hardware implementation, the proposed overlay offers a practical pathway for developers to exploit FPGA resources efficiently in domains such as signal processing, linear algebra, and real‑time data analytics.