DSL-based Design Space Exploration for Temporal and Spatial Parallelism of Custom Stream Computing

DSL-based Design Space Exploration for Temporal and Spatial Parallelism   of Custom Stream Computing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Stream computation is one of the approaches suitable for FPGA-based custom computing due to its high throughput capability brought by pipelining with regular memory access. To increase performance of iterative stream computation, we can exploit both temporal and spatial parallelism by deepening and duplicating pipelines, respectively. However, the performance is constrained by several factors including available hardware resources on FPGA, an external memory bandwidth, and utilization of pipeline stages, and therefore we need to find the best mix of the different parallelism to achieve the highest performance per power. In this paper, we present a domain-specific language (DSL) based design space exploration for temporally and/or spatially parallel stream computation with FPGA. We define a DSL where we can easily design a hierarchical structure of parallel stream computation with abstract description of computation. For iterative stream computation of fluid dynamics simulation, we design hardware structures with a different mix of the temporal and spatial parallelism. By measuring the performance and the power consumption, we find the best among them.


💡 Research Summary

This paper addresses the challenge of maximizing performance and energy efficiency for iterative stream computations on FPGAs by jointly exploiting temporal parallelism (deepening pipelines) and spatial parallelism (duplicating pipelines). While stream processing naturally benefits from pipelining and regular memory accesses, real hardware imposes constraints: limited logic, DSP, and on‑chip memory resources; finite external memory bandwidth; and the need to keep pipeline stages well utilized. The authors propose a domain‑specific language (DSL) that lets designers describe hierarchical stream architectures at a high level, specifying computation stages, data flow, iteration counts, and two key parallelism parameters – pipeline depth (temporal) and duplication factor (spatial).

A dedicated compiler translates the DSL description into a parametrized hardware template. The template contains configurable pipeline stages and replication units, and the compiler automatically generates a design space consisting of all feasible combinations of depth and duplication that satisfy the FPGA resource model and the external memory bandwidth model. A design‑space‑exploration engine evaluates each candidate using analytical models for throughput, latency, resource consumption, and power, producing a Pareto front of solutions. Designers can then select the configuration that best matches their performance‑per‑watt target without manually writing or modifying RTL code.

To validate the methodology, the authors implement a fluid‑dynamics simulation (a simplified 2‑D Navier‑Stokes stencil) as a representative iterative stream kernel. The kernel repeatedly updates grid cells and applies boundary conditions, requiring regular reads/writes to off‑chip DRAM. Four distinct parallelism mixes are synthesized on a Xilinx UltraScale+ board: (1) depth = 1, duplication = 1; (2) depth = 2, duplication = 1; (3) depth = 1, duplication = 2; and (4) depth = 4, duplication = 0. For each design the authors measure clock frequency, logic/DSP/BRAM utilization, DDR4 bandwidth usage, and power consumption.

Results reveal clear trade‑offs. Increasing depth reduces latency and raises clock speed but consumes more BRAM for pipeline registers, eventually limiting the maximum feasible depth. Increasing duplication spreads computation across parallel pipelines, boosting raw throughput, but it also multiplies the demand on external memory bandwidth; when the DDR4 interface saturates, performance plateaus or even degrades. The configuration with depth = 2 and duplication = 2 strikes the best balance: it fits within the available on‑chip resources, stays under the memory‑bandwidth ceiling, and achieves the highest throughput per watt—approximately 1.35× better than the pure‑depth (depth = 4) or pure‑duplication (duplication = 4) extremes.

A notable contribution is the quantitative analysis of pipeline utilization. By modeling stage activity, the exploration engine can discard designs with many idle stages (over‑deep pipelines) early, saving synthesis time. Memory‑access pattern analysis shows that designs that maximize data reuse on‑chip reduce external bandwidth pressure, further improving energy efficiency.

In summary, the paper demonstrates that a DSL‑driven, automated design‑space‑exploration framework can effectively navigate the complex interplay between temporal and spatial parallelism, FPGA resource limits, and memory bandwidth constraints. It provides a practical workflow for developers to obtain high‑performance, low‑power stream accelerators without hand‑tuned RTL. Future work is suggested in extending the approach to multi‑kernel workloads, dynamic power‑management schemes, and adaptive runtime reconfiguration, which would enable real‑time, application‑aware optimization of parallelism ratios.


Comments & Academic Discussion

Loading comments...

Leave a Comment