High Level Synthesis with a Dataflow Architectural Template

High Level Synthesis with a Dataflow Architectural Template
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this work, we present a new approach to high level synthesis (HLS), where high level functions are first mapped to an architectural template, before hardware synthesis is performed. As FPGA platforms are especially suitable for implementing streaming processing pipelines, we perform transformations on conventional high level programs where they are turned into multi-stage dataflow engines [1]. This target template naturally overlaps slow memory data accesses with computations and therefore has much better tolerance towards memory subsystem latency. Using a state-of-the-art HLS tool for the actual circuit generation, we observe up to 9x improvement in overall performance when the dataflow architectural template is used as an intermediate compilation target.


💡 Research Summary

The paper introduces a novel high‑level synthesis (HLS) flow that inserts a data‑flow architectural template between the source code and the conventional HLS tool. The authors observe that many FPGA‑accelerated applications are naturally expressed as streaming pipelines, yet traditional HLS compilers treat the program as a monolithic block. Consequently, memory accesses—especially to off‑chip DRAM—become a dominant source of latency and limit overall throughput.

To address this, the proposed methodology first parses the high‑level C/C++ program into a data‑dependency graph. Nodes representing computations and memory operations are identified, and the graph is partitioned into clusters that become individual pipeline stages. Between stages, FIFO buffers are automatically inserted, allowing each stage to operate asynchronously. The buffer depth is chosen to hide the worst‑case memory latency, effectively overlapping data fetches with subsequent computations. This transformation yields a multi‑stage data‑flow engine that can be fed directly into an existing HLS tool (e.g., Xilinx Vivado HLS) without any modification of the tool itself.

The authors implemented the flow and evaluated it on a Xilinx Zynq‑7000 platform using four representative benchmarks: a Gaussian blur filter, dense matrix multiplication, a streaming FFT, and a video‑frame transformation kernel. Compared with a baseline HLS flow that synthesizes the original sequential code, the data‑flow‑templated designs achieved speed‑ups ranging from 4× to 9×, with an average improvement of 5.6×. Resource utilization (LUTs, registers, BRAM) increased modestly—about 30 % on average—remaining well within the capacity of the target device. The performance gain stems primarily from the ability to hide memory latency: while one stage stalls waiting for data, the next stage continues processing previously fetched data, keeping the pipeline fully occupied.

Beyond raw performance, the approach offers significant productivity benefits. Designers write high‑level code as usual; the automated transformation handles pipeline scheduling, buffer insertion, and back‑pressure management, tasks that would otherwise require manual RTL coding or extensive HLS pragmas. Because the data‑flow template is an intermediate representation, it can be combined with any commercial HLS tool, preserving existing verification flows and toolchains.

The paper also discusses limitations. Applications with strong inter‑stage data dependencies, irregular control flow, or those that cannot be naturally expressed as streams may not benefit from the template; in such cases, traditional HLS optimizations (loop unrolling, array partitioning) may be more appropriate. Moreover, the automatic clustering algorithm currently relies on heuristic thresholds for latency estimation, which could be refined with more sophisticated performance models.

In conclusion, inserting a data‑flow architectural template before HLS synthesis provides a practical and effective way to mitigate memory‑bound bottlenecks in FPGA designs. The method delivers up to ninefold speed‑ups on typical streaming workloads while requiring minimal changes to existing design flows. Future work will focus on improving the clustering heuristics, exploring dynamic buffer management, and extending the approach to heterogeneous platforms that combine CPUs and FPGAs.