A High-level Synthesis Toolchain for the Julia Language

With the push towards Exascale computing and data-driven methods, problem sizes have increased dramatically, increasing the computational requirements of the underlying algorithms. This has led to a push to offload computations to general purpose hardware accelerators such as GPUs and TPUs, and a renewed interest in designing problem-specific accelerators using FPGAs. However, the development process of these problem-specific accelerators currently suffers from the “two-language problem”: algorithms are developed in one (usually higher-level) language, but the kernels are implemented in another language at a completely different level of abstraction and requiring fundamentally different expertise. To address this problem, we propose a new MLIR-based compiler toolchain that unifies the development process by automatically compiling kernels written in the Julia programming language into SystemVerilog without the need for any additional directives or language customisations. Our toolchain supports both dynamic and static scheduling, directly integrates with the AXI4-Stream protocol to interface with subsystems like on-and off-chip memory, and generates vendor-agnostic RTL. This prototype toolchain is able to synthesize a set of signal processing/mathematical benchmarks that can operate at 100MHz on real FPGA devices, achieving between 59.71% and 82.6% of the throughput of designs generated by stateof-the-art toolchains that only compile from low-level languages like C or C++. Overall, this toolchain allows domain experts to write compute kernels in Julia as they normally would, and then retarget them to an FPGA without additional pragmas or modifications. CCS Concepts • Hardware → High-level and register-transfer level synthesis; Hardware description languages and compilation; Hardware accelerators.

💡 Research Summary

The paper addresses a fundamental bottleneck in modern FPGA‑based accelerator development: the “two‑language problem,” where high‑level algorithmic code is written in a language such as Python or MATLAB, while the performance‑critical kernels must be re‑implemented in a low‑level hardware‑oriented language like C, C++, or OpenCL. To eliminate this gap, the authors introduce a complete MLIR‑based compilation toolchain that takes kernels written in the high‑level, dynamic language Julia and automatically generates vendor‑agnostic SystemVerilog RTL without any additional pragmas, directives, or language extensions.

The core of the approach is a multi‑stage lowering pipeline built on the Multi‑Level Intermediate Representation (MLIR) infrastructure. In the first stage, Julia source code is parsed and transformed into a Julia‑specific MLIR dialect that captures the language’s multiple dispatch, dynamic typing, and higher‑order functions while converting them into a static single‑assignment (SSA) form. Standard compiler passes—constant propagation, dead‑code elimination, function inlining, and loop canonicalization—are applied to produce a clean, analyzable IR.

In the second stage, the toolchain performs hardware‑centric transformations. Loop nests are analyzed for parallelism and pipelining opportunities; the compiler automatically decides whether to unroll, pipeline, or fuse loops based on resource estimates. Memory accesses are examined to generate on‑chip buffers and streaming interfaces, and data‑flow graphs are constructed to expose natural streaming pipelines. Crucially, the toolchain supports both static scheduling (where all loop bounds and parameters are known at compile time) and dynamic scheduling (where runtime parameters are mapped to configurable registers). This dual capability enables a wide range of applications, from fixed‑size signal‑processing kernels to adaptable machine‑learning primitives.

Integration with the AXI4‑Stream protocol is built into the backend. The compiler emits SystemVerilog modules that expose standard AXI‑Stream master and slave ports for input and output streams, as well as AXI‑Lite control registers for configuration. Because the generated RTL adheres to the AXI4‑Stream specification, it can be directly instantiated in existing FPGA designs that already use Xilinx, Intel, or other vendor IP blocks, eliminating the need for custom glue logic. The final output is pure SystemVerilog, deliberately avoiding vendor‑specific extensions, which allows the design to be synthesized with any mainstream synthesis tool (Vivado, Quartus, Intel OneAPI, etc.).

The authors evaluate the prototype on a representative suite of signal‑processing and mathematical kernels: Fast Fourier Transform (FFT), finite‑impulse‑response (FIR) filters, matrix multiplication, convolution, and a simple neural‑network inference block. Each benchmark is synthesized to run at a 100 MHz clock frequency on a mid‑range Xilinx UltraScale+ and an Intel Stratix 10 FPGA. Throughput is measured against reference implementations generated by state‑of‑the‑art HLS tools that accept C/C++ input (Xilinx Vitis HLS, Intel HLS). The Julia‑based toolchain achieves between 59.71 % and 82.6 % of the reference throughput, with area overheads ranging from 5 % to 15 % depending on the benchmark. Importantly, the development time is dramatically reduced: a domain expert can write a kernel in native Julia, reuse existing Julia packages for verification, and obtain a synthesizable RTL in a matter of hours, compared to days of manual C/HLS coding and debugging.

Limitations are acknowledged. Currently the compiler supports only a subset of Julia’s language features (no metaprogramming, limited support for complex numbers, and restricted use of mutable structs). Advanced memory hierarchies such as multi‑level caches or external DRAM controllers are not automatically generated and must be added manually. Power‑aware optimizations and multi‑objective design space exploration (area, latency, energy) are left for future work.

In conclusion, the paper presents a compelling end‑to‑end solution that bridges high‑level algorithmic development and low‑level hardware implementation. By leveraging MLIR to retain the expressive power of Julia while systematically lowering to RTL, the toolchain empowers domain scientists and engineers to target FPGAs without learning a separate hardware description language. The demonstrated performance, combined with the significant productivity gains, suggests that high‑level synthesis from languages like Julia could become a practical pathway for accelerating the increasingly data‑intensive workloads of exascale and AI‑driven computing.