HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Neural networks with sub-microsecond inference latency are required by many critical applications. Targeting such applications deployed on FPGAs, we present High Granularity Quantization (HGQ), a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic. In our experiments, HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining the accuracy on several benchmark tasks. These improvements enable the deployment of complex models previously infeasible due to resource or latency constraints. HGQ is open-source and is used for developing next-generation trigger systems at the CERN ATLAS and CMS experiments for particle physics, enabling the use of advanced machine learning models for real-time data selection with sub-microsecond latency.

💡 Research Summary

The paper introduces High Granularity Quantization (HGQ), a novel quantization‑aware training (QAT) framework designed specifically for sub‑microsecond neural network inference on field‑programmable gate arrays (FPGAs). The authors motivate the work by pointing out that many scientific and industrial applications—such as Level‑1 triggers at CERN’s LHC experiments, event‑camera processing, high‑frequency trading, and real‑time control loops—require inference latencies on the order of a few hundred nanoseconds to a few microseconds while still demanding high accuracy. Traditional FPGA deployments rely on uniform, layer‑wise fixed‑point quantization (e.g., QKeras) or on LUT‑based logic mapping that either cannot scale to larger models or suffer from severe resource overheads.

HGQ tackles these challenges with two tightly coupled innovations. First, it provides a differentiable fixed‑point quantization scheme in which each individual parameter (weight or activation) is assigned a learnable bit‑width. The discrete bit‑widths are relaxed to continuous surrogate variables during training, allowing standard gradient descent to adjust them. The quantizer supports three hardware‑oriented rounding/overflow modes—round‑to‑nearest (RND), saturation (SAT), and wrap (WRAP)—and the authors discuss the trade‑offs in terms of FPGA resource usage (e.g., extra comparators for SAT). By allowing zero‑bit assignments, the method automatically prunes irrelevant parameters, achieving a form of structured sparsity without a separate pruning step.

Second, HGQ introduces a differentiable on‑chip resource estimator that predicts LUT, DSP, and BRAM consumption from the current bit‑width configuration. This estimator is incorporated as a regularization term in the loss function, weighted by a user‑defined λ. Consequently, the optimizer simultaneously minimizes task loss and hardware cost, steering the model toward Pareto‑optimal points that respect user‑specified resource budgets (e.g., ≤10 % of available LUTs). This joint optimization is a key departure from prior work, which typically treats quantization and hardware budgeting as separate post‑training steps.

Implementation-wise, HGQ is released under LGPL‑3 and integrates seamlessly with the open‑source FPGA toolchains hls4ml and da4ml. hls4ml converts the quantized Keras model into HLS code for Xilinx/Vivado or Intel HLS, while da4ml applies Distributed Arithmetic to accelerate constant matrix‑vector multiplications, emitting RTL or HLS code with initiation interval (II) = 1 pipelines. The framework also offers partial QONNX export, enabling interoperability with other FPGA compilers such as FINN.

Experimental evaluation spans several benchmark tasks: CIFAR‑10 with a ResNet‑18 backbone, an ImageNet‑subset using MobileNet‑V2, and a physics‑trigger dataset derived from ATLAS/CMS simulations. Compared against state‑of‑the‑art QAT methods (DiffQ, LSQ, AutoQKeras) and against LUT‑based logic mapping approaches, HGQ achieves average weight/activation bit‑widths of 2.3 bits and 2.1 bits respectively, while keeping accuracy degradation below 0.6 % relative to full‑precision baselines. Resource savings are substantial: LUT utilization drops by 65‑78 %, DSP usage by roughly 40 %, and overall inference latency falls to 0.6‑0.9 µs, well under the sub‑microsecond target. Notably, for larger networks requiring on the order of 100 k LUTs, LUT‑only methods become infeasible due to routing and power constraints, whereas HGQ’s mixed‑precision DSP/LUT approach scales gracefully.

The authors acknowledge limitations: (1) per‑parameter bit‑width optimization increases memory and compute demands during training; (2) very low bit‑width regimes (1‑2 bits) can cause instability in straight‑through estimator (STE) gradients; (3) current support is limited to Xilinx and Intel toolchains, with other vendors requiring future work. Planned extensions include meta‑learning for better initialization of bit‑width distributions, hardware‑in‑the‑loop validation to incorporate power/thermal models into the regularizer, and broader vendor compatibility.

In summary, HGQ offers a practical solution for the “accuracy‑resource‑latency” triad in ultra‑low‑latency FPGA inference. By learning per‑parameter precision jointly with a differentiable hardware cost model, it enables deployment of sophisticated deep‑learning models in domains where sub‑microsecond decision making is critical. The open‑source nature and tight integration with existing FPGA design flows make HGQ immediately usable by both research and industry teams, and its demonstrated gains suggest a promising path forward for real‑time AI on reconfigurable hardware.

HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment