BitLogic: Training Framework for Gradient-Based FPGA-Native Neural Networks
The energy and latency costs of deep neural network inference are increasingly driven by deployment rather than training, motivating hardware-specialized alternatives to arithmetic-heavy models. Field-Programmable Gate Arrays (FPGAs) provide an attractive substrate for such specialization, yet existing FPGA-based neural approaches are fragmented and difficult to compare. We present BitLogic, a fully gradient-based, end-to-end trainable framework for FPGA-native neural networks built around Lookup Table (LUT) computation. BitLogic replaces multiply-accumulate operations with differentiable LUT nodes that map directly to FPGA primitives, enabling native binary computation, sparse connectivity, and efficient hardware realization. The framework offers a modular functional API supporting diverse architectures, along with learned encoders, hardware-aware heads, and multiple boundary-consistent LUT relaxations. An automated Register Transfer Level (RTL) export pipeline translates trained PyTorch models into synthesizable HDL, ensuring equivalence between software and hardware inference. Experiments across standard vision benchmarks and heterogeneous hardware platforms demonstrate competitive accuracy and substantial gains in FPGA efficiency, including 72.3% test accuracy on CIFAR-10 achieved with fewer than 0.3M logic gates, while attaining sub-20 ns single-sample inference using only LUT resources.
💡 Research Summary
BitLogic is a comprehensive, gradient‑based framework that enables the design, training, and hardware deployment of FPGA‑native neural networks using lookup‑table (LUT) computation as the fundamental operation. Traditional deep neural networks rely heavily on multiply‑accumulate (MAC) units, which map poorly onto the bit‑level logic fabric of FPGAs. BitLogic replaces MACs with differentiable LUT nodes that directly correspond to the FPGA’s native K‑input LUT primitives. Each LUT node implements an n‑input Boolean function via a truth table of 2ⁿ entries; during training the discrete truth table is relaxed to a continuous, differentiable surrogate (e.g., a probabilistic expectation over Bernoulli‑distributed inputs or a straight‑through estimator). This relaxation allows back‑propagation while preserving an exact mapping to hardware after training: a simple 0.5 threshold quantizes the learned parameters back to binary logits, yielding an identical Boolean function that can be instantiated as a physical LUT.
The framework is built around three modular components: (1) Encoders that transform real‑valued inputs into binary streams (thermometer encoding, quantization‑based encodings, data‑driven threshold learning), (2) Layers/Blocks composed of many LUT nodes. Connections between inputs and LUTs are defined by a mapping Mj, which can be random, locality‑based, or learned; a top‑K sparsity scheme is introduced so that each node selects only the K most salient input bits, dramatically reducing LUT count and routing complexity. Blocks reuse the same LUT layer across sliding windows, providing convolution‑like behavior while sharing parameters, and additional blocks (residual, attention, transposed convolution) are supported. (3) Heads that aggregate binary feature vectors into real‑valued outputs for classification or regression, ranging from simple group‑sum (pop‑count per class) to weighted linear combinations. Heads and encoders are implemented efficiently on FPGA using DSP blocks for arithmetic and BRAM for threshold storage.
BitLogic’s API is configuration‑driven: a model is described as an ordered list of registered components (encoders → blocks/layers → heads). All architectural choices (widths, kernel sizes, connectivity, node type, sparsity level) are specified in a JSON‑like configuration, allowing rapid swapping of components without code changes. This promotes reproducible experimentation and fair comparison across different LUT‑based designs.
Training incorporates several stability techniques: gradient stabilization, in‑layer bit‑flip regularization, probabilistic node formulations, and specialized initialization schemes. Early training may use dense connectivity, which is gradually pruned to the final sparse mapping required for hardware synthesis. The framework supports multiple differentiable relaxations and exact gradient options, giving users flexibility to balance training speed and fidelity to the discrete target.
After training, BitLogic automatically exports the PyTorch model to synthesizable RTL. Each component implements a to_hdl() method that generates hierarchical Verilog/VHDL modules; the export pipeline assembles these into a complete design ready for Vivado, Quartus, or other FPGA toolchains. The authors demonstrate the workflow on Xilinx UltraScale+ devices, achieving sub‑20 ns single‑sample inference latency while using fewer than 0.3 M LUTs. Benchmark results include 72.3 % test accuracy on CIFAR‑10, 23.4 % on CIFAR‑100, 93.8 % on Fashion‑MNIST, and 99.1 % on MNIST, all with resource usage and latency superior to prior LUT‑centric approaches such as LUTNet, LogicNets, PolyLUT, and neural‑gate networks.
In summary, BitLogic unifies the fragmented landscape of FPGA‑specific neural network research by providing a fully differentiable LUT‑based computation model, a modular configuration‑driven API, and an end‑to‑end hardware export flow. It enables designers to co‑optimize algorithmic accuracy and hardware cost from the outset, delivering state‑of‑the‑art performance on edge devices where power and latency are critical. Future work may extend scalability to larger input dimensions, multi‑FPGA clusters, and broader application domains such as speech and time‑series processing.
Comments & Academic Discussion
Loading comments...
Leave a Comment