Exploration of Unary Arithmetic-Based Matrix Multiply Units for Low Precision DL Accelerators

Exploration of Unary Arithmetic-Based Matrix Multiply Units for Low Precision DL Accelerators
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

General matrix multiplication (GEMM) is a fundamental operation in deep learning (DL). With DL moving increasingly toward low precision, recent works have proposed novel unary GEMM designs as an alternative to conventional binary GEMM hardware. A rigorous evaluation of recent unary and binary GEMM designs is needed to assess the potential of unary hardware for future DL compute. This paper focuses on unary GEMM designs for integer-based DL inference and performs a detailed evaluation of three latest unary design proposals, namely, uGEMM, tuGEMM and tubGEMM, by comparing them to a conventional binary GEMM. Rigorous post-synthesis evaluations beyond prior works are performed across varying bit-widths and matrix sizes to assess the designs’ tradeoffs and determine optimal sweetspots. Further, we perform weight sparsity analysis across eight pretrained convolutional neural networks (CNNs) and the LLaMA2 large language model (LLM). In this work, we demonstrate how unary GEMM can be effectively used for energy-efficient compute in future edge AI accelerators.


💡 Research Summary

The paper presents a comprehensive evaluation of three recent unary‑based general matrix‑multiply (GEMM) units—uGEMM, tuGEMM, and tubGEMM—against a conventional binary GEMM (bGEMM) in the context of low‑precision integer deep‑learning inference. The authors first clarify the two primary unary coding schemes: rate‑coding (frequency of 1s) and temporal‑coding (run‑length of consecutive 1s). uGEMM implements a unified architecture that can handle both rate and temporal encodings, using a single AND gate for multiplication and parallel adder trees for accumulation. tuGEMM is the first fully temporal‑coded design; it replaces arithmetic units with simple counters and shift‑based accumulators, achieving the smallest silicon area and lowest static power. tubGEMM introduces a hybrid approach, encoding one operand temporally and the other in binary, and employs a novel “2‑unary” scheme that halves the worst‑case latency compared with tuGEMM.

All four designs are synthesized in the same Nangate 45 nm open‑source standard‑cell library using Synopsys Design Compiler, operating at 400 MHz. Post‑synthesis metrics are collected for 2‑, 4‑, and 8‑bit precisions and for two matrix dimensions (16×16 and 32×32). The authors also extend the evaluation to larger arrays (64×64 and 128×128) for 4‑bit precision to emulate edge‑TPU and cloud‑TPU scales. The measured quantities include silicon area (µm²), static power (mW), energy (nJ) derived from worst‑case latency, and area‑delay product (ADP) as a combined spatio‑temporal figure of merit.

Key quantitative findings:

  • Area & Power – tuGEMM consistently achieves the smallest area and lowest power across all configurations, thanks to its counter‑centric architecture that avoids large adder trees. tubGEMM follows closely, offering roughly 10–15 % of uGEMM’s area and 12–20 % of its power. bGEMM’s area grows rapidly with bit‑width, while uGEMM remains the most power‑hungry due to its extensive parallel adders.
  • Latency & Energy – tuGEMM’s latency scales quadratically with bit‑width (≈ N·(2^w − 1)^2 cycles), leading to the highest energy consumption despite its low static power. For 8‑bit 32×32 GEMM, tuGEMM’s energy exceeds 12 µJ, far above the other designs. tubGEMM reduces worst‑case latency to N·(2^w − 2) cycles, making it the most energy‑efficient at 2‑bit and comparable to bGEMM at 4‑bit. bGEMM’s energy is minimal at higher precisions because its latency is constant (N cycles) and independent of bit‑width.
  • Area‑Delay Product (ADP) – bGEMM has the lowest ADP for small matrices due to its short latency, but tubGEMM’s ADP becomes competitive for larger arrays (64×64, 128×128) where its modest latency increase is outweighed by its area advantage. tuGEMM’s ADP is prohibitively high (up to 20× larger than bGEMM) because of its extreme latency.

The authors also profile weight sparsity for eight popular CNNs (MobileNetV2/V3, InceptionV3, ShuffleNetV2, GoogleNet, ResNet18/50, ResNeXt101) and a quantized LLaMA2‑70B large language model. Two sparsity metrics are considered: word sparsity (percentage of zero‑valued weights) and bit sparsity (percentage of zero bits in the temporal‑unary stream). Measured bit sparsities range from 60 % to 70 % for the CNNs and exceed 50 % for the LLM’s most significant bits. Because temporal‑unary processing time is proportional to the number of ‘1’s, higher bit sparsity directly reduces dynamic latency for tuGEMM and tubGEMM. The authors estimate a dynamic latency reduction of roughly 30–40 % on average for the evaluated models.

From these results the paper draws several design recommendations for edge AI accelerators:

  1. Ultra‑low precision (2‑bit) – tubGEMM offers the best energy efficiency and a favorable area‑power trade‑off, making it ideal for ultra‑compact, battery‑powered devices.
  2. Moderate precision (4‑bit) with larger PE arrays – tubGEMM scales well; for 64×64 and 128×128 arrays its energy is within 1.2× of bGEMM and its ADP is lower, suggesting it as a superior alternative to conventional binary units in future edge TPUs.
  3. Area‑constrained designs where static power dominates – tuGEMM’s minimal silicon footprint and static power make it attractive when die area is at a premium and latency tolerances are relaxed (e.g., batch‑processed inference).
  4. Workloads with high weight sparsity – both tuGEMM and tubGEMM can exploit bit‑level sparsity to cut dynamic cycles, providing additional energy savings beyond the worst‑case numbers reported.

In summary, the study demonstrates that unary GEMM units can indeed compete with traditional binary MAC arrays under specific operating points. While tuGEMM excels in area and static power, its latency penalty limits its energy efficiency. tubGEMM strikes a balanced compromise, delivering near‑binary energy performance at 2‑bit and superior scalability at 4‑bit for larger matrices. uGEMM, despite its flexibility, lags behind in all metrics for the evaluated configurations. The paper’s systematic post‑synthesis analysis, combined with realistic sparsity profiling of modern CNNs and LLMs, provides a solid foundation for architects to consider unary arithmetic as a viable path toward energy‑efficient edge AI accelerators.


Comments & Academic Discussion

Loading comments...

Leave a Comment