In-Pipeline Integration of Digital In-Memory-Computing into RISC-V Vector Architecture to Accelerate Deep Learning
Expanding Deep Learning applications toward edge computing demands architectures capable of delivering high computational performance and efficiency while adhering to tight power and memory constraints. Digital In-Memory Computing (DIMC) addresses this need by moving part of the computation directly within memory arrays, significantly reducing data movement and improving energy efficiency. This paper introduces a novel architecture that extends the Vector RISC-V Instruction Set Architecture (ISA) to integrate a tightly coupled DIMC unit directly into the execution stage of the pipeline, to accelerate Deep Learning inference at the edge. Specifically, the proposed approach adds four custom instructions dedicated to data loading, computation, and write-back, enabling flexible and optimal control of the inference execution on the target architecture. Experimental results demonstrate high utilization of the DIMC tile in Vector RISC-V and sustained throughput across the ResNet-50 model, achieving a peak performance of 137 GOP/s. The proposed architecture achieves a speedup of 217x over the baseline core and 50x area-normalized speedup even when operating near the hardware resource limits. The experimental results confirm the high potential of the proposed architecture as a scalable and efficient solution to accelerate Deep Learning inference on the edge.
💡 Research Summary
The paper addresses the growing mismatch between the computational demands of modern deep‑learning models and the limited power, area, and memory budgets of edge devices. It proposes a tightly‑coupled integration of a digital in‑memory computing (DIMC) tile directly into the execution stage of a RISC‑V vector processor, thereby eliminating the costly data movement between separate compute and memory units.
The DIMC tile, originally described in a prior ISSCC paper, consists of a 32 KiB SRAM array organized as 32 rows of 1024 bits, plus a 1024‑bit input buffer. Each row can be configured as a compute unit, allowing 256 parallel 4‑bit multiply‑accumulate (MAC) operations per cycle, or, via precision re‑configuration, 512 × 2‑bit or 1024 × 1‑bit MACs. The tile’s internal sub‑arrays share read word lines and bitlines, and an adder tree accumulates partial results into a 24‑bit sum that can optionally pass through a ReLU before being written back.
To expose this capability to software, the authors extend the RISC‑V vector ISA (profile Zve32x, VLEN = 64, ELEN = 32) with four custom vector instructions: two load instructions that move weight rows and feature vectors into the DIMC, a compute‑start instruction that triggers the in‑memory MAC operation, and a write‑back instruction that stores the result into a vector register or memory. The encoding follows the standard vector instruction format, using a dedicated opcode and custom fields while preserving mask and nvec semantics for fine‑grained control.
Architecturally, the DIMC is added as an extra execution lane alongside the conventional vector functional units (VFUs). The vector register file (VRF) supplies 256‑bit data stripes directly to the DIMC, and the DIMC’s results are written back in the same pipeline stage, allowing the DIMC to operate in parallel with regular vector arithmetic. This tight coupling reduces latency to a few cycles, avoids extra bus transactions, and enables the vector core to act as a flexible data‑manipulation engine that reshapes, packs, or transposes tensors before they enter the DIMC.
The authors evaluate the design on an industrial‑grade RISC‑V vector core implemented in a 28 nm CMOS process. Using a single DIMC tile, they run the full ResNet‑50 inference workload. The measured peak throughput reaches 137 GOPS, corresponding to a 217× speedup over the baseline vector core without DIMC, and a 50× speedup when normalized by silicon area. Although explicit power numbers are not reported, the SRAM‑based DIMC is expected to consume orders of magnitude less energy per MAC than a conventional DRAM‑to‑CPU data path, because the dominant energy cost of off‑chip memory accesses is eliminated.
The paper positions its contribution relative to prior work. Earlier tightly‑coupled designs such as AI‑PiM and RDCIM embed IMC in scalar pipelines, limiting parallelism, while VECIM integrates CIM as a register‑file but does not expose a dedicated compute lane. Loosely‑coupled approaches like VPU‑CIM and CIMR‑V treat the IMC as an off‑core accelerator, incurring significant communication overhead. By contrast, the proposed architecture merges DIMC directly into the vector pipeline, leverages the vector ISA’s data‑level parallelism, and retains full programmability through standard toolchains.
Scalability is discussed: the current prototype uses a single tile, but the same pipeline could host multiple DIMC tiles, linearly increasing bandwidth and compute capacity. The custom ISA extensions are defined in a way that they could be standardized as optional RISC‑V extensions, facilitating adoption by other vector cores or heterogeneous SoCs.
In conclusion, the work demonstrates that a tightly‑coupled DIMC unit, controlled by a small set of custom vector instructions, can dramatically improve the performance‑per‑area and performance‑per‑energy of edge AI inference. It provides a concrete path toward programmable, high‑throughput deep‑learning accelerators that combine the flexibility of a general‑purpose vector processor with the efficiency of in‑memory computation.
Comments & Academic Discussion
Loading comments...
Leave a Comment