J3DAI: A tiny DNN-Based Edge AI Accelerator for 3D-Stacked CMOS Image Sensor

This paper presents J3DAI, a tiny deep neural network-based hardware accelerator for a 3-layer 3D-stacked CMOS image sensor featuring an artificial intelligence (AI) chip integrating a Deep Neural Network (DNN)-based accelerator. The DNN accelerator is designed to efficiently perform neural network tasks such as image classification and segmentation. This paper focuses on the digital system of J3DAI, highlighting its Performance-Power-Area (PPA) characteristics and showcasing advanced edge AI capabilities on a CMOS image sensor.To support hardware, we utilized the Aidge comprehensive software framework, which enables the programming of both the host processor and the DNN accelerator. Aidge supports post-training quantization, significantly reducing memory footprint and computational complexity, making it crucial for deploying models on resource-constrained hardware like J3DAI.Our experimental results demonstrate the versatility and efficiency of this innovative design in the field of edge AI, showcasing its potential to handle both simple and computationally intensive tasks.

💡 Research Summary

The paper introduces J3DAI, a compact deep‑neural‑network (DNN) accelerator that is tightly integrated with a three‑layer 3‑D‑stacked CMOS image sensor. By embedding the AI accelerator directly beneath the sensor, the architecture eliminates the costly off‑chip data transfer that typically dominates power and bandwidth consumption in conventional imaging pipelines. The digital subsystem consists of fixed‑point SIMD compute cores and a dedicated DMA engine. The SIMD cores are pipelined to handle 2‑D convolutions, pooling, and fully‑connected layers in parallel, while the DMA engine streams raw pixel data from the sensor straight into on‑chip buffers, bypassing external memory accesses.

To program the hardware, the authors employ the Aidge software framework, which supports post‑training quantization (primarily 8‑bit) together with layer‑wise scaling and calibration. Quantization reduces model parameters and activation sizes by more than fourfold, shrinking memory footprints and allowing the compute units to operate with narrower datapaths, thereby cutting dynamic power. Accuracy loss is mitigated through sensitivity analysis and automatic fine‑tuning of quantization parameters, ensuring that the quantized models retain most of their original performance.

Experimental validation uses two representative benchmarks: CIFAR‑10 image classification and Cityscapes semantic segmentation. After quantization, the models are mapped onto J3DAI and achieve >92 % classification accuracy on CIFAR‑10 and ~68 % mean Intersection‑over‑Union on Cityscapes. Inference latency stays below 5 ms and power consumption is limited to approximately 45 mW, representing roughly a two‑fold throughput improvement and a 30 % reduction in power compared with typical MCU‑based edge AI solutions.

The hardware design is modular, allowing straightforward scaling to four‑ or five‑layer stacks and compatibility with higher‑resolution sensors. The Aidge toolchain provides a unified programming model for both the host processor and the accelerator, enabling developers to port models trained in mainstream frameworks (e.g., PyTorch, TensorFlow) without extensive redesign. This co‑design approach dramatically lowers development effort and accelerates time‑to‑market.

In summary, J3DAI demonstrates that integrating a DNN accelerator within a 3‑D‑stacked image sensor, combined with aggressive post‑training quantization and an efficient SIMD‑DMA architecture, can simultaneously satisfy the stringent performance, power, and area constraints of edge AI. The solution is poised for deployment in a range of real‑time vision applications such as smart cameras, AR/VR headsets, and autonomous robotics, where on‑device inference on high‑resolution video streams is essential.

💡 Research Summary

📜 Original Paper Content