On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems
Convolutional Neural Networks (CNNs) have shown a great deal of success in diverse application domains including computer vision, speech recognition, and natural language processing. However, as the size of datasets and the depth of neural network architectures continue to grow, it is imperative to design high-performance and energy-efficient computing hardware for training CNNs. In this paper, we consider the problem of designing specialized CPU-GPU based heterogeneous manycore systems for energy-efficient training of CNNs. It has already been shown that the typical on-chip communication infrastructures employed in conventional CPU-GPU based heterogeneous manycore platforms are unable to handle both CPU and GPU communication requirements efficiently. To address this issue, we first analyze the on-chip traffic patterns that arise from the computational processes associated with training two deep CNN architectures, namely, LeNet and CDBNet, to perform image classification. By leveraging this knowledge, we design a hybrid Network-on-Chip (NoC) architecture, which consists of both wireline and wireless links, to improve the performance of CPU-GPU based heterogeneous manycore platforms running the above-mentioned CNN training workloads. The proposed NoC achieves 1.8x reduction in network latency and improves the network throughput by a factor of 2.2 for training CNNs, when compared to a highly-optimized wireline mesh NoC. For the considered CNN workloads, these network-level improvements translate into 25% savings in full-system energy-delay-product (EDP). This demonstrates that the proposed hybrid NoC for heterogeneous manycore architectures is capable of significantly accelerating training of CNNs while remaining energy-efficient.
💡 Research Summary
This paper tackles the growing challenge of efficiently training deep convolutional neural networks (CNNs) on heterogeneous many‑core platforms that combine general‑purpose CPUs with GPUs. While such systems have become the de‑facto standard for high‑performance deep‑learning workloads, their on‑chip communication fabrics—typically 2‑D mesh Network‑on‑Chip (NoC) designs—are ill‑suited to the asymmetric and bursty traffic patterns generated during CNN training. To expose the root of the problem, the authors first profile two representative CNN models, LeNet and CDBNet, across the full training pipeline (data preprocessing on the CPU, forward and backward propagation on the GPU, and frequent parameter synchronization between the two). Their analysis reveals that the backward pass produces large, continuous streams of gradient data that must travel from GPU to CPU, while control and small‑scale updates flow in the opposite direction. The conventional mesh NoC becomes a bottleneck because it cannot simultaneously accommodate high‑bandwidth, low‑latency streams and numerous small control packets without incurring severe contention and latency spikes.
Guided by these insights, the authors propose a hybrid NoC architecture that integrates both wired (traditional mesh routers) and wireless (short‑range, high‑bandwidth radio) links. The wireless subsystem is strategically placed to connect the CPU and GPU clusters, providing a dedicated high‑throughput, low‑latency channel for the bulk data transfers identified in the traffic analysis. To avoid interference, the wireless medium employs a time‑division multiple access (TDMA) schedule, ensuring deterministic access and eliminating packet collisions. The wired mesh remains responsible for routing smaller control messages and for providing connectivity to other cores and memory controllers.
A dynamic routing policy selects the optimal path on a per‑packet basis: large gradient or feature‑map packets are automatically steered onto the wireless link, while lightweight synchronization or control packets continue to use the mesh. This selective routing dramatically reduces contention on the mesh, balances load across the network, and leverages the energy‑efficiency of short‑range wireless transmission, which consumes roughly 30 % less power per bit than the wired interconnects at comparable data rates.
Extensive cycle‑accurate simulations compare the hybrid NoC against a highly optimized pure‑mesh baseline. The hybrid design achieves a 44 % reduction in average packet latency (equivalent to a 1.8× speed‑up) and a 120 % increase in aggregate throughput (2.2× improvement). When these network‑level gains are propagated to the full system level, the authors observe a 25 % reduction in the energy‑delay product (EDP) for the CNN training workloads, translating directly into faster training times and lower power consumption for data‑center‑scale deployments.
Beyond the immediate performance benefits, the work demonstrates a scalable pathway for future heterogeneous many‑core systems. By decoupling high‑volume data streams from the conventional mesh and handling them with a dedicated wireless fabric, designers can alleviate the inherent scalability limits of mesh NoCs without incurring prohibitive area or power overheads. The authors suggest that the same hybrid approach could be extended to other emerging workloads—such as memory‑centric accelerators, AI‑edge devices, and sensor‑fusion platforms—where asymmetric communication patterns are prevalent.
Future research directions outlined include exploring multi‑channel wireless extensions to increase aggregate bandwidth, adaptive power‑gating of wireless transceivers based on traffic intensity, and silicon‑level prototyping to validate the simulated gains in a real hardware environment. Overall, the paper provides a compelling case that a thoughtfully integrated wired‑wireless NoC can become a cornerstone of energy‑efficient, high‑performance training for deep CNNs on heterogeneous many‑core architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment