EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices

EmBench: Quantifying Performance Variations of Deep Neural Networks   across Modern Commodity Devices
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, advances in deep learning have resulted in unprecedented leaps in diverse tasks spanning from speech and object recognition to context awareness and health monitoring. As a result, an increasing number of AI-enabled applications are being developed targeting ubiquitous and mobile devices. While deep neural networks (DNNs) are getting bigger and more complex, they also impose a heavy computational and energy burden on the host devices, which has led to the integration of various specialized processors in commodity devices. Given the broad range of competing DNN architectures and the heterogeneity of the target hardware, there is an emerging need to understand the compatibility between DNN-platform pairs and the expected performance benefits on each platform. This work attempts to demystify this landscape by systematically evaluating a collection of state-of-the-art DNNs on a wide variety of commodity devices. In this respect, we identify potential bottlenecks in each architecture and provide important guidelines that can assist the community in the co-design of more efficient DNNs and accelerators.


💡 Research Summary

The paper “EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices” presents a systematic benchmark of a broad set of state‑of‑the‑art image‑classification deep neural networks (DNNs) on five representative commodity platforms that span the spectrum from high‑end servers to low‑power edge devices. The authors first compile a list of 20 popular DNN architectures (including AlexNet, VGG‑11/13/16/19, ResNet‑18/34/50/101/152, Inception‑v3/v4, DenseNet‑121/169/201/161, MobileNetV2, ShuffleNet‑v2, MnasNet, PNASNet, NasNet, NasNet‑Mobile, and SqueezeNet variants). For each model they record FLOPs, parameter count, and top‑1/top‑5 ImageNet accuracy, illustrating the wide range of computational complexity and accuracy trade‑offs.

The hardware side comprises: (1) an NVIDIA RTX 2080 Ti GPU (desktop/server class, 1.5 GHz, 11 GB GDDR6, 250 W TDP, Tensor Cores), (2) an Intel Xeon 4116 CPU (12 cores, 2.1 GHz, 256 GB RAM, 85 W), (3) an NVIDIA Jetson Xavier SoC (512‑core Volta GPU with 64 Tensor Cores, 1.3 GHz, 16 GB, 30 W), (4) a Qualcomm Snapdragon 845 (Kryo 385) mobile CPU‑GPU combo (8 cores, 2.8/1.8 GHz, 6 GB, ~5 W), and (5) an Intel Neural Compute Stick 2 (Movidius Myriad X VPU, 1 W, 1 TOPS). By covering both GPU‑centric and CPU‑centric, high‑power and low‑power platforms, the study captures the heterogeneity of today’s edge ecosystem.

All models are first exported to ONNX, then run with the most appropriate runtime on each device: PyTorch + CUDA/cuDNN on the RTX 2080 Ti and Xeon, PyTorch + CUDA on Jetson Xavier, Facebook’s AI Performance Evaluation Platform (FAI‑PEP) with the Caffe2 backend and NNPACK on the Snapdragon CPU, and OpenVINO on the NCS 2. Each configuration is evaluated for three variables: (i) DNN architecture, (ii) batch size (powers of two from 1 to 512), and (iii) target platform. For every run the authors perform 50 repetitions (10 warm‑up, 40 measured) with a 2‑second cool‑down to avoid thermal throttling, reporting the minimum latency as the performance metric.

Key findings:

  1. FLOPs are not a reliable predictor of latency. On the RTX 2080 Ti, networks with similar FLOP counts can differ in inference time by up to an order of magnitude, due to differences in layer composition, memory access patterns, and how well the hardware’s Tensor Cores are utilized.

  2. Batch size has a non‑linear impact. The high‑end GPU achieves peak throughput at batch sizes of 128–256, after which additional batching yields diminishing returns because memory bandwidth becomes saturated. The Xeon CPU plateaus much earlier (≈32) because its core count and cache hierarchy limit parallelism. Mobile CPUs (Kryo 385) show modest gains up to batch‑size 32, constrained by both compute and memory bandwidth. The NCS 2 exhibits almost no batch‑size benefit because its on‑chip memory is tiny and data transfer dominates.

  3. Operation‑level bottlenecks vary per platform. Convolution (both standard and depth‑wise) dominates on GPUs and CPUs, but on the VPU the memory‑transfer and tensor‑re‑ordering stages consume a large fraction of time. On Jetson Xavier, FP16 support and Tensor Core usage reduce GEMM latency, yet the lack of full 8‑bit support still leaves certain layers (e.g., batch‑norm) as bottlenecks.

  4. Accuracy vs. throughput trade‑offs are hardware‑dependent. Models such as MobileNetV2 achieve comparable accuracy to VGG‑16 while requiring far fewer FLOPs, yet on some platforms (e.g., the Xeon) the absolute throughput advantage is modest because the CPU is not FLOP‑bound. Conversely, on the RTX 2080 Ti the same FLOP reduction translates into a large speed‑up, highlighting the importance of matching model characteristics to device capabilities.

From these observations the authors propose three practical guidelines for co‑design: (a) select or design architectures that align with the target device’s specialized units (e.g., Tensor Cores, VPU kernels); (b) tune batch size to the memory and compute envelope of the device rather than assuming “larger is better”; and (c) prioritize reducing memory traffic and favoring operations that map efficiently to the hardware (e.g., replace depth‑wise convolutions with grouped convolutions when the device lacks optimized depth‑wise kernels).

The paper concludes that a comprehensive benchmark like EmBench is essential for both DNN researchers and hardware architects. It provides a publicly available dataset of latency, throughput, and operation‑level breakdowns across a diverse hardware portfolio, enabling more informed decisions in model selection, hardware procurement, and future accelerator design. The authors suggest extending the study to include power/energy measurements, real‑time streaming workloads, and automated model‑hardware matching tools as future work.


Comments & Academic Discussion

Loading comments...

Leave a Comment