Towards lightweight convolutional neural networks for object detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose model with larger spatial size of feature maps and evaluate it on object detection task. With the goal to choose the best feature extraction network for our model we compare several popular lightweight networks. After that we conduct a set of experiments with channels reduction algorithms in order to accelerate execution. Our vehicle detection models are accurate, fast and therefore suit for embedded visual applications. With only 1.5 GFLOPs our best model gives 93.39 AP on validation subset of challenging DETRAC dataset. The smallest of our models is the first to achieve real-time inference speed on CPU with reasonable accuracy drop to 91.43 AP.

💡 Research Summary

The paper addresses the challenge of building a fast, accurate object detector that can run in real‑time on embedded CPUs, focusing on vehicle detection in the DETRAC dataset. The authors start from the SSD detection framework, which is known for a good speed‑accuracy trade‑off, and explore how to design a lightweight backbone that preserves high spatial resolution while drastically reducing the number of convolutional channels.

Four candidate backbones are evaluated: SqueezeNet 1.0, SqueezeNet 1.0 with batch‑normalization, MobileNet, PV‑ANet, and a shallow ResNet‑10. Their ImageNet top‑1/top‑5 accuracies and FLOPs are reported (Table 1). Contrary to the common belief that the backbone with the highest classification accuracy will yield the best detection performance, the experiments show that ResNet‑10, despite modest ImageNet scores, provides the highest AP on DETRAC when integrated into SSD.

The core architectural modification consists of removing the last two spatial down‑sampling layers (which normally reduce the feature map by 16× and 32×) and replacing them with dilated convolutions (dilation rates 2 and 4). This keeps the feature maps relatively large (e.g., 40 × 30 instead of 10 × 8) and therefore retains fine‑grained localization cues that are crucial for detecting small vehicles. To keep the computational budget low, the authors systematically prune channels in the convolutional layers. Three pruning strategies are investigated:

One‑shot random sampling – randomly keep a subset of channels and fine‑tune the resulting sub‑network.
One‑shot L1 pruning – compute the L1 norm of each filter, discard a fixed percentage (5 % for early layers, 10 % for later layers), then fine‑tune.
Iterative pruning – repeat the L1 pruning and fine‑tuning cycle until a target FLOP count is reached.

Iterative pruning yields the best balance, achieving 91.43 AP at only 0.47 GFLOPs (473 MFLOPs) with a model size of 0.24 M parameters. A slightly larger model (0.75 GFLOPs) reaches 92.49 AP, while the full‑size SSDR‑1.5 (1.5 GFLOPs, 1.1 M parameters) attains 93.39 AP on the DETRAC validation split.

The authors also compare their models against several state‑of‑the‑art detectors (Faster‑RCNN, YOLO‑2, clustered‑prior SSD, and Feature Pyramid Networks). Their SSDR variants consistently outperform these baselines in both accuracy and speed, especially on CPU. The smallest model runs at ~34 frames per second on an Intel Core i7‑6700K using MKL and Caffe, making it the first CPU‑real‑time vehicle detector with AP above 90 % on DETRAC.

To demonstrate generality, the same design principles are applied to the PASCAL VOC 2007 detection task. Using MobileNet as backbone, a modified SSD (named SSDM 7.5) with dilated convolutions achieves 73.08 AP, surpassing the vanilla SSD (70.04 AP) and other lightweight baselines. The authors note that on VOC the ranking of backbones aligns more closely with ImageNet classification performance, whereas on DETRAC the correlation is weaker, likely due to the single‑class nature of the vehicle dataset.

In the discussion, the paper emphasizes that preserving high‑resolution feature maps while aggressively pruning channels is a simple yet effective strategy that does not rely on hardware‑specific optimizations such as quantization or sparse matrix kernels. This makes the approach portable across different platforms. The authors suggest that further gains could be obtained by combining their method with quantization, neural architecture search, or specialized inference libraries.

In conclusion, the work presents a practical recipe for building lightweight, high‑resolution object detectors that achieve state‑of‑the‑art accuracy on challenging benchmarks while running in real‑time on commodity CPUs. The proposed models are well suited for embedded visual applications such as driver assistance, traffic monitoring, and smart‑city surveillance, where computational resources are limited but detection reliability remains critical.

Towards lightweight convolutional neural networks for object detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment