EXTD: Extremely Tiny Face Detector via Iterative Filter Reuse
In this paper, we propose a new multi-scale face detector having an extremely tiny number of parameters (EXTD),less than 0.1 million, as well as achieving comparable performance to deep heavy detectors. While existing multi-scale face detectors extract feature maps with different scales from a single backbone network, our method generates the feature maps by iteratively reusing a shared lightweight and shallow backbone network. This iterative sharing of the backbone network significantly reduces the number of parameters, and also provides the abstract image semantics captured from the higher stage of the network layers to the lower-level feature map. The proposed idea is employed by various model architectures and evaluated by extensive experiments. From the experiments from WIDER FACE dataset, we show that the proposed face detector can handle faces with various scale and conditions, and achieved comparable performance to the more massive face detectors that few hundreds and tens times heavier in model size and floating point operations.
💡 Research Summary
The paper introduces EXTD (Extremely Tiny Face Detector), a multi‑scale face detection framework that uses fewer than 0.1 million parameters (approximately 400 KB) while achieving performance comparable to much larger state‑of‑the‑art detectors on the WIDER FACE benchmark. The central contribution is an “Iterative Filter Reuse” strategy: a shallow, lightweight backbone network—built from inverted residual blocks inspired by MobileNet‑V2—is applied repeatedly to the same input feature map, halving the spatial resolution at each iteration. Starting from a stride‑2 convolution on a 640 × 640 image, the backbone generates six feature maps of sizes 160 × 160, 80 × 80, 40 × 40, 20 × 20, 10 × 10, and 5 × 5.
Two detector heads are explored. In the SSD‑like variant, each of the six feature maps directly feeds a 3 × 3 convolutional classification head and a 3 × 3 regression head. In the FPN‑like variant, each feature map is up‑sampled using bilinear interpolation followed by a depth‑wise and point‑wise convolution block; the up‑sampled map is then added to a lower‑resolution map via a skip connection, producing a set of enriched feature maps that combine high‑level semantics with fine‑grained spatial detail without adding extra parameters.
The classification head outputs two channels (face vs. background) for all scales except the highest‑resolution map, where four channels are produced and a Maxout operation selects the two most confident responses—mirroring the strategy used in S3FD to reduce false positives on tiny faces. The regression head predicts four offsets (center‑x, center‑y, width, height) per anchor, following the standard RPN formulation.
Training employs a multitask loss consisting of cross‑entropy classification loss and smooth‑L1 regression loss, balanced by a factor λ = 4. To address the severe class imbalance inherent in face detection, the authors adopt online hard negative mining (maintaining a 3:1 negative‑to‑positive ratio) and the scale‑compensation anchor matching scheme from S3FD, which ensures sufficient positive samples for very small faces. Data augmentation follows the S3FD pipeline (color jitter, random cropping, horizontal and vertical flips). The entire network is implemented in PyTorch and trained on the NSML platform.
Experimental results on the WIDER FACE dataset show that both EXTD‑SSD and EXTD‑FPN achieve average precision (AP) scores on the Easy, Medium, and Hard subsets that are virtually indistinguishable from S3FD when the latter uses a MobileFaceNet backbone. However, EXTD’s model size is 20–100× smaller and its FLOPs are reduced to under 1 G, making real‑time inference feasible on CPUs and low‑power mobile GPUs. Notably, the FPN‑like version outperforms the SSD‑like version on the Hard subset (which contains many tiny faces), confirming that the iterative reuse of the backbone injects richer semantic context into low‑resolution maps.
In summary, EXTD demonstrates that extensive parameter sharing across scales—implemented via iterative application of a compact inverted‑residual backbone—can dramatically shrink model size while preserving, or even enhancing, detection accuracy for small faces. This design is highly suitable for embedded and mobile applications where memory and compute budgets are tight, and it opens avenues for extending iterative filter reuse to other object detection tasks or to combine with alternative lightweight backbones such as ShuffleNet or EfficientNet‑Lite.
Comments & Academic Discussion
Loading comments...
Leave a Comment