Enhanced Detection of Tiny Objects in Aerial Images

Enhanced Detection of Tiny Objects in Aerial Images
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While one-stage detectors like YOLOv8 offer fast training speed, they often under-perform on detecting small objects as a trade-off. This becomes even more critical when detecting tiny objects in aerial imagery due to low-resolution targets and cluttered backgrounds. To address this, we introduce four enhancement strategies-input image resolution adjustment, data augmentation, attention mechanisms, and an alternative gating function for attention modules-that can be easily implemented on YOLOv8. We demonstrate that image size enlargement and the proper use of augmentation can lead to enhancement. Additionally, we designed a Mixture of Orthogonal Neural-modules Network (MoonNet) pipeline which consists of multiple attention-module-augmented CNNs. Two well-known attention modules, Squeeze-and-Excitation (SE) Block and Convolutional Block Attention Module (CBAM), were integrated into the backbone of YOLOv8 to form the MoonNet design, and the MoonNet backbone obtained improved detection accuracy compared to the original YOLOv8 backbone and single-type attention-module-augmented backbones. MoonNet further proved its adaptability and potential by achieving state-of-the-art performance on a tiny-object benchmark when integrated with the YOLC model. Our code is available at: https://github.com/Kihyun11/MoonNet


💡 Research Summary

**
The paper addresses the persistent challenge of detecting tiny objects in aerial imagery, where objects often occupy only a few dozen pixels and are embedded in cluttered backgrounds. While one‑stage detectors such as YOLOv8 provide fast training and inference, they typically sacrifice small‑object performance due to coarse feature maps and the absence of a proposal stage. To overcome these limitations, the authors propose four complementary enhancement strategies that can be applied with minimal code changes to the standard YOLOv8 pipeline: (1) increasing the input image resolution, (2) applying targeted data augmentation, (3) integrating attention mechanisms, and (4) replacing the conventional sigmoid gating function with an identity‑safe alternative (1 + tanh).

Resolution Scaling
Three input sizes—640×640, 800×800, and 928×928—are evaluated on a modified version of the DOTAv2.0 dataset that retains only five classes with consistently small spatial footprints. The highest resolution (928×928) yields the best results, improving AP₅₀ from 0.621 (640) to 0.696 (+12 %p) and recall from 0.539 to 0.613 (+7 %p). The authors argue that larger inputs preserve fine‑grained details that are otherwise lost in down‑sampling layers.

Data Augmentation
Three augmentation packages are tested: the default Ultralytics set, a geometry‑focused set (rotations, scaling, flips), and a color‑space set (HSV adjustments). The geometry‑focused package, combined with the default augmentations, provides the most balanced improvement, confirming that mild, class‑aware augmentations help mitigate the severe class imbalance typical of tiny‑object datasets.

Hybrid Attention Backbone (MoonNet)
The core contribution is the design of a mixed‑attention backbone named MoonNet, which interleaves Squeeze‑and‑Excitation (SE) blocks and Convolutional Block Attention Modules (CBAM) within the YOLOv8 CSP backbone. Six backbone variants are explored, ranging from single‑type (SE‑only or CBAM‑only) to mixed configurations. The mixed SE + CBAM design consistently outperforms the single‑type counterparts, achieving AP₅₀ = 0.503 and recall = 0.313 on the modified DOTAv2.0 dataset. When the same mixed backbone is transferred to the multi‑scale YOLC framework (HRNet‑based), it further improves state‑of‑the‑art results on the VisDrone benchmark, raising AP₅₀ from 0.530 (HRNet alone) to 0.550 (+3.8 %p) and AP₇₅ from 0.311 to 0.336.

Alternative Gating Function
Both SE and CBAM traditionally use a sigmoid gating function to modulate feature maps. The authors replace sigmoid with an identity‑safe function, 1 + tanh(·), which preserves the original feature magnitude while amplifying subtle signals. In the YOLOv8 setting this change yields negligible gains, but within the YOLC framework it produces a noticeable boost (AP₅₀ + 3.77 %p, AP + 6.13 %p) over the sigmoid version, suggesting that the gating impact is architecture‑dependent.

Training Protocol and Computational Cost
All experiments are conducted on a single RTX 4080 (YOLOv8) or an A40 (YOLC) GPU. The final MoonNet‑enhanced YOLOv8n‑obb model is trained from scratch for 150 epochs with batch size 4, using AdamW optimizer. Compared to the baseline, MoonNet adds only 0.1 GFLOPs and 0.9 ms to the end‑to‑end latency (5.2 ms vs. 4.3 ms), confirming its suitability for real‑time deployment.

Ablation Study
The authors quantify the contribution of each component: resolution scaling yields a +21.38 %p gain in AP₅₀ and +33.44 %p in AP; targeted augmentation adds +11.41 %p in AP₅₀ and +18 %p in AP; the MoonNet backbone contributes a modest but consistent +0.45 %p in AP₅₀ and +0.21 %p in AP, while delivering the highest absolute performance (AP₅₀ = 0.667, AP = 0.486).

Conclusions and Future Work
The study demonstrates that (i) higher input resolution is essential for tiny‑object detection, (ii) carefully designed augmentations alleviate class imbalance, (iii) hybrid attention modules exploit complementary channel and spatial cues, and (iv) alternative gating functions can further enhance performance when matched to the right architecture. MoonNet’s modest computational overhead makes it attractive for UAV and drone applications where real‑time processing is critical. Future directions include exploring additional lightweight attention blocks (e.g., ECA, CoordAtt), automated search for optimal resolution‑augmentation combinations, and extending the mixed‑attention concept to other domains such as satellite or medical imaging.


Comments & Academic Discussion

Loading comments...

Leave a Comment