Enhancing Vehicle Detection under Adverse Weather Conditions with Contrastive Learning
Aside from common challenges in remote sensing like small, sparse targets and computation cost limitations, detecting vehicles from UAV images in the Nordic regions faces strong visibility challenges and domain shifts caused by diverse levels of snow coverage. Although annotated data are expensive, unannotated data is cheaper to obtain by simply flying the drones. In this work, we proposed a sideload-CL-adaptation framework that enables the use of unannotated data to improve vehicle detection using lightweight models. Specifically, we propose to train a CNN-based representation extractor through contrastive learning on the unannotated data in the pretraining stage, and then sideload it to a frozen YOLO11n backbone in the fine-tuning stage. To find a robust sideload-CL-adaptation, we conducted extensive experiments to compare various fusion methods and granularity. Our proposed sideload-CL-adaptation model improves the detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.
💡 Research Summary
The paper tackles the problem of detecting vehicles in UAV imagery captured over snow‑covered Nordic terrain, where limited annotated data and severe domain shifts caused by varying snow depth, illumination, and cloud cover degrade the performance of lightweight detectors such as YOLO11n. To exploit the abundance of unlabelled video frames, the authors propose a two‑stage Sideload‑Contrastive‑Learning‑Adaption (SCLA) framework.
In the pre‑training stage, a side CNN (architecturally identical to the YOLO11n backbone) is trained on unannotated NVD videos using a novel self‑supervised method called Feature‑Map‑Patch Contrastive Learning (FM‑PaCL). Instead of compressing an entire image into a single embedding, FM‑PaCL extracts overlapping patches directly from the intermediate feature map, flattens each patch, projects it through a small MLP head, and treats patches at the same spatial location across two photometrically‑augmented views as positive pairs while all other patches in the batch serve as negatives. An InfoNCE loss is applied at the patch level, encouraging invariance to colour jitter, blur, etc., while preserving fine‑grained spatial correspondence—crucial for small‑object detection. After pre‑training, the MLP head is discarded and the side CNN is frozen.
During fine‑tuning, labelled NVD frames are fed simultaneously to the frozen side CNN and the frozen COCO‑pre‑trained YOLO11n backbone. Their respective feature maps (identical in channel and spatial dimensions) are merged through a dynamic fusion block. The authors evaluate several fusion strategies: simple addition, concatenation, and learnable gating mechanisms (including SE‑style gating). Static methods perform poorly because the two feature streams have mismatched distributions; the dynamic gating learns to weight each channel adaptively based on the input, effectively balancing domain‑specific cues from FM‑PaCL with generic visual knowledge from COCO. The fused representation is then passed to YOLO11n’s neck and head for detection.
Experiments on the Nordic Vehicle Dataset (NVD) demonstrate substantial gains. Using the original train/val/test split, the SCLA model improves mAP@50 by 8.9 percentage points over the YOLO11n baseline. In a more realistic split where training, validation, and testing videos are completely disjoint, the improvement is 2.8 percentage points, confirming that the method generalises across unseen scenes. An ablation study shows that removing FM‑PaCL or using a naïve side‑CNN pre‑training leads to performance drops, highlighting the importance of the patch‑level contrastive objective and the careful fusion design.
The paper’s contributions are threefold: (1) introduction of FM‑PaCL, a patch‑level contrastive learning scheme tailored for small‑object UAV detection; (2) a stable side‑loading architecture that fuses domain‑agnostic COCO features with domain‑specific representations learned from unlabelled data; (3) empirical evidence that such a combination yields up to ~9 % absolute mAP gains without any additional annotation cost.
Limitations include increased memory consumption due to high‑resolution feature‑map patches and added computational overhead from the dynamic gating module. Future work could explore memory‑efficient patch sampling, multi‑scale fusion, application to other lightweight detectors (e.g., NanoDet), and validation on different sensor modalities or geographic regions. Overall, SCLA offers a practical pathway to boost UAV‑based vehicle detection in harsh, snow‑laden environments while keeping the model lightweight enough for edge deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment