Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reliable unmanned aerial vehicle (UAV) detection is critical for autonomous airspace monitoring but remains challenging when integrating sensor streams that differ substantially in resolution, perspective, and field of view. Conventional fusion methods-such as wavelet-, Laplacian-, and decision-level approaches-often fail to preserve spatial correspondence across modalities and suffer from annotation of inconsistencies, limiting their robustness in real-world settings. This study introduces two fusion strategies, Registration-aware Guided Image Fusion (RGIF) and Reliability-Gated Modality-Attention Fusion (RGMAF), designed to overcome these limitations. RGIF employs Enhanced Correlation Coefficient (ECC)-based affine registration combined with guided filtering to maintain thermal saliency while enhancing structural detail. RGMAF integrates affine and optical-flow registration with a reliability-weighted attention mechanism that adaptively balances thermal contrast and visual sharpness. Experiments were conducted on the Multi-Sensor and Multi-View Fixed-Wing (MMFW)-UAV dataset comprising 147,417 annotated air-to-air frames collected from infrared, wide-angle, and zoom sensors. Among single-modality detectors, YOLOv10x demonstrated the most stable cross-domain performance and was selected as the detection backbone for evaluating fused imagery. RGIF improved the visual baseline by 2.13% mAP@50 (achieving 97.65%), while RGMAF attained the highest recall of 98.64%. These findings show that registration-aware and reliability-adaptive fusion provides a robust framework for integrating heterogeneous modalities, substantially enhancing UAV detection performance in multimodal environments.

💡 Research Summary

The paper addresses the critical problem of detecting unmanned aerial vehicles (UAVs) in complex airspace by fusing heterogeneous thermal (infrared) and visual (RGB) sensor streams that differ markedly in resolution, perspective, and field of view. Recognizing that conventional fusion techniques—wavelet, Laplacian pyramid, simple pixel‑level blending, or decision‑level voting—assume spatially aligned, uniformly sized inputs, the authors propose two novel fusion strategies that explicitly handle mis‑registration and modality reliability.

The first method, Registration‑aware Guided Image Fusion (RGIF), first aligns the thermal and visual images using an Enhanced Correlation Coefficient (ECC) based affine transformation. After global registration, a guided filter separates the thermal image into a base layer (preserving contrast and heat signatures) and the visual image into a detail layer (preserving edges and texture). The guided filter then injects the visual detail into the thermal base, producing a fused image that retains the salient thermal contrast of UAVs while gaining structural sharpness from the RGB channel.

The second method, Reliability‑Gated Modality‑Attention Fusion (RGMAF), builds on RGIF by adding a dense optical‑flow refinement step to correct residual local mis‑alignments. It then computes a reliability score for each modality: thermal reliability is derived from temperature contrast metrics, while visual reliability is estimated from image sharpness (Laplacian variance). These scores feed a soft‑attention mechanism that dynamically weights the contribution of each modality on a per‑pixel basis, allowing the system to favor the more trustworthy source under varying illumination, weather, or motion conditions.

Experiments are conducted on the Multi‑Sensor and Multi‑View Fixed‑Wing UAV (MMFW‑UAV) dataset, which contains 147,417 air‑to‑air frames captured by three heterogeneous sensors (infrared 1024×1280, wide‑angle RGB 2160×3840, and zoom‑RGB). The authors first benchmark single‑modality detectors using several YOLO variants; YOLOv10x emerges as the most stable across domains and is selected as the detection backbone for all fusion experiments.

Quantitative results show that RGIF improves the visual‑only baseline mAP@50 from 95.42 % to 97.65 % (a 2.13 % absolute gain) while preserving the original detection speed. RGMAF achieves the highest recall of 98.64 % and an F1‑score of 97.98 %, demonstrating superior robustness when thermal contrast is weak or visual sharpness is degraded. Traditional fusion baselines (Laplacian pyramid, wavelet, alpha blending) lag behind, achieving mAP values in the mid‑90 % range and exhibiting pronounced sensitivity to resolution mismatches. Additional ablation studies where synthetic registration errors are injected confirm that RGIF and RGMAF degrade gracefully (less than 1 % performance loss), whereas classical methods suffer severe drops.

The paper also discusses computational overhead: ECC registration and optical‑flow refinement add modest latency (≈15 ms per frame on a high‑end GPU), which is acceptable for many surveillance platforms but may require further optimization for edge devices. Limitations include the focus on only two modalities and the lack of real‑time embedded implementation.

In conclusion, the study demonstrates that a fusion pipeline combining explicit geometric alignment with reliability‑aware attention can substantially boost UAV detection performance in heterogeneous sensor environments. The proposed RGIF and RGMAF frameworks set a new benchmark for multimodal UAV perception and open avenues for extending the approach to additional modalities (e.g., radar, acoustic) and for deploying lightweight versions on onboard processors for real‑time airspace monitoring.

Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors

💡 Research Summary

Comments & Academic Discussion

Leave a Comment