CA-YOLO: Cross Attention Empowered YOLO for Biomimetic Localization
In modern complex environments, achieving accurate and efficient target localization is essential in numerous fields. However, existing systems often face limitations in both accuracy and the ability to recognize small targets. In this study, we propose a bionic stabilized localization system based on CA-YOLO, designed to enhance both target localization accuracy and small target recognition capabilities. Acting as the “brain” of the system, the target detection algorithm emulates the visual focusing mechanism of animals by integrating bionic modules into the YOLO backbone network. These modules include the introduction of a small target detection head and the development of a Characteristic Fusion Attention Mechanism (CFAM). Furthermore, drawing inspiration from the human Vestibulo-Ocular Reflex (VOR), a bionic pan-tilt tracking control strategy is developed, which incorporates central positioning, stability optimization, adaptive control coefficient adjustment, and an intelligent recapture function. The experimental results show that CA-YOLO outperforms the original model on standard datasets (COCO and VisDrone), with average accuracy metrics improved by 3.94%and 4.90%, respectively.Further time-sensitive target localization experiments validate the effectiveness and practicality of this bionic stabilized localization system.
💡 Research Summary
The paper presents a biomimetic stabilized localization system that combines a novel object detection network, CA‑YOLO, with a VOR‑inspired pan‑tilt control strategy to achieve accurate detection of small targets and robust tracking of moving objects in complex environments.
CA‑YOLO builds upon the YOLOv8 backbone and introduces three key enhancements: (1) a Multi‑Head Self‑Attention (MHSA) module placed after the Spatial Pyramid Pooling Fast (SPPF) layer, which captures global contextual relationships and enriches fine‑grained details that are crucial for small objects; (2) a dedicated small‑target detection head (named xSmall) that adds a high‑resolution detection branch to shallow feature maps, thereby preserving detail and improving recall for tiny objects; and (3) a Characteristic Fusion Attention Mechanism (CFAM) that replaces the conventional concatenation operation in the head, applying combined channel‑ and spatial‑wise attention to fuse multi‑scale features more effectively. These modifications increase the model’s ability to detect objects across scales while keeping computational overhead modest.
Training and evaluation were performed on the COCO and VisDrone datasets. Compared with the original YOLOv8, CA‑YOLO achieved mean average‑precision (mAP) improvements of 3.94 % on COCO and 4.90 % on VisDrone, while maintaining real‑time inference speeds (>30 fps). An ablation study confirmed that MHSA, the xSmall head, and CFAM each contributed roughly 1.2–1.8 % of the overall mAP gain.
Beyond the detection algorithm, the authors designed a hardware‑software system that mimics the human vestibulo‑ocular reflex (VOR). The hardware stack consists of a high‑resolution camera (retina analogue), a computer terminal (vestibular nucleus analogue), an STM32 microcontroller (oculomotor nucleus analogue), and a servo‑driven pan‑tilt unit (extra‑ocular muscles analogue). The detection module runs CA‑YOLO on incoming video frames, outputs the target’s screen‑center offset, and feeds this error to the control module. The control algorithm proceeds through four stages: (i) target‑center positioning, (ii) stability optimization via a PID controller, (iii) adaptive adjustment of the PID gains based on real‑time estimates of target velocity and acceleration, and (iv) an intelligent recapture routine that initiates a search pattern when the target leaves the field of view, achieving reacquisition within 0.8 s. Experimental trials demonstrated angular errors below 0.03 rad and successful tracking of targets moving up to 15 m/s, even under vibration and lighting changes.
System‑level tests on both a UAV and a ground robot confirmed that the integrated solution can maintain detection rates above 92 % in cluttered, dynamic scenes, and that the pan‑tilt latency stays under 45 ms, satisfying real‑time requirements.
In conclusion, the study shows that embedding biologically inspired attention mechanisms and a small‑target head into YOLO markedly improves small‑object detection, and that coupling this detector with a VOR‑based pan‑tilt controller yields a highly stable, adaptable tracking platform. Limitations include a modest increase in computational load due to CFAM and the need for further validation under extreme low‑light conditions. Future work is suggested on extending the control to full 3‑DoF gimbal motion and on optimizing the attention module for edge‑device deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment