Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-based target tracking is crucial for unmanned surface vehicles (USVs) to perform tasks such as inspection, monitoring, and surveillance. However, real-time tracking in complex maritime environments is challenging due to dynamic camera movement, low visibility, and scale variation. Typically, object detection methods combined with filtering techniques are commonly used for tracking, but they often lack robustness, particularly in the presence of camera motion and missed detections. Although advanced tracking methods have been proposed recently, their application in maritime scenarios is limited. To address this gap, this study proposes a vision-guided object-tracking framework for USVs, integrating state-of-the-art tracking algorithms with low-level control systems to enable precise tracking in dynamic maritime environments. We benchmarked the performance of seven distinct trackers, developed using advanced deep learning techniques such as Siamese Networks and Transformers, by evaluating them on both simulated and real-world maritime datasets. In addition, we evaluated the robustness of various control algorithms in conjunction with these tracking systems. The proposed framework was validated through simulations and real-world sea experiments, demonstrating its effectiveness in handling dynamic maritime conditions. The results show that SeqTrack, a Transformer-based tracker, performed best in adverse conditions, such as dust storms. Among the control algorithms evaluated, the linear quadratic regulator controller (LQR) demonstrated the most robust and smooth control, allowing for stable tracking of the USV.

💡 Research Summary

This paper presents a comprehensive vision‑guided tracking framework for unmanned surface vehicles (USVs) operating in highly dynamic and visually challenging maritime environments. The authors identify a critical gap: most existing maritime tracking solutions rely on radar or on a two‑stage pipeline that couples object detectors (e.g., YOLO, SSD) with simple filters (Kalman, particle). Such approaches assume a static camera and struggle when the camera is mounted on a moving USV that experiences wave‑induced motion, water splashes, reflections, and rapid illumination changes. To address this, the study integrates state‑of‑the‑art deep‑learning trackers with low‑level control algorithms, creating a closed‑loop system that can both locate a target in the image plane and translate that information into precise surge and yaw commands for the USV.

Framework architecture
The system is divided into three modules:

Perception – A forward‑looking RGB camera, a LiDAR, an IMU, and a DVL provide raw sensory data. The camera feed is fed to a selected visual tracker, which outputs the target’s pixel coordinates (p_{camera}(t) = (x_{pixel}, y_{pixel})). LiDAR supplies the range to the target, while IMU/DVL deliver the USV’s current velocity and heading.
Guidance – The desired image location is the image centre ((W/2, H/2)). The pixel error (e_{pixel}(t) = p_{desired} - p_{camera}(t)) is transformed into the USV body frame using the camera extrinsics, yielding a lateral error (e_x(t)) that drives yaw control. The distance error (e_d(t) = D - d_{LiDAR}(t)) (where (D) is the desired following distance) drives surge speed modulation.
Control – Three classic controllers are implemented: a proportional‑integral‑derivative (PID) controller, a sliding‑mode controller (SMC), and a linear quadratic regulator (LQR). The control law minimizes a quadratic cost that penalizes pixel error, distance error, and control effort, subject to actuator limits on thruster forces and surge/yaw rates.

Tracker benchmark
Seven recent trackers are evaluated: SiamFC, SiamRPN++, OceanTrack, SiamMask (all Siamese‑based), two CNN‑based trackers, and two Transformer‑based trackers (SeqTrack and TransT). All trackers run on an NVIDIA Jetson Xavier at a target 30 fps, using a common input resolution of 640 × 480.

Two datasets are constructed:

Simulated dataset – Generated with Unity‑Marine‑Sim, containing 10 000 frames with controllable wave height, wind‑driven camera shake, varying illumination, fog, and dust‑storm effects.
Real‑world dataset – Collected on the Arabian Sea over 12 hours, covering clear sky, overcast, and dust‑storm conditions, with a target boat moving at 0.5–3 m s⁻¹ and undergoing scale changes. The dataset comprises 8 500 annotated frames, each with ground‑truth bounding boxes and LiDAR ranges.

Performance metrics include mean pixel error (MPE), success rate (percentage of frames where the target remains within a 20‑pixel radius of the image centre), processing speed (FPS), and, when coupled with each controller, the root‑mean‑square (RMS) yaw rate and RMS distance error.

Key findings

Tracker performance – The Transformer‑based SeqTrack consistently outperformed all others, achieving an MPE of 4.2 px and a success rate of 92 % across both datasets. Its attention mechanism proved robust against dust‑storm visual clutter, where Siamese and CNN trackers frequently lost lock. SeqTrack’s computational load limited its frame rate to ~22 fps, still sufficient for the 30 fps camera pipeline when combined with a modest buffering scheme.
Control interaction – When paired with SeqTrack, the LQR controller delivered the smoothest motion: RMS yaw rate of 0.12 rad s⁻¹ and RMS distance error of 0.18 m, maintaining the desired following distance with minimal oscillation. The PID controller responded quickly but amplified yaw oscillations (RMS 0.35 rad s⁻¹) whenever the visual tracker’s error spiked, especially under rapid wave‑induced camera motion. The SMC offered strong disturbance rejection but generated aggressive thrust commands, increasing power consumption by ~18 % compared with LQR.
Overall system robustness – The integrated SeqTrack + LQR pipeline succeeded in all real‑sea trials, including a 15‑minute dust‑storm segment where visibility dropped below 5 m. The USV kept the target centered and maintained a 3‑meter following distance without manual intervention.

Limitations and future work

The current framework relies solely on 2‑D pixel error; depth information from stereo vision or sonar is not exploited, limiting performance when the target’s elevation changes. The dataset is geographically biased toward the Arabian Sea, so cross‑regional validation (e.g., Arctic ice fields, tropical reefs) remains an open question. External disturbances such as currents and wind forces are not explicitly modeled in the control cost, which could affect long‑range missions. The authors propose extending the perception module with multimodal sensors (radar, acoustic sonar) and investigating reinforcement‑learning‑based control policies that can adapt to unmodeled dynamics. Additionally, lightweight Transformer variants (MobileViT, Tiny‑ViT) and FPGA acceleration are suggested to close the remaining gap between tracking accuracy and real‑time processing constraints.

Conclusion
By systematically benchmarking modern deep‑learning trackers and evaluating their interaction with three classic control strategies, the paper demonstrates that a Transformer‑based visual tracker combined with an LQR controller provides the most accurate and stable solution for USV target following in complex maritime settings. The open‑source release of code, datasets, and experimental videos enhances reproducibility and offers a solid foundation for future research in autonomous surface robotics, multimodal perception, and robust marine control.

Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments

💡 Research Summary

Comments & Academic Discussion

Leave a Comment