Lightweight RGB-T Tracking with Mobile Vision Transformers
Single-modality tracking (RGB-only) struggles under low illumination, weather, and occlusion. Multimodal tracking addresses this by combining complementary cues. While Vision Transformer-based trackers achieve strong accuracy, they are often too large for real-time. We propose a lightweight RGB-T tracker built on MobileViT with a progressive fusion framework that models intra- and inter-modal interactions using separable mixed attention. This design delivers compact, effective features for accurate localization, with under 4M parameters and real-time performance of 25.7 FPS on the CPU and 122 FPS on the GPU, supporting embedded and mobile platforms. To the best of our knowledge, this is the first MobileViT-based multimodal tracker. Model code and weights are available in the GitHub repository.
💡 Research Summary
The paper addresses the well‑known weakness of single‑modality RGB object tracking under challenging conditions such as low illumination, adverse weather, and occlusions. By fusing thermal infrared (IR) data with RGB, the authors aim to create a more robust multimodal tracker. However, recent multimodal trackers that rely on large Vision Transformers (ViT, HiViT) achieve high accuracy at the cost of massive model size and computational demand, making real‑time deployment on embedded devices impractical.
To solve this, the authors propose a lightweight RGB‑T tracker built on MobileViT‑v2, a transformer architecture specifically designed for mobile platforms. The core contribution is a progressive fusion framework that employs separable mixed attention (SMA) in two stages: (1) intra‑modal attention within each modality (RGB and IR) and (2) inter‑modal attention after concatenating the two modality token streams. In the backbone, after an initial depthwise‑separable convolution and inverted residual blocks that down‑sample the spatial resolution, Layer 3 tokenizes each modality’s template and search features into non‑overlapping patches (size p₁) and processes them with L layers of SMA. This yields modality‑specific global context with linear complexity O(N·d). Layer 4 then concatenates the RGB and IR token sequences and applies another L layers of SMA, enabling the model to learn cross‑modal relationships only after sufficient intra‑modal reasoning has taken place.
The neck module performs pixel‑wise cross‑correlation (PW‑XCorr) between template and search features for each modality, producing fused feature maps (C_f channels). A lightweight cross‑modal fusion transformer further refines these maps: after patching (size p₂) and flattening, L SMA layers are applied, and the final fused representation is obtained by a learnable channel‑wise weighted sum σ(W_RGB)·F_RGB + σ(W_IR)·F_IR. The prediction head follows the SMA‑T design, with parallel classification and regression branches that include a 3×3 convolution, separable self‑attention, and a final 3×3 convolution.
Training is conducted on the large‑scale LasHeR RGB‑T benchmark for 60 epochs using AdamW (lr = 4e‑4) with a reduced learning rate for the backbone (×0.1). Data augmentation consists of horizontal flips and brightness jitter. Input template and search sizes are 128×128 and 256×256, respectively, and the network progressively downsamples by a factor of 2 four times, yielding 8×8 (template) and 16×16 (search) feature maps.
Experimental results on three standard RGB‑T benchmarks—LasHeR, RGBT234, and GTO‑T—demonstrate that the proposed model achieves competitive tracking accuracy while dramatically reducing model size and increasing speed. The tracker contains only 3.93 M parameters, runs at 121.9 FPS on an RTX 3090 GPU and 25.7 FPS on an Intel i9‑12900KF CPU, and outperforms or matches state‑of‑the‑art efficient multimodal trackers such as SUT‑track‑Tiny (22 M params, ~100 FPS) and EMT‑track (larger parameter budget). On GTO‑T, it even attains the highest Precision Rate (PR) and Success Rate (SR), indicating strong performance on small, fast‑moving targets.
Ablation studies explore (a) the placement of fusion (no fusion, all fusion, and the proposed progressive design) and (b) the contribution of the IR stream and the cross‑modal transformer. Results confirm that early, full‑modal fusion harms discriminability, while the progressive scheme yields the best trade‑off. Removing the IR modality drops PR and SR by roughly 5 %, and omitting the cross‑modal transformer reduces performance by an additional 3‑4 %.
The authors acknowledge a limitation: concatenating RGB and IR tokens doubles the token count, slightly increasing memory usage and inference time. They suggest future work on token pruning or other efficient token reduction techniques. Moreover, the framework is readily extensible to other modalities (e.g., depth, event cameras), opening avenues for broader multimodal tracking research.
In summary, this work presents a well‑engineered, mobile‑friendly RGB‑T tracker that balances accuracy, model size, and speed through a novel progressive intra‑to‑inter‑modal fusion strategy based on separable mixed attention. It demonstrates that high‑performance multimodal tracking can be achieved on resource‑constrained platforms, making it highly relevant for real‑world applications such as autonomous vehicles, UAV surveillance, and portable robotics.
Comments & Academic Discussion
Loading comments...
Leave a Comment