Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking
Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder–decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model’s corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the full-depth baseline across both short-term and long-term sequences.
💡 Research Summary
The paper addresses a fundamental inefficiency in modern transformer‑based single‑object trackers such as STARK: every video frame is processed with the full encoder‑decoder stack regardless of visual complexity. In long sequences most frames are temporally coherent and visually simple, so the fixed‑depth inference wastes computation and energy. The authors propose UncL‑STARK, an architecture‑preserving framework that enables dynamic, uncertainty‑aware depth adaptation without adding auxiliary heads or modifying the original network topology.
The method consists of two complementary components. First, during a fine‑tuning stage the model is trained with random‑depth sampling. For each training sample a full‑depth “teacher” pass (the original STARK configuration) and a randomly truncated “student” pass are executed. A knowledge‑distillation loss aligns the student’s predictions with the teacher’s output, while the standard tracking loss is also applied. This forces the network to produce reliable predictions at any intermediate encoder‑decoder depth, effectively turning the original model into a multi‑exit network without explicit early‑exit heads.
Second, at inference time a lightweight uncertainty estimate is derived directly from the corner localization heatmaps that STARK already generates. After spatial soft‑max, the top‑k probability mass of each heatmap (top‑left and bottom‑right corners) is averaged to obtain a scalar confidence score C. High C indicates a sharp, peaked heatmap (high certainty), low C indicates a diffuse heatmap (uncertainty due to occlusion, motion blur, etc.). A simple threshold‑based policy uses two pre‑defined thresholds (τ_high > τ_low) to select one of three depth configurations for the next frame: an “easy” shallow pair (E_easy, D_easy), a “medium” pair, or a “hard” deep pair. Because video frames are temporally correlated, a confident prediction on frame t typically predicts a similarly easy frame t + 1, allowing the system to skip unnecessary transformer layers.
Experiments are conducted on the large‑scale GOT‑10k and LaSOT benchmarks. The authors first evaluate symmetric depth pairs (i,i) for i = 1…5, showing that without random‑depth training the shallow configurations suffer severe accuracy loss, while the proposed RD + KD training narrows the gap dramatically. For example, depth (3,3) retains almost full‑depth performance (within 0.2 % of the (5,5) baseline) while saving roughly 12 % of GFLOPs.
Applying the uncertainty‑driven feedback policy yields further gains: overall GFLOPs are reduced by 11.9 %–12.0 %, latency drops by 7 %–8.9 %, and GPU energy consumption is cut by 4 %–10.8 % across datasets. Tracking accuracy measured by AO (GOT‑10k) and AUC (LaSOT) degrades by less than 0.2 % relative to the fixed‑depth baseline, confirming that the dynamic depth selection does not sacrifice performance.
Ablation studies examine the impact of minimum depth constraints, the distillation weight λ, and the top‑k value used for confidence computation. Results indicate that keeping at least the first encoder and decoder layers (MIN = 1) is essential for stable features, λ ≈ 0.5 provides the best trade‑off between teacher guidance and task loss, and the confidence metric is robust to the exact choice of k (5–10).
In summary, UncL‑STARK demonstrates that a transformer‑based tracker can be made computationally adaptive by (1) training it to be robust at multiple depths via random‑depth knowledge distillation, and (2) exploiting the inherent confidence signal in its heatmap outputs to decide depth on a per‑frame basis. The approach achieves significant reductions in FLOPs, latency, and energy while preserving state‑of‑the‑art tracking accuracy, making it attractive for real‑time and resource‑constrained applications such as mobile robotics, AR/VR, and embedded vision systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment