Benchmarking SAM2-based Trackers on FMOX

Benchmarking SAM2-based Trackers on FMOX
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Several object tracking pipelines extending Segment Anything Model 2 (SAM2) have been proposed in the past year, where the approach is to follow and segment the object from a single exemplar template provided by the user on a initialization frame. We propose to benchmark these high performing trackers (SAM2, EfficientTAM, DAM4SAM and SAMURAI) on datasets containing fast moving objects (FMO) specifically designed to be challenging for tracking approaches. The goal is to understand better current limitations in state-of-the-art trackers by providing more detailed insights on the behavior of these trackers. We show that overall the trackers DAM4SAM and SAMURAI perform well on more challenging sequences.


💡 Research Summary

This paper presents a comprehensive benchmarking study evaluating the performance of several state-of-the-art object tracking pipelines based on Meta’s Segment Anything Model 2 (SAM2) on datasets containing Fast Moving Objects (FMOs). The study aims to provide detailed insights into the limitations and capabilities of these trackers under challenging conditions characterized by high-speed motion and small object sizes.

The researchers benchmark four SAM2-based trackers: the original SAM2, EfficientTAM (optimized for efficiency), DAM4SAM (designed to handle distractors), and SAMURAI (incorporating motion-aware memory management). The evaluation is conducted on the FMOX dataset, a curated collection of 46 video sequences from four existing FMO datasets: Falling Object, TbD, TbD-3D, and FMOv2. FMOX is notably challenging, featuring sequences with extremely small objects and very low bounding box overlap between consecutive frames, pushing trackers to their limits.

The methodology is rigorously defined. All trackers are initialized using the ground-truth bounding box from the first frame where the target object appears in each sequence. Performance is measured using two standard metrics: the mean Intersection over Union (mIoU) and the mean Dice Score (mDice), computed over all frames containing ground truth. Frames where a tracker fails to produce a prediction are assigned a score of zero, ensuring that complete tracking failures are accurately reflected in the final scores.

The key findings reveal a clear hierarchy in performance on the overall FMOX dataset. DAM4SAM consistently achieves the best results, topping both the mean and median scores for mIoU and mDice. SAMURAI follows closely with competitive performance. The original SAM2 shows moderate results, while EfficientTAM ranks lowest among the four. This aligns with the design focus of each tracker; DAM4SAM’s distractor-aware memory and SAMURAI’s motion-modeling are particularly beneficial for the confusing and dynamic scenarios in FMO data, whereas EfficientTAM’s architectural simplifications for efficiency seem to compromise accuracy in these demanding conditions.

A deeper analysis per dataset (Falling Object, TbD-3D, FMOv2, TbD) shows that performance is highly dependent on sequence characteristics. While DAM4SAM leads in several datasets, SAMURAI performs best on the most challenging FMOv2 dataset, and SAM2 surprisingly leads on TbD-3D. The box plots visually underscore the difficulty of FMOv2 and TbD, where all trackers exhibit wide performance spreads and frequent low/zero scores, indicating sequences where they completely fail. The paper also notes that initialization with motion-blurred frames, common in FMO data, can adversely affect tracker performance.

In conclusion, the study demonstrates that while SAM2-based trackers are powerful, their performance on fast-moving objects varies significantly based on their architectural enhancements. Trackers like DAM4SAM and SAMURAI, which employ advanced memory update strategies beyond SAM2’s simple FIFO approach, show greater robustness. The work highlights that the combined challenge of small object size and extreme inter-frame displacement, as formalized in the FMOX benchmark, remains a significant hurdle for current state-of-the-art trackers, offering a clear direction for future research in robust visual tracking.


Comments & Academic Discussion

Loading comments...

Leave a Comment