MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.


💡 Research Summary

The research paper introduces “MS-Temba,” a novel architecture designed to tackle the complexities of Temporal Action Detection (TAD) in long, untrimmed videos. The primary challenge in TAD, especially for Activities of Daily Living (ADL), lies in the necessity to process extended durations, capture multi-scale temporal variations, and identify overlapping, dense actions. While traditional CNNs struggle with long-range context and Transformers suffer from quadratic computational complexity, the recently emerged State-space Model (SSM), Mamba, offers a linear-scaling alternative. However, the authors identify a critical flaw: applying Mamba naively to TAD leads to the collapse of fine-grained temporal structures, making precise boundary detection impossible.

To overcome this, the authors propose the Multi-Scale Temporal Mamba (MS-Tamba). The core innovation lies in the introduction of “Dilated SSMs.” By utilizing dilated convolutions within the SSM framework, the model can capture temporal patterns across multiple scales, effectively bridging the gap between fine-grained local details and long-range global structures. This allows the model to maintain high resolution for short-term action boundaries while simultaneously understanding the broader context of long-duration activities.

The architecture is further enhanced by “Temba blocks,” which integrate these dilated SSMs with specialized loss functions designed to promote the learning of highly discriminative representations. To unify the multi-scale features extracted by these blocks, the authors developed a “Multi-scale Mamba Fuser.” This lightweight, SSM-based aggregation mechanism intelligently merges features from different temporal resolutions, ensuring precise action-boundary localization.

The efficiency of MS-Temba is a standout feature. With only 17 million parameters, the model achieves state-of-the-art (SOTA) performance on the TSU and Charades benchmarks, which are critical for ADL analysis. Furthermore, the model demonstrates remarkable generalization capabilities; when applied to the task of long-form video summarization, it sets new SOTA records on the TVSum and SumMe datasets. This dual success in both detection and summarization highlights MS-Temba’s robustness and potential as a foundational architecture for advanced video understanding tasks, offering a highly efficient and scalable solution for processing the increasingly long and complex video data prevalent in modern AI applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment