Text-driven Online Action Detection
Detecting actions as they occur is essential for applications like video surveillance, autonomous driving, and human-robot interaction. Known as online action detection, this task requires classifying actions in streaming videos, handling background noise, and coping with incomplete actions. Transformer architectures are the current state-of-the-art, yet the potential of recent advancements in computer vision, particularly vision-language models (VLMs), remains largely untapped for this problem, partly due to high computational costs. In this paper, we introduce TOAD: a Text-driven Online Action Detection architecture that supports zero-shot and few-shot learning. TOAD leverages CLIP (Contrastive Language-Image Pretraining) textual embeddings, enabling efficient use of VLMs without significant computational overhead. Our model achieves 82.46% mAP on the THUMOS14 dataset, outperforming existing methods, and sets new baselines for zero-shot and few-shot performance on the THUMOS14 and TVSeries datasets.
💡 Research Summary
The paper introduces TOAD (Text‑driven Online Action Detection), a novel architecture that brings vision‑language models (VLMs), specifically CLIP, into the online action detection (OAD) problem. OAD requires frame‑wise classification of streaming video without future context, handling background frames and incomplete actions. While transformer‑based methods have become state‑of‑the‑art, they are computationally heavy due to self‑attention over long sequences. TOAD addresses this by exploiting CLIP’s pretrained textual embeddings as a frozen classifier, thereby eliminating the need for large contrastive learning batches and reducing computational overhead.
The pipeline consists of four main components. First, video frames are uniformly sampled and passed through CLIP’s visual encoder to obtain per‑frame visual features. These features are fed into a standard transformer encoder (6 layers, 12 heads) that aggregates temporal information and produces a video embedding by averaging the token representations. Second, class names are encoded with CLIP’s textual encoder. The authors explore three initialization strategies: raw class name, a prompt (“a video of a person {action}”), and a mixed average of both. Experiments show that the prompt‑based embedding consistently yields the best performance, likely because it aligns more closely with the pretraining objective of CLIP.
Third, TOAD adds a future‑anticipation branch. A separate textual prompt (“a video of a person {action} in the future”) generates embeddings for the next action. Because the visual appearance of an ongoing action differs from its future continuation, the video embedding is passed through an additional fully‑connected layer with ReLU before computing the dot product with the future‑action text embedding. This yields a second set of logits for future prediction, although in the reported experiments the weighting factor λ for the future loss is set to zero, focusing training on the current‑action loss.
During training, both CLIP visual and textual encoders are frozen; only the transformer encoder and the small FC layer for future anticipation are updated using cross‑entropy loss. The probability distribution is obtained by scaling the logits with the temperature τ learned from CLIP. The total loss is a weighted sum of the current‑action and future‑action cross‑entropy terms.
The authors evaluate TOAD on two widely used OAD benchmarks: THUMOS14 and TVSeries. On THUMOS14, TOAD achieves 82.46 % mean average precision (mAP), surpassing recent transformer‑based methods such as OadTR, LightTR, LSTR, and TesTra. Importantly, TOAD also establishes new baselines for zero‑shot and few‑shot OAD. In zero‑shot mode, only the textual prompts are used without any fine‑tuning, yet the model attains competitive mAP. In few‑shot experiments (1, 5, and 10 labeled videos per class), TOAD quickly closes the gap to fully supervised performance, demonstrating strong data efficiency.
A comprehensive ablation study dissects the contributions of each component. The prompt‑based textual initialization outperforms raw class‑name embeddings and a simple average of both. Adding the future‑anticipation branch yields modest gains, confirming that the model can benefit from predicting upcoming actions, though the primary performance boost comes from the text‑driven classifier. Varying the depth and head count of the transformer shows that the chosen 6‑layer, 12‑head configuration balances accuracy and computational cost effectively.
Key insights from the work include: (1) Leveraging frozen CLIP textual embeddings as classifier weights provides a lightweight yet powerful way to align visual features with semantic concepts, eliminating the need for large contrastive loss matrices; (2) Prompt engineering is crucial—well‑crafted natural‑language prompts bridge the gap between CLIP’s pretraining and downstream OAD tasks; (3) The architecture remains fully RGB‑only, avoiding the computational burden of optical‑flow extraction while still achieving state‑of‑the‑art results; (4) The method naturally supports zero‑shot and few‑shot scenarios, reducing reliance on extensive frame‑level annotations.
In conclusion, TOAD demonstrates that vision‑language models can be efficiently repurposed for online action detection, achieving superior accuracy with significantly lower computational demands. The paper opens avenues for future research such as richer multimodal prompting, continual learning on unlabelled video streams, and deployment on edge devices where real‑time constraints are stringent.
Comments & Academic Discussion
Loading comments...
Leave a Comment