Video Understanding: Through A Temporal Lens
This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using “recurrent adapters” to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new “temporal-oriented recipe” for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model’s ability to represent and reason about the fluid nature of video content.
💡 Research Summary
This dissertation tackles the central question of how to exploit temporal relations among video elements to advance video understanding. It delivers five major contributions, each addressing a distinct limitation of current approaches.
- Automatic Annotation with Noise‑Robust Contrastive Learning – Leveraging large vision‑language models (VL‑LMs), the author generates textual descriptions for videos automatically. Because these annotations are noisy, a novel contrastive objective with a subtractive angular margin is introduced. This margin penalizes overly perfect similarity between video and caption embeddings, reflecting the intuition that they overlap but are not identical, and empirically improves recall by about 7 % over standard contrastive losses.
- Parameter‑Efficient Fine‑Tuning via Recurrent Adapters (READ) – To avoid over‑fitting in low‑resource settings, only a small set of adapter modules is trained. READ equips these adapters with recurrent layers, enabling them to capture temporal dynamics while keeping the bulk of the transformer frozen. A partial video‑language alignment (PVLA) loss further aligns temporal semantics between modalities. READ‑based models outperform full‑parameter fine‑tuning when only 1 % of labeled data is available, achieving a 4.3 % accuracy gain.
- State‑Space Layers for Long‑Form Video Modeling – Traditional self‑attention scales quadratically with sequence length, making it impractical for videos spanning minutes to hours. The author integrates a Gated State‑Space Multi‑Model Transformer (GSMT) that processes densely sampled frames with linear complexity. Two new benchmarks, Ego‑QA (average 18 min) and MAD‑QA (up to 2 h), are introduced, featuring GPT‑4‑generated, highly challenging questions that require simultaneous summarization, comparison, and compression across long temporal horizons. On these benchmarks, the SSL‑augmented model surpasses vanilla transformers by 12‑18 % in accuracy, especially on composite reasoning tasks.
- Motion‑Aware Contrastive Learning for Fine‑Grained Temporal Relations – Existing contrastive methods focus on coarse video‑text alignment, neglecting the rich relations between individual motions or moments. The thesis proposes a contrastive framework that explicitly models pairwise motion relations and extends to moment‑level relations using triplet‑based and shuffle‑based contrastive objectives. This yields a 5.4 % increase in mean average precision on action recognition, moment detection, and video QA tasks, demonstrating that explicit temporal relation modeling makes representations more discriminative.
- Temporal‑Oriented Recipe for Scaling Large Vision‑Language Models (LVLMs) – An extensive empirical study reveals that the bottleneck for temporal reasoning in LVLMs lies in the intermediate interface between the visual encoder and the large language model. The author designs a temporal‑oriented training scheme and an upscaled interface that injects temporal cues directly into the LLM. When applied to standard video understanding benchmarks, this recipe improves LVLM performance by over 9 % relative to the best prior LVLMs.
Collectively, the work shows that (i) automatic, noise‑robust annotation can supply abundant supervision; (ii) lightweight recurrent adapters enable effective transfer in data‑scarce regimes; (iii) state‑space layers make long‑form video processing tractable; (iv) contrastive learning of fine‑grained motion relations enriches temporal awareness; and (v) redesigning the vision‑language interface unlocks the full potential of large multimodal models for temporal reasoning. The thesis not only advances the state of the art across a suite of video tasks but also provides new benchmarks and methodological tools that will shape future research in video AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment