DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.
💡 Research Summary
DaMO (Data‑efficient Multimodal Orchestrator) addresses three persistent shortcomings of current video‑large language models (Video‑LLMs): insufficient multimodal integration, heavy reliance on massive pre‑training datasets, and loss of fine‑grained temporal information during spatial reduction. The core of DaMO is the Temporal‑aware Fuseformer (T‑Fuseformer), a hierarchical dual‑stream transformer that first refines each modality (visual and audio) with self‑attention and feed‑forward layers, then compresses the temporal token stream using a small set of learnable queries via cross‑attention. These queries serve three purposes: (i) discarding redundant temporal tokens, (ii) preserving salient moments, and (iii) producing a compact representation. After modality‑specific refinement, a set of “FUSION” queries attends jointly to the compressed visual and audio streams, integrating complementary information across multiple temporal scales.
To keep spatial redundancy low without sacrificing global context, DaMO introduces a Global Residual mechanism. For each frame, the CLS token (visual) or mean‑pooled audio token is treated as a global feature and passed through a lightweight feed‑forward network. The remaining local tokens are down‑sampled with adaptive average pooling. The final representation is the sum of the pooled local features and the refined global residual, dramatically reducing the spatial dimension while preserving scene‑level semantics.
Temporal modeling is further enhanced by inserting explicit Temporal Embeddings— a blend of learnable positional vectors and fixed sinusoidal encodings—into both modalities before feeding them to T‑Fuseformer. This enables the model to capture both flexible and structured temporal relationships.
Training proceeds in four progressive stages: (1) Video‑Text Alignment aligns the fused multimodal embeddings with textual descriptions, establishing a basic cross‑modal grounding; (2) Representation Bridging projects these embeddings into the frozen LLM’s token space using a Q‑Former and LoRA adapters; (3) Temporal Perception Learning explicitly teaches event localization, ordering, and duration using newly created temporally‑grounded QA pairs generated by large language models; (4) Dialogue Tuning fine‑tunes the system on multi‑turn conversational data to improve temporal reasoning in dialogue contexts.
A key contribution is the creation of several temporally‑annotated QA datasets. Existing video‑language corpora are enriched by prompting LLMs to produce time‑grounded question‑answer pairs, providing supervision without costly human annotation.
Extensive experiments on benchmarks such as TVQA‑Temporal, Ego4D Temporal Grounding, and AVSD‑Temporal demonstrate that DaMO consistently outperforms prior methods (Video‑LLaMA, VTimeLLM, PLLaVA, etc.) in both temporal grounding accuracy and video‑question answering performance. Notably, DaMO retains high accuracy even when trained on less than 10 % of the full dataset, confirming its data‑efficiency. Ablation studies reveal that each component—global residual, temporal embeddings, query‑based compression, and the dual‑stream architecture—contributes measurably to the overall gains.
In summary, DaMO offers a practical, scalable solution for fine‑grained temporal reasoning in multimodal video contexts. By jointly optimizing spatial reduction, temporal modeling, and multimodal fusion, and by leveraging a staged training regime with automatically generated temporal supervision, DaMO lowers the barrier to building powerful video‑LLMs that can be deployed in real‑world applications such as education, surveillance, and multimedia retrieval.
Comments & Academic Discussion
Loading comments...
Leave a Comment