Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

February 18, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion
ArXiv ID: 2512.12935
Date: 2025-12-15
Authors: Toan Le Ngo Thanh, Phat Ha Huu, Tan Nguyen Dang Duy, Thong Nguyen Le Minh, Anh Nguyen Nhu Tinh

📝 Abstract

The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.

💡 Deep Analysis

📄 Full Content

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion Thanh Toan Le Ngo1,5, Huu Phat Ha1,5, Duy Tan Nguyen Dang4, Minh Thong Nguyen Le2,5, Tinh Anh Nguyen Nhu3,5, 1University of Information Technology, Ho Chi Minh City, Vietnam 2International University, Ho Chi Minh City, Vietnam 3Ho Chi Minh City University of Technology, Ho Chi Minh City, VietNam 4AI VIET NAM 5Vietnam National University, Ho Chi Minh City, Vietnam 23521603@gm.uit.edu.vn, 22521067@gm.uit.edu.vn, nddtan2011@gmail.com, thongnlm29@mp.hcmiu.edu.vn, anh.nguyennhu2306@hcmut.edu.vn Abstract The exponential growth of video content has created an ur- gent need for efficient multimodal moment retrieval sys- tems. However, existing approaches face three critical chal- lenges: (1) fixed-weight fusion strategies fail under cross- modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penaliz- ing unrealistic gaps, and (3) systems require manual modal- ity selection, reducing usability. We propose a unified mul- timodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEiT- 3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal- aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent- guided query decomposition (GPT-4o) automatically inter- prets ambiguous queries, decomposes them into modality- specific sub-queries (visual/OCR/ASR), and performs adap- tive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent se- quences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities. Introduction The exponential growth of video content across multiple do- mains has made efficient video retrieval a critical challenge. In 2022 alone, over 500 hours of new video were uploaded every minute to online platforms (Navarrete et al. 2025), a trend further accelerated by the emergence of new platforms such as TikTok and similar short-form video services. This content spans diverse domains—from educational lectures and tutorials to news broadcasts and entertainment creating an increasingly heterogeneous and complex video ecosys- tem. Moreover, each video encodes information across mul- tiple modalities: visual scenes depicting objects and actions, spoken dialogue and background audio, and textual infor- mation appearing on-screen (e.g., captions, signs, and UI el- ements) (Wan et al. 2025; Nguyen, Tran, and Quang-Hoang Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2024). Real user queries are often free-form and unclear (Za- mani et al. 2019). People rarely say which channel to search (visual, OCR, or ASR), and the quality of each channel can vary greatly (e.g., noisy audio, OCR mistakes). Simple, fixed fusion breaks under this ambiguity and cross-modal noise, and asking users to build queries themselves makes the system harder to use (Zamani et al. 2020). This mul- timodal richness raises a fundamental question: How can we design a multimodal moment retrieval system that can understand and decompose user’ ambiguous queries in nat- ural language, then flexibly select and fuse modalities (vi- sual/OCR/ASR) to return relevant results? However, leveraging multiple modalities effectively is far from straightforward. Francis et al. (Francis et al. 2019) demonstrated that background noise in audio tracks or erro- neous OCR extractions—simple fusion strategies (e.g., av- eraging or Reciprocal Rank Fusion) can actually degrade retrieval performance. Alternative methods segment videos into shots or keyframes and then individually embed each unit. This fine-grained indexing-creating separate vectors for each scene or keyframe—improves retrieval of specific mo- ments but requires processing substantially larger data vol- umes (Rossetto et al. 2021; Nguyen et al. 2025). Sun et al. (Sun et al. 2020) emphasised the importance of jointly encoding multiple modalities. Similarly, Chen et al. (Chen et al. 2024) introduced the VERIFIED benchmark and observed that many user queries remain rather coarse- grained, indicating the need for models capable of capturing more fine-grained video semantics. Currently, temporal modeling methods in moment re- trieval can generally be split into three main categories. The first category is Fixed Temporal Windows, which are widely used in many Video Browser Showdown (VBS) sys- tems , but often struggle to handle events of varying dura- tions. The second is Attention-based Methods, which ap- ply temporal attention mechanisms to assign weights across time, yet often lack explicit strategies

📄 Read Full PDF on ArXiv