Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization
Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify’’ cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.
💡 Research Summary
This paper addresses a critical gap in video retrieval research: existing benchmarks focus on matching precise textual descriptions to short clips within closed candidate pools, which does not reflect how users actually search for videos on the open web. In real‑world scenarios, users rely on fuzzy, multi‑dimensional memories that combine a global impression of the video, a specific key moment, temporal context, and auditory cues. To evaluate models under these realistic conditions, the authors introduce RVMS‑Bench, a new benchmark comprising 1,440 video samples drawn from YouTube across 20 diverse topics and four duration ranges (under 3 min, 3‑10 min, 10‑30 min, 30‑60 min). Each sample is annotated with a hierarchical description set: Global Impression, Key Moment, Temporal Context, and Auditory Memory. The annotation pipeline uses Gemini 3 Pro for initial generation and a rigorous human‑in‑the‑loop verification by ten experts to eliminate hallucinations and ensure factual consistency.
RVMS‑Bench defines nine query types that vary in information density: single‑dimensional (G or K), dual‑modal (K+T, K+A, K+G), tri‑modal (K+G+T, K+G+A, K+T+A), and the full four‑modal combination (K+G+T+A). This systematic variation allows researchers to diagnose model robustness when faced with incomplete or conflicting cues.
Complementing the benchmark, the authors propose RACLO (Real‑world Abductive Cognitive Logic Retrieval), an agentic framework that mimics the human “Recall‑Search‑Verify” loop. RACLO first employs abductive reasoning to expand vague memory fragments into plausible search queries. A chain‑of‑thought prompt guides the model to infer likely video titles, tags, or concepts from the fragmented cues. The agent then follows a ReAct‑style “Observe‑Think‑Act” cycle, issuing queries to a web search engine, filtering results for accessible YouTube URLs, and downloading candidate videos.
Once candidates are collected, RACLO performs parallel verification and localization. For video‑level verification, it first checks exact URL matches; if the URL differs (e.g., due to re‑uploads), it samples 64 frames and the full audio track, feeding them to a multimodal LLM to assess alignment with the Global Impression. A second verification step uses Gemini 2.5 Pro for consensus. Simultaneously, for moment localization, the agent inputs the Key Moment, Temporal Context, and Auditory Memory into the model, which cross‑examines dense frame sequences and audio‑visual synchronization cues to predict the most relevant frame index. The predicted frame is then re‑validated against the ground‑truth frame using a final prompt. This dual‑track design ensures robustness against video modifications and noisy web data.
Extensive experiments evaluate several state‑of‑the‑art multimodal large language models (including Gemini 3 Pro and GPT‑4V) across all nine query types. Results reveal that current models perform poorly, especially on queries involving auditory memory and on longer videos (30‑60 min). The findings highlight a substantial deficiency in long‑range temporal reasoning and multimodal integration. Moreover, the study demonstrates that traditional ID‑based matching is brittle in open‑world settings, whereas the proposed content‑based verification with multi‑model consensus provides a more realistic performance metric.
In summary, the paper makes two major contributions: (1) the release of RVMS‑Bench, the first benchmark explicitly designed for real‑world video memory search with hierarchical, multi‑modal annotations; and (2) the RACLO framework, which operationalizes abductive reasoning and a human‑like verification loop to tackle fuzzy memory retrieval and precise moment localization on the open web. The work establishes a new evaluation paradigm and underscores the need for future research on stronger multimodal reasoning, better long‑term temporal modeling, and more sophisticated web‑scale retrieval pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment