LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://lovrbench.github.io/

💡 Research Summary

The paper introduces LoVR, a new benchmark specifically designed for long‑video text retrieval. Existing datasets such as MSR‑VTT, DiDeMo, and MSVD consist mainly of short clips (average length < 1 minute) with coarse, often machine‑generated captions, which limits their ability to evaluate models on real‑world, hour‑long content. LoVR addresses these gaps by providing 467 videos longer than 15 minutes (average 26 minutes) and a total of 40,804 fine‑grained clips (average 17.9 seconds).

Data construction follows a three‑stage pipeline. First, videos are segmented into clips using PySceneDetect’s ContentDetector, which detects visual changes based on an inter‑frame difference threshold (τ = 34). This yields a natural split between high‑dynamic (e.g., lectures) and low‑dynamic (e.g., static scenes) videos, preventing over‑segmentation.

Second, a hybrid caption generation framework is employed. The state‑of‑the‑art vision‑language model Qwen2.5‑VL‑Instruct generates initial captions for each clip. These captions are automatically scored with EVQAScore, a quality estimator aligned with human judgments. Clips whose scores fall below a predefined threshold are sent to a human verification round, where annotators correct or rewrite the text. This loop balances scalability with high fidelity, dramatically reducing manual labor while preserving quality.

Third, full‑video captions are constructed by clustering the clip‑level captions and summarizing them through a semantic fusion method that preserves contextual information. A final human review ensures consistency across the dataset. The resulting captions are exceptionally rich: average clip caption length is 1,393 tokens, and full‑video captions average 106,075 tokens.

Benchmark statistics show that LoVR far exceeds prior datasets in both video duration and caption density. Experiments with several leading retrieval models (CLIP4Clip, CLIP2TV, VSE++) reveal that performance drops by 10‑20 % relative to short‑video benchmarks, highlighting three core challenges: (1) semantic sparsity over long temporal spans, (2) difficulty aligning long, nuanced captions with visual features, and (3) increased computational cost for feature extraction and indexing. Error analysis identifies theme mismatches, missing content, and temporal alignment errors as the most common failure modes.

The authors discuss limitations: the current release contains only English captions, lacks multimodal fusion of audio and OCR, and the clip boundaries may not always correspond to true semantic units. Future work is suggested in multilingual expansion, incorporation of additional modalities, and meaning‑driven dynamic clip re‑segmentation.

Overall, LoVR provides the first large‑scale, high‑quality benchmark for long‑form video‑text retrieval, introduces an efficient semi‑automatic captioning pipeline, and exposes critical gaps in existing retrieval models, thereby offering a valuable platform for advancing multimodal understanding of lengthy video content.

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

💡 Research Summary

Comments & Academic Discussion

Leave a Comment