This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
Recent multimodal large language models (MLLMs) have excelled at understanding "what" happens in a video, yet they largely fail when asked "when." This limitation is We systematically explore the key factors for building performant video temporal grounding models, dissecting our efforts along two primary dimensions: data quality and algorithmic design. For data quality, we focus on benchmark diagnosis, benchmark refinement, and creating a reliable evaluation suite. For algorithmic design, we study various aspects including time encoding, training recipes, and optimization strategies to establish best practices and develop the TimeLens models.
central to the task of video temporal grounding (VTG). The challenge is twofold: 1) VTG necessitates a fundamental shift from coarse semantic aggregation to finegrained time-aware perception; 2) Distinguishing queried events requires modeling long-term visual dynamics over appearance-centric features, which are notoriously difficult to annotate and learn. As MLLMs become integral to perception [42,43,53,56] and reasoning systems [6,13,36,38,39,64], equipping them with robust temporal awareness is no longer optional, but essential [26,34,45,48,52]. This work focuses on post-training MLLMs with leading temporal grounding ability. This investigation is a straightforward extension given the recent progress in pretrained foundation MLLMs [2,3,51]. Different from heavily studied general understanding tasks, recipes for fine-grained grounding tasks are not yet to be established. This paper aims to systematically investigate core components of building time-aware MLLMs (Fig. 1) along two primary dimensions: data quality and algorithmic design. Our investigation starts by exposing critical flaws in evaluation benchmarks. We find that existing VTG benchmarks [11,25,27] not only lack a clear comparison between leading proprietary and open-source models but are also rife with low-quality queries and erroneous timestamps. This noisy data may render current leaderboards misleading and misguide research efforts. To rectify this, we undertook a meticulous data overhaul. We first defined strict criteria for query and timestamp quality, in terms of uniqueness, existence, clarity, and accuracy. We then manually re-annotated three popular datasets (Charades-STA [11], ActivityNet Captions [25], QVHighlights [27]) to create TimeLens-Bench, a rigorously cross-validated benchmark. As shown in Fig. 2a, the necessity of this correction is confirmed by a dramatic re-ranking of models on TimeLens-Bench compared to their performance on legacy benchmarks, proving the unreliability of prior evaluation standards. Beyond evaluation, we also fix the noisy training data by automated re-annotation, yielding TimeLens-100K, a large-scale, high-quality training dataset.
With our curated data suite as a solid foundation, we conduct in-depth explorations on the algorithmic design principles from three key aspects. First, for timestamp representation, we discovered that a simple yet effective interleaved textual encoding strategy outperforms more complex alternatives. Second, we determined that VTG is fundamentally a perception-driven task, and thus employ a pure thinkingfree reinforcement learning with verifiable rewards (RLVR) approach that outperforms other training paradigms in both efficiency and performance. Finally, our detailed analy-sis of RLVR training reveals two key recipes for both performance and training efficiency: (1) early stopping when reward metrics plateau, and (2) difficulty-based data sampling. By integrating these insights and design principles, we ultimately develop TimeLens models, a family of MLLMs with superior VTG capability. As shown in Fig. 2b, our model achieves state-of-the-art performance among open-source models and even surpasses proprietary models such as GPT-5 and Gemini-2.5-Flash.
Through these efforts, we identified and addressed longoverlooked quality issues in existing datasets, and derived a series of insights and best practices in algorithmic design. We hope TimeLens can serve as a solid foundation in both data curation and algorithmic design principles, to facilitate future research on building MLLMs with strong VTG capabilities. Our code, data, and models will be open-sourced.
Temporal Grounding Datasets. Numerous VTG datasets have been proposed, spanning diverse domains [14,22,25,27,40,44,49]. Early works [11,37,63] trained and evaluated models on the training and test splits of a single benchmark [25,44] to assess their ability to fit single-domain data distribution. In recent works [17,36,45], large diverse corpuses composed of multiple different source datasets [1,22,35,40,49,58] are aggregated for training, and a suite of distinct benchmarks [11,25,27] are used to probe the models’ real-world cross-domain generalizability.
However, the critical issue of data quality has been over-Figure 3. Qualitative examples of errors and fixes. We present representativ
This content is AI-processed based on open access ArXiv data.