Moment and Highlight Detection via MLLM Frame Segmentation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM’s output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous “0” and/or “1” characters, with one character per frame. The “0”/“1” characters benefit from the LLM’s inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames – less than half of comparable methods – our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.

💡 Research Summary

The paper tackles the unified problem of video moment retrieval (MR) and highlight detection (HD) from natural‑language queries by leveraging a multimodal large language model (MLLM). Existing transformer‑based unified methods rely on custom prediction heads and IoU‑based losses, while recent MLLM approaches generate timestamps as textual tokens. The latter enjoys the reasoning power of LLMs but cannot provide frame‑level gradients because the output is a sequence of language tokens, making the training objective disconnected from the visual signal. Reinforcement‑learning (RL) methods have been proposed to bridge this gap, yet they require carefully designed reward functions and are computationally intensive.

The authors propose a novel “frame‑segmentation” formulation. A video is uniformly sampled to a small number of frames (f = 25). Each frame is processed by a visual encoder (SigLIP) and a vision tokenizer (Perceiver Resampler) to obtain a fixed‑length token sequence (l = 128 tokens per frame). These visual tokens are interleaved with a textual prompt that instructs the LLM to output a binary string of length f, where each character is either “0” (background) or “1” (foreground). In the BLIP‑3 tokenizer, “0” and “1” correspond to single token IDs (29900 and 29896), guaranteeing a one‑to‑one mapping between frames and output tokens.

During training, the model is optimized with a weighted sum of four losses: (1) the standard causal language‑model loss (L_lm) to preserve the LLM’s generative abilities; (2) binary cross‑entropy (L_bce) on the per‑frame foreground probabilities, with a positive class weight of 2.3378 to counter the strong background bias; (3) Tversky loss (L_tv) with β = 0.7 to favor recall and penalize false negatives; and (4) generalized Dice loss (L_gd) to automatically balance intra‑batch class imbalance. The segmentation losses are initially set to zero and linearly ramped up after a warm‑up period, allowing the model to first learn language modeling before receiving frame‑level supervision. The final loss is L = w_lm L_lm + w_bce L_bce + w_tv L_tv + w_gd L_gd.

Inference proceeds by beam‑search decoding of the binary sequence. The logits for the “1” token at each position are interpreted as saliency scores for HD, while contiguous runs of “1” tokens define temporal windows that are mapped back to timestamps for MR. Because only 25 frames are sampled from a 150‑second video, the authors linearly interpolate between frame scores to obtain per‑second highlight scores, handling videos whose duration is not an exact multiple of the sampling interval.

Experiments are conducted on QVHighlights, the only benchmark that provides both moment boundaries and fine‑grained highlight scores. The dataset contains 7,218 training videos (≈150 s each) with soft labels at 2‑second intervals and a background‑to‑foreground ratio of roughly 70:30. The visual encoder is frozen, while the LLM is fine‑tuned using PiSSA and rsLoRA, updating only 135 M parameters (≈1.35 % of the full 4.36 B‑parameter model). Training runs for 11 epochs on a single NVIDIA H100 GPU, taking about 9 h 20 min total.

Results show that the proposed method achieves a highlight HIT@1 of 56.74, only 0.36 points below the strongest RL‑based baseline (TempSamp‑R1, 57.10) while using less than half the number of frames (25 vs. 60+). For moment retrieval, the model attains a MAP of 35.28, surpassing the Moment‑DETR baseline (33.12) and approaching the performance of larger SFT/RL models. Importantly, the segmentation losses continue to decrease even after the causal LM loss plateaus, confirming that they provide a stable auxiliary learning signal.

The paper’s contributions are threefold: (1) introducing a frame‑wise binary mask output for MLLMs, enabling direct frame‑level supervision; (2) demonstrating that a combination of BCE, Tversky, and Dice losses together with the LM loss yields robust training dynamics; and (3) showing that strong MR and HD performance can be achieved with a very compact temporal representation, dramatically reducing computational cost.

Limitations include the coarse temporal granularity (≈6 s between sampled frames), which may miss very short moments, and the decision to keep the visual encoder frozen, potentially limiting the model’s ability to adapt to video‑specific visual patterns. Future work could explore denser or adaptive frame sampling, joint fine‑tuning of the visual backbone, and scaling to larger LLMs or multimodal pre‑training on video data to further close the gap with RL‑based methods while preserving efficiency.

Moment and Highlight Detection via MLLM Frame Segmentation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment