Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.


💡 Research Summary

The paper addresses the limitations of existing post‑training methods for Temporal Video Grounding (TVG), namely the sparse‑reward problem of on‑policy reinforcement learning (specifically Group Relative Policy Optimization, GRPO) and the heavy computational burden caused by multiple rollouts. To overcome these issues, the authors propose Video‑OPD, an on‑policy distillation framework that combines the strengths of on‑policy sampling with dense, token‑level supervision from a fixed “frontier” teacher model.

In Video‑OPD, a student multimodal large language model (MLLM) first samples a trajectory τ = (a₁,…,a_T) from its current policy π_θ given a video‑query pair. The teacher, which is pre‑trained and kept frozen, produces a probability distribution over the same token space for each step. The student’s log‑probabilities log π_θ(a_t|s_t) are compared to the teacher’s probabilities via a reverse KL divergence D_KL(p_T‖π_θ). This reverse KL is summed across all tokens and used as a dense loss, effectively turning the sparse episode‑level reward into fine‑grained step‑wise feedback. Because the loss is computed on trajectories actually generated by the student, the on‑policy property is preserved, mitigating distributional shift between training and inference.

The framework eliminates the need for multiple rollouts: a single on‑policy sample provides sufficient learning signal, dramatically reducing GPU time and memory consumption.

To further improve sample efficiency, the authors introduce Teacher‑Validated Disagreement Focusing (TVDF). TVDF uses ground‑truth temporal annotations only to validate the teacher’s reliability (e.g., by checking that the teacher’s IoU with the ground truth exceeds a threshold). Among the reliable samples, it prioritizes those where the aggregated reverse KL (i.e., teacher‑student disagreement) is highest, under the intuition that large disagreement carries the most informative learning signal. This curriculum steers training toward data points that are both trustworthy and maximally beneficial, without directly injecting the ground‑truth loss.

Extensive experiments on three major TVG benchmarks—Charades‑TimeLens, ActivityNet‑TimeLens, and QVHighlights‑TimeLens—show that Video‑OPD consistently outperforms GRPO, achieving an average performance gain of over 17% (versus ~12% for GRPO). Training dynamics analysis reveals substantially faster convergence and a 40‑50% reduction in computational cost. The method also generalizes well to broader video‑understanding tasks such as TempCompass, MVBench, and Video‑MME, indicating its applicability beyond TVG.

In summary, the paper makes four key contributions: (1) it identifies the dual bottlenecks of sparse rewards and multi‑rollout overhead in GRPO‑based TVG post‑training; (2) it proposes Video‑OPD, which replaces sparse episode‑level rewards with dense token‑level reverse‑KL supervision from a teacher, preserving on‑policy alignment and enabling better credit assignment; (3) it introduces TVDF, a lightweight curriculum that validates teacher reliability and focuses on high‑disagreement trajectories to boost sample efficiency; and (4) it provides comprehensive empirical evidence that the proposed approach yields superior accuracy, faster convergence, and lower computational demands across both TVG and general video‑understanding benchmarks.


Comments & Academic Discussion

Loading comments...

Leave a Comment