Factorized Learning for Temporally Grounded Video-Language Models

February 09, 2026

Reading time: 1 minute

...

📝 Original Info

Title: Factorized Learning for Temporally Grounded Video-Language Models
ArXiv ID: 2512.24097
Date: 2025-12-30
Authors: Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng

📝 Abstract

Evidence grounding What did I put in the rack? User D 2 VLM Small bag. The relevant event happens in Response with evidence referencing Factorized Preference Optimization (FPO) (c) (a) Previous methods 𝐃 𝟐 𝐕𝐋𝐌  Preferred response Large basket. The relevant event happens in [23.5s-46.1s]. Small bag. The relevant event happens in [12.3s-15.6s].   Factorized perturbation Small bag. The relevant event happens in [23.5s-46.1s]. 23.5s 46.1s Evidence token Explicit visual semantic capture Token generation flow Evidence referencing Figure 1. (a) Performance: Our method outperforms SOTA methods across various tasks (here we draw the maximum performance across methods, detailed in Sec. 6). (b) Model: We propose a new framework D 2 VLM, where we decompose the generation objective into a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens to emphasize explicit event-level visual semantic capture. (c) Training Algorithm: We introduce Factorized Preference Optimization (FPO) that explicitly addresses both temporal grounding and textual response. A factorized data synthesis approach is also designed to support FPO.

📄 Full Content

...(본문 내용이 길어 생략되었습니다. 사이트에서 전문을 확인해 주세요.)

Factorized Learning for Temporally Grounded Video-Language Models

📝 Original Info

📝 Abstract

📄 Full Content

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Start searching

No results found