Factorized Learning for Temporally Grounded Video-Language Models

Reading time: 1 minute
...

๐Ÿ“ Original Info

  • Title: Factorized Learning for Temporally Grounded Video-Language Models
  • ArXiv ID: 2512.24097
  • Date: 2025-12-30
  • Authors: Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng

๐Ÿ“ Abstract

Evidence grounding What did I put in the rack? User D 2 VLM Small bag. The relevant event happens in Response with evidence referencing Factorized Preference Optimization (FPO) (c) (a) Previous methods ๐ƒ ๐Ÿ ๐•๐‹๐Œ ๏’ Preferred response Large basket. The relevant event happens in [23.5s-46.1s]. Small bag. The relevant event happens in [12.3s-15.6s]. ๏ช ๏ซ Factorized perturbation Small bag. The relevant event happens in [23.5s-46.1s]. 23.5s 46.1s Evidence token Explicit visual semantic capture Token generation flow Evidence referencing Figure 1. (a) Performance: Our method outperforms SOTA methods across various tasks (here we draw the maximum performance across methods, detailed in Sec. 6). (b) Model: We propose a new framework D 2 VLM, where we decompose the generation objective into a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens to emphasize explicit event-level visual semantic capture. (c) Training Algorithm: We introduce Factorized Preference Optimization (FPO) that explicitly addresses both temporal grounding and textual response. A factorized data synthesis approach is also designed to support FPO.

๐Ÿ“„ Full Content

...(๋ณธ๋ฌธ ๋‚ด์šฉ์ด ๊ธธ์–ด ์ƒ๋žต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์ดํŠธ์—์„œ ์ „๋ฌธ์„ ํ™•์ธํ•ด ์ฃผ์„ธ์š”.)

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut