E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching
Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event’s semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special
💡 Research Summary
The paper introduces E.M.Ground, a novel Video Large Language Model (Vid‑LLM) designed to tackle Temporal Video Grounding (TVG), the task of precisely localizing video segments that correspond to natural‑language queries. Existing Vid‑LLM approaches, exemplified by E.T.Chat, rely on two separate tokens—one for the start frame and one for the end frame—and select the frames with the highest token‑to‑frame similarity. This strategy over‑emphasizes exact timestamps, ignores the semantic continuity of the event, and suffers from noisy similarity scores and information loss caused by aggressive visual compression, especially on longer videos.
E.M.Ground departs from this paradigm with three key innovations.
-
Holistic
Token – Instead of two boundary tokens, a single special tokenis introduced to represent the entire query event. During training the token is encouraged to match all frames within the ground‑truth interval, preserving semantic continuity. The token’s representation is taken from the second‑to‑last layer of the language model, projected into a shared space, and compared with frame‑level visual features via cosine similarity. A smoothed binary label (with a small decay α outside the interval) is used to guide the model toward a continuous similarity curve rather than isolated peaks. -
Savitzky‑Golay Smoothing – Token‑to‑frame similarity scores are inherently noisy across adjacent frames. The authors apply a Savitzky‑Golay filter, a polynomial‑regression‑based moving average, to the similarity sequence during inference. This filter removes high‑frequency noise while preserving peak shapes, which is crucial for detecting short‑duration events and for the Video Highlight Detection (VHD) sub‑task. The smoothing leads to more stable start/end predictions and reduces abrupt changes that previously caused boundary errors.
-
Multi‑Grained Visual Feature Aggregation – To mitigate information loss from aggressive frame sampling, the visual encoder extracts features from multiple transformer layers (L layers). For each frame, features from all layers are averaged, yielding a multi‑grained representation that captures both low‑level texture and high‑level semantic cues. This richer representation improves the alignment with the
token, especially in long videos where intermediate frames contain valuable context.
The training objective combines the standard Negative Log‑Likelihood loss for multimodal language generation with an auxiliary matching loss (binary cross‑entropy between the smoothed ground‑truth label y_t and the similarity score s_t). This dual‑loss formulation simultaneously optimizes textual generation and temporal alignment.
Extensive experiments on benchmark TVG datasets (Charades‑STA, TACoS, ActivityNet‑Caption) and on VHD demonstrate consistent gains over state‑of‑the‑art Vid‑LLMs. E.M.Ground achieves higher Recall@0.3/0.5/0.7, larger mean IoU, and superior F1 scores on VHD, with particularly notable improvements on longer videos where prior methods degrade. Ablation studies isolate the contribution of each component: the holistic
In summary, E.M.Ground redefines temporal grounding for video‑language models by (i) representing the whole event with a single token to capture semantic continuity, (ii) smoothing noisy similarity signals with a proven signal‑processing technique, and (iii) enriching visual representations through multi‑layer aggregation. This combination overcomes the limitations of timestamp‑centric matching and opens the door for future multimodal LLMs to handle fine‑grained temporal tasks such as automated video editing, interactive search, and real‑time highlight extraction.
Comments & Academic Discussion
Loading comments...
Leave a Comment