While recognizing actions, LMMs struggle to detect core interaction events

Reading time: 6 minute
...

📝 Abstract

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (‘contact’) or detached (‘release’). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

💡 Analysis

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (‘contact’) or detached (‘release’). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

📄 Content

Discovering and understanding actions and interactions are fundamental cognitive capabilities of humans and other intelligent beings, necessary in interpreting and planning dynamic events between objects and agents in the surrounding environment [20]. Infants develop early sensitivity to spa-tiotemporal continuity in simple events, for example, the physical contact between an agent and a target object. This sensitivity to the causality of perception guides infants at a very young age, in learning to detect and interpret interactions between objects and agents, including launching, entraining and expulsion events [2,12,16,21].

Computationally, recent vision models showed increasing performance in recognizing actions and interactions in realistic video sequences [7,17,24]. Some models include special architectural designs to improve on the internal representation learning [31], while others use a common architecture, but train on very large unlabeled video datasets utilizing self-supervised learning paradigms [5,14,26]. The introduction of large multi-modal models (LMMs), allowed the combination of semantic information from large language models (LLMs) and foundational visual representations, thus allowing to generalize to unseen videos and actions without explicit training [24].

Despite the increasing success of models in generalizing to more complex tasks without explicit training, recent studies revealed fundamental limitations in the models ability to reason about the performed tasks and develop human-like generalizable understanding [4,11,22]. We are interested to explore whether the enhanced generalization behavior in interpreting actions in LMMs, reflects a deeper understanding of dynamic events at the core of interactions, similar to intelligent beings, or does it merely reflect a seemingly convincing “story telling” ability of reliably detected objects in the scenes?

For this purpose we introduce a large-scale dataset -The Contact-Release Interaction Dataset. This first of its kind dataset consists of more than 20K annotated interactions, based on 10, 000 action videos from the “Something-Something v.2” (SSv2) dataset [8]. Using the Amazon Mechanical Turk (AMTurk) crowd sourcing platform, we con-S-Figure 1. Collecting human annotations for interactions using Amazon Mechanical Turk platform. Human subjects were asked to annotate core interaction events in videos from SSv2 dataset [8]. Shown here are example annotations for ‘contact’ and ‘release’ events, where the target object (white candle) comes in contact with a hand (left) and a surface (middle), or is detached from the hand (right). The annotations include the event type, the kind of agent-object pair and the spatiotemporal location of the event (frame and image coordinates). ducted a survey, in which 250 human annotators labeled 24, 222 core interaction events, including: (i) the type of agent acting upon the target object (e.g., a hand or another object), (ii) the type of core interaction event, i.e, ‘contact’ if a target object becomes attached to an agent, or ‘release’ if an object becomes detached from an agent, (iii) the spatiotemporal location of the event, i.e., frame number and image coordinates (see Fig. 1 and Sec. 3).

Using the new annotations, we conducted a series of experiments to evaluate the ability of current LMMs to detect the spatio-temporal location of core interaction events in real-world video sequences. The experiments were conducted under several In-Context-Learning regimes, while applying two modifying conditions on the models’ prompts -Reasoning and Grounding -inspired by earlier studies [10,23,27]. In the evaluation, we used an open source version of Alibaba’s Qwen-2.5VL-72B [1] and OpenAI’s online GPT-4o model [18].

In summary our contributions include:

• Introduce a large scale dataset, based on 10K action videos from SSv2 dataset, with more than 20K first of their kind human annotations of core interaction events (‘Contact’ and ‘Release’). The annotations include details on the event and agent types, as well as the spatiotemporal locations of the events.

• A set of prompting experiments under several In-Context-Learning regimes and modifying conditions, including Reasoning and Grounding.

• A discussion around the question: Does the enhanced generalization behavior of LMMs in action recognition reflect an improved video understanding, or is it merely a convincing ‘story telling’ about detected objects in visual scenes?

Video understanding was thoroughly researched in the past due to its high value to the advancement in the domain of AI. Unlike action recognition in video, such as identifying people jumping, playing tennis etc., video understanding is a more complex task, often requiring a high level of generalization. Recent studies in this area of research, such as Maaz et al. [15], showed that an a ChatGPT agent can successfully answer complex questions when prompted with image data together with the verba

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut