MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects’ motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method’s source code are publicly available at https://henghuiding.com/MeViS/
💡 Research Summary
The paper introduces MeViS, a large‑scale multi‑modal dataset specifically designed for “referring motion expression video segmentation,” a task that requires segmenting and tracking objects in videos based solely on motion‑focused language descriptions. Existing referring video segmentation datasets (e.g., DAVIS‑17‑RVOS, Refer‑YouTube‑VOS) largely contain static attributes such as color or shape, allowing the target object to be identified from a single frame. Consequently, current methods can often treat the problem as a variant of semi‑supervised video object segmentation and achieve strong performance without truly reasoning about temporal dynamics.
MeViS addresses this gap by collecting 2,006 diverse, interaction‑rich videos that feature multiple moving objects and ambiguous visual cues. Human annotators provided 33,072 motion expressions, each in both text and audio form, covering 8,171 distinct objects. The dataset expands the earlier MeViS‑v1 with 4,502 new “motion‑reasoning” expressions (requiring multi‑step temporal inference) and “no‑target” expressions (describing motion that does not correspond to any object). Over 150,000 seconds of human‑recorded or TTS speech are included, enabling research on audio‑guided video object segmentation (AVOS). In addition to pixel‑level masks, bounding‑box trajectories are supplied, making MeViS the largest referring multi‑object tracking (RMOT) benchmark to date.
Four tasks are defined on MeViS:
- Referring Video Object Segmentation (RVOS) – segment the target(s) given a textual motion expression.
- Audio‑guided Video Object Segmentation (AVOS) – segment based on spoken motion description.
- Referring Multi‑Object Tracking (RMOT) – output bounding‑box tracks for all objects referenced by the expression (including multi‑target and no‑target cases).
- Referring Motion Expression Generation (RMEG) – generate a concise, unambiguous motion description for a selected object or set of objects.
To evaluate the difficulty of the dataset, the authors benchmark 15 state‑of‑the‑art methods: six RVOS models, three AVOS models, two RMOT models, and four video‑captioning models (used for RMEG). Across all tasks, performance drops markedly compared with results on older datasets, highlighting that current architectures struggle with pure motion cues, temporal ordering, and audio noise.
The paper also proposes a new baseline, Language‑guided Motion Perception and Matching (LMPM++). LMPM++ first uses language‑conditioned queries to detect candidate objects and extracts object‑level embeddings rather than raw frame features. These embeddings are fed into a large language model (LLM) as “object tokens,” allowing the system to process long video sequences (≈200 frames) efficiently. A temporal‑level contrastive loss forces the model to differentiate motions that share the same verbs but differ in order (e.g., “jump high then far” vs. “jump far then high”). Experiments show that LMPM++ achieves new state‑of‑the‑art results on MeViS: higher mIoU for RVOS, improved AP for AVOS, and increased MOTA for RMOT, especially excelling at correctly ignoring “no‑target” expressions.
The authors discuss several limitations: the object detection stage can still be confused by complex backgrounds or lighting changes; reliance on a heavyweight LLM raises computational cost; and the current audio modality focuses on English and Mandarin, leaving multilingual extensions open. Future directions include lightweight object‑token encoders, stronger multimodal attention mechanisms, real‑time inference optimizations, and integration with embodied AI scenarios such as robotics or autonomous driving.
In summary, MeViS constitutes the first comprehensive dataset that foregrounds motion as the primary referential cue, providing rich multimodal annotations and supporting a suite of four challenging tasks. The extensive benchmark reveals substantial gaps in existing methods, while the proposed LMPM++ demonstrates a promising pathway toward motion‑aware video‑language understanding. By releasing the dataset and code publicly, the authors aim to catalyze further research on temporal reasoning, audio‑visual grounding, and motion‑centric perception in complex, real‑world video environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment