FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.
💡 Research Summary
FoundationMotion tackles a fundamental bottleneck in video‑language understanding: the scarcity of fine‑grained “how” motion data. While recent vision‑language models (VLMs) such as Gemini and Qwen‑VL can recognize objects and describe events, they still falter on tasks that require reasoning about the precise way objects move and interact in space. To close this gap, the authors present a fully automated pipeline that converts raw videos into a massive motion‑centric dataset, then uses this dataset to fine‑tune open‑source VLMs, achieving performance that rivals or exceeds strong closed‑source baselines.
The pipeline consists of four stages. First, videos are temporally cropped to 5‑10 second clips; clips with excessive camera motion are filtered out using a pretrained camera‑pose estimator (VGGT) to ensure reliable tracking. Second, objects are detected in two complementary ways. General objects are identified by prompting the Qwen2.5‑VL‑7B model to list salient categories, which are then localized with Grounded‑DINO. Human‑centric detection proceeds hierarchically: Cascade Mask RCNN + ViTDet‑H finds persons, ViTPose+ extracts full‑body and hand keypoints, and the Hands23 model refines hand boxes, determines left/right side, and predicts contact states (no contact, self‑contact, object‑contact, other‑contact) together with the associated object box. Third, temporal coherence is maintained by SAM2, which propagates detections across frames to produce per‑object trajectories stored as JSON. Finally, the trajectories and sampled frames are fed to large language models (LLMs) via carefully crafted prompts. The LLMs generate concise motion captions and a diverse set of question‑answer (QA) pairs that probe not only “what moves” but also “how it moves”, relative spatial relations, geometric constraints, and temporal ordering. In total the system produces roughly 500 K captions and QA pairs, collectively called the FoundationMotion Dataset.
To evaluate the utility of this data, the authors fine‑tune two open‑source VLMs—NVILA‑Video‑15B and Qwen2.5‑7B—on the new dataset. They benchmark the resulting models on existing motion‑focused suites (MotionBench, FA‑VOR‑Bench) and on a newly curated “how‑motion” benchmark covering hand manipulation, robot arm tasks, and autonomous‑driving vehicle trajectories. Across all metrics, the fine‑tuned models outperform the state‑of‑the‑art closed‑source Gemini‑2.5 Flash and even larger open‑source models such as Qwen2.5‑VL‑72B, especially on “how” questions where gains of 12–18 percentage points are reported. Importantly, performance on broader video‑language tasks (image captioning, generic video QA) remains unchanged or slightly improved, demonstrating that the motion‑specific fine‑tuning does not degrade general capabilities.
Key technical contributions include: (1) a fully automated, scalable pipeline that integrates cutting‑edge detection (Grounded‑DINO, SAM2), human‑centric hand‑contact analysis (Hands23), and LLM‑driven language generation to produce high‑quality motion annotations without any human labeling; (2) the creation of a large‑scale motion dataset with both captions and QA pairs, filling the “how” data gap that has limited prior work; (3) a systematic evaluation showing that motion‑centric data can substantially boost VLM reasoning about dynamics while preserving overall versatility; and (4) open release of code, data, and benchmarks to catalyze community progress.
The paper also discusses limitations. Automated tracking can still fail on fast or occluded motions, leading to noisy trajectories; LLM‑generated text may contain factual inconsistencies that would benefit from human verification or automated quality checks. Moreover, the current system operates on 2‑D bounding boxes and trajectories, which may be insufficient for tasks that require true 3‑D spatial reasoning (e.g., robot manipulation in cluttered environments). Future work could incorporate depth sensors, multi‑view video, or 3‑D reconstruction to enrich the motion representation, and develop self‑supervised validation loops to further improve annotation reliability.
Overall, FoundationMotion demonstrates that large‑scale, automatically curated “how” motion data is feasible and highly effective. By bridging the data scarcity gap, it enables open‑source VLMs to achieve motion understanding on par with proprietary systems, opening new avenues for research in physical reasoning, embodied AI, and multimodal learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment