Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60% on VBench, 21-22% lower FVD, and 71.4% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51%, surpassing REPA (92.91%) by 2.60%, and reduce FVD to 360.57, a 21.20% and 22.46% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .

💡 Research Summary

This paper, “Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation,” addresses a fundamental challenge in video generation: producing realistic and physically plausible motion that preserves the structural integrity of objects, particularly articulated and deformable entities like humans and animals. While diffusion models have excelled in generating high-fidelity static images, they often fail to create coherent dynamic motion, resulting in artifacts such as limb disconnection, texture tearing, and implausible transitions.

The core problem is that scaling training data or relying on explicit but noisy motion representations (e.g., optical flow or skeletons extracted by imperfect external models) has proven insufficient. The authors propose a novel solution centered on the idea of “deriving structure from tracking.” They posit that a model trained to understand and track motion over time (a tracker) possesses implicit knowledge about structural preservation that can be transferred to a model trained to generate motion (a generator).

The proposed framework, named SAM2VideoX, distills structure-preserving motion priors from a state-of-the-art autoregressive video tracking model, SAM2 (Segment Anything Model 2), into a bidirectional video diffusion model, CogVideoX. SAM2 is designed for robust, long-term object tracking across occlusions, meaning its internal feature representations inherently encode how object parts move together, maintain connectivity, and resolve occlusions over time—precisely the knowledge lacking in standard generators.

The technical methodology involves two key innovations:

Bidirectional Feature Fusion Module: A fundamental architectural mismatch exists: SAM2 processes videos causally (recurrently), while diffusion transformers (DiTs) like CogVideoX use bidirectional attention. To bridge this gap, the authors extract SAM2’s memory features from both the original video (forward pass) and the temporally reversed video (backward pass). These features are then fused at the level of their Local Gram Flow representations to create a unified teacher signal that approximates global video context, making it suitable for distillation into the bidirectional generator.
Local Gram Flow (LGF) Loss: Instead of using direct feature matching (e.g., L2 loss), which is ineffective for capturing relational motion, a novel alignment loss is introduced. For a given spatial token in frame t, the LGF operator computes its similarity (dot product) with tokens in a local 7x7 neighborhood in frame t+1. This results in a vector representing potential motion trajectories from that location. Both the student (projected DiT features) and teacher (fused SAM2 features) undergo this operation. The resulting similarity vectors are converted into probability distributions via a softmax, and the distance between these distributions is minimized using Kullback-Leibler (KL) divergence. This forces the generator to learn the relative motion patterns—how features move in relation to each other—rather than absolute feature values, leading to more effective knowledge transfer of structural motion priors.

The model is trained on a curated dataset of ~10,000 video clips focusing on human and animal motions. Extensive evaluations demonstrate significant improvements:

On the VBench benchmark (Matched Dynamic Degree), SAM2VideoX achieves a score of 95.51%, surpassing the previous best method, REPA (92.91%), by 2.60 percentage points.
It substantially improves video quality metrics, reducing the Fréchet Video Distance (FVD) to 360.57, which is a 21.20% and 22.46% improvement over REPA-finetuning and LoRA-finetuning baselines, respectively.
A human preference study confirms the perceptual superiority of the generated motions, with 71.4% of ratings favoring SAM2VideoX outputs over those from strong baselines like CogVideoX and HunyuanVid. Qualitative results show that SAM2VideoX generates videos where animals walk with correct leg alternation, human limbs follow plausible trajectories, and human-object interactions appear more natural.

In summary, this work presents a paradigm shift for enhancing video generation. By distilling implicit, structure-centric motion knowledge from a powerful tracking model into a generative model, it significantly advances the state-of-the-art in producing physically plausible and structurally coherent dynamic content without relying on external control signals during inference. This approach moves video generation closer to becoming a reliable simulator of dynamic real-world processes.

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment