MoCA-Video: Motion-Aware Concept Alignment for Consistent Video Editing

MoCA-Video: Motion-Aware Concept Alignment for Consistent Video Editing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present MoCA-Video, a training-free framework for semantic mixing in videos. Operating in the latent space of a frozen video diffusion model, MoCA-Video utilizes class-agnostic segmentation with diagonal denoising scheduler to localize and track the target object across frames. To ensure temporal stability under semantic shifts, we introduce momentum-based correction to approximate novel hybrid distributions beyond trained data distribution, alongside a light gamma residual module that smooths out visual artifacts. We evaluate model’s performance using SSIM, LPIPS, and a proposed metric, \metricnameabbr, which quantifies semantic alignment between reference and output. Extensive evaluation demonstrates that our model consistently outperforms both training-free and trained baselines, achieving superior semantic mixing and temporal coherence without retraining. Results establish that structured manipulation of diffusion noise trajectories enables controllable and high-quality video editing under semantic shifts.


💡 Research Summary

MoCA‑Video introduces a training‑free framework for semantic mixing in videos by manipulating the latent noise trajectory of a frozen text‑to‑video diffusion model (VideoCrafter2). The method first recovers the latent trajectory of a source video using DDIM inversion. At selected timesteps, when the target object is sufficiently formed but still semantically pliable, a class‑agnostic segmentation model (Grounded‑SAM2) is applied to the predicted clean frame (x̂₀) to obtain a binary mask of the object. An IoU‑based overlap maximization algorithm tracks this mask across subsequent timesteps, ensuring spatial consistency despite the presence of noise.

The reference image is encoded into the same latent space, and its features are injected into the masked region of each latent frame using a soft‑fusion formula: x_mixₜ = xₜ·(1‑mₜ) + λₜ·z_ref·mₜ, where λₜ is a time‑dependent strength that peaks around the injection timestep and decays thereafter. This soft mask allows the diffusion process to naturally smooth minor segmentation errors.

To preserve temporal coherence, MoCA‑Video augments the standard DDIM update with a momentum‑corrected term. The deviation introduced by feature injection is captured as gₜ = xₜ – xₜ₋₁ + λ·dirₜ, where dirₜ is the usual DDIM direction. A momentum vector vₜ = β·vₜ₋₁ + (1‑β)·gₜ accumulates these deviations, and a scaling factor κₜ (decreasing with time) adds a correction to the predicted clean image: x̂₀(corr) = x̂₀(DDIM) + κₜ·vₜ. This heuristic steers the denoising trajectory toward a hybrid distribution that lies outside the original training manifold, enabling the generation of novel entities such as “astronaut‑cat” or “surfer‑kayak”.

A lightweight γ‑residual noise module further stabilizes the process by adding a small calibrated Gaussian noise term (γ·ε) to each fused latent, damping flicker and inter‑frame artifacts without degrading visual fidelity.

For evaluation, the authors construct an “Entity Blending” dataset by combining the CTIB super‑categories with DAVIS‑16 video segmentation classes, yielding 100 diverse intra‑ and inter‑category pairs. Metrics include SSIM, LPIPS, and a newly proposed Conceptual Alignment Shift Score (CASS) together with its normalized variant relCASS, which measure semantic alignment using CLIP embeddings. MoCA‑Video consistently outperforms training‑free baselines (FreeBlend, RAVE) and strong pretrained methods (AnimateDiff‑V2V, TokenFlow) across all metrics (e.g., +3.2 % SSIM, –0.07 LPIPS, +0.12 CASS). Qualitative results demonstrate temporally stable hybrid objects and smooth semantic transitions.

The paper’s contributions are: (1) the first training‑free video semantic mixing framework, (2) a combined mask‑tracking and momentum‑corrected denoising scheme that approximates novel hybrid distributions, (3) the CASS family of metrics for evaluating semantic alignment in video editing, and (4) a comprehensive benchmark for entity‑level blending. Limitations include reliance on mask quality and occasional tracking failures on fast‑motion scenes. Future work may explore optical‑flow‑enhanced tracking, multi‑object simultaneous editing, learnable momentum parameters, and automated λ selection guided by CASS.


Comments & Academic Discussion

Loading comments...

Leave a Comment