Beyond Rigid: Benchmarking Non-Rigid Video Editing

Beyond Rigid: Benchmarking Non-Rigid Video Editing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite the remarkable progress in text-driven video editing, generating coherent non-rigid deformations remains a critical challenge, often plagued by physical distortion and temporal flicker. To bridge this gap, we propose NRVBench, the first dedicated and comprehensive benchmark designed to evaluate non-rigid video editing. First, we curate a high-quality dataset consisting of 180 non-rigid motion videos from six physics-based categories, equipped with 2,340 fine-grained task instructions and 360 multiple-choice questions. Second, we propose NRVE-Acc, a novel evaluation metric based on Vision-Language Models that can rigorously assess physical compliance, temporal consistency, and instruction alignment, overcoming the limitations of general metrics in capturing complex dynamics. Third, we introduce a training-free baseline, VM-Edit, which utilizes a dual-region denoising mechanism to achieve structure-aware control, balancing structural preservation and dynamic deformation. Extensive experiments demonstrate that while current methods have shortcomings in maintaining physical plausibility, our method achieves excellent performance across both standard and proposed metrics. We believe the benchmark could serve as a standard testing platform for advancing physics-aware video editing.


💡 Research Summary

The paper addresses a critical gap in text‑driven video editing: the generation of coherent non‑rigid deformations. While recent diffusion‑based models have excelled at rigid appearance changes, they often produce physically implausible motions and temporal flicker when applied to deformable objects such as cloth, fluids, hair, or soft bodies. Existing benchmarks (e.g., TGVE, FiVE) focus on generic visual quality and semantic alignment, lacking dedicated protocols for assessing physical plausibility and temporal consistency of non‑rigid edits. To fill this void, the authors introduce three major contributions.

First, they construct NR VBench, a dedicated benchmark for non‑rigid video editing. The dataset comprises 180 high‑resolution videos, each trimmed to 60 frames, drawn from six physics‑grounded categories: Articulated Soft Bodies (ASB), Cloth and Thin‑Shells (CTS), Hair/Fur/Feathers (HFF), Liquid Free Surfaces (LFS), Gas/Smoke/Fire (GSF), and Deformable Solid Objects (DSO). For each video, they generate 2,340 fine‑grained editing instructions using GPT‑4o, create pixel‑accurate masks with SAM2 followed by human verification, and design 360 multiple‑choice questions (MCQs) to enable automated evaluation. The instruction set follows a hierarchical difficulty taxonomy (Degree Editing, Topology Editing, Attribute Editing), allowing systematic analysis of model performance across increasingly challenging physical manipulations.

Second, they propose NR VE‑Acc, a novel evaluation metric that leverages a large Vision‑Language Model (Qwen2.5‑VL‑7B) to assess three orthogonal dimensions: (1) Instruction Alignment – measured via MCQ correctness, yielding a score in


Comments & Academic Discussion

Loading comments...

Leave a Comment