OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://lewandofskee.github.io/projects/OpenVE.

💡 Research Summary

The paper addresses a critical gap in instruction‑guided video editing (IVE) research: the lack of large‑scale, high‑quality, and diverse datasets. The authors introduce OpenVE‑3M, an open‑source dataset comprising three million video‑edit pairs. The dataset is organized into two main groups—spatially‑aligned (SA) edits and non‑spatially‑aligned (NSA) edits—covering eight sub‑categories: Global Style, Background Change, Local Change, Local Remove, Local Add, Subtitles Edit, Camera Multi‑Shot Edit, and Creative Edit. Each edit type is generated through a carefully engineered three‑stage pipeline.

Stage 1 – Video Pre‑processing: A base library of one million high‑quality videos (65–129 frames, 720 p) is assembled from Open‑Sora‑Plan, OpenViD‑HD, and UltraVideo. For each clip, Qwen2.5‑VL‑72B produces a textual description and extracts object names. Depth maps (Video‑DepthAnything) and edge maps (OpenCV Canny) are computed, while Grounded‑SAM2 provides detection and segmentation masks. Localized object descriptions are generated with DAM.

Stage 2 – Taxonomy‑Guided Generation: GPT‑4o creates detailed natural‑language instructions for each edit category. These instructions drive a suite of generative models: FLUX‑K for image‑level edits, Wan2.1‑Control for video‑level diffusion conditioned on Canny/depth maps, Seedance for multi‑shot and creative generation, and DiffEraser for in‑painting. The pipeline handles complex tasks such as precise background replacement (using IoU > 0.95 foreground masks), bidirectional local add/remove (synthesizing a source video by adding an object, then removing it to create the counterpart), and subtitle manipulation via rendering tools. NSA tasks leverage Seedance’s “Native Multi‑Shot” feature and creative I2V prompts.

Stage 3 – High‑Quality Filtering: Three evaluation dimensions—Instruction Compliance, Consistency & Detail Fidelity, and Visual Quality & Stability—are scored 1–5 by large vision‑language models. Human annotation of 300 samples establishes a >3 average score threshold for positive examples. Intern‑VL‑3.5‑38B is selected as the primary scorer (≈66 % accuracy) and applied to the entire set; only pairs with a composite score ≥ 3 are retained, yielding the final 3 M high‑quality samples. Compared to prior open‑source IVE datasets (InsViE‑1M, Señorita‑2M, Ditto‑1M), OpenVE‑3M offers the longest average instruction length (40.6 words), higher resolution (1280 × 720), more frames per clip, and a broader edit taxonomy.

To provide a unified evaluation platform, the authors construct OpenVE‑Bench, a curated benchmark of 431 video‑edit pairs spanning all eight categories. For each pair, three prompts target the same evaluation dimensions used in filtering; scores are obtained by feeding the original video, edited video, and instruction into a vision‑language model, ensuring strong alignment with human judgments.

The paper also introduces OpenVE‑Edit, a 5‑billion‑parameter instruction‑guided video editing model. Its architecture consists of:

A multimodal large language model (MLLM) that jointly encodes video frames and textual instructions, extracting high‑level semantic representations while preserving spatio‑temporal context.
An MoE‑Connector that routes these representations to specialized expert subnetworks tailored to each edit type (e.g., style transfer, object manipulation, shot transition). The final linear layers of the experts are initialized to zero to prevent random feature interference during early training.
A Diffusion Transformer (DiT) that synthesizes the edited video, maintaining temporal consistency and high visual fidelity.

Training leverages the OpenVE‑3M dataset; evaluation on OpenVE‑Bench shows that OpenVE‑Edit surpasses all existing open‑source IVE models, including a 14‑billion‑parameter baseline, achieving state‑of‑the‑art performance especially on the more challenging NSA tasks. This demonstrates that a well‑designed data pipeline and modular architecture can yield superior results without resorting to massive model sizes.

In summary, the contributions are:

OpenVE‑3M: a 3 M‑sample, high‑quality, multi‑category video editing dataset with long, detailed instructions.
Robust Construction Pipeline: a scalable three‑stage process combining LLMs, detection/segmentation, depth/edge estimation, and diverse generative models, followed by VLM‑based quality filtering.
OpenVE‑Bench: a comprehensive benchmark with human‑aligned evaluation metrics.
OpenVE‑Edit: a 5 B parameter model that sets a new SOTA on the benchmark, outperforming larger open‑source counterparts.

The work paves the way for future research in instruction‑guided video editing, encouraging exploration of more complex edits (e.g., physics‑based effects), multimodal interaction (audio, gestures), and real‑time or user‑personalized editing systems built upon the OpenVE‑3M foundation.

OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment