FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion
Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: https://filmweaver.github.io
💡 Research Summary
FilmWeaver tackles the under‑explored problem of generating multi‑shot videos that are both visually consistent and arbitrarily long. Existing video diffusion models excel at single‑shot synthesis but fail to preserve character, background, and style continuity across shot boundaries, and they typically require a fixed number of frames per clip. FilmWeaver’s core contribution is a dual‑level cache system that explicitly separates inter‑shot consistency from intra‑shot coherence and injects these memories into an autoregressive diffusion pipeline without altering the underlying architecture.
The Shot Cache stores a small set of keyframes from previously generated shots. Keyframes are selected by measuring cosine similarity between the CLIP embedding of the current textual prompt and the CLIP image embeddings of all candidate frames, then taking the top‑K most relevant frames. This cache acts as a long‑term memory, providing the model with a concise visual summary of the narrative so far, thereby ensuring that characters, objects, and environments reappear with the same appearance in later shots.
The Temporal Cache operates at a finer granularity within a single shot. It maintains a sliding window of the most recent frames, using a differential compression scheme: frames close to the generation point are stored at high fidelity, while older frames are progressively compressed. This short‑term memory guarantees smooth motion and prevents flickering, while keeping computational overhead manageable.
FilmWeaver adopts an autoregressive diffusion framework (e.g., a 3D‑DiT backbone). During denoising, the model receives three conditioning inputs: the text prompt, the Temporal Cache, and the Shot Cache. The loss remains the standard mean‑squared error between predicted and true noise, but the conditioning vectors are concatenated to the latent representation, effectively performing in‑context learning. Because the cache is injected as additional inputs rather than architectural changes, the method is compatible with any pre‑trained text‑to‑video diffusion model.
Training proceeds in two stages. Stage 1 focuses solely on long single‑shot generation, disabling the Shot Cache and training the model with only the Temporal Cache. This allows the network to master motion dynamics without the added complexity of cross‑shot dependencies. Stage 2 introduces the Shot Cache and trains on a mixed curriculum that includes all four cache configurations: (1) no cache (pure text‑to‑video), (2) Temporal‑only, (3) Shot‑only, and (4) Full‑cache. This progressive curriculum stabilizes convergence and equips the model to handle arbitrary sequences of shots and extensions.
A major bottleneck for multi‑shot research is the scarcity of suitable datasets. The authors therefore construct a high‑quality multi‑shot video dataset via a dedicated pipeline: (i) automatic shot boundary detection and clustering into scenes, (ii) multi‑level agent annotation that provides detailed captions and selects representative keyframes, and (iii) refinement steps to ensure consistent labeling across shots. This dataset supplies the necessary supervision for both cache learning and evaluation.
Evaluation introduces three new consistency metrics—Identity Consistency, Background Consistency, and Temporal Smoothness—alongside standard video quality scores (PSNR, FID, CLIP‑Score). Human studies confirm that FilmWeaver’s outputs are perceived as more coherent and narratively stable than prior methods such as TTT, LCT, and EchoShot. Moreover, the framework supports downstream tasks: multi‑concept injection (by manually seeding the Shot Cache with additional characters), interactive shot extension, and flexible shot‑by‑shot editing, making it a practical tool for filmmakers, advertisers, and game developers.
Limitations include sensitivity to cache hyper‑parameters (e.g., number of keyframes K, compression ratios) and the computational cost of updating caches during real‑time interaction. Future work may explore adaptive cache selection, more efficient compression, and integration with real‑time rendering pipelines.
In summary, FilmWeaver presents a memory‑augmented autoregressive diffusion approach that simultaneously achieves inter‑shot consistency and intra‑shot coherence, enabling the generation of high‑quality, arbitrarily long multi‑shot videos without redesigning the underlying diffusion architecture. This represents a significant step toward controllable, narrative‑driven video synthesis.
Comments & Academic Discussion
Loading comments...
Leave a Comment