디퓨전 트랜스포머 비디오 생성에 세계 지식 메모리를 주입하는 혁신적 접근

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Diffusion Transformer (DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (e.g. 150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here https://thrcle421. github.io/DiT-Mem-Web/.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Recent advances in diffusion-based generative models have dramatically improved the visual quality and temporal coherence of video generation [13,14]. Built on the diffusion Transformer (DiT) architecture [33], recent models [21,48,54] leverage scaling in both model size and training data, achieving remarkable capability in producing high-fidelity video clips. Yet, generating truly realistic videos remains a major challenge. Beyond sharp frames and smooth motion, videos must obey the same physical laws and commonsense dynamics that govern the real world. In practice, even state-of-the-art models often violate basic physics and commonsense, e.g. objects passing through solid walls, liquids flowing upward, or collisions that leave both objects completely unaffected. Such failures highlight the broader issue that current video diffusion models still lack sufficient world knowledge to reliably enforce physical and commonsense consistency during generation.

Equipping video diffusion models with rich world knowledge remains a fundamental yet challenging problem. Existing efforts are predominantly training-based and follow two main strategies. One line of work scales up model capacity and training data in the hope that broad exposure to diverse scenes allows the model to implicitly internalize world knowledge regularities [34]. Another line introduces additional constraints or supervision into the training objective [27,38,57]. However, both strategies substantially increase training cost and complexity, and still struggle to comprehensively cover the vast and diverse knowledge required to handle real-world scenarios [20].

In this paper, we aim to develop an efficient solution to equip video diffusion models with a memory that provides useful world knowledge to guide generation. Inspired by the success of in-context memory for Transformer-based LLMs [5,6,24,52], we consider a plug-and-play memory design for DiT-based video generation models. To this end, we conduct preliminary experiments to study whether we can guide DiT through in-context intervention. We retrieve a few relevant videos, encode them into embeddings, and then inject these embeddings into the hidden states of the DiT during inference. Surprisingly, we find that performing low-pass and high-pass filtering [7,22] in the embedding space can guide the DiT to generate the desired objects and to better follow physical rules, respectively. It suggests that DiT supports guidance via interventions on its hidden states, and straightforward filtering in the frequency domain can disentangle low-level and high-level concept embeddings that effectively steer the generation process.

Building on these findings, we propose DiT-Mem, a learnable, general-purpose memory module that can be plugged into existing video diffusion models to provide guidance during generation. Specifically, we design a lightweight memory encoder composed of stacked 3D CNN layers for spatiotemporal downsampling, high-pass and low-pass filters to extract disentangled high-level (semantic/dynamic) and low-level (appearance/texture) features, and a self-attention layer for feature aggregation. Given a reference video, DiT-Mem compresses it into a small set of memory embeddings, which can be concatenated with hidden states from self-attention layers of the DiT as in-context memory tokens. During training, we only update the parameters of the memory encoder while keeping the DiT backbone frozen, resulting in a parameter-efficient learning process and a truly plug-and-play memory that can be used to guide generation at inference time.

In practice, we only need to finetune few training parameters (e.g. 150M) on 10K data samples, leading to a rather efficient training recipe to obtain the general memory for fixed DiT-based video diffusion models. To demonstrate the effectiveness of our approach, we integrate the proposed memory module into state-of-the-art open-source models (e.g. Wan2.1 T2V-1.3B [46] and Wan2.2 TI2V-5B [48]) and evaluate them on a diverse suite of video generation benchmarks. Across multiple evaluation settings covering object controllability, physical consistency, and long-horizon dynamics, our method yields improvements in visual quality and rule compliance, even surpassing closed-source models (e.g. Pika [36] and Kling [23]).

DiT-based Video Generation Model We build on a textto-video (T2V) generation model where a DiT serves as the backbone. The DiT-based video generation model typically consists of a text encoder, a vision Transformer (ViT), and a pre-trained variational autoencoder (VAE) [40]. Given a natural-language prompt p as input, the text encoder first produces a conditioning embedding e(p). At inference time, starting from randomly sampled pure noise in the latent space, the ViT iteratively denoises the latent noise under the guidance of the text condition. The final clean latent z 0 is decoded by the VAE into the output video v = D(z 0 ) that realizes the semant

View Original ArXiv

This content is AI-processed based on ArXiv data.

디퓨전 트랜스포머 비디오 생성에 세계 지식 메모리를 주입하는 혁신적 접근

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found