When to Lock Attention: Training-Free KV Control in Video Diffusion

When to Lock Attention: Training-Free KV Control in Video Diffusion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model’s capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.


💡 Research Summary

The paper tackles a persistent problem in video diffusion editing: preserving background fidelity while allowing substantial modifications to foreground objects. Existing training‑based approaches achieve decent results but require costly fine‑tuning, whereas current training‑free methods often inject full‑image information or manipulate attention in a coarse manner, leading to background artifacts or degraded foreground quality. To resolve this “when to lock attention” dilemma, the authors propose KV‑Lock, a training‑free plug‑and‑play framework for DiT‑based video diffusion models.

The central insight is that the variance of the model’s predicted clean sample (ˆx₀) over a range of denoising timesteps—referred to as the hallucination metric—directly reflects generation diversity. Prior work has shown that this diversity is tightly coupled with the classifier‑free guidance (CFG) scale. By monitoring the hallucination metric in real time, KV‑Lock can infer the risk of hallucination at each timestep.

KV‑Lock first caches the key‑value (KV) pairs of the source video’s background tokens during a forward pass of the original video. During editing, a token‑level mask derived from the user‑provided spatial mask determines which tokens receive cached KV (background) and which receive newly generated KV (foreground). The attention mechanism then computes a weighted combination of cached and fresh KV, controlled by a dynamic fusion coefficient αₖ. αₖ is computed as αₖ = clamp(σ²ₖ / τ, 0, 1), where σ²ₖ is the variance of ˆx₀ at timestep k and τ is an empirically set threshold. When the variance is low (low hallucination risk), αₖ is small, allowing more influence from newly generated KV and thus greater foreground flexibility. When variance spikes, αₖ approaches 1, strongly locking the background to its cached representation.

Simultaneously, the CFG scale is modulated by the same variance signal: higher variance triggers a larger CFG scale, strengthening the conditional guidance from the target prompt and steering the foreground toward the desired attributes (color, pose, etc.). This dual‑dynamic scheduling transforms the heuristic “when to lock” decision into a principled, variance‑driven process.

Implementation-wise, KV‑Lock requires only a few additional steps: (1) encode the video with a 3‑D VAE, (2) generate a token‑level binary mask via max‑pooling, (3) extract KV pairs from each DiT block at the same timesteps used for denoising, (4) compute the hallucination metric on‑the‑fly, and (5) fuse KV and adjust CFG per timestep. No extra training or fine‑tuning is needed, making the method compatible with any pre‑trained DiT model.

Extensive experiments compare KV‑Lock against state‑of‑the‑art methods such as VACE, ProEdit, and LongLive across both reference‑based and reference‑free editing tasks. Quantitative metrics (FVD, CLIP‑Score, PSNR, SSIM) show consistent improvements: background fidelity rises by up to 12 % and foreground quality gains 0.08–0.15 CLIP‑Score points. Qualitative results demonstrate that even under aggressive transformations (e.g., drastic color changes, pose alterations), the background remains virtually unchanged while the foreground accurately follows the target prompt.

The authors acknowledge limitations: the variance‑based risk estimator depends on the chosen timestep window and the threshold τ, which may need dataset‑specific tuning; KV‑Lock is currently tailored to transformer‑based diffusion models and may not directly apply to CNN‑based architectures. Future work is suggested to explore alternative uncertainty measures (e.g., entropy), multi‑scale KV caching, and extensions to non‑transformer diffusion models.

In summary, KV‑Lock offers a theoretically grounded, training‑free solution to the long‑standing trade‑off between background preservation and foreground creativity in video diffusion editing, delivering superior visual quality with minimal engineering overhead.


Comments & Academic Discussion

Loading comments...

Leave a Comment