CHAI: CacHe Attention Inference for text2video
Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.
💡 Research Summary
The paper introduces CHAI (Cache Attention Inference), a training‑free acceleration framework for text‑to‑video diffusion models that dramatically reduces inference latency while preserving visual quality. The authors first identify the core bottleneck of modern video diffusion: the sequential denoising of high‑dimensional spatio‑temporal latents, which typically requires 30–50 steps to achieve acceptable results. Existing acceleration strategies fall into two categories. Training‑based methods demand costly retraining and are tightly coupled to specific architectures, making them impractical for deployment. Training‑free intra‑inference caching skips intermediate steps by reusing latents within a single generation run, but because early steps introduce large structural changes, aggressive skipping leads to severe degradation in motion consistency and object fidelity.
CHAI tackles this limitation by moving the caching granularity from whole prompts to shared semantic entities (objects or scenes) across different inference runs. The authors observe that, even though video prompts are often unique, a large proportion share at least one common entity (e.g., “tiger”, “beach”). Leveraging this, CHAI builds an entity‑level cross‑inference cache. An Entity Extractor parses incoming prompts, producing a set of entity tokens. These tokens are embedded and stored in a vector database together with metadata linking each embedding to a previously cached latent (the latent is the 3‑D representation produced at a specific denoising step of a prior generation). A lightweight LRU policy manages storage, keeping the cache size within 1–5 GB.
The technical novelty lies in the “Cache Attention” module, which replaces the spatial self‑attention layer in the STDiT block of OpenSora 1.2. In standard attention, query (Q), key (K) and value (V) all derive from the current latent. In Cache Attention, Q remains the prompt‑conditioned Gaussian noise of the current step, while K and V are taken from the cached latent that matches the current step and shares the relevant entities. This design allows the model to selectively attend to cached entity features without overwriting the prompt‑specific noise, effectively injecting useful structural information while preserving the ability to generate novel details. The authors empirically find that using cached K/V for the 2nd, 3rd, and 4th denoising steps (and only for the first block of each step) yields the best trade‑off between speed and quality; earlier steps are avoided because Q is pure noise, and later steps provide diminishing returns while increasing storage cost.
Experimental evaluation uses OpenSora 1.2 as the backbone. The authors cache latents from a 30‑step run and then run accelerated inference with only 8 denoising steps. With a realistic cache size (≈300–1000 prompts), entity‑based hit rates reach 80 %–100 %, compared to <40 % when using whole‑prompt similarity. CHAI achieves 1.65×–3.35× end‑to‑end speedup over baseline OpenSora 1.2 while maintaining comparable FVD, IS, and CLIP‑Score metrics. Qualitative examples show that, for the prompt “A beautiful coastal beach in spring, waves lapping on sand,” CHAI can extract the “beach” component from a cached “A party on a beach” latent, producing a clean video with minimal noise, whereas the naive NIR‑VANA‑VID adaptation introduces party‑related artifacts and the 8‑step OpenSora baseline fails to render coherent structures.
The paper also discusses storage‑efficiency analyses, showing that a 2 GB cache already yields >80 % hit rates, and that the LRU eviction strategy keeps the system scalable for continuous deployment. Limitations include dependence on the quality of entity extraction and the assumption that cached latents are generated with a higher‑step schedule (30 steps) to provide richer structural cues.
In summary, CHAI demonstrates that cross‑inference caching at the entity level, combined with a carefully designed Cache Attention mechanism, can cut the number of denoising steps by a factor of four without sacrificing visual fidelity. This training‑free approach offers a practical path toward real‑time or interactive text‑to‑video generation, and opens avenues for further research on multi‑modal entity representations, cache compression, and broader applicability to other large‑scale video diffusion architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment