REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation
Diffusion models have significantly advanced the field of talking head generation (THG). However, slow inference speeds and prevalent non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, a pioneering diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through a spatiotemporal variational autoencoder with a high compression ratio. Additionally, to enable semi-autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles into key-value caching for maintaining identity consistency and temporal coherence during long-term streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) strategy is proposed to mitigate error accumulation and enhance temporal consistency in streaming generation, leveraging a non-streaming teacher with an asynchronous noise schedule to supervise the streaming student. REST bridges the gap between autoregressive and diffusion-based approaches, achieving a breakthrough in efficiency for applications requiring real-time THG. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.
💡 Research Summary
The paper introduces REST, a novel diffusion‑based framework that enables real‑time, end‑to‑end streaming talking‑head generation driven by audio. Existing diffusion models achieve high visual fidelity and accurate lip‑sync but suffer from prohibitive inference latency (tens to hundreds of seconds for a few seconds of video) and a non‑streaming design that forces users to wait for the whole video to be synthesized. Autoregressive (AR) approaches, on the other hand, support streaming by generating intermediate motion representations sequentially, yet their two‑stage pipelines limit visual quality because the motion representation is less expressive than a full‑frame generation.
REST bridges this gap through three key innovations:
-
Temporal VAE for Compact Latent Space – A spatiotemporal variational auto‑encoder compresses raw video frames into a highly compact latent tensor (32 × 32 × 8 tokens per frame). This 256‑fold reduction in spatial resolution and a strong temporal down‑sampling dramatically lower the computational burden of diffusion while preserving essential motion cues.
-
ID‑Context Cache – Integrated into a DiT (Diffusion Transformer) backbone, this mechanism combines two principles:
- ID‑Sink treats the key‑value embeddings of the reference image as a permanent sink that is concatenated to every attention block, guaranteeing that the generated frames retain the speaker’s identity throughout streaming.
- Context‑Cache concatenates the KV pairs of the previous chunk with those of the current chunk, effectively extending the receptive field across chunk boundaries. The resulting attention matrix operates on `
Comments & Academic Discussion
Loading comments...
Leave a Comment