Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts
This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs’ cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.
💡 Research Summary
Stereo‑Talker is a novel one‑shot, audio‑driven system that synthesizes 3D talking videos from a single portrait image and an arbitrary audio clip. The method follows a two‑stage pipeline. In the first stage, raw speech is encoded by a pre‑trained wav2vec 2.0 model to obtain high‑level semantic audio features. These features are projected into the textual latent space of a large language model (LLM) using a lightweight projection network. The projected vectors are then processed by a LoRA‑fine‑tuned LLM encoder, which enriches the representation with linguistic context and semantic priors. The enriched latent vectors serve as conditioning for a diffusion‑based motion generator. Motion is represented with a VQ‑VAE that compresses pose sequences into a discrete codebook, allowing the diffusion model to produce diverse, one‑to‑many mappings from audio to full‑body gestures, including facial expressions and lip movements.
In the second stage, the generated motion sequence is rendered into photorealistic video frames using a U‑Net diffusion backbone augmented with two prior‑guided Mixture‑of‑Experts (MoE) modules. The view‑guided MoE contains multiple expert branches, each specialized for a particular camera viewpoint. During inference, the distance between the target viewpoint and each expert’s canonical view determines the blending weights, enabling smooth continuous viewpoint control while preserving 3D consistency. The mask‑guided MoE focuses on region‑specific rendering stability: a separate VAE‑based mask prediction network derives per‑frame human masks directly from the motion data, and these masks guide the MoE experts that specialize in different body parts (face, hands, torso). This design mitigates common artifacts such as blurry hands or unstable lip regions and improves temporal coherence.
To support robust training and evaluation, the authors introduce the High‑Definition Audio‑Visual (HDA‑V) dataset, comprising 2,203 identities with detailed 3D template parameters, pose annotations, and synchronized audio‑video pairs. The dataset covers a wide range of body gestures, facial expressions, and lighting conditions, facilitating strong generalization across unseen subjects.
Extensive experiments demonstrate that Stereo‑Talker outperforms prior state‑of‑the‑art methods on several metrics: lip‑sync accuracy, gesture diversity, visual fidelity under viewpoint changes, and temporal consistency. Ablation studies confirm that both the LLM‑enhanced audio‑to‑motion translation and the prior‑guided MoE modules contribute substantially to the observed gains.
Key contributions are: (1) the first framework that combines LLM semantic priors with diffusion models for high‑fidelity, one‑shot 3D talking video synthesis; (2) a prior‑guided MoE architecture that injects view‑ and mask‑specific knowledge without a large computational overhead; (3) a large‑scale, richly annotated dataset that lowers the barrier for future research in audio‑driven human video generation; and (4) a comprehensive analysis showing that semantic audio cues, rather than low‑level rhythmic features alone, are essential for expressive co‑speech gesture generation.
Limitations include the high computational cost of LLM encoders and diffusion sampling, which may hinder real‑time deployment, and sensitivity to the number and placement of MoE experts. Future work will explore model distillation, adaptive expert selection, and integration with interactive VR/AR pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment