DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.


💡 Research Summary

DreamID‑Omni tackles the fragmented landscape of controllable human‑centric audio‑video generation by unifying three traditionally separate tasks—reference‑based generation (R2AV), video editing (RV2AV), and audio‑driven animation (RA2V)—into a single diffusion‑based architecture. At its core lies a dual‑stream Diffusion Transformer (DiT) where a video stream and an audio stream interact through bidirectional cross‑attention, enabling fine‑grained temporal synchronization and semantic alignment across modalities.

The Symmetric Conditional DiT is the key architectural innovation. Heterogeneous conditioning signals—reference images, voice timbres, source video, and driving audio—are injected symmetrically: reference features are concatenated to the noisy latent sequences, while structural cues (source video or driving audio) are added element‑wise. This dual‑injection scheme cleanly separates identity preservation from structural guidance, allowing the same parameter set to switch seamlessly among R2AV, RV2AV, and RA2V simply by toggling the presence of structural inputs.

A major obstacle in multi‑person scenarios is identity‑timbre binding failure and attribute‑content misattribution. DreamID‑Omni addresses this with a Dual‑Level Disentanglement strategy.

  1. Signal‑level Synchronised Rotary Positional Embedding (Syn‑RoPE) scales the standard RoPE frequencies to match video and audio sequence lengths and reserves a large “RoPE margin” for each identity. Features belonging to the k‑th identity occupy a distinct positional interval, forcing the attention mechanism to operate in separate rotational sub‑spaces. This yields two benefits: (i) inter‑identity decoupling—cross‑attention scores between different persons are naturally suppressed, and (ii) intra‑identity synchronization—visual and acoustic features of the same person share identical positional indices, providing implicit cross‑modal alignment without extra loss terms.
  2. Structured Captions introduce explicit anchor tokens ⟨sub k⟩ for each reference identity, followed by fine‑grained descriptions of appearance, motion, and spoken content. Generated by a large multimodal language model, these captions supply the text‑condition attention with unambiguous mappings, eliminating the ambiguity that plain prompts suffer from in multi‑subject contexts.

Training proceeds via a Multi‑Task Progressive Training schedule. The first two stages focus exclusively on the weakly‑constrained R2AV task, using in‑pair reconstruction and cross‑pair disentanglement losses to strengthen identity‑timbre fidelity while learning robust reference embeddings. Once the model has mastered these basics, strongly‑constrained tasks (RV2AV and RA2V) are introduced jointly with R2AV. This staged approach prevents over‑fitting to the more restrictive tasks, preserves the generalization acquired on R2AV, and harmonizes the disparate objectives without catastrophic interference.

Extensive experiments on public benchmarks and a curated multi‑person dataset evaluate video quality (FVD, IS), audio quality (MOS, PESQ), audio‑visual sync (AV‑Sync), and identity‑audio matching accuracy. DreamID‑Omni consistently outperforms state‑of‑the‑art open‑source models (e.g., Ovi, LTX‑2) and leading commercial services such as Veo3, Sora2, and Seedance 1.5 Pro across all metrics. Notably, in multi‑person generation the identity‑timbre mismatch rate drops by ~70 % compared to baselines, and the overall audiovisual consistency surpasses prior art.

In summary, DreamID‑Omni delivers a unified, controllable, and high‑fidelity human‑centric audio‑video generation system. Its symmetric conditioning, dual‑level disentanglement, and progressive multi‑task training together enable precise, disentangled control over multiple characters and voice timbres within a single model, bridging the gap between academic research and commercial‑grade applications. The authors commit to releasing code, pretrained weights, and a demo portal to foster further development in the community.


Comments & Academic Discussion

Loading comments...

Leave a Comment