MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning
Audio-Video Joint Generation with Multimodal Control (b) Audio-Video Joint Generation with Timbre Control (c) Audio-Video Joint Generation with First Frame Control Framework Comparison Audio-Video J
Audio-Video Joint Generation with Multimodal Control (b) Audio-Video Joint Generation with Timbre Control (c) Audio-Video Joint Generation with First Frame Control Framework Comparison Audio-Video Joint Generation with Multimodal Control. A white goat @speaker 0 stands indoors and says, “I am a goat, very cute.” And point the front hoof to the opposite side.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...