MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Audio-Video Joint Generation with Multimodal Control (b) Audio-Video Joint Generation with Timbre Control (c) Audio-Video Joint Generation with First Frame Control Framework Comparison Audio-Video J

MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Audio-Video Joint Generation with Multimodal Control (b) Audio-Video Joint Generation with Timbre Control (c) Audio-Video Joint Generation with First Frame Control Framework Comparison Audio-Video Joint Generation with Multimodal Control. A white goat @speaker 0 stands indoors and says, “I am a goat, very cute.” And point the front hoof to the opposite side.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...