MusicInfuser: Making Video Diffusion Listen and Dance
We introduce MusicInfuser, an approach that aligns pre-trained text-to-video diffusion models to generate high-quality dance videos synchronized with specified music tracks. Rather than training a multimodal audio-video or audio-motion model from scratch, our method demonstrates how existing video diffusion models can be efficiently adapted to align with musical inputs. We propose a novel layer-wise adaptability criterion based on a guidance-inspired constructive influence function to select adaptable layers, significantly reducing training costs while preserving rich prior knowledge, even with limited, specialized datasets. Experiments show that MusicInfuser effectively bridges the gap between music and video, generating novel and diverse dance movements that respond dynamically to music. Furthermore, our framework generalizes well to unseen music tracks, longer video sequences, and unconventional subjects, outperforming baseline models in consistency and synchronization. All of this is achieved without requiring motion data, with training completed on a single GPU within a day.
💡 Research Summary
This paper introduces “MusicInfuser,” a novel method for generating high-quality dance videos that are precisely synchronized with a given music track. Instead of training a new audio-to-video model from scratch—a challenge due to the scarcity of high-quality, music-aligned dance footage—MusicInfuser efficiently adapts a pre-trained text-to-video diffusion model. The core insight is that such models, trained on vast and diverse internet video, already possess an implicit understanding of human motion and dance-like movements. MusicInfuser aims to preserve this rich “prior knowledge” while teaching the model to align its outputs with musical rhythm and style.
The technical framework addresses three main challenges: where to inject musical conditioning, how to do it stably, and how to train efficiently with limited data. First, the authors propose a layer-wise adaptability criterion. Rather than attaching music-conditioning modules to all layers of the model (which is costly and can degrade quality), they develop a metric to identify the most “adaptable” layers. This metric measures a layer’s positive influence on video structure by using it to guide the sampling process of a model that lacks that layer. This allows for pre-computation of an optimal subset of layers for adaptation without exhaustive fine-tuning.
Second, for the actual conditioning mechanism, they introduce Zero-Initialized Cross-Attention (ZICA) blocks. These are inserted into the selected layers of the Diffusion Transformer (DiT). Crucially, the output projection matrix of this cross-attention block is initialized to zero. This means the block initially behaves as an identity function, causing no disruption to the pre-trained model’s behavior. As training progresses, it gradually learns to incorporate information from the audio tokens, ensuring stable and conservative adaptation.
Third, they employ a Beta-Uniform noise scheduling strategy during training. Typically, diffusion models sample noise levels uniformly. MusicInfuser starts training by sampling predominantly from low noise levels (which affect fine details like dance nuances), using a Beta distribution. Over time, it transitions to a uniform distribution, eventually training on all noise levels equally. This strategy prioritizes learning music-responsive details early on while preserving the model’s foundational knowledge of human motion physics learned at higher noise levels.
The results demonstrate that MusicInfuser successfully generates diverse and novel dance movements that dynamically respond to music beats and style. It shows impressive generalization to unseen music genres, longer video sequences than those seen during training, and even unconventional subjects like dancing animals—all while maintaining text-based control over style and setting. Evaluations, including an automated framework using Video-LLMs and human assessments, show it outperforms baseline adaptation strategies in terms of video-music synchronization and overall quality. Remarkably, this effective alignment is achieved without requiring any motion capture data, and the entire adaptation process is completed on a single GPU within a day, highlighting its efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment