Diffusion Timbre Transfer Via Mutual Information Guided Inpainting
We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input’s melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices, analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases. Demo available at 1
💡 Research Summary
The paper tackles timbre transfer for music audio as an inference‑time editing problem, leveraging a strong pre‑trained latent diffusion model (LDM) without any additional training. The authors introduce two lightweight mechanisms that operate solely during the reverse diffusion process. First, a dimension‑wise noise injection targets latent channels that carry the most mutual information about instrument identity. By analyzing the correlation between CLAP (Contrastive Language‑Audio Pre‑training) embeddings and each latent dimension, the method identifies “timbre channels” and injects Gaussian noise only into these channels, thereby selectively perturbing the instrument‑specific information while leaving other aspects of the latent representation intact. Second, an early‑step clamping mechanism re‑imposes the melodic and rhythmic structure of the input audio during the initial steps of reverse diffusion. By forcing the spectrogram of the original audio back into the latent representation for a small number of diffusion steps (e.g., the first 10 steps), the approach preserves the musical content that would otherwise be degraded by the noise injection.
Both mechanisms are compatible with text or audio conditioning via CLAP, allowing users to specify target timbres through natural language prompts, reference audio clips, or a combination of both. The system therefore supports multi‑modal style transfer while keeping the underlying diffusion scheduler and decoder unchanged.
Experimental evaluation is conducted on public datasets such as MAESTRO and URMP, as well as a custom multi‑track collection. Quantitative metrics include instrument classification accuracy (to measure timbre change), melody preservation scores (derived from pitch‑tracking similarity), and frame‑wise L2 distance on spectrograms (to assess overall fidelity). Subjective listening tests further gauge perceived naturalness of timbre conversion and structural integrity. Results show that the proposed inference‑time controls achieve timbre transformation quality comparable to, and in some cases surpassing, fully fine‑tuned baselines, while offering a clear advantage in preserving melodic and rhythmic content.
Ablation studies explore the sensitivity of channel selection, the impact of varying the noise magnitude, and the effect of different clamping step counts. The analysis reveals a smooth trade‑off curve: stronger noise and fewer clamping steps yield more dramatic timbre changes at the cost of structural fidelity, whereas milder noise and longer clamping preserve the original musical structure but produce subtler timbral shifts. When only text conditioning is used, timbre changes are less precise; however, augmenting the prompt with a reference audio clip dramatically improves the alignment of the generated timbre with the target instrument.
The authors argue that this approach demonstrates a new paradigm for re‑using large pre‑trained diffusion models as “plug‑in” components for downstream audio editing tasks. By avoiding costly retraining, the method enables real‑time or near‑real‑time timbre manipulation, which is valuable for music production, sound design, and educational tools. Future work is suggested in the direction of more sophisticated latent space analysis (e.g., disentanglement techniques), adaptive scheduling of noise injection, and user‑friendly interfaces that expose the inference‑time knobs to non‑technical musicians. Overall, the paper provides a compelling proof‑of‑concept that simple inference‑time controls can meaningfully steer powerful generative models for style‑transfer applications in the audio domain.
Comments & Academic Discussion
Loading comments...
Leave a Comment