Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance

Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text-to-audio diffusion models by exploring the use of anti-memorization strategies. We adopt Anti-Memorization Guidance (AMG), a technique that modifies the sampling process of pre-trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designed to reduce replication while preserving generation quality. We use Stable Audio Open as our backbone, leveraging its fully open-source architecture and training dataset. Our comprehensive experimental analysis suggests that AMG significantly mitigates memorization in diffusion-based text-to-audio generation without compromising audio fidelity or semantic alignment.


💡 Research Summary

This paper tackles the problem of data replication in text‑to‑audio diffusion models, where a generative model unintentionally reproduces portions of its training set during inference. Building on the Anti‑Memorization Guidance (AMG) originally proposed for image generation, the authors adapt and extend the technique for latent‑diffusion audio models. AMG operates entirely at inference time and introduces three complementary guidance signals that are added to the predicted noise whenever the generated audio becomes too similar to any training example.

  1. Despecification Guidance (g_spe) reduces the influence of an overly specific textual prompt. It does this by subtracting a scaled difference between the conditional and unconditional noise predictions, with a dynamic scale s₁ that grows with the similarity σₜ between the current reconstruction and its nearest neighbor.

  2. Caption Deduplication Guidance (g_dup) addresses duplicated captions in the training corpus. When a nearest‑neighbor caption y_ν is identified, the guidance pushes the model away from that caption using a similar scaled difference, with scale s₂ bounded by s₁ to avoid over‑steering.

  3. Dissimilarity Guidance (g_sim) directly minimizes the cosine similarity σₜ in the CLAP embedding space by adding the gradient ∇ₓσₜ (scaled by c₃) to the noise term, effectively pulling the generation away from the memorized sample.

The similarity σₜ is computed using CLAP‑laion embeddings, and the nearest neighbor ν is found by minimizing the Euclidean distance in that embedding space. A threshold λₜ (parabolic schedule from 0.4 to 0.5) determines when the guidance is activated. The final noise update at each reverse‑diffusion step is:

ˆε ← –ˆε + 𝟙


Comments & Academic Discussion

Loading comments...

Leave a Comment