Generative Audio Extension and Morphing

Generative Audio Extension and Morphing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a novel variant of classifier-free guidance on such masked latents, we demonstrate that: (i) given an audio reference, we can extend it both forward and backward for a specified duration, and (ii) given two audio references, we can morph them seamlessly for the desired duration. Furthermore, we show that by fine-tuning the model on different types of stationary audio data we mitigate potential hallucinations. The effectiveness of our method is supported by objective metrics, with the generated audio achieving Fréchet Audio Distances (FADs) comparable to those of real samples from the training data. Additionally, we validate our results through a subjective listener test, where subjects gave positive ratings to the proposed model generations. This technique paves the way for more controllable and expressive generative sound frameworks, enabling sound designers to focus less on tedious, repetitive tasks and more on their actual creative process.


💡 Research Summary

**
The paper introduces a novel generative audio framework that enables both forward/backward extension of a single audio clip and seamless morphing between two audio clips. The core idea is to operate on the latent space of a Diffusion Transformer (DiT) and to mask portions of the noisy latent vectors before denoising. By applying a new variant of classifier‑free guidance, called Audio Prompt Guidance (APG), the model is forced to generate content that aligns closely with the original audio prompt while still allowing creative variation in the masked region.

Methodology

  1. Latent Encoding – A stereo VAE compresses 48 kHz audio into a 256‑dimensional latent space, preserving left/right channel information and a difference channel that captures spatial cues.
  2. Masking Function – Gaussian noise (z_G) is fed into the DiT. A deterministic masking function (f_M(z_G, z)) replaces either the beginning, the end, or both ends of the latent sequence with the noisy latent, leaving the rest untouched. This enables the model to keep the original prompt latent intact while only re‑generating the masked segment.
  3. Audio Prompt Guidance (APG) – Extending the classic CFG formulation, APG introduces a guidance scale (\gamma) that directly weights the difference between the DiT output on the masked latent and the output on the unmasked noise:
    \

Comments & Academic Discussion

Loading comments...

Leave a Comment