확산 모델을 활용한 노래 음성 분리 라티스 디퓨전 기반 효율적 생성 접근

Reading time: 6 minute
...

📝 Original Info

  • Title: 확산 모델을 활용한 노래 음성 분리 라티스 디퓨전 기반 효율적 생성 접근
  • ArXiv ID: 2511.20470
  • Date: 2025-11-26
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in music signals poses an inherent challenge. Also, accessing all sources in the mixture is crucial to train these systems, while complicated. Attempts to address these challenges in a generative fashion exist, however, the separation performance and inference efficiency remain limited. In this work, we study the potential of diffusion models to advance toward bridging this gap, focusing on generative singing voice separation relying only on corresponding pairs of isolated vocals and mixtures for training. To align with creative workflows, we leverage latent diffusion: the system generates samples encoded in a compact latent space, and subsequently decodes these into audio. This enables efficient optimization and faster inference. Our system is trained using only open data. We outperform existing generative separation systems, and level the compared non-generative systems on a list of signal quality measures and on interference removal. We provide a noise robustness study on the latent encoder, providing insights on its potential for the task. We release a modular toolkit for further research on the topic.

💡 Deep Analysis

Deep Dive into 확산 모델을 활용한 노래 음성 분리 라티스 디퓨전 기반 효율적 생성 접근.

Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in music signals poses an inherent challenge. Also, accessing all sources in the mixture is crucial to train these systems, while complicated. Attempts to address these challenges in a generative fashion exist, however, the separation performance and inference efficiency remain limited. In this work, we study the potential of diffusion models to advance toward bridging this gap, focusing on generative singing voice separation relying only on corresponding pairs of isolated vocals and mixtures for training. To align with creative workflows, we leverage latent diffusion: the system generates samples encoded in a compact latent space, and subsequently decodes these into audio. This enables efficient optimization and f

📄 Full Content

Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model Gen´ıs Plaja-Roglans∗†, Yun-Ning Hung∗, Xavier Serra†, and Igor Pereira∗ ∗Music.AI, Salt Lake City, Utah, United States †Music Technology Group, Universitat Pompeu Fabra. Barcelona, Spain Abstract—Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in music signals poses an inherent challenge. Also, accessing all sources in the mixture is crucial to train these systems, while complicated. Attempts to address these challenges in a generative fashion exist, however, the separation performance and inference efficiency remain limited. In this work, we study the potential of diffusion models to advance toward bridging this gap, focusing on generative singing voice separation relying only on corresponding pairs of isolated vocals and mixtures for training. To align with creative workflows, we leverage latent diffusion: the system generates samples encoded in a compact latent space, and subsequently decodes these into audio. This enables efficient optimization and faster inference. Our system is trained using only open data. We outperform existing generative separation systems, and level the compared non-generative systems on a list of signal quality measures and on interference removal. We provide a noise robustness study on the latent encoder, providing insights on its potential for the task. We release a modular toolkit for further research on the topic.1 Index Terms—Denoising Diffusion Probabilistic Models, Music Source Separation, Generative Modeling. I. INTRODUCTION Deep generative audio modeling has emerged as a widely explored topic, with important advances specially attributed to denoising diffusion probabilistic models (DDPM) [1]. These generative systems have demonstrated impressive performance for creative purposes [2, 3]. By introducing stringent condi- tioning techniques [4, 5], the generative potential of DDPM may be used to address audio inverse problems [4, 6, 7], showing promise for multiple applications crucial to music creation. Few examples are audio or speech enhancement [4], upsampling [8], and more recently source separation [7, 9]. Music source separation (MSS) involves isolating individual elements from a musical mixture [10]. It plays an important role in music creation, practice, and analysis [11]. This task is generally addressed via neural networks that mask or trans- form the spectrogram of a mixture to extract the individual sources [12]. However, these face an inherent challenge due to the significant overlap between musical sources, which may limit performance. Additionally, synthesizing the estimated spectrograms into the time domain introduces further complex- ity [10], and predicting the phase of complex spectrograms is a studied but challenging task [13]. Moreover, having access 1Please see https://github.com/WeAreMusicAI/dmx-diffusion to all sources that linearly sum up to the mixture is crucial to train these systems [14], but acquiring such data is costly [15]. These challenges are crucial as music practitioners value high-quality, clean separations. While recent deterministic models have achieved impressive performance on objective separation metrics [16, 17], it remains unclear whether these metrics fully-capture perceptual quality [18, 19]. This is more pronounced for generative models, which inherently sample from a modeled data distribution. This often results in out- puts with minor, potentially imperceptible deviations from the target when addressing inverse problems. These subtle differences are disproportionately penalized by the separation metrics [20, 21]. However, users may prioritize perceptual quality and cleanliness over an exact copy of the target signal. Despite their generation potential, the computational cost of training DDPM remains a challenge, while large datasets are normally required [1, 4]. To alleviate this, latent diffu- sion models (LDM) were proposed [22]. These systems are trained to generate samples encoded in a learned and compact latent representation which is leveraged from an autoencoder optimized for the target data. Thereby, latent diffusion enables faster and more efficient optimization. More importantly, in- ference can be run effectively with less computing resources, which is crucial to bring these tools to music practitioners. We explore the potential of latent diffusion to separate the singing voice, a crucial but complex source, having solely access to solo vocal tracks and the corresponding mixtures for training. Recently, DDPM have been employed to separate mu- sical sources both in the time [7] and latent domains [9]. How- ever, these studies primarily focus on synthetic instrumental mixtures and often exclude vocals

…(Full text truncated)…

📸 Image Gallery

block.png cover.png cut_off_diagram_4.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut