Supervised Speech Separation Based on Deep Learning: An Overview
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multi-talker separation), and speech dereverberation, as well as multi-microphone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.
💡 Research Summary
This paper provides a comprehensive overview of supervised speech separation research that leverages deep learning. It begins by framing speech separation as the classic “cocktail‑party problem,” highlighting the limitations of traditional signal‑processing approaches such as spectral subtraction, computational auditory scene analysis (CASA), and beamforming. The authors then introduce the supervised paradigm, where the task is cast as a classification or regression problem using time‑frequency (T‑F) masks, with the ideal binary mask (IBM) serving as the earliest training target.
The review is organized around three core components: learning machines, training targets, and acoustic features.
Learning machines – The paper surveys the evolution from shallow multilayer perceptrons (MLPs) to modern deep architectures. It discusses how unsupervised pre‑training (RBMs), rectified linear units (ReLUs), and skip connections mitigated the vanishing‑gradient problem, enabling deeper MLPs. Convolutional neural networks (CNNs) are presented as efficient for exploiting local invariances in spectrograms, while recurrent neural networks (RNNs), especially long short‑term memory (LSTM) units, capture the strong temporal dynamics of speech. The authors also note emerging generative adversarial networks (GANs) that learn to produce masks or waveforms through a discriminator‑generator game, offering more natural reconstruction.
Training targets – Beyond IBM, the paper enumerates a spectrum of targets: ideal ratio mask (IRM), phase‑sensitive mask (PSA), complex‑valued masks, and direct waveform regression (e.g., time‑domain end‑to‑end models). Each target is linked to specific loss functions—cross‑entropy for binary masks, mean‑square error for ratio masks, and scale‑invariant signal‑to‑noise ratio (SI‑SNR) for waveform prediction. Empirical evidence shows that continuous masks (IRM, PSA) generally outperform binary masks in perceptual metrics such as PESQ and STOI.
Acoustic features – The most common input is the log‑power spectrogram, often augmented with mel‑frequency cepstral coefficients (MFCCs) or phase information. Recent work feeds raw waveforms into convolutional‑recurrent hybrids (e.g., Conv‑TasNet, DPRNN), eliminating the need for hand‑crafted spectral features and achieving state‑of‑the‑art performance.
The survey then distinguishes between monaural and multi‑microphone (array) approaches. In the monaural domain, the authors categorize algorithms into speech enhancement (speech‑noise separation), speaker separation (multiple talkers), and dereverberation. They highlight that time‑domain end‑to‑end models (Conv‑TasNet, DPRNN, Transformer‑based networks) now surpass traditional spectrogram‑based methods.
For array‑based techniques, the paper reviews classic beamformers (delay‑and‑sum, MVDR) and their deep‑learning‑enhanced variants. Mask‑based beamforming uses DNN‑estimated masks to compute spatial filters, while fully end‑to‑end spatial‑temporal networks jointly learn spatial filtering and mask estimation, showing robustness to co‑located sources and reverberation.
A dedicated section addresses the generalization challenge inherent to supervised learning. Domain mismatch—differences in noise types, speaker identities, room acoustics, and microphone configurations—can degrade performance. The authors discuss data augmentation, domain adaptation, meta‑learning, and unsupervised pre‑training as strategies to improve robustness.
Finally, the paper raises a conceptual question: “What should be considered the target?” It argues that the optimal target may not be the clean waveform per se, but a signal optimized for human intelligibility or downstream tasks (e.g., ASR‑oriented masks).
In conclusion, deep learning has dramatically advanced supervised speech separation, with end‑to‑end time‑domain models and spatial‑temporal integration representing the current frontier. Future research directions include improving cross‑domain generalization, achieving low‑latency real‑time deployment, and tighter integration with models of human auditory perception to finally solve the cocktail‑party problem.
Comments & Academic Discussion
Loading comments...
Leave a Comment