📝 Original Info
- Title: Supervised Speech Separation Based on Deep Learning: An Overview
- ArXiv ID: 1708.07524
- Date: 2018-06-18
- Authors: Researchers from original ArXiv paper
📝 Abstract
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multi-talker separation), and speech dereverberation, as well as multi-microphone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.
💡 Deep Analysis
Deep Dive into Supervised Speech Separation Based on Deep Learning: An Overview.
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on sep
📄 Full Content
1
Abstract— Speech separation is the task of separating
target
speech
from
background
interference.
Traditionally, speech separation is studied as a signal
processing problem. A more recent approach formulates
speech separation as a supervised learning problem, where
the discriminative patterns of speech, speakers, and
background noise are learned from training data. Over the
past decade, many supervised separation algorithms have
been put forward. In particular, the recent introduction of
deep learning to supervised speech separation has
dramatically accelerated progress and boosted separation
performance. This article provides a comprehensive
overview of the research on deep learning based
supervised speech separation in the last several years. We
first introduce the background of speech separation and
the formulation of supervised separation. Then we discuss
three main components of supervised separation: learning
machines, training targets, and acoustic features. Much of
the overview is on separation algorithms where we review
monaural
methods,
including
speech
enhancement
(speech-nonspeech separation), speaker separation (multi-
talker separation), and speech dereverberation, as well as
multi-microphone techniques. The important issue of
generalization, unique to supervised learning, is discussed.
This overview provides a historical perspective on how
advances are made. In addition, we discuss a number of
conceptual issues, including what constitutes the target
source.
Index Terms—Speech separation, speaker separation,
speech enhancement, supervised speech separation, deep
learning, deep neural networks, speech dereverberation,
time-frequency masking, array separation, beamforming.
I.INTRODUCTION
The goal of speech separation is to separate target speech from
background interference. Speech separation is a fundamental
task in signal processing with a wide range of applications,
D. L. Wang is with the Department of Computer Science and Engineering
and the Center for Cognitive and Brain Sciences, The Ohio State University,
Columbus, OH 43210, USA (e-mail: dwang@cse.ohio-state.edu). He also
holds a visiting appointment at the Center of Intelligent Acoustics and
Immersive Communications, Northwestern Polytechnical University, Xi’an,
China.
J. Chen was with the Department of Computer Science and Engineering,
The Ohio State University, Columbus, OH 43210, USA (email:
chen.2593@osu.edu). He is now with Silicon Valley AI Lab at Baidu
Research, 1195 Bordeaux Drive, Sunnyvale, CA 94089, USA.
including hearing prosthesis, mobile telecommunication, and
robust automatic speech and speaker recognition. The human
auditory system has the remarkable ability to extract one
sound source from a mixture of multiple sources. In an
acoustic environment like a cocktail party, we seem capable of
effortlessly following one speaker in the presence of other
speakers and background noises. Speech separation is
commonly called the “cocktail party problem,” a term coined
by Cherry in his famous 1953 paper [26].
Speech separation is a special case of sound source
separation. Perceptually, source separation corresponds to
auditory stream segregation, a topic of extensive research in
auditory perception. The first systematic study on stream
segregation was conducted by Miller and Heise [124] who
noted that listeners split a signal with two alternating sine-
wave tones into two streams. Bregman and his colleagues
have carried out a series of studies on the subject, and in a
seminal book [15] he introduced the term auditory scene
analysis (ASA) to refer to the perceptual process that
segregates an acoustic mixture and groups the signal
originating from the same sound source. Auditory scene
analysis is divided into simultaneous organization and
sequential
organization.
Simultaneous
organization
(or
grouping) integrates concurrent sounds, while sequential
organization integrates sounds across time. With auditory
patterns displayed on a time-frequency representation such as
a spectrogram, main organizational principles responsible for
ASA include: Proximity in frequency and time, harmonicity,
common amplitude and frequency modulation, onset and
offset synchrony, common location, and prior knowledge (see
among others [163] [15] [29] [11] [30] [32]). These grouping
principles also govern speech segregation [201] [154] [31] [4]
[49] [93]. From ASA studies, there seems to be a consensus
that the human auditory system segregates and attends to a
target sound, which can be a tone sequence, a melody, or a
voice. More debatable is the role of auditory attention in
stream segregation [17] [151] [148] [120]. In this overview,
we use speech separation (or segregation) primarily to refer to
the computational task of separating the target speech signal
from a noisy mixture.
How well do we
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.