Supervised Speech Separation Based on Deep Learning: An Overview

Reading time: 6 minute
...

📝 Original Info

  • Title: Supervised Speech Separation Based on Deep Learning: An Overview
  • ArXiv ID: 1708.07524
  • Date: 2018-06-18
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multi-talker separation), and speech dereverberation, as well as multi-microphone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

💡 Deep Analysis

Deep Dive into Supervised Speech Separation Based on Deep Learning: An Overview.

Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on sep

📄 Full Content

1

Abstract— Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This article provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multi- talker separation), and speech dereverberation, as well as multi-microphone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

Index Terms—Speech separation, speaker separation, speech enhancement, supervised speech separation, deep learning, deep neural networks, speech dereverberation, time-frequency masking, array separation, beamforming. I.INTRODUCTION The goal of speech separation is to separate target speech from background interference. Speech separation is a fundamental task in signal processing with a wide range of applications,

D. L. Wang is with the Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210, USA (e-mail: dwang@cse.ohio-state.edu). He also holds a visiting appointment at the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University, Xi’an, China. J. Chen was with the Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA (email: chen.2593@osu.edu). He is now with Silicon Valley AI Lab at Baidu Research, 1195 Bordeaux Drive, Sunnyvale, CA 94089, USA. including hearing prosthesis, mobile telecommunication, and robust automatic speech and speaker recognition. The human auditory system has the remarkable ability to extract one sound source from a mixture of multiple sources. In an acoustic environment like a cocktail party, we seem capable of effortlessly following one speaker in the presence of other speakers and background noises. Speech separation is commonly called the “cocktail party problem,” a term coined by Cherry in his famous 1953 paper [26].
Speech separation is a special case of sound source separation. Perceptually, source separation corresponds to auditory stream segregation, a topic of extensive research in auditory perception. The first systematic study on stream segregation was conducted by Miller and Heise [124] who noted that listeners split a signal with two alternating sine- wave tones into two streams. Bregman and his colleagues have carried out a series of studies on the subject, and in a seminal book [15] he introduced the term auditory scene analysis (ASA) to refer to the perceptual process that segregates an acoustic mixture and groups the signal originating from the same sound source. Auditory scene analysis is divided into simultaneous organization and sequential organization. Simultaneous organization (or grouping) integrates concurrent sounds, while sequential organization integrates sounds across time. With auditory patterns displayed on a time-frequency representation such as a spectrogram, main organizational principles responsible for ASA include: Proximity in frequency and time, harmonicity, common amplitude and frequency modulation, onset and offset synchrony, common location, and prior knowledge (see among others [163] [15] [29] [11] [30] [32]). These grouping principles also govern speech segregation [201] [154] [31] [4] [49] [93]. From ASA studies, there seems to be a consensus that the human auditory system segregates and attends to a target sound, which can be a tone sequence, a melody, or a voice. More debatable is the role of auditory attention in stream segregation [17] [151] [148] [120]. In this overview, we use speech separation (or segregation) primarily to refer to the computational task of separating the target speech signal from a noisy mixture.
How well do we

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut