Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

ST A TISTICAL SPEECH ENHANCEMENT BASED ON PR OB ABILISTIC INTEGRA TION OF V ARIA TION AL A UTOENCODER AND NON-NEGA TIVE MA TRIX F A CT ORIZA TION Y oshiaki Bando 1 , Masato Mimura 1 , Katsutoshi Itoyama 1 , Kazuyoshi Y oshii 1 , 2 , T atsuya Kawahara 1 1 Graduate School of Informatics, K yoto Univ ersity , Sak yo-ku, Kyoto 606-8501, Japan 2 Center for Adv anced Intelligence Project, RIKEN, Chuo-ku, T okyo 103-0027, Japan ABSTRA CT This paper presents a statistical method of single-channel speech en- hancement that uses a variational autoencoder (V AE) as a prior dis- tribution on clean speech. A standard approach to speech enhance- ment is to train a deep neural network (DNN) to take noisy speech as input and output clean speech. Although this supervised approach requires a very lar ge amount of pair data for training, it is not rob ust against unknown en vironments. Another approach is to use non- negati ve matrix factorization (NMF) based on basis spectra trained on clean speech in advance and those adapted to noise on the ﬂy . This semi-supervised approach, howev er, causes considerable sig- nal distortion in enhanced speech due to the unrealistic assumption that speech spectrograms are linear combinations of the basis spec- tra. Replacing the poor linear generative model of clean speech in NMF with a V AE—a powerful nonlinear deep generativ e model— trained on clean speech, we formulate a uniﬁed probabilistic gener- ativ e model of noisy speech. Giv en noisy speech as observed data, we can sample clean speech from its posterior distribution. The pro- posed method outperformed the conv entional DNN-based method in unseen noisy en vironments. Index T erms — Single-channel speech enhancement, varia- tional autoencoder , Bayesian signal processing 1. INTR ODUCTION Deep neural networks (DNNs) have demonstrated excellent perfor - mance in single-channel speech enhancement [1–6]. The denois- ing autoencoder (DAE) [1], for example, is a typical variant of such networks, which is trained to directly con vert a noisy speech spec- trogram to a clean speech spectrogram with a supervised training. Alternativ ely , a DNN can be trained to predict time-frequency (TF) masks called ideal ratio masks (IRMs) that represent ratios of speech to input signals and are used for obtaining a speech spectrogram from a noisy spectrogram [4]. Although it is necessary to prepare as training data a large amount of pairs of clean speech signals and their noisy versions, these supervised methods often deteriorate in unknown noisy en vironments. This calls for semi-supervised meth- ods that are trained by using only clean speech data in adv ance and then adapt to unseen noisy en vironments. Statistical source separation methods based on the additivity of speech and noise spectrograms have also been used for speech en- hancement [7–11]. Non-negati ve matrix factorization (NMF) [9, 12], for example, regards a noisy speech spectrogram as a non-negati ve matrix and approximates it as the product of two non-negati ve ma- trices (a set of basis spectra and a set of the corresponding activ a- tions). If a partial set of basis spectra is trained in advance from Thanks to JSPS KAKENHI No. 15J08765 for funding. Observed signal  Pre-training  (  |  ,  ) NMF -based noise model VA E -based speech m o del  (  |  )  =  +  Clean speech d ataset Fig. 1 . Overview of our speech enhancement model. clean speech spectrograms, the noisy spectrogram is decomposed into the sum of speech and noise spectrograms in a semi-supervised manner . Robust principal component analysis (RPCA) [13, 14] is another promising method that can decompose a noisy spectrogram into a sparse speech spectrogram and a low-rank noise spectrogram in an unsupervised manner . These con ventional statistical methods, howe ver , have a common problem that the linear representation or the sparseness assumption of speech spectrograms is not satisﬁed in reality and results in considerable signal distortion. Recently , deep generative models such as generati ve adversarial networks (GANs) and v ariational autoencoders (V AEs) hav e gained a lot of attention for learning a probability distribution over complex data (e.g., images and audio signals) that cannot be represented by con ventional linear models [15–19]. GANs and V AEs are both based on two kinds of DNNs ha ving dif ferent roles. In GANs [15], a gen- erator is trained to synthesize data that fool a discriminator from a latent space while the discriminator is trained to detect synthesized data in a minimax-game fashion. In V AEs [16, 17], on the other hand, an encoder that embeds observed data into a latent space and a decoder that generates data from the latent space are trained jointly such that the lower bound of the log marginal likelihood for the ob- served data is maximized. Although in general GANs can generate more realistic data, V AEs provide a principled scheme of inferring the latent representations of both giv en and new data. In this paper we propose a uniﬁed probabilistic generati ve model of noisy speech spectra by combining a V AE-based generati ve model of speech spectra with an NMF-based generativ e model of noise spectra (Fig. 1). The V AE is trained in advance from a sufﬁcient amount of clean speech spectra and its decoder is used as a prior dis- tribution on clean speech spectra included in noisy speech spectra. Giv en observed data, we can estimate both the latent representations of speech spectra as well as the basis spectra and their activ ations of noise spectra through Bayesian inference based on a Markov chain Monte Carlo (MCMC) initialized by the encoder of the V AE. Our Bayesian approach can adapt to both unseen speech and noise spec- tra by using prior knowledge of clean speech and the low-rankness assumption on noise instead of ﬁxing all the parameters in adv ance. *Demo page: http://sap.ist.i.kyoto- u.ac.jp/members/yoshiaki/demo/vae- nmf/ 2. RELA TED W ORK This section overvie ws DNN-based speech enhancement and intro- duces the variational autoencoder (V AE). 2.1. DNN-based speech enhancement V arious network architectures and cost functions for enhancing speech signals ha ve been reported [1–6]. The popular approach of DNN-based speech enhancement is to train a DNN to directly represent clean speech [6]. The DNN is trained using simulated noisy data constructed by adding noise to speech as input and clean speech as the target. There are se veral methods that combine a supervised NMF and a DNN [20, 21]. A DNN is trained to estimate activ ation vectors of the pre-trained basis vectors corresponding to speech and noise. Bayesian W aveNet [22] uses two networks: one, called a prior network, represents how likely a signal is speech and the other , called a likelihood netw ork, represents how likely a signal is included in the observation. These two networks enhance the noisy speech signal with a maximum a posteriori (MAP) estimation. Another reported method uses two networks that are trained to repre- sent ho w likely the input signal is speech or noise, respecti vely [23]. The speech signal is enhanced by optimizing a cost function so that the estimated speech maximizes the speech-likelihood network and minimizes the noise-likelihood network. All the above mentioned methods are trained with datasets of both speech and noise signals. A DNN-based method using only training data of speech signals was reported [24]. This method represents speech and noise spectra with two autoencoders (AEs). The AE for speech is pre-trained, whereas that for noise is trained at the inference for adapting to the observed noise signal. Since the inference of this framework is under-determined, the estimated speech is constrained to be repre- sented by a pre-trained NMF model. It, thus, might hav e the same problem as the semi-supervised NMF . 2.2. V ariational autoencoder A V AE [16] is a frame work for learning the probability distribution of a dataset. In this subsection, we denote by X a dataset that con- tains F -dimensional samples x t ∈ R F ( t = 1 , . . . , T ). The V AE assumes that a D -dimensional latent v ariable (denoted by z t ∈ R D ) follows a standard Gaussian distribution and each sample x t is stochastically generated from a conditional distribution p ( x t | z t ) : z t ∼ N ( 0 , I D ) , (1) x t ∼ p ( x t | z t ) , (2) where N ( µ, σ ) represents a Gaussian distrib ution with mean param- eter µ and variance parameter σ . p ( x t | z t ) is called a decoder and parameterized as a well-known probability density function whose parameters are giv en by nonlinear functions represented as neural networks. For example, Kingma et al. [16] reported a V AE model that has the following Gaussian lik elihood function: x t ∼ p ( x t | z t ) = Y f p ( x f t | z t ) = Y f N  µ x f ( z t ) , σ x f ( z t )  , (3) where µ x f : R D → R and σ x f : R D → R + are neural networks representing the mean and variance parameters, respecti vely . The objectiv e of V AE training is to ﬁnd a likelihood function p ( x t | z t ) that maximizes the log marginal likelihood: argmax p ( x t | z t ) log p ( X ) = argmax p ( x t | z t ) Y t Z p ( x t | z t ) p ( z t ) d z t . (4) Since calculating this marginal likelihood is intractable, it is approx- imated with a variational Bayesian (VB) framework. The V AE ﬁrst ≈   2             Fig. 2 . V AE representation of a speech spectrogram. approximates the posterior distrib ution of z t with the follo wing vari- ational posterior distribution q ( z t ) called an encoder: p ( z 1 , . . . , z T | X ) ≈ Y t q ( z t ) = Y d,t q ( z dt ) (5) = Y d,t N ( µ z d ( x t ) , σ z d ( x t )) , (6) where µ z d : R F → R and σ z d : R F → R + are nonlinear functions representing the mean and variance parameters, respectiv ely . These functions are formulated with DNNs. By using the variational pos- terior , the log marginal likelihood is lo wer-bounded as follo ws: log p ( X ) = X t log Z p ( x t | z t ) p ( z t ) d z t (7) ≥ X t Z q ( z t ) log p ( x t | z t ) p ( z t ) q ( z t ) d z t (8) = − X t KL [ q ( z t ) | p ( z t ) ] + X k E q [log p ( x t | z t )] , (9) where KL [ · |· ] represents the Kullback-Leibler di vergence. The V AE is trained so that p ( x t | z t ) and q ( z t ) maximize this variational lower bound. The ﬁrst term of Eq. (9) is analytically tractable and the second term can be approximated with a Monte-Carlo algorithm. The lower bound can be maximized by using a stochastic gradient descent (SGD) [25]. 3. ST A TISTICAL SPEECH ENHANCEMENT B ASED ON COMBINA TION OF V AE AND NMF This section describes the proposed probabilistic generative model called V AE-NMF , that combines a V AE-based speech model and a NMF-based noise model. W e formulate the generativ e process of an observed complex spectrogram X ∈ C F × T by formulating the process of a speech spectrogram S ∈ C F × T and a noise spectrogram N ∈ C F × T . The characteristics of speech and noise signals are represented by their priors based on V AE and NMF , respectively . 3.1. V AE-based speech model In our speech model we assume a frame-wise D -dimensional latent variable Z ∈ R D × T . Each time-frame of the latent variable z t is supposed to represent the characteristics of a speech spectrum such as fundamental frequency , spectral en velope, and type of phoneme. The speciﬁc representation of z t is obtained automatically by con- ducting the V AE training with a dataset of clean speech spectra. As in the conv entional V AEs, we put the standard Gaussian prior on each element of Z : z dt ∼ N (0 , 1) . (10) Since the speech spectra are primarily characterized by its po wer spectral density (PSD), it follows a zero-mean complex Gaussian distribution whose v ariance parameter is formulated with Z (Fig. 2): s f t ∼ N C  0 , σ s f ( z t )  , (11) where N C ( µ, σ ) is a complex Gaussian distribution with mean pa- rameter µ and variance parameter σ . σ s f ( z t ) : R D → R + is a nonlinear function representing the relationship between Z and the speech signal S . This function is formulated by using a DNN and obtained by the V AE training. 3.2. Generative model of mixture signals In our Bayesian generative model, the input complex spectrogram X ∈ C F × T is represented as the sum of a speech spectrogram S and a noise spectrogram N : x f t = s f t + n f t . (12) W e put the V AE-based hierarchical prior model (Eqs. (10) and (11)) on the speech spectrogram S . On the other hand, we assume that the PSD of the noise spectrogram is low-rank and put an NMF-based prior model on it. More speciﬁcally , the PSD of a noise spectrogram can be represented as the product of K spectral basis vectors W = [ w 1 , . . . , w K ] ∈ R F × K + and their activ ation vectors H ∈ R K × T + . The zero-mean complex Gaussian distrib ution is put on each TF bin of the noise spectrogram N as follows: n f t ∼ N C 0 , X k w f k h kt ! . (13) For mathematical con venience, we put conjugate prior distributions on W and H as follows: w f k ∼ G ( a 0 , b 0 ) , h kt ∼ G ( a 1 , b 1 ) , (14) where G ( α, β ) is a gamma distribution with the shape parameter α > 0 and the rate parameter β > 0 ; a 0 , b 0 , a 1 , and b 1 are hyperpa- rameters that should be set in advance. By marginalizing out the speech and noise complex spectro- grams S and N , we obtain the following Gaussian likelihood: x f t | W , H , Z ∼ N C 0 , X k w f k h kt + σ s f ( z t ) ! . (15) Since this lik elihood function is independent of the phase term of the input spectrogram X , it is equiv alent to the following exponential likelihood: | x f t | 2   W , H , Z ∼ Exp X k w f k h kt + σ s f ( z t ) ! , (16) where | x f t | 2 is the power spectrogram of X and Exp ( λ ) is the ex- ponential distribution with a mean parameter λ . Maximization of the exponential likelihood on a power spectrogram corresponds to min- imization of Itakura-Saito div ergence, which is widely used in audio source separation [12, 26]. 3.3. Pre-training of V AE-based speech model The goal of the pre-training of the V AE-based speech model is to ﬁnd p ( s t | z t ) that maximizes the following marginal likelihood p ( S ) from the dataset of clean speech signal (denoted by S ∈ C F × T in this subsection): p ( S ) = Y t Z p ( s t | z t ) p ( z t ) p z t . (17) As stated in Sec. 2.2, it is difﬁcult to analytically calculate this marginal likelihood. W e approximate it by using the V ariational mean-ﬁeld approximation. Let q ( Z ) be the variational posterior dis- tribution of Z . Since p ( S | Z ) is independent from the phase term of the speech spectrogram S , the variational posterior q ( Z ) is deﬁned by ignoring the phase term as follows: q ( Z ) = Y d,t q ( z dt ) = Y d,t N  µ z d  | s t | 2  , σ z d  | s t | 2  , (18) where | s t | 2 is the power spectrum of s t and µ z d : R F + → R and σ z d : R F + → R + are nonlinear functions representing the mean and variance parameters of the Gaussian distribution. These two func- tions are deﬁned with DNNs. The marginal likelihood is approxi- mately calculated as follows: log p ( S ) ≥ − KL [ q ( Z ) | p ( Z ) ] + E q [log p ( S | Z )] (19) = − X d,t 1 2 n  µ z d ( | s t | 2 )  2 + σ z d ( | s t | 2 ) − log σ z d ( | s t | 2 ) o + X f ,t E q " − log σ s f ( z t ) − | s f t | 2 σ s f ( z t ) # + const . (20) The DNNs for σ s f , µ z d , and σ z d are optimized by using SGD so that this variational lo wer bound is maximized. 3.4. Bayesian inference of V AE-NMF T o enhance the speech signal in a noisy observed signal, we calculate the full posterior distribution of our model: p ( W , H , Z | X ) . Since the true posterior is analytically intractable, we approximate it with a ﬁnite number of random samples by using a Markov chain Monte Carlo (MCMC) algorithm [27]. MCMC alternatively and iterati vely samples one of the latent v ariables ( W , H , and Z ) according to their conditional posterior distributions. By ﬁxing the speech parameter Z , the conditional posterior dis- tributions p ( W | X , H , Z ) and p ( H | X , W , Z ) can be derived with a variational approximation [26, 27] as follo ws: w f k | X , H , Z ∼ G I G a 0 , b 0 + X t h kt λ f t , X t | x f t | 2 φ 2 f tk h kt ! , (21) h kt | X , W , Z ∼ G I G   a 1 , b 1 + X f w f k λ f t , X f | x f t | 2 φ 2 f tk w f k   , (22) λ f t = X k w f k h kt + σ s f ( z t ) , φ f tk = w f k h kt P k w f k h kt + σ s f ( z t ) , (23) where G I G ( γ , ρ, τ ) ∝ x γ − 1 exp( − ρx − τ /x ) is the generalized in- verse Gaussian distribution and λ f t and φ f tk are auxiliary v ariables. The latent v ariable of speech Z is updated by using a Metropolis method [27] because it is hard to analytically deriv e the conditional posterior p ( Z | X , W , H ) . The latent variable is sampled at each time frame by using the following Gaussian proposal distribution q ( z ∗ t | z t ) whose mean is the previous sample z t : z ∗ t ∼ q ( z ∗ t | z t ) = N ( z t , σ I ) , (24) where σ is a variance parameter of the proposal distribution. This candidate z ∗ t is randomly accepted with the following probability: a z ∗ t | z t = min  1 , p ( x t | W , H , z ∗ t ) p ( z ∗ t ) p ( x t | W , H , z t ) p ( z t )  . (25) 3.5. Reconstruction of complex speech spectrogram In this paper we obtain the enhanced speech with W iener ﬁltering by maximizing the conditional posterior p ( S | X , W , H , Z ) . Let ˆ S ∈ C F × T be the speech spectrogram that maximizes the conditional posterior . It is given by the follo wing equation: ˆ s f t = σ s f ( z t ) P k w f k h kt + σ s f ( z t ) x f t . (26) W e simply use the mean values of the sampled latent variables as W , H , and Z in Eq. (26). ⋯ Hidden layers (5 x 512, ReLU) Input   2 (  = 513 ) Output     (  = 10 , Linear) Output     (  = 10 , Softplus ) ⋯ H i dde n l a ye rs ( 5 x 512, Re L U ) O ut put 1 /     (  = 1 0 , S of t pl us ) Input   (  = 1 0 ) (a) Encoder: q ( z t | s t ) ⋯ H i dde n l a ye rs ( 5 x 512, Re L U ) Input   2 (  = 5 1 3 ) O ut put     (  = 1 0 , L i n e a r) O ut put     (  = 10 , S of t pl us ) ⋯ Hidden layers (5 x 512, ReLU) Output 1/     (  = 513, Softplus ) Input   (  = 10 ) (b) Decoder: p ( s t | z t ) Fig. 3 . Conﬁguration of the V AE used in the Sec. 4. 4. EXPERIMENT AL EV ALU A TION This section reports experimental results with noisy speech signals whose noise signals were captured in actual en vironments. 4.1. Experimental settings T o compare V AE-NMF with a DNN-based supervised method, we used CHiME-3 dataset [28] and DEMAND noise database 1 . The CHiME-3 dataset was used for both the training and ev aluation. The DEMAND database was used for constructing another ev alu- ation dataset for unseen noise conditions. The ev aluation with the CHiME-3 was conducted by using its development set, which con- sists of 410 simulated noisy utterances in each of four different noisy en vironments: on a bus (BUS), in a cafe (CAF), in a pedestrian area (PED) and on a street junction (STR). The av erage signal-to-noise ratio (SNR) of the noisy speech signals was 5.8 dB. The ev alua- tion with the DEMAND was conducted by using 20 simulated noisy speech signals in each of four different noisy environments: on a subway (SUB), in a cafe (CAF), at a town square (SQU), and in a living room (LIV). W e generated these signals by mixing the clean speech signals of the CHiME-3 dev elopment set with the noise sig- nals in the DEMAND database. The SNR of these noisy speech signals was set to be 5.0 dB. The sampling rate of these signals was 16 kHz. The enhancement performance was ev aluated by using the signal-to-distortion ratio (SDR) [29]. T o obtain the prior distribution of speech signals p ( s t | z t ) , we trained a V AE that had two networks of p ( s t | z t ) and q ( z t ) as shown in Fig. 3. The dimension of the latent variables D was set to be 10. The training data were about 15 hours of clean speech signals in the WSJ-0 corpus [30]. Their spectrograms were obtained with a short-time Fourier transform (STFT) with a window length of 1024 samples and a shifting interval of 256 samples. T o make the prior distribution robust against a scale of the speech power , we randomly changed the average power of the spectrogram between 0.0 and 10.0 at each parameter update. The parameters for V AE-NMF were as follows. The number of bases K was set to be 5 . The hyperparameters a 0 , b 0 , a 1 , b 1 , and σ were set to be 1 . 0 , 1 . 0 , and 1 . 0 , K /scale , and 0 . 01 , respectiv ely . The scale represents the empirical a verage power of the input noisy spectrogram. After drawing 100 samples for burn-in, we drew 50 samples to estimate the latent variables. These parameters had been determined empirically . The latent variables of noise W and H were randomly initialized. Since the latent variable of speech Z depends on the initial state, the initial sample was drawn from q ( z t | s t ) by setting the observation x t as the speech signal s t . W e compared V AE-NMF with a DNN-based supervised method and the unsupervised RPCA. W e implemented a DNN that outputs IRMs (DNN-IRM). It had ﬁv e hidden layers with ReLU activ ation functions. It takes as an input 11 frames of noisy 100-channel log- Mel-scale ﬁlterbank features and predicts one frame of IRMs 2 . W e 1 http://parole.loria.fr/DEMAND/ 2 SDRs were ev aluated by dropping 2048 samples (5 frames) at both ends. T able 1 . Enhancement performance in SDR for CHiME-3 dataset Method A verage BUS CAF PED STR V AE-NMF 10.10 9.47 10.62 10.93 9.39 DNN-IRM 10.93 8.92 11.92 12.92 9.95 RPCA 7.53 6.13 8.10 9.13 6.77 Input 6.02 3.26 7.21 8.83 4.78 T able 2 . Enhancement performance in SDR for DEMAND dataset Method A verage SUB CAF SQU LIV V AE-NMF 11.17 10.56 9.57 12.38 12.16 DNN-IRM 9.85 9.13 9.15 10.69 10.42 RPCA 7.03 6.48 6.37 6.99 8.28 Input 5.21 5.25 5.24 5.19 5.16 trained DNN-IRM with the training dataset of CHiME-3, which was generated by using the WSJ-0 speech utterances and noise signals. The noise signals were recorded in the same environments as those in the ev aluated data. 4.2. Experimental results The enhancement performance is shown in T ables 1 and 2. In the ex- periments using the CHiME-3 test set (T able 1), DNN-IRM, which was trained using the noisy data recorded in the same environments at the test data, yielded the highest av erage SDR. The proposed V AE- NMF achiev ed higher SDRs than RPCA in all conditions and even outperformed the supervised DNN-IRM in BUS condition without any prior training of noise signals. From the results obtained us- ing the test set constructed with the DEMAND noise data, we can see that V AE-NMF outperformed the other methods in all the condi- tions. The noise data in DEMAND is unknown to DNN-IRM trained using the CHiME-3 training set, and its enhancement performance deteriorated signiﬁcantly . These results clearly show the robustness of the proposed V AE-NMF against v arious types of noise conditions. The SDR performance of V AE-NMF for the CAF condition in the DEMAND test set was lower than those for the other condi- tions. In this condition, the background noise contained con versa- tional speech. Since V AE-NMF estimates speech component inde- pendently at each time frame, the background conv ersations were enhanced at the time frames where the power of the target speech was relatively small. This problem would be solved by making the V AE-based speech model to maintain time dependencies of a speech signal. The variational recurrent neural network [31] would be use- ful for this extension. 5. CONCLUSION W e presented a semi-supervised speech enhancement method, called V AE-NMF , that inv olves a probabilistic generative model of speech based on a V AE and that of noise based on NMF . Only the speech model is trained in advance by using a sufﬁcient amount of clean speech. Using the speech model as a prior distribution, we can ob- tain posterior estimates of clean speech by using an MCMC sampler while adapting the noise model to noisy en vironments. W e experi- mentally conﬁrmed that V AE-NMF outperformed the conv entional supervised DNN-based method in unseen noisy en vironments. One interesting future direction is to extend V AE-NMF to the multichannel scenario. Since complicated speech signals and a spa- tial mixing process can be represented by a V AE and a well-studied phase-aware linear model (e.g., [2, 3, 32]), respectively , it would be effecti ve to integrate these models in a uniﬁed probabilistic frame- work. W e also inv estigate GAN-based training of the speech model to accurately learn a probability distribution of speech. 6. REFERENCES [1] X. Lu, Y . Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder, ” in Interspeech , 2013, pp. 436–440. [2] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming, ” in IEEE International Conference on Acoustics, Speech and Sig- nal Processing , 2016, pp. 196–200. [3] A. A. Nugraha, A. Liutkus, and E. V incent, “Multichannel au- dio source separation with deep neural networks, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 24, no. 9, pp. 1652–1664, 2016. [4] A. Narayanan and D. W ang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr o- cessing , 2013, pp. 7092–7096. [5] S. Pascual, A. Bonafonte, and J. Serr ` a, “SEGAN: Speech en- hancement generative adversarial network, ” Interspeec h , pp. 3642–3646, 2017. [6] Z.-Q. W ang and D. W ang, “Recurrent deep stacking networks for supervised speech separation, ” in IEEE International Con- fer ence on Acoustics, Speech and Signal Pr ocessing , 2017, pp. 71–75. [7] Y . Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude es- timator , ” IEEE T ransactions on Acoustics, Speech, and Signal Pr ocessing , vol. 32, no. 6, pp. 1109–1121, 1984. [8] P . C. Loizou, Speec h enhancement: theory and practice . CRC press, 2013. [9] S. Mohammed and I. T ashev , “ A statistical approach to semi- supervised speech enhancement with low-order non-negati ve matrix factorization, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing , 2017, pp. 546–550. [10] P . Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures, ” Independent Component Analysis and Signal Sep- aration , pp. 414–421, 2007. [11] M. Sun, Y . Li, J. F . Gemmeke, and X. Zhang, “Speech enhance- ment under low SNR conditions via noise estimation using sparse and low-rank NMF with K ullback–Leibler diver gence, ” IEEE/A CM T ransactions on Audio, Speech and Language Pr o- cessing , vol. 23, no. 7, pp. 1233–1242, 2015. [12] C. F ´ evotte, N. Bertin, and J. Durrieu, “Nonnegativ e matrix fac- torization with the Itakura-Saito diver gence: W ith application to music analysis, ” Neural computation , vol. 21, no. 3, pp. 793–830, 2009. [13] C. Sun, Q. Zhang, J. W ang, and J. Xie, “Noise reduction based on robust principal component analysis, ” Journal of Compu- tational Information Systems , vol. 10, no. 10, pp. 4403–4410, 2014. [14] P . S. Huang, S. D. Chen, P . Smaragdis, and M. Hasega wa- Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis, ” in IEEE Interna- tional Confer ence on Acoustics, Speech and Signal Processing , 2012, pp. 57–60. [15] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde- Farle y , S. Ozair , A. Courville, and Y . Bengio, “Generative ad- versarial nets, ” in Advances in neural information pr ocessing systems , 2014, pp. 2672–2680. [16] D. P . Kingma and M. W elling, “ Auto-encoding variational bayes, ” arXiv pr eprint arXiv:1312.6114 , 2013. [17] O. Fabius and J. R. van Amersfoort, “V ariational recurrent auto-encoders, ” arXiv pr eprint arXiv:1412.6581 , 2014. [18] W .-N. Hsu, Y . Zhang, and J. Glass, “Learning latent repre- sentations for speech generation and transformation, ” in Inter- speech , 2017, pp. 1273–1277. [19] M. Blaauw and J. Bonada, “Modeling and transforming speech using variational autoencoders, ” in Interspeech , 2016, pp. 1770–1774. [20] T . G. Kang, K. Kwon, J. W . Shin, and N. S. Kim, “NMF-based target source separation using deep neural network, ” IEEE Sig- nal Processing Letters , vol. 22, no. 2, pp. 229–233, 2015. [21] T . T . V u, B. Bigot, and E. S. Chng, “Combining non-negati ve matrix factorization and deep neural networks for speech en- hancement and automatic speech recognition, ” in IEEE Inter- national Confer ence on Acoustics, Speech and Signal Process- ing , 2016, pp. 499–503. [22] K. Qian, Y . Zhang, S. Chang, X. Y ang, D. Flor ˆ encio, and M. Hasegawa-Johnson, “Speech enhancement using bayesian wa venet, ” Interspeech , pp. 2013–2017, 2017. [23] E. M. Grais, M. U. Sen, and H. Erdogan, “Deep neural net- works for single channel source separation, ” in IEEE Interna- tional Confer ence on Acoustics, Speech and Signal Processing , 2014, pp. 3734–3738. [24] M. Sun, X. Zhang, and T . F . Zheng, “Unseen noise estima- tion using separable deep auto encoder for speech enhance- ment, ” IEEE/ACM T ransactions on Audio, Speech, and Lan- guage Pr ocessing , vol. 24, no. 1, pp. 93–104, 2016. [25] D. Kingma and J. Ba, “ Adam: A method for stochastic opti- mization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [26] A. T . Cemgil, “Bayesian inference for nonnegativ e matrix factorisation models, ” Computational Intelligence and Neur o- science , vol. 2009, no. 785152, pp. 1–17, 2009. [27] C. M. Bishop, P attern r ecognition and machine learning . Springer , 2006. [28] J. Barker , R. Marxer , E. V incent, and S. W atanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines, ” in IEEE W orkshop on Automatic Speech Recognition and Understanding , 2015, pp. 504–511. [29] E. V incent, R. Gribonv al, and C. F ´ evotte, “Performance mea- surement in blind audio source separation, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 14, no. 4, pp. 1462–1469, 2006. [30] J. Garofalo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete, ” Linguistic Data Consortium, Philadelphia , 2007. [31] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y . Bengio, “ A recurrent latent variable model for sequential data, ” in Advances in neural information processing systems , 2015, pp. 2980–2988. [32] A. Ozerov and C. F ´ evotte, “Multichannel nonnegati ve matrix factorization in conv olutiv e mixtures for audio source sepa- ration, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 18, no. 3, pp. 550–563, 2010.

Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment