Deep Griffin-Lim Iteration

This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the co…

Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi

Deep Griffin-Lim Iteration
This paper has been accepted to the 44th International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP 2019). DEEP GRIFFIN–LIM ITERA TION Y oshiki Masuyama † , K ohei Y atabe † , Y uma K oizumi ‡ , Y asuhir o Oikawa † , Noboru Harada ‡ † Department of Intermedia Art and Science, W aseda Uni versity , T okyo, Japan ‡ NTT Media Intelligence Laboratories, T okyo, Japan ABSTRA CT This paper presents a nov el phase reconstruction method (only from a giv en amplitude spectrogram) by combining a signal-processing- based approach and a deep neural network (DNN). T o retriev e a time-domain signal from its amplitude spectrogram, the correspond- ing phase is required. One of the popular phase reconstruction meth- ods is the Grif fin–Lim algorithm (GLA), which is based on the re- dundancy of the short-time Fourier transform. Howe v er , GLA often in volv es many iterations and produces low-quality signals owing to the lack of prior knowledge of the target signal. In order to address these issues, in this study , we propose an architecture which stacks a sub-block including tw o GLA-inspired fix ed layers and a DNN. The number of stack ed sub-blocks is adjustable, and we can trade the performance and computational load based on requirements of ap- plications. The effecti v eness of the proposed method is in vestigated by reconstructing phases from amplitude spectrograms of speeches. Index T erms — Phase reconstruction, spectrogram consistency , deep neural network, residual learning. 1. INTR ODUCTION In recent years, phase reconstruction has gained much attention in the signal processing community [1, 2]. Man y ordinary speech pro- cessing methods defined in the time-frequenc y domain have con- sidered only amplitude spectrograms and utilized the phase of the observed signal without modifying it. Meanwhile, recent studies hav e pro ven that phase reconstruction can improv e the quality of the reconstructed signal [3], and thus se veral methods hav e been pro- posed for that [4, 5, 6, 7, 8]. Phase reconstruction solely from an amplitude spectrogram has also receiv ed increasing attention along the dev elopment of the short-time Fourier transform (STFT)-based speech synthesis [9, 10] which generates an amplitude spectrogram and requires phase reconstruction for generating a time-domain sig- nal. This paper focuses on such a situation where only an amplitude spectrogram is av ailable for reconstructing the phase. When only an amplitude spectrogram is a v ailable and no explicit information is giv en for the phase, such as in STFT -based speech synthesis, the Grif fin–Lim algorithm (GLA) is one of the popular methods for phase reconstruction [11]. GLA promotes the consis- tency of a spectrogram by iterating two projections (see Section 2.1), where a spectrogram is said to be consistent when its inter-bin de- pendency owing to the redundancy of STFT is retained [12]. GLA is based only on the consistency and does not take any prior knowl- edge about the target signal into account. Consequently , GLA often requires many iterations and results in lo w-quality signals. For incorporating prior knowledge of target signals into phase reconstruction, deep neural networks (DNNs) have been applied re- cently [13, 14, 15, 16]. There exist a number of approaches to re- construct phase using DNNs. One approach is to treat it as a clas- sification problem by discretizing the candidates of phase [13, 14], DNN Amplitude Amplitude … … GLA-inspired fixed layers DeGLI output Initial vale (can be ) Fig. 1 . A block diagram of the proposed architecture for reconstruct- ing phase from a giv en amplitude spectrogram (top), which stacks a common sub-block (bottom). The sub-block consists of two fixed GLA-inspired layers (red, blue) and a trainable DNN (green). which is ef fectiv ely utilized in speech separation. Other approaches handle phase as a continuous periodic variable [15] or treat complex- valued spectrogram [16]. While these DNN-based phase reconstruc- tion methods have obtained successful results, the number of layers is determined when they are trained. That is, their performance and computational load are fixed at the training. It should be beneficial if one can easily trade the performance and computational load at the time of inference depending on requirements of applications. In this study , we propose a phase reconstruction method which incorporates a DNN into GLA. The proposed method stacks a com- mon sub-block motiv ated by the iterativ e procedure of GLA, which constructs a deep architecture, named deep Griffin–Lim iteration (DeGLI), as illustrated in Fig. 1. In the proposed architecture, the number of total layers corresponds to the number of stacking, and its depth can be adjusted afterward based on the allo wable computa- tional load in applications. Its training procedure is also proposed to effecti vely train the DNN within the sub-block. Our main contribu- tions are tw ofold: (1) proposing a deep architecture whose sub-block contains the fixed GLA-inspired layers which enable reduction of the amount of trainable parameters (Section 3.1); and (2) proposing its training procedure which instructs the sub-block to be a denoiser , instead of requiring it to reconstruct the phase (Section 3.2). Thanks to this training procedure, the dif ficulty of training a DNN in phase reconstruction arisen from the periodic nature of phase is circum- vented. T o e valuate the ef fectiv eness of the proposed method, the quality of the reconstructed signal by GLA and the proposed method is compared. 2. RELA TED WORKS 2.1. Griffin–Lim Algorithm (GLA) GLA is a popular phase recovery algorithm based on the consistency of a spectrogram [11]. This algorithm expects to recover a complex- valued spectrogram, which is consistent and maintains the giv en am- c  2019 IEEE This paper has been accepted to the 44th International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP 2019). plitude A , by the following alternati ve projection procedure: X [ m +1] = P C  P A ( X [ m ] )  , (1) where X is a complex-v alued spectrogram updated through the iter- ation, P S is the metric projection onto a set S , and m is the iteration index. Here, C is the set of consistent spectrograms, and A is the set of spectrograms whose amplitude is the same as the giv en one. The metric projections onto these sets C and A are given by , P C ( X ) = G G † X , (2) P A ( X ) = A  X  | X | , (3) where G represents STFT , G † is the pseudo inv erse of STFT (iSTFT),  and  are element-wise multiplication and division, respectiv ely , and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem [12]: min X k X − P C ( X ) k 2 Fro s.t. X ∈ A , (4) where k · k Fro is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on am- plitude which must be equal to the giv en one. Although GLA has been widely utilized because of its simplicity , GLA often in volves many iterations until it con ver ges to a certain spectrogram and re- sults in low reconstruction quality . This is because the cost function in Eq. (4) only requires the consistency , and the characteristics of the tar get signal are not taken into account. Introducing prior kno wl- edge of the target signal into the algorithm can improve the quality of reconstructed signals as discussed in [17, 7]. 2.2. DNN-based phase reconstruction with fixed STFT lay ers Recently , DNNs including fix ed STFT (and iSTFT) layers were con- sidered for treating phase information within the networks. A gen- erativ e adversarial network (GAN)-based approach to reconstruct a complex-v alued spectrogram solely from a gi ven amplitude spectro- gram w as presented in [16]. The output of the generator (a complex- valued spectrogram) is con verted back to the time domain by iSTFT layer and inputted to the discriminator , where this iSTFT layer is es- sential for its training as discussed in [16]. As another example, a DNN for speech separation [18] employed the multiple input spec- trogram in verse (MISI) layer which consists of the pair of STFT and iSTFT as in GLA. The MISI layer is applied to the output of the DNN for speech separation to improv e its performance by consider- ing the effect of the phase reconstruction together with the separa- tion. In addition, in [19], the time-frequency representation was also trained with the DNN for speech separation. The success of these DNNs indicates that considering STFT (and iSTFT) together with a DNN is important for treating phase. The common strategy for these DNNs is that fixed STFT -related layers are placed after a rich DNN. Their loss functions are ev alu- ated after going through such STFT -related layers, and their ef fect is propagated for updating the parameters of DNNs. Based on this observation, loss functions tied with STFT (and iSTFT) seem impor- tant in phase reconstruction because such loss functions are related to the concept of the consistency . At the same time, fix ed STFT -related layers hav e sev eral benefits for training. Since they do not contain trainable parameters, adding STFT -related layers does not increase the number of trainable parameters while they capture the structure of complex-v alued spectrograms ef ficiently . Therefore, use of the STFT -related layers within DNNs may be recommended for treating phase information. Howe ver , there are little research on such DNN containing STFT within the network. 3. PR OPOSED DEEP ARCHITECTURE Based on the above discussions, we propose an architecture for phase reconstruction, named deep Grif fin–Lim iteration (DeGLI), which is a unification of GLA and a DNN. As illustrated in Fig. 1, the pro- posed architecture consists of a common sub-block, and it is stacked to form the whole deep architecture based on the iterativ e procedure of GLA. The architecture of DeGLI is introduced in Section 3.1, while its training procedure is described in Section 3.2. 3.1. Deep Griffin–Lim Iteration (DeGLI) One interesting trend of research in deep learning is to interpret an optimization algorithm as a recurrent neural network (RNN) and construct a DNN architecture following that [20, 21, 22]. The DNN introduced in the previous section [18, 19] was also obtained by a similar approach called deep unfolding [23, 24]. In this context, the iterativ e procedure of GLA in Eq. (1) is interpreted as an RNN which stacks the fix ed linear layer P C and tar get-dependent nonlinear layer P A . By looking close at Eq. (1), it can be seen that the complex- valued spectrogram at m th iteration X [ m ] is inputted into the non- linear layer P A , and then its output passes through the fixed linear layer P C consisting of STFT G and iSTFT G † as in Eq. (2). That is, GLA is a parameter-fixed RNN consisting of STFT and iSTFT layers within the network. Inspired from the above observations, the proposed deep architecture for phase reconstruction, or DeGLI, is defined through a sub-block based on GLA. Let us consider the intermediate representation of GLA, Y [ m ] = P A ( X [ m ] ) , (5) Z [ m ] = P C ( Y [ m ] ) , (6) where the combination of these equations recov ers Eq. (1). Since Y [ m ] is the amplitude-replaced version of X [ m ] , their difference in- dicates the amount of mismatch between the amplitude of current spectrogram | X [ m ] | and the desired amplitude A . Similarly , since Z [ m ] is the closest consistent spectrogram to Y [ m ] (in the Euclidean sense), the difference between them indicates the amount of incon- sistent components [12]. Such differences should be quite informa- tiv e for phase reconstruction because the aim of GLA is to reduce them as much as possible. Howe ver , such intermediate information is not considered in the original GLA in Eq. (1). T o effectiv ely use those intermediate information in a learning scheme, we propose DeGLI as the following architecture: X [ m +1] = B ( X [ m ] ) , (7) = Z [ m ] − F ( X [ m ] , Y [ m ] , Z [ m ] ) , (8) where B is the proposed DeGLI-block inspired by GLA as in Fig. 1, and F is a DNN. The whole architecture can also be viewed as an RNN or a feed-forward network in which the weights are shared. By stacking M DeGLI-blocks (which is equiv alent to iterate Eq. (7) M times), the whole DeGLI architecture becomes M -times deeper without increasing the number of trainable parameters. That is, the total depth of the DeGLI architecture can be adjusted afterward, which enables one to easily trade its performance and computational load for adapting the allo wable computational time of v arious appli- cations. Note that, as a specific case, DeGLI reduces to the ordi- nary GLA when F ( X [ m ] , Y [ m ] , Z [ m ] ) = O , where O is the zero matrix. A variant of GLA in [25] can also be obtained by setting F ( X [ m ] , Y [ m ] , Z [ m ] ) = γ ( X [ m ] − Z [ m ] ) (0 < γ < 1) , which in- dicates that DeGLI is a general architecture including sev eral GLA- type algorithms as spacial cases. This paper has been accepted to the 44th International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP 2019). Amplitude Amplitude Noise Clean Signal DNN estimates residual Estimated Signal DNN Fig. 2 . The block diagram for training the sub-block. One of the key points of the DeGLI architecture is that Z [ m ] (= P C ( P A ( X [ m ] ))) is the output of GLA-inspired layers at m th iteration, and the proposed DeGLI-block B is defined as the sub- traction of the DNN output F ( X [ m ] , Y [ m ] , Z [ m ] ) from the output of GLA-inspired layers Z [ m ] . Defining the DeGLI-block in this way is based on two reasons: (1) dif ferences between each intermediate representation indicates the r esidual to the desired ones as discussed in the abov e paragraph; and (2) it is known that treating a residual is easier than directly estimating the target according to the literature on the residual learning [26, 27]. 3.2. T raining procedure f or DeGLI-block Since the proposed DeGLI architecture can be interpreted as a lar ge DNN, a simple strategy for training the DeGLI-block is directly min- imizing the loss of phase reconstruction measured by the output: min θ D  G θ ( A ) , P C ( G θ ( A ))  , (9) where G ( · ) = P A ( B ( · · · B ( B ( · )))) represents the whole DeGLI architecture, θ represents all trainable parameters in G (i.e., the pa- rameters in F ), D ( · , · ) is a measure of mismatch such as a norm of difference, and the minimization is considered for all A . This problem is related to the optimization problem for GLA in Eq. (4) when D is the squared Frobenius norm of the difference of the vari- ables (note that, since P A is applied at the last of G , the constraint in Eq. (4) is always satisfied). Although the abo ve training strat- egy is straightforward, the number of the blocks should be defined in advance for applying it. In addition, it did not work well in our preliminary experiments. In order to tackle this issue, we train the DeGLI-block B to be a denoiser by the training procedure illustrated in Fig. 2. Let X ? be a complex-v alued spectrogram of a tar get signal, and e X = X ? + N be its noisy counterpart degraded by complex-v alued noise N . Then, the DeGLI-block B is trained so that B ( e X ) ≈ X ? , i.e., B ( e X ) = e Z − F ( e X , e Y , e Z ) ≈ X ? , (10) based on the definition in Eq. (8). Since e Z is obtained only from the fixed layers P A and P C , the optimization problem for training the DNN F is given by min θ D  e Z − X ? , F θ ( e X , e Y , e Z )  . (11) In such denoising, the DNN F estimates the residual components e Z − X ? which should not be contained in the GLA output e Z = P C ( P A ( e X )) . T o be specific, the DNN takes the mismatch to the consistency and amplitude into account by inputting e Y and e Z , and it implicitly eliminates the latent tar get signal (such as clean speech) through hidden layers in F . This training strategy is closely related to the residual learning strategy . It has been sho wn that a denoising sub-block with the residual learning strategy is robust to the type and lev el of noise, and it can be applied to a variety of tasks as discussed Conv GLU Conv GLU Conv GLU Conv GLU Conv Complex-valued Spectrogram Complex-valued Spectrograms Fig. 3 . The illustration of the DNN used in the experiment. It maps real and imaginary parts of three complex-v alued spectrograms ( X , Y , and Z ) to those of the residual. Here, “Con v” indicates a con- volutional layer with the zero padding for keeping the input size, where k , s , and c are the kernel size, stride size, and the number of channels, respectiv ely . “GLU” represents the gated linear unit. in [27, 28]. The idea of applying a denoising DNN for general tasks can also be found in [29, 30]. Note that, after passing through the fixed nonlinear layer P A , the amplitude of the complex-v alued spectrogram is always replaced by the desired one. That is, the difference between e Y and the target X ? is only phase, and thus denoising of e Y (= P A ( e X )) corresponds to phase reconstruction. It can be expected that the denoising sub-block including GLA-inspired layers also works well in phase reconstruc- tion. In any case, the trained DNN F (and thus B ) only affects the phase of the final output because the amplitude is always set to the giv en one by P A after the last DeGLI block. 4. EXPERIMENT In order to v alidate the ef fectiv eness of DeGLI, the quality of re- constructed speeches was e valuated by objecti ve measures. The pro- posed method was compared with GLA as a baseline method. 4.1. Experimental settings A DNN F used in the DeGLI-block B for the experiment is illus- trated in Fig. 3. The 2 D Con volutional layers (Conv) and the gated linear units (GLU) [31] are stacked with the skip connections. In the Con v layers, the complex-valued spectrograms are treated as im- ages, where the real and imaginary parts are concatenated along the channel direction. Note that the input of the DNN is three complex- valued spectrograms as in Figs. 1 and 2, which results in six channels as each of the three consists of the real and imaginary parts. As the training dataset for denoising, the W all Street Journal (WSJ- 0 ) corpus recorded at the sampling rate of 16 kHz w as uti- lized. 14 250 speech files were randomly selected from the database to form a training set, and the rest of the data was used as a valida- tion set. During the mini-batch training, the utterances were divided into about 2 -second-long segments ( 32 768 samples), and the Adam optimizer was utilized as the optimization solver . The network was trained for 50 epochs with a learning rate control, where the learning rate was decayed by multiplying 10 − 0 . 5 if the loss function on the validation set did not decrease for 2 consecutiv e epochs, and the ini- tial learning rate was set to 10 − 3 . As the noise utilized for training in the time-frequency domain (described in Section 3.2), the com- plex Gaussian noise was added so that the signal-to-noise ratio was randomly selected from − 6 to 0 dB, and the measure of mismatch D as the loss function in Eq. (11) was set to the ` 1 -norm of difference. STFT was implemented with the Hann windo w , whose duration w as 64 ms, with 32 ms shifting. As the test dataset, randomly selected 500 utterances from the TIMIT dataset were utilized for obtaining amplitude spectrograms for phase reconstruction, where the initial phases were set to zero in the time-frequency domain (i.e., the am- plitude spectrogram was directly inputted as the initial v alue). This paper has been accepted to the 44th International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP 2019). Frequency Time GLA Original Residual DNN output Fig. 4 . An example of the spectrograms within the proposed DeGLI- block. “GLA ” represents the output of the GLA-inspired layers Z = P C ( P A ( X )) , and “Original” is the clean speech signal X ? to be recovered. The difference between them Z − X ? is sho wn as “Residual”, while its estimation F ( X , Y , Z ) is denoted by “DNN output”. The DNN was able to accurately estimate the Residual. 4.2. Experimental results An example of the results of the residual learning is sho wn in Fig. 4 for illustrating how the DNN in the proposed DeGLI-block works. As sho wn in the figure, the DNN appropriately estimated the resid- ual which is the difference between the output of the current GLA- inspired layers and the target spectrogram. Such estimated resid- ual is subtracted so that the difference to the ideal spectrogram is reduced. W e expect that the estimation by the DNN is reasonably accurate to improv e the output of the DeGLI-block. The performance of phase reconstruction was ev aluated by STOI [32] and PESQ [33]. The score per iteration averaged among the test set is shown in the upper row of Fig. 5. Both STOI and PESQ of the proposed method were always higher than those of GLA at each iteration, and it improv ed the performance as the number of iteration increased. Since the iteration corresponds to the depth of the whole architecture of DeGLI, this result indicates that one can iterate the DeGLI-block until the quality of the reconstructed signal become satisfactory . Namely , one can eliminate unnecessary computation, or decide the depth based on the a vailable computational resource at that time. W e stress that this unique feature of the proposed method cannot be achie ved by a single rich DNN directly mapping an in- putted amplitude spectrogram into the final reconstructed signal. Since the computational time per iteration is different between GLA and the proposed DeGLI-block, the performance was also in- vestigated in terms of computational time for fair comparison. In this experiment, “Intel Core i 9 - 7980 XE ( 2 . 60 GHz)” and “NVIDIA GeForce GTX 1080 T i” were employed for the CPU and GPU, re- spectiv ely . For both methods, STFT and iSTFT were implemented by T ensorFlow . The scores per computational time are illustrated in the bottom row of Fig. 5. Since the computational time per iter- ation of the proposed method was about 2 . 4 and 9 . 6 times slower than GLA by using GPU and CPU, respectively , the difference of the scores between the methods is closer than in the top ro w . Never - theless, the proposed method notably outperformed GLA especially for PESQ. T o see the scores at some specific iterations, box plots of the scores are also shown in Fig. 6. The results were obtained from the 100 th iteration for GLA and the 10 th iteration for the pro- posed method because the computational times of these methods are roughly the same at those iteration numbers. It can be seen that the tendencies of the scores are the same as the a veraged values in Fig. 5, and the effecti veness of the proposed DeGLI architecture was con- firmed by a paired one-side t -test ( p < 0 . 01 ). In summary , it was confirmed that the proposed DeGLI archi- 10 0 10 1 10 2 10 3 Iteration 0.75 0.8 0.85 0.9 0.95 STOI 10 0 10 1 10 2 10 3 Iteration 2.4 2.6 2.8 3 PESQ 10 -2 10 0 10 2 Computational time [s] 0.75 0.8 0.85 0.9 0.95 STOI 10 -2 10 0 10 2 Computational time [s] 2.4 2.6 2.8 3 PESQ GLA in GPU DeGLI in GPU GLA in CPU DeGLI in CPU Fig. 5 . A verage scores of STOI and PESQ per iteration (top) and per computational time (bottom) for GLA (blue, circles) and the pro- posed method (red, cross marks). The yellow dashed line indicates that the real time factor is 1 . For measuring the computational time, both methods were implemented by using CPU and GPU. GLA DeGLI 0.85 0.9 0.95 STOI GLA DeGLI 2 2.5 3 3.5 PESQ Fig. 6 . Box plots of the scores of ST OI and PESQ, where the results of GLA were e valuated at the 100 th iteration, while those of DeGLI were obtained from the output of the 10 th stacks. The red lines are the median, and the boxes indicates the first and third quartiles. tecture can be trained so that utilizing the common block for e v- ery iteration improves the performance, which should be because of training as a denoiser and the residual learning strategy . Note that the trainable DNN used in this experiment was merely an example, and it must be possible to impro ve the performance by considering a DNN more suitable for phase reconstruction. 5. CONCLUSION In this study , we proposed a deep architecture, named DeGLI, which combines a DNN with the iterati ve procedure of GLA. The k ey idea was to stack the same sub-block, so that the depth of whole architec- ture can be adjusted without increasing the number of trainable pa- rameters. This feature enables one to trade the quality of the recon- structed signal and computational load depending on applications. The residual learning strategy was applied to train the sub-block as a denoiser , where the DNN removes the undesired components in- troduced by GLA. Experimental results confirmed that a denoising sub-block is applicable to phase reconstruction, which indicates that the task of training can be different from the phase reconstruction which is not an easy task for a DNN owing to the periodic nature of phase. In vestigation of a DNN suitable for the proposed DeGLI remains as a future work. This paper has been accepted to the 44th International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP 2019). 6. REFERENCES [1] T . Gerkmann, M. Krawczyk-Becker , and J. Le Roux, “Phase processing for single-channel speech enhancement: History and recent advances, ” IEEE Signal Pr ocess. Mag . , vol. 32, no. 2, pp. 55–66, Mar . 2015. [2] P . Mowlaee, R. Saeidi, and Y . Stylianou, “ Advances in phase- aware signal processing in speech communication, ” Speech Commun. , vol. 81, pp. 1–29, July 2016. [3] K. Paliwal, K. W ´ ojcicki, and B. Shannon, “The importance of phase in speech enhancement, ” Speech Commun. , vol. 53, no. 4, pp. 465–494, Apr . 2011. [4] M. Krawczyk and T . Gerkmann, “STFT phase reconstruction in voiced speech for an improv ed single-channel speech en- hancement, ” IEEE/A CM T rans. A udio, Speech, Lang. Pr ocess. , vol. 22, no. 12, pp. 1931–1940, Dec. 2014. [5] Y . W akabayashi, T . Fukumori, M. Nakayama, T . Nishiura, and Y . Y amashita, “Single-channel speech enhancement with phase reconstruction based on phase distortion averaging, ” IEEE/A CM T rans. Audio, Speech, Lang. Pr oc. , vol. 26, no. 9, pp. 1559–1569, Sept. 2018. [6] Y . Masuyama, K. Y atabe, and Y . Oikawa, “Model-based phase recovery of spectrograms via optimization on Rieman- nian manifolds, ” in Int. W orkshop Acoust. Signal Enhance. (IW AENC) , Sept. 2018, pp. 126–130. [7] K. Y atabe, Y . Masuyama, and Y . Oika wa, “Rectified linear unit can assist Griffin–Lim phase recovery , ” in Int. W orkshop Acoust. Signal Enhance. (IW AENC) , Sept. 2018, pp. 555–559. [8] Y . Masuyama, K. Y atabe, and Y . Oika wa, “Griffin–Lim like phase recovery via alternating direction method of multipliers, ” IEEE Signal Pr ocess. Lett. , vol. 26, no. 1, pp. 184–188, Jan. 2019. [9] S. T akaki, H. Kameoka, and J. Y amagishi, “Direct modeling of frequency spectra and wa veform generation based on phase re- cov ery for DNN-based speech synthesis, ” in INTERSPEECH , Aug. 2017, pp. 1128–1132. [10] Y . Saito, S. T akamichi, and H. Saruwatari, “T ext-to-speech synthesis using STFT spectra based on low-/multi-resolution generativ e adversarial networks, ” in IEEE Int. Conf. Acoust., Speech, Signal Pr ocess. (ICASSP) , Apr . 2018, pp. 5299–5303. [11] D. Griffin and J. Lim, “Signal estimation from modified short- time Fourier transform, ” IEEE T rans. Acoust., Speech, Signal Pr ocess. , vol. 32, no. 2, pp. 236–243, Apr . 1984. [12] J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “Fast sig- nal reconstruction from magnitude STFT spectrogram based on spectrogram consistency , ” in Pr oc. 13th Int. Conf . Digit. Audio Ef f. (D AFx-10) . Sept. 2010, 397–403. [13] N. T akahashi, P . Agrawal, N. Goswami, and Y . Mitsufuji, “PhaseNet: Discretized phase modeling with deep neural net- works for audio source separation, ” in INTERSPEECH , Sept. 2018, pp. 2713–2717. [14] J. Le Roux, G. W ichern, S. W atanabe, A. Sarroff, and J. R. Her- shey , “Phasebook and friends: Lev eraging discrete representa- tions for source separation, ” arXiv preprint , 2018. [15] S. T akamichi, Y . Saito, N. T akamune, D. Kitamura, and H.Saruwatari, “Phase reconstruction from amplitude spectro- grams based on von–Mises-distrib ution deep neural network, ” in Int. W orkshop Acoust. Signal Enhance. (IW AENC) , Sept. 2018, pp. 286–290. [16] K. Oyamada, H. Kameoka, K. T anaka T . Kanek o, N. Hojo, and H. Ando, “Generativ e adversarial network-based approach to signal reconstruction from magnitude spectrograms, ” in Eur . Signal Pr ocess. Conf. (EUSIPCO) , Sept. 2018. [17] P . Magron, J. Le Roux, and T . V irtanen, “Consistent anisotropic wiener filtering for audio source separation, ” in IEEE W orkshop Appl. Signal Pr ocess. Audio Acoust. (W AS- P AA) , Oct. 2017, pp. 269–273. [18] Z.-Q. W ang, J. Le Roux, D. W ang, and J. R. Hershey , “End- to-end speech separation with unfolded iterative phase recon- struction, ” in INTERSPEECH , Sept. 2018, pp. 2708–2712. [19] G. W ichern and J. Le Roux, “Phase reconstruction with learned time-frequency representations for single-channel speech sep- aration, ” in Int. W orkshop Acoust. Signal Enhance . (IW AENC) , Sept. 2018, pp. 396–400. [20] K. Gre ger and Y . LeCun, “Learning fast approximations of sparse coding, ” in Int. Conf. Mar ch. learn. , 2010, pp. 399–406. [21] M. Bor gerding, P . Schniter, and S. Rangan, “AMP-inspired deep networks for sparse linear in verse problems, ” IEEE T rans. Signal Pr ocess. , vol. 65, no. 16, pp. 4294–4308, Aug. 2017. [22] Y . Y ang, J. Sun, H. Li, and Z. Xu, “Deep ADMM-Net for com- pressiv e sensing MRI, ” in Adv . Neural Pr ocess. Syst. (NIPS) , pp. 10–18. Curran Associates, Inc., 2016. [23] J. R. Hershey , J. Le Roux, and F . W eninger, “Deep unfolding: Model-based inspiration of novel deep architectures, ” arXiv pr eprint arXiv:1409.2574 , 2014. [24] S. Wisdom, J. Hershey , and and S. W atanabe J. Le Roux, “Deep unfolding for multichannel source separation, ” in IEEE Int. Conf. Acoust., Speech, Signal Pr ocess. (ICASSP) , Mar . 2016, pp. 121–125. [25] N. Perraudin, P . Balazs, and P . L. Sønder gaard, “ A f ast Grif fin– Lim algorithm, ” in IEEE W orkshop Appl. Signal Pr ocess. Au- dio Acoust. (W ASP AA) , Oct. 2013, pp. 1–4. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in IEEE Conf. Comp. V ision P attern Recognit. (CVPR) , June 2016. [27] K. Zhang, W . Zuo, Y . Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising, ” IEEE T rans. Image Pr ocess. , v ol. 26, no. 7, pp. 3142–3155, July 2017. [28] K. Zhang, W . Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration, ” in IEEE Conf . Comp. V ision P attern Recognit. (CVPR) , July 2017, pp. 2808–2817. [29] S. V . V enkatakrishnan, C. A. Bouman, and B. W ohlberg, “Plug-and-Play priors for model based reconstruction, ” in IEEE Glob . Conf. Signal, Inf. Pr ocess. , Dec. 2013, pp. 945– 948. [30] Y . Romano, M. Elad, and P . Milanfar , “The little engine that could: Regularization by denoising (RED), ” SIAM J. Imaging Sciences , vol. 10, no. 4, pp. 1804–1844, 2017. [31] Y . N. Dauphin, A. Fan, M. Auli, and D. Grangier , “Language modeling with gated con volutional networks, ” arXiv preprint arXiv:1612.08083 , 2016. [32] C. H. T aal, R. C. Hendriks, R. Heusdens, and J. Jensen, “ An algorithm for intelligibility prediction of time-frequency weighted noisy speech, ” IEEE T rans. Audio, Speech, Lang . Pr ocess, , vol. 19, no. 7, pp. 2155–2136, 2011. [33] A. W . Rix, J. G. Beerends, M. P . Hollier , and A. P . Hekstra, “Perceptual ev aluation of speech quality (PESQ) — A new method for speech quality assessment of telephone networks and codecs, ” in IEEE Int. Conf. Acoust., Speech, Signal Pr o- cess. (ICASSP) , May 2001, vol. 19, pp. 2125–2136.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment