GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

GELP: GAN-Excited Linear Pr ediction f or Speech Synthesis fr om Mel-spectr ogram Lauri J uvela 1 , Bajibabu Bollepalli 1 , J unic hi Y amagishi 2 , 3 , P aavo Alku 1 1 Aalto Uni versity , Finland 2 National Institute of Informatics, Japan; 3 Uni versity of Edinb ur gh, UK { lauri.juvela, bajibabu.bollepalli, paavo.alku } @aalto.fi, jyamagis@nii.ac.jp Abstract Recent advances in neural netw ork -based text-to-speech ha v e reached human lev el naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are con v enient for modeling, but present additional challenges for v ocoding (i.e., wa veform generation from the acoustic features). High- quality synthesis can be achie ved with neural vocoders, such as W av eNet, b ut such autoregressiv e models suffer from slow se- quential inference. Meanwhile, their existing parallel inference counterparts are difﬁcult to train and require increasingly large model sizes. In this paper , we propose an alternati v e training strategy for a parallel neural v ocoder utilizing generati ve adv er- sarial networks, and integrate a linear predictive synthesis ﬁlter into the model. Results show that the proposed model achiev es signiﬁcant improvement in inference speed, while outperform- ing a W av eNet in copy-synthesis quality . Index T erms : Neural vocoder , Source-ﬁlter model, GAN, W av eNet 1. Introduction In recent years, the state of the art in text-to-speech (TTS) syn- thesis has been adv ancing rapidly . Introduction of deep neu- ral networks for acoustic modeling [1] in statistical parametric speech synthesis (SPSS) [2] started the ongoing trend of deep learning in TTS, and was soon followed by recurrent networks [3], and more recently , sequence-to-sequence models with at- tention [4, 5]. The latter learn to align input text (or phoneme) sequences with the target acoustic feature sequences, which in effect integrates a duration and acoustic model into a single neural network. Meanwhile, the advances in neural network models for speech waveform generation, such as W aveNet [6], hav e raised the bar in synthetic speech quality . Although the original W av eNet was controlled mainly with linguistic fea- tures, it used supplementary acoustic information from duration and pitch predictor models. The role of a wa veform generator model has since shifted more to a “neural vocoder” [7], i.e., the model is conditioned directly on acoustic features. W ith this division of labor, neural v ocoders hav e been shown to produce high-quality synthetic speech when paired with a sequence-to- sequence model [8], or a more con ventional SPSS pipeline [9]. Howe v er , W av eNet and similar models suffer from their sample-by-sample sequential inference process. Although the inference can be made faster with dilation buffering [10], or optimizing recurrent neural networks (RNNs) for real-time use [11], sequential inference remains ill-suited for massi vely par- allel applications. T o remedy this, models capable of parallel inference hav e been proposed [12, 13]. The training setup in- volv es two models: a pre-trained autoregressiv e model (teacher) is used to score the outputs of a feed-forward parallel model (student). The student training criterion is related to in verse au- toregressi ve ﬂows, and alternativ e ﬂow-based models ha v e been further presented [14]. A major issue with these parallel syn- thesis models is their restriction to in vertible transforms, which limits model capacity utilization. This in turn leads to large models that require a considerable amount of training [13, 14]. In addition to ﬂow-related scoring rules, more con ventional short time Fourier transform (STFT)-based regression losses hav e been found essential in training parallel waveform gen- erators [12, 13]. Furthermore, similar models have been trained solely using STFT -based losses [15]. The method builds on the classical source-ﬁlter model of speech, and learns a neural ﬁl- ter for a mixed excitation signal constructed from harmonic and noise components [15]. Ho we v er , this approach requires ex- plicit voicing and pitch inputs, and is sensitiv e to the alignment between generated and target signals when STFT phase loss is used. Meanwhile, the source-ﬁlter model has also been used with linear predictive (LP) ﬁlters to build neural models for the glottal excitation [16, 17] or the LP residual signal [18, 19]. The use of LP in verse ﬁltering can be seen analogous to the the ﬂow-based approaches, as they apply an in vertible transform whose residual is more Gaussian and is thus easier to model [17]. Howe ver , these LP-based models still use sequential infer - ence (similar to W a veNet or W av eRNN) and the inherent speed limitations remain in place. Generativ e adversarial netw orks (GANs) [20] are especially appealing for waveform synthesis, as they enable unrestricted use of feedforward architectures capable of parallel inference. GAN-based residual excitation models ha v e been proposed pre- viously for all-pole en v elopes deriv ed from mel-frequenc y cep- stral coefﬁcients (MFCCs) [21], with further reﬁnements to the GAN architectures and training procedure presented in [22]. Howe v er , these models operate in a pitch-synchronous synthe- sis framework, which makes them sensiti v e to pitch marking ac- curacy during training, and requires pitch information at synthe- sis time. Further, the models were optimized solely in the exci- tation domain, which may be perceptually sub-optimal [18, 19]. In the present T acotron style TTS systems [8], the most commonly used acoustic feature is the log-compressed mel- spectrogram, that utilizes the classical MFCC triangular ﬁlter- bank [23]. The mel-spectrogram is appealing due to its sim- ple estimation, perceptual relev ance, and overall ubiquity in speech applications. Howe ver , from a TTS perspectiv e, the mel- spectrogram poses additional challenges due to its lack of ex- plicit v oicing and pitch information. Nevertheless, the typically used 80 mel ﬁlters hav e enough frequenc y resolution to include some harmonics, which is sufﬁcient for rudimentary wa veform reconstruction [24]. Altogether, there remains a need for a fast high-quality neural v ocoder capable of generating speech w av e- forms directly from mel-spectra with relativ ely little training. In this paper , we combine a MFCC-based en velope model [21] with recent GAN training insights [22]. Furthermore, we use neural net architectures that operate directly on raw audio, and integrate a parallel inference capable LP synthesis ﬁlter into the computation graph. Inspired by classic speech coding, we call the proposed method “GAN-excited linear prediction” (GELP). 2. Methods Figure 1 shows an overvie w of the training setup. The model consists of three trainable components: a generator G and dis- criminator D , both operating at an audio rate, and a condition- ing model C , operating at a control frame rate. All the models are non-causal 1-D conv olution nets, as detailed in section 2.3. The conditioning model creates a context embedding ( c ) of the mel-spectrogram at the frame rate, which is linearly upsampled to the audio rate before inputting it into G and D . The input ( z ) of the generator model is a white noise sequence, sampled at the audio rate, which the model transforms into a LP residual signal ˆ e . T o produce synthetic speech, the generated residual ˆ e is fed as an excitation to an LP synthesis ﬁlter , whose coefﬁcients are deriv ed from the mel-spectrogram. No direct voicing or pitch features are provided to the model, so the model has to infer this information from the in- put mel-spectrogram. In contrast, we explicitly use the spectral en velope information contained in the mel-spectrogram. This is achiev ed by ﬁrst ﬁtting an all-pole en velope to the mel-spectrum (section 2.1), and second, implementing a synthesis ﬁlter suit- able for parallel inference (section 2.2). T o train G and C , two kinds of loss functions are used: 1) a regression loss based on STFT magnitudes, and 2) a time domain adversarial loss prop- agated through the discriminator model D . The former guides the generator to learn the mel band energies of speech, which is vital for creating harmonics of voiced sounds. The latter loss guides the system to learn in time domain the speech signal’ s phase information and ﬁne stochastic details. 2.1. En velope r ecov ery from mel-spectr ogram Spectral env elope ﬁtting to MFCCs has been proposed previ- ously for neural vocoding in [21]. The previous study focused on lower order MFCCs, while a typical T acotron conﬁguration uses high-resolution mel-spectra with 80 mel ﬁlters and omits the discrete cosine transform to cepstrum domain. Neverthe- less, the method for ﬁtting an all-pole en velope to mel-spectrum is directly applicable. The mel-spectrogram m is computed as m = log( M X ) , where X is a pre-emphasized STFT magni- tude spectrogram, and M is a mel-ﬁlterbank matrix. A recon- struction is obtained simply by using the pseudo-in verse of M and ﬂooring the result with a small positiv e  to prevent neg ati ve values: ˜ X = max( M + exp( m ) ,  ) . An all-pole en velope in frame k is obtained by computing the autocorrelation sequence from ˜ X k via IFFT , and solving the resulting normal equations for the LP polynomial a k [25]. 2.2. Synthesis ﬁlter for parallel infer ence The LP synthesis ﬁlter corresponding to a k is of an inﬁnite im- pulse response. Ho we ver , the impulse response can be trun- cated without noticeable degradation due to the minimum phase property of LP . The synthesis ﬁlter can then be applied frame- wise in STFT domain and the time domain ﬁltered signal is ob- tained simply by in verse STFT (ISTFT). More speciﬁcally , let C G Mel to all-pole Synthesis ﬁlter D STFT STFT e   z Mel-spec. x    S T F T  G A N x Figure 1: Model training setup. An input mel-spectr ogr am is passed to a conditioning model C , upsampled, and used to contr ol an excitation generator G . The generator transforms a white noise input into an excitation signal, which is then ﬁlter ed with an all-pole spectral en velope extracted fr om the mel-spectrum. The resulting signal is trained to match a tar- get speech signal by r e gr ession on the STFT magnitude and a time-domain adversarial loss pr ovided by discriminator D . the (complex valued) frequency response of the LP polynomial be A k = FFT { a k } , (1) where the FFT is zero padded to match the number of frequency bins in the STFT . The corresponding synthesis ﬁlter is the in- verted frequenc y response H k = exp( − i ∠ A k ) max( | A k | ,  ) , (2) i.e., the numerator contains the sign-inv erted phase and the de- nominator is the magnitude ﬂoored by a small positive  . Fil- tering for the entire excitation signal ˆ e = G ( z , c ) is applied by multiplication in the STFT domain ˆ x = ISTFT { STFT { ˆ e }  H } . (3) W e use a cosine window as both the STFT analysis window and ISTFT synthesis window . As all the operations are dif- ferentiable, the gradients with respect to the synthetic speech ∂ ˆ x /∂ G and ∂ ˆ x /∂ C can be computed similarly to [15], and used to update G and C . 2.3. Network ar chitectures All networks in this paper use a same basic architecture, namely a dilated con volution residual network with gated activations and skip connections to output. In other words, the architecture is similar to the non-causal feedforward W av eNet [26]. More speciﬁcally , a dilated con v olution block recei ves an input tensor x i from the previous layer and (optionally) a conditioning c . Both tensors are of shape ( B , T , R ) , where B is the batch size, T is the number of time-steps, and R is the number of residual channels. A hidden state h i is computed from the inputs as h i = tanh( W f i ∗ x i + V f i c )  σ ( W g i ∗ x i + V g i c ) , (4) where ∗ denotes (dilated) con v olution. Layer output y i = W o i h i + x i is passed as input to the next conv olution layer and h i are connected to a post processing layer via skip con- nections. The post-processing layer concatenates the skip con- nections channel-wise and applies an afﬁne projection, follo wed by tanh non-linearity before a ﬁnal afﬁne projection to output. Biases are used throughout, but omitted here for bre vity . In G and C , the con v olutions are zero padded in order to match the timesteps between the input and output. No zero padding is used in D , and the discriminator gradually reduces its input receptiv e ﬁeld length into a single timestep. Further- more, residual connections are not used in D . Notably , the pro- posed architectures are non-causal, which is permissible for the intended use with utterance level bi-directional acoustic models. 2.4. Losses The current study adapts the loss functions from our previous work [22]. Howe v er , now we can deﬁne the losses directly in speech domain, since the LP synthesis ﬁlter and overlap-add are integrated to the computation graph. Denote the target signal from by x , and the generator model G output passed through the synthesis ﬁlter by ˆ x . Additionally , the output of the condi- tioning model c = C ( m ) is made accessible to both G and D . The goal of the generator is to produce ˆ x that appears to come from the same distribution as x ∼ p X . The conditioning model C collaborates with G : it shares the same training objectiv e and obtains its learning signals through G . W e use the W asserstein GAN [27] for our main GAN loss L GAN = − E x ∼ p X [ D ( x , c )] + E ˆ x ∼ p G [ D ( ˆ x , c )] , (5) which the discriminator aims to minimize, and the generator tries to maximize. T o keep the discriminator function sufﬁ- ciently smooth, we use the gradient penalty proposed in [27] L GP = E x ∼ p X , ˆ x ∼ p G  ( k∇ ˜ x D ( ˜ x , c ) k − 1) 2  , (6) where ˜ x = ε x + (1 − ε ) ˆ x is sampled randomly along the line segment between x and ˆ x . Additionally , we regularize the discriminator gradient magnitude for the real data samples to av oid a non-con v ergent training dynamic described in [28] L R1 = E x ∼ p X  k∇ x D ( x , c ) k 2  . (7) Finally , we use the following loss term based on the mean squared error of the STFT magnitudes L STFT = E  ( | STFT { x }| − | STFT { ˆ x }| ) 2  . (8) Notably , the STFT loss uses only the F ourier magnitude, which is insensitiv e to phase alignment within the short-time frame. All learning signals related to the phase information originate from the discriminator, which in turn does not require parallel data during training. The total training objecti ve for G and C is to minimize L G,C = λ 1 L STFT − L GAN , (9) while D attempts to minimize L D = L GAN + λ 2 L GP + λ 3 L R1 . (10) 3. Experiments 3.1. Data The experiments were conducted on the “Nancy” dataset (from Blizzard Challenge 2011 [29]), spoken by a professional female US English voice talent. The dataset comprises approximately 12000 utterances in normal read style with medium to high ex- pressiv eness, totaling to 16 h 45 min of speech. The audio was resampled to 16 kHz and utterances longer than 8.75 seconds were excluded. The remaining 11643 utterances were randomly split to training (11000 utts), validation (200 utts), and test (443 utts) sets. 3.2. T raining the neural vocoder During training, the data was split into one-second segments, such that one training iteration corresponds to processing one second of speech. The model was pre-trained in the residual excitation domain, i.e., the in verse ﬁlter was applied to the tar- get speech signal to obtain a target excitation. Pre-training the model speeds up the training considerably; harmonics started to emerge already at around 60k iterations. After excitation model training for 200k iterations, the optimization criteria were switched to the speech signal domain and trained until a total of 1M iterations. This corresponds to approximately 15 full epochs through the dataset. For discriminator input, we randomly cropped 32 receptiv e ﬁeld length segments from the reference and generated signals at each iteration. The models were optimized using alternating minibatch updates with the Adam optimizer [30] at the default hyperparameters (LR=1e- 4, β 1 =0.9, β 2 =0.999). Loss weights were chosen experimen- tally , so that the weighted STFT and GAN losses have roughly the same order of magnitude. W e use λ 1 = 10 for the STFT loss and λ 2 = 10 and λ 3 = 1 for the gradient penalties. The network conﬁgurations used for the experiments are listed in T able 1. T able 1: GELP network conﬁgurations. G D C Residual channels 64 64 64 Skip channels 64 64 64 Filter width 5 5 5 Dilated stacks 3 3 2 Dilation cycle 8 7 4 Residual connection yes no yes Operation rate Audio Audio Frame 3.3. T acotr on synthesis system Our T acotron system is similar to the T acotron-1 architecture [5], with minor modiﬁcations. The main modiﬁcation was to predict mel-spectrograms instead of linear spectrograms as the ﬁnal output of the system. The input linguistic features were mono-phonemes e xtracted using the Combilex lexicon [31] and represented as one-hot vectors. Mel-spectrograms were normal- ized to lie between 0 and 1 using min-max normalization. The minibatch size was set to 32 and 2 acoustic frames were pre- dicted for each output-step. The initial learning rate was set to 0.002 and adjusted during the training using the Noam scheme [32] with 4000 warmup steps. All the parameters of the sys- tems were optimized using the Adam optimizer . The system was trained on a single NVIDIA T itan X GPU for 150k steps. The acoustic features in the T acotron system use a 80 Hz frame rate and a 50 ms window . For the GELP vocoder , the features were upsampled to 200 Hz rate. 3.4. Reference methods As a simple baseline system for mel-spectrogram-to-wa veform synthesis, we ﬁrst con vert from the mel scale to the linear scale via the ﬁlterbank pseudo-in verse, and then apply the Grifﬁn- Lim phase recovery algorithm [33]. This approach av oids the use of a linear -scale spectrogram predictor post-net proposed in [5], at the cost of some quality loss. Howe v er , the intended use of phase recovery in T acotron-based synthesis is generally not high-quality synthesis, but rather a de v elopment tool to quickly check whether the generated acoustic features are meaningful. For a reference autoregressive W aveNet vocoder , we trained a speaker dependent model conditioned on mel-spectrograms, following the “W av e-30” conﬁguration from [17]. The system uses a dilation cycle of 10 (e.g., 1, 2, ..., 512) repeated three times, 64 residual channels, and 256 post-net channels. The model was trained on softmax cross entropy using 8-bit am- plitude quantization and µ -law companding. The conditioning mel-spectrogram features were stacked for a ﬁv e-frame time context. 3.5. Listening test Subjectiv e quality of the systems was ev aluated on the 5 level absolute category rating scale, ranging from 1 (“Bad”) to 5 (“Excellent”). Figure 2 shows the test results as mean opin- ion scores (MOS) [34] with the 95% conﬁdence intervals, along with stacked score distrib ution histograms. The Mann-Whitney U-test found all differences between systems statistically sig- niﬁcant at p < 0 . 05 corrected for multiple comparisons. The listening tests were split to two groups: copy-synthesis using natural acoustic features, and T acotron TTS using generated acoustic features. Both tests include 50 randomly chosen test set utterances, and 20 evaluations were collected for each ut- terance and system combination. The tests were conducted on the Figure Eight crowd-sourcing platform [35]. Demonstration samples are a v ailable at https://ljuvela.github.io/ GELP/demopage/ . 4. Discussion In copy-synthesis, GELP achiev ed higher MOS scores than the reference W av eNet vocoder . Howe v er , in the T acotron TTS ex- periment, the mean quality of the proposed system dropped be- low that of W aveNet. Quality of the W av eNet-based TTS sys- tem is rob ust to mismatch between natural and generated acous- tic features due to the W aveNet’ s ability to correct its behavior based on previous predictions [18]. In contrast, the parallel in- ference in the proposed GELP model generates the full wav e- form in a single forward pass, which does not allow correcti ve, AR-type of feedback. On the other hand, a clear beneﬁt of the proposed sys- tem is its parallel inference which makes the system signiﬁ- cantly faster compared to the autoregressiv e W aveNet. When synthesizing test set utterances, GELP generated on average 389k samples/second (approximately 24 times real time equiv- alent rate), on a Titan X Pascal GPU. In comparison, the ref- erence W aveNet with sequential inference generated 217 sam- ples/second on average, which is slower by a factor of 1800. Although sequential inference can be sped up by careful DSP implementation, it remains unable to fully leverage the parallel Reference GELP WaveNet Griffin & Lim 1 2 3 4 5 Copy synthesis MOS Reference GELP WaveNet Griffin & Lim 1 2 3 4 5 Tacotron MOS 5 4 3 2 1 Figure 2: Mean opinion scor es with 95% conﬁdence intervals. Stack ed score distribution histograms are shown on the back- gr ound. processing power provided by GPUs. Similar speedups over se- quential inference hav e been reported for other parallel neural vocoders [12, 13, 15], but comparison with these is left as future work due to limited av ailable resources for implementation and computation. The MOS score measured for the W aveNet vocoder in the current study may seem low compared to the score (about 4.5) reported in [8]. Howev er , similar studies reporting variations in the W aveNet performance (depending on speaker and acous- tic features) are found in the literature: MOS scores around 3.5 have been reported when using mel-ﬁlterbank [36] or mel- cepstrum [7] acoustic features. Meanwhile, our version of W av eNet has previously achieved MOS scores above 4 using glottal vocoder acoustic features [17]. The performance of a parallel wa v eform generator is likely to increase with the use of a separate pitch predictor model [15]. Ho wev er , in this work we chose to limit the acoustic features to mel-spectrogram for their relativ e simplicity . 5. Conclusions This paper proposed a “GAN-exicted linear prediction” (GELP) neural vocoder for fast synthesis of speech wa veforms from mel-spectrogram features. The proposed model leverages the spectral en velope information contained in the mel-spectra, and implements an all-pole synthesis ﬁlter as part of the compu- tation graph. The model is optimized using STFT magnitude regression and GAN-based losses in time domain. The pro- posed model has a considerably faster inference speed than a sequential W aveNet vocoder , while outperforming the W aveNet in copy-synthesis quality . Future work includes narrowing the quality gap between natural and generated acoustic features. 6. Acknowledgements This study was supported by the Academy of Finland (project 312490), JST CREST Grant Number JPMJCR18A6 (V oicePersonae project), Japan, and by MEXT KAKENHI Grant Numbers (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan. W e acknowledge the computational re- sources provided by the Aalto Science-IT project. 7. References [1] H. Zen, A. Senior , and M. Schuster, “Statistical parametric speech synthesis using deep neural networks, ” in Proc. ICASSP , May 2013, pp. 7962–7966. [2] H. Zen, K. T okuda, and A. W . Black, “Statistical parametric speech synthesis, ” Speech Communication , vol. 51, no. 11, pp. 1039–1064, 2009. [3] Y . Fan, Y . Qian, F .-L. Xie, and F . K. Soong, “TTS synthesis with bidirectional LSTM based recurrent neural networks. ” in Inter- speech , 2014, pp. 1964–1968. [4] J. Sotelo, S. Mehri, K. Kumar , J. F . Santos, K. Kast- ner , A. Courville, and Y . Bengio, “Char2W a w: End-to- end speech synthesis, ” in ICLR, workshop track , 2017, https://openrevie w .net/pdf?id=B1VW yySKx. [5] Y . W ang, R. Skerry-Ryan, D. Stanton, Y . Wu, R. J. W eiss, N. Jaitly , Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “T acotron: T owards end-to-end speech synthesis, ” in Pr oc. Interspeech , 2017, pp. 4006–4010. [Online]. A vailable: http: //dx.doi.org/10.21437/Interspeech.2017- 1452 [6] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. V in yals, A. Graves, N. Kalchbrenner, A. Senior , and K. Kavukcuoglu, “W aveNet: A generativ e model for raw audio, ” arXiv pre-print , 2016. [Online]. A vailable: http://arxiv .org/abs/1609.03499 [7] A. T amamori, T . Hayashi, K. Kobayashi, K. T akeda, and T . T oda, “Speaker-dependent W a veNet vocoder , ” in Proc. Interspeech , 2017, pp. 1118–1122. [8] J. Shen, R. Pang, R. W eiss, M. Schuster , N. Jaitly , Z. Y ang, Z. Chen, Y . Zhang, Y . W ang, R. Skerry-Ryan, R. Saurous, Y . Agiomyrgiannakis, and Y . W u, “Natural TTS synthesis by conditioning W aveNet on mel spectrogram predictions, ” in Pr oc. ICASSP , 2018, pp. 4779–4783. [9] X. W ang, J. Lorenzo-Trueba, S. T akaki, L. Juvela, and J. Y amag- ishi, “ A comparison of recent wa veform generation and acoustic modeling methods for neural-network-based speech synthesis, ” in Pr oc. ICASSP , 2018, pp. 4804–4808. [10] P . Ramachandran, T . L. Paine, P . Khorrami, M. Babaeizadeh, S. Chang, Y . Zhang, M. A. Hasegawa-Johnson, R. H. Campbell, and T . S. Huang, “F ast generation for conv olutional autoregressiv e models, ” in Proc. ICLR (W orkshop track) , 2017. [11] N. Kalchbrenner , E. Elsen, K. Simonyan, S. Noury , N. Casagrande, E. Lockhart, F . Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audio synthesis, ” arXiv pre-print , 2018. [Online]. A v ailable: http://arxiv .org/abs/1802.08435 [12] A. van den Oord, Y . Li, I. Babuschkin, K. Simonyan, O. Vin yals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo, F . Stimberg, N. Casagrande, D. Grewe, S. Noury , S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T . W alters, D. Belov , and D. Hassabis, “Parallel W av eNet: Fast high-ﬁdelity speech synthesis, ” arXiv pre-print , 2017. [Online]. A vailable: http://arxiv .org/abs/1711.10433 [13] W . Ping, K. Peng, and J. Chen, “ClariNet: Parallel wav e generation in end-to-end text-to-speech, ” arXiv pre-print , 2018. [Online]. A vailable: https://arxiv .org/abs/1807.07281 [14] R. Prenger, R. V alle, and B. Catanzaro, “W aveGlo w: A ﬂow- based generativ e network for speech synthesis, ” arXiv pre-print , 2018. [Online]. A vailable: http://arxiv .org/abs/1811.00002 [15] X. W ang, S. T akaki, and J. Y amagishi, “Neural source- ﬁlter-based waveform model for statistical parametric speech synthesis, ” in Accepted to ICASSP , 2019. [Online]. A v ailable: https://arxiv .org/abs/1810.11946 [16] L. Juvela, V . Tsiaras, B. Bollepalli, M. Airaksinen, J. Y amagishi, and P . Alku, “Speaker -independent ra w w av eform model for glot- tal excitation, ” in Pr oc. Interspeec h , 2018, pp. 2012–2016. [17] L. Juvela, B. Bollepalli, V . Tsiaras, and P . Alku, “Glotnet—a raw wav eform model for the glottal excitation in statistical parametric speech synthesis, ” IEEE/ACM T ransactions on Audio, Speech, and Language Processing , 2019. [Online]. A vailable: https://ieeexplore.ieee.or g/document/8675543 [18] J. S. Jean-Marc V alin, “LPCNet: Improving neural speech synthesis through linear prediction, ” in Accepted to ICASSP , 2019. [Online]. A vailable: https://arxiv .org/abs/1810.11846 [19] M.-J. Hwang, F . Soong, F . Xie, X. W ang, and H.-G. Kang, “LP-W aveNet: Linear prediction-based W aveNet speech synthesis, ” arXiv pre-print , 2018. [Online]. A v ailable: http: //arxiv .org/abs/1811.11913 [20] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde- Farley , S. Ozair, A. Courville, and Y . Bengio, “Generative adver- sarial nets, ” in Advances in neural information processing sys- tems , 2014, pp. 2672–2680. [21] L. Juvela, B. Bollepalli, X. W ang, H. Kameoka, M. Airaksi- nen, J. Y amagishi, and P . Alku, “Speech waveform synthesis from MFCC sequences with generativ e adversarial networks, ” in Pr oc. ICASSP , 2018, pp. 5679–5683. [22] L. Juvela, B. Bollepalli, J. Y amagishi, and P . Alku, “W aveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generativ e adversarial networks, ” in Accepted to ICASSP , 2019. [Online]. A vailable: https://arxiv .org/abs/1810. 12598 [23] S. Davis and P . Mermelstein, “Comparison of parametric repre- sentations for monosyllabic word recognition in continuously spo- ken sentences, ” IEEE T ransactions on Acoustics, Speech, and Sig- nal Pr ocessing , v ol. 28, no. 4, pp. 357–366, 1980. [24] L. E. Boucheron, P . L. De Leon, and S. Sandoval, “Low bit-rate speech coding through quantization of mel-frequency cepstral co- efﬁcients, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 20, no. 2, pp. 610–619, 2012. [25] J. Makhoul, “Linear prediction: A tutorial revie w , ” Proceedings of the IEEE , vol. 63, no. 4, pp. 561–580, Apr 1975. [26] D. Rethage, J. Pons, and X. Serra, “ A W av eNet for speech denois- ing, ” in Proc. ICASSP , 2018, pp. 5069–5073. [27] I. Gulrajani, F . Ahmed, M. Arjovsky , V . Dumoulin, and A. C. Courville, “Improved training of W asserstein GANs, ” in Advances in Neural Information Pr ocessing Systems 30 , I. Guyon, U. V . Luxburg, S. Bengio, H. W allach, R. Fergus, S. Vishw anathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5767–5777. [Online]. A vailable: http://papers.nips.cc/ paper/7159- improved- training- of- wasserstein- gans.pdf [28] L. Mescheder , A. Geiger , and S. No wozin, “Which training meth- ods for GANs do actually con ver ge?” in Proc. ICML , vol. 80, 2018, pp. 3481–3490. [29] S. King and V . Karaiskos, “The Blizzard Challenge 2011, ” in Bliz- zar d Challenge 2011 W orkshop , T urin, Italy , September 2011. [30] D. P . Kingma and J. Ba, “ Adam: A method for stochastic opti- mization, ” in Proc. ICLR , 2015. [31] K. Richmond, R. A. Clark, and S. Fitt, “Robust L TS rules with the Combilex speech technology lexicon, ” in Proc. Interspeech , Brighton, September 2009, pp. 1295–1298. [32] A. V aswani, N. Shazeer, N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “ Attention is all you need, ” in Advances in Neural Information Pr ocessing Systems , 2017, pp. 5998–6008. [33] D. W . Grifﬁn and J. S. Lim, “Signal estimation from modiﬁed short-time fourier transform, ” IEEE Tr ansactions on Acoustics, Speech, and Signal Processing , vol. 32, no. 2, pp. 236–243, Apr 1984. [34] International T elecommunication Union, “Methods for Subjecti ve Determination of T ransmission Quality, ” ITU-T SG12, Geneva, Switzerland, Recommendation P .800, Aug. 1996. [35] Figure Eight Inc., “Crowd-sourcing platform, ” https://www .ﬁgure-eight.com/, accessed: 2019-04-04. [36] N. Adiga, V . Tsiaras, and Y . Stylianou, “On the use of W aveNet as a statistical vocoder , ” in Pr oc. ICASSP , 2018, pp. 5674–5678.

GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment