Effect of data reduction on sequence-to-sequence neural TTS

EFFECT OF D A T A REDUCTION ON SEQUENCE-T O-SEQUENCE NEURAL TTS J avier Latorr e, J akub Lachowicz, J aime Lor enzo-T rueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, Klimkov V iacheslav , Amazon.com { jlatorre, lachj, truebaj, thommer, drugman, ronanks, vklimkov } @amazon.com Abstract Recent speech synthesis systems based on sampling from au- toregressi ve neural networks models can generate speech al- most undistinguishable from human recordings. Howe ver , these models require large amounts of data. This paper shows that the lack of data from one speaker can be compensated with data from other speakers. The naturalness of T acotron2-like models trained on a blend of 5k utterances from 7 speakers is better than that of speaker dependent models trained on 15k utterances, b ut in terms of stability multi-speaker models are always more sta- ble. W e also demonstrate that models mixing only 1250 utter- ances from a target speaker with 5k utterances from another 6 speakers can produce signiﬁcantly better quality than state-of- the-art DNN-guided unit selection systems trained on more than 10 times the data from the target speak er . Index T erms : statistical parametric speech synthesis, au- toregressi ve, neural v ocoder, generati ve models, sequence-to- sequence 1. Introduction Data acquisition is one of the main problems of data-dri ven text- to-speech (TTS) systems. High-quality unit selection TTS relies on large single speaker databases, usually of tens of hours of speech. Classical statistical parametric speech synthesis (SPSS) is more data frugal. Less than one hour of data is enough to train an intelligible speaker dependent (SD) model. More data improv es SPSS quality , b ut from around 4-5 hours of data on- wards quality tends to saturate [1]. T o reduce the dependenc y on a single speaker , techniques based on mixing data from multiple speakers into an A verage V oice Model (A VM) were developed. These techniques produce reasonable quality with as little as 3 minutes of target speak er data [2]. Ho wever , when the a vailable target speaker data is abo ve 2 hours ( ∼ 2k utterances), Speaker - Dependent (SD) models were better [3]. The change of paradigm introduced by auto-regressi ve models [4, 5, 6, 7, 8], has produced synthetic speech of unprece- dented quality . These new models require much more data than traditional TTS but they are also more efﬁcient at integrating div erse data [9, 10, 11]. Several studies hav e reported that it is easy to train multi-speaker models [9, 12] and that adding more speakers improves the loss function over the validation set [4]. Most approaches for multi-speak er models rely on a speaker embedding but they v ary on the type of embedding and where to apply it. Whereas some use an external model, e.g. speaker classiﬁcation, to provide the embeddings [13, 12] others train the speaker embedding together with the model out of a one- hot speaker ID vector [4, 9, 5]. Some approaches use the em- Paper submitted to IEEE ICASSP 2019 bedding at the input only as a global conditioning [4], whereas others apply it at different le vels within the model [9, 5]. Despite all the recent attention to multi-speaker models, to the best of our knowledge, nobody has published yet any study on practical issues such as; 1) at which point an SD model be- comes better than a multi-speaker one, 2) whether it is better or worse to use gender-dependent multi-speaker models, 3) what is the effect of training models with an unbalanced mixture of data from the target speaker and other speak ers. This paper presents the results of sev eral experiments aimed at answering these questions. W e hope our results will help other dev elopers and researchers in designing their systems and experiments. The structure of the paper is as follo ws: Section 2 describes the basic structure of our TTS system; Sections 3 and 4 describe the experimental protocol and results respectiv ely . Finally , in Section 5 conclusion are drawn. 2. System description Our system architecture follows that of T acotron2 [8]. First, a sequence-to-sequence (S2S) acoustic model predicts the mel- spectrograms from a sequence of linguistic inputs. Then a neu- ral vocoder con verts the mel-spectrograms into a w aveform. 2.1. Acoustic model The architecture of the acoustic model is described in ﬁgure 1. It is a S2S model with attention mechanism as in [8]. How- ev er, instead of using raw graphemes as inputs, our system ﬁrst con verts the graphemes into phonemes which are then encoded with a one-hot vector . For the vo wels, we use 3 dif ferent sym- bols depending on their level of stress (0,1,2). The punctuation after each w ord, including blanks, is treated as if it were another phoneme. The attention mechanism for the S2S model follows the one proposed in [14] with normalised attention weights [15]. In this mechanism the attention weights for the current frame depend both on the previous output of the decoder and on the attention weights of the previous frame. The speaker conditioning is sim- ilar to [4], with a one-hot speaker ID and global conditioning. The output of the model are blocks of 5 frames of mel- spectrograms, each consisting of an 80-dimensional vector spanning frequencies between 50 Hz and 12 kHz. Each frame is computed o ver 50 ms and shifted e very 12.5 ms. The last frame of the pre vious block is passed as input to both the atten- tion model and the decoder to generate the next 5-frame block. During training, this recursiv e input is randomly switched be- tween real spectrograms and self-generated ones (scheduled sampling). The probability for taking real spectrograms is 0.9. In addition to the mel-spectrograms, the model also predicts a stop token to mark the end of the utterance. The stop token is Figure 1: Acoustic model arc hitecture encoded as a real number between 0 and 1 that reaches the value of 1 at the end of the sentence. The model was trained with a dropout probability of 0.1 for both the decoder and the auto- regression, but without dropout for the encoder . The dropout was also applied at inference time. 2.2. Neural vocoder The architecture of the neural vocoder closely follows W av- eRNN [16]. The autoregressi ve part of the network consists of a single forward Gated Recurrent Unit with a hidden size of 896 and a pair of afﬁne layers followed by a softmax layer with 1024 outputs to predict the 10-bit mu-la w samples with 24 kHz sampling rate. The conditioning network consists of a two bi- directional Long Short-T erm Memory (LSTMs) with a hidden size of 128. The mel-spectrograms for conditioning consisted of 80 coef ﬁcients extracted using Librosa library [17] for frequen- cies from 50 Hz to 12 kHz. The model was trained on data from 74 speakers on 17 different languages with between 1k to 2.5k utterances per speaker . Around two thirds of the speakers were female and the other third male, e xcept for one child. More de- tails about the vocoder architecture and how it was trained can be found in [18]. 3. Experimental protocol The research questions we attempted to answer were: 1. Can a multi-speak er model with limited data per speak er achiev e similar quality than a SPSS guided unit selection with a large database? 2. Can we train multi-speaker models with less data for the target speak er than for the supporting speakers? 3. How much data is needed for a SD model to be better than a multi-speaker one? 4. Is it better to combine all the a vailable speakers or only the most similar ones, e.g., only female speakers? 5. Does mixing speakers af fects the speaker similarity? The results of our experiments to answer these questions are presented in section 4. 3.1. T raining data and model stability The speech data used to train the models came from 7 internal speakers: 2 males, 4 females and one child. The av ailable data T able 1: P ercenta ge of correctly g enerated ﬁles model model name #training utt % stable single speaker sd-8500 8.5k 35.4% (1 speaker) sd-15000 15k 46.2% sd-25000 25k 69.3% female only fe4-2500 4 × 2.5k 88.3% (4 speakers) fe4-5000 4 × 5k 77.33% fe4-8500 4 × 8.5k 77.33% mix-gender mx7-2500 7 × 2.5k 54.5 % (7 speakers) mx7-5000 7 × 5k 93.5% mx7-8500 7 × 8.5k 95.6% mix-gender unbalanced mx6+1250 6 × 5k + 1.25k 91.4% (7 speakers) mx6+2500 6 × 5k + 2.5k 78.9% for these speaker was 8.5k utterances for four speakers (2 fe- male, one male and the child), 15k for two (one male and one female) and 25k utterances for one female speaker . Out of this, we randomly selected a ﬁxed number of utterances per speaker depending on the model. For each speaker in the model, we used 90% of the utterances for training and 10% for develop- ment. The ﬁrst three columns of table 1 sho ws the speakers blend and amount of data used to train each type of model in our experiments. A problem in S2S models is that the attention sometimes gets lost at inference time. This produces errors such as skip- ping one or more phones, repeating part of the sentence, getting stuck in silences, etc. An analysis of the stability of the models is useful to understand their robustness tow ard different blends of training data. T o measure this, we generated 75 utterances from each speaker on each type of model and marked those that, after listening, presented any of the above mentioned stability problems. The last column of table 1 shows the proportion of stable utterances for each model. SD models are clearly much more unstable than multi-speaker one, regardless of whether the mixed speaker ones are female-only or mix-gender . This result agrees with the comments on [4] about con ver gence of multi-speaker models. Model stability does not seem to be di- rectly linked to amount of training data. The female-only model trained on 2.5k utterances/speaker was more stable than the female-only models trained on more data. Also, some multi- speaker models are more stable than SD ones, despite being trained on less data. The type of problems seem to depend on the speaker , e ven in the multi-speaker models. All these suggest that stability depends on the characteristics of the data itself, b ut we could not ﬁnd any clear pattern for it. 3.2. Subjective e valuation T o address our research questions, we ran se veral MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) tests [19]. The adv antage of MUSHRA over Mean Opinion Score (MOS) is that for each sentence, all the systems being evalu- ated can be presented simultaneously in one panel. On ev ery test the positions of the systems on the panel were randomised. All the panels included the natural recordings as an upper an- chor but similar to [20] subjects were not forced to assign the top score to any of the systems. All tests were conducted in Amazon Mechanical T urk. Subjects were people living in the United States who deﬁne themselves as nativ e English speak ers. For each ev aluation, we selected sentences of length between 5 and 30 words. For all the tests the signiﬁcance of the results was analysed with a W ilcoxon signed-rank test and a standard t-test, both with Bonferroni-Holm correction applied [21]. The main goal of the subjective tests was to ev aluate speech quality and speaker similarity . Therefore for each of the MUSHRA tests we chose only those utterances that did not present an y stability problem with any of the systems under consideration. 3.2.1. Naturalness For tests on questions 1-4 subjects were asked to “rate the audio samples in terms of their naturalness” with a continuous slider between “completely unnatural” (0) and “completely natural” (100). Each stimuli panel was ev aluated by 10 subjects. The set of sentences used on each experiment were slightly differ - ent since due to stability problems, not all the systems on each test were able to synthesise all the utterances. In the tests for questions 1 and 2, a guided unit selection was included among the systems to be ev aluated. This guided unit selection was a standard system in which the linguistic cost is combined with the acoustic cost, computed as the distance between the F0, duration and spectrum of the units and those predicted by a state-lev el DNN model. The models for the acoustic cost were speaker dependent and trained with all the a vailable data for each speaker . At synthesis time, the ev aluation sentences were blacklisted so that their units could not be selected. This black- listing removed less than 0.5% of the unit selection data. Since those tests prov ed that the guided unit selection was w orse than the other systems, we didn’t include it for the rest of the exper - iments. 3.2.2. Speaker similarity For the test on question 5 subjects were asked to “rate whether the speaker of the reference sounds like the same person as the speakers of the samples. ” between “Deﬁnitely a different per- son” (0) and “Deﬁnitely the same person” (100). Subjects were presented with a reference audio from the target speaker (sen- tence1) and speech audio samples for a different sentence (sen- tence2) generated by the ev aluated models. The recording of sentence2 by the target speaker was also included as an upper anchor . For each of the sev en speakers, we ran an indepen- dent MUSHRA test with 10 utterances and the best available SD model for that speaker . 4. Results 4.1. Multi-speaker vs unit selection The ﬁrst experiment ev aluates the naturalness of two multi- speaker models, ‘mx7-5000’ and ‘mx7-2500’ (see table 1) vs the guided unit selection. As an additional reference point, we included samples re-synthesised from the original mel- spectrograms with the neural vocoder , ‘n v-resynthesis’. The e valuation consisted of 27 utterances from each of the 7 speakers resulting in a total of 189 stimuli panels. A to- tal of 70 subjects e valuated 27 panels each. The boxplots of the MUSHRA scores can be seen in ﬁgure 2. All the mod- els were signiﬁcantly different from each other at p < 0 . 05 . As expected, the recordings and the ‘nv-resynthesis’ samples achiev ed the higher score followed by the ‘mx7-5000’ and ‘mx7-2500’. The dif ference between the ‘mx7-2500’ and ‘mx7- 5000’ is small but statistically signiﬁcant. The most surprising result was the comparativ ely low score of the guided unit selec- tion, despite it being built upon more than 99% of all the av ail- able data. Obviously , there were differences between speakers, but the y do not correlate with the amount of data of the unit se- lection system. The rank order of the systems was consistent Figure 2: Multi-speaker models vs. Unit selection across speakers. The median MUSHRA in ﬁgure 2 also shows that the gap between ‘nv-resynthesis’ and the recordings is very small, despite the vocoder being a generic one trained on mul- tiple speakers in dif ferent languages. The main gap is between the models and the ‘n v-resynthesis’, i.e. in the modelling of the mel-spectrograms. Comparativ ely , the gap due to dif ference in the amount of training data is smaller . 4.2. Balanced vs unbalanced mixture of speak ers The second experiment e valuated the naturalness of the models of previous experiments vs models trained with 5k utterances from six speakers plus 2.5k or 1.25k utterances from a target speaker , ‘mx6+2500’ and ‘mx6+1250’ respectively . W e train one ‘mx6+... ’ model for each speaker and used them only to generate speech with the v oice of that speaker . T o keep the lower anchor of the pre vious experiment (section 4.1), we added again samples produced by the guided unit selection system. The e valuation consisted of 27 utterances from each of the 7 speakers. They were ev aluated by a total of 70 sub- jects. Figure 3 shows the results. The ranks of the results of ‘unit-selection’, ‘mx7-2500’, ‘mx7-5000’ and ‘recordings’ con- ﬁrm the results of Section 4.1. The ‘mx7-2500’, ‘mx6+1250’ and ‘mx6+2500’ are not signiﬁcantly different from each other . This indicates that the beneﬁt of using 5k utterances instead of 2.5k for the non-target speakers is not in terms of quality , b ut in terms of stability as was shown in section 3.1. A second inter- esting result was that in terms of quality , 1250 utterances from a target speaker mixed with sufﬁcient data from other speakers can generate better speech quality than a state-of-the-art unit selection system. The only e xception to this w as the female speakers on > 15k utterances. There were some other minor dif- ferences between speakers, especially in the relative ranking of the two ‘mx6’ models. Howev er, with the abo ve mentioned ex- ception, the rank order between systems was consistent across speakers. 4.3. Multi-speaker vs speaker dependent This set of experiments compared SD models with multi- speaker models ‘mx7-5000’ and ‘mx7-8500’ (see table 1). W e trained ‘sd-8500’ models for all seven speakers, ‘sd-15000’ for 3 speakers and ‘sd-25000’ for one speaker , depending on the amount of data a vailable for the speakers. Three separated ev al- uations were conducted, for the SD models on 8.5k, 15k and 25k utterances. Unfortunately , out of the seven ‘sd-8500’ models only 3 Figure 3: Mixed models with balanced vs unbalanced data (two female and one male) were stable enough to generate sam- ples. T o compensate for the lack of data points in the ev alua- tion of the ‘sd-8500’ models we used 42 samples for each of the remaining 3 speakers. For the ev aluation of the ‘sd-15000’ models we only ha ve 3 speakers with enough data. As shown in table 1 these models were also very unstable, especially for one of the speakers which only generated correctly 24 utterances. T o keep the number of utterances per speaker balanced we e val- uated these models with 24 utterances/speaker . Finally , for the ‘sd-25000’ we generated 45 sentences from that speaker model. T o compensate for the lack of data, each MUSHRA panel of the ‘sd-25000’ ev aluation was judged by 15 subjects. T able 2 sho ws the median MUSHRA score for the three tests and the average rank of the systems . In the three ev al- uations, the difference between ‘mx7-5000’ and ‘mx7-8500’ were not statistically signiﬁcantly as can be seen by the small differences in their av eraged rank order . Both multi-speaker models were better than ‘sd-8500’ models and worse than the ‘sd-25000’ model. The ‘mx7-8500’ model was better than the ’ sd-15000’ model. These differences were statistically signif- icant. The differences between ‘sd-15000’ model and ‘mx7- 5000’ were not signiﬁcant with the W ilcoxon signed-rank test. These results suggest that, similar to classical SPSS, a SD model can sound more natural than a multi-speaker one when trained on sufﬁcient amount of data. Howe ver , multi-speaker models are better than SD models when they are trained on more than 2.3 times more data or, alternativ ely , when the SD model is trained on less than 15 hours. Further work is needed to clarify this last point. T able 2: Median scor e and avera ge rank (in parentheses) of the tests comparing multi-speaker vs speaker dependent models Evaluated Recordings Models SD model SD mx7-8500 mx7-5000 sd-8500 71 (1.96) 61 (2.78) 63 (2.61) 62 (2.64) sd-15000 74 (1.91) 61.5 (2.79) 63 (2.65) 62 (2.65) sd-25000 77 (1.97) 68 (2.56) 67 (2.73) 66 (2.75) 4.4. Female only vs mixed gender The last naturalness experiment compared models trained on all 7 speakers against those trained only on 4 female speakers. The total amount of data was different but the amount of data per speaker was constant. Figure 4 shows the results. The differ- ences between models trained on dif ferent number of utterances per speaker were statistically signiﬁcant b ut the differences be- tween models trained on the same data per speaker were not. Figure 4: Mixed of all speakers vs only female speakers 4.5. Speaker similarity T able 3 summarises the results. Each ro w represents a dif- ferent speaker . On the average, only the dif ferences between recordings and the other systems were statistically signiﬁcant at p < 0 . 05 . On a per -speaker basis, the differences for tw o speak- ers (in bold type face in table 3 ) and the rest of the models were statistically signiﬁcant. Howe ver , that signiﬁcancy dissappears when both 8500 or 15000 speakers are considered jointly . T able 3: A verag e speaker similarity Evaluated Record- Models SD model ings best mx7-... mx6+... SD 8500 5000 2500 1250 2500 sd-8500 78.3 69.5 70.7 70.2 68.8 70.0 70.1 73.0 68.1 76.3 74.7 76.7 71.9 73.0 76.5 - 70.7 71.3 71.0 71.3 71.4 79.3 - 71.5 73.3 74.6 69.1 73.2 sd-15000 68.1 68.4 65.2 68.2 67.4 68.4 64.5 83.4 77.3 82.3 84.3 84.2 82.3 82.3 sd-25000 75.0 72.1 69.9 71.4 70.7 71.8 70.1 A verage 76.0 70.7 72.1 73.1 73.2 71.3 72.2 5. Conclusions This paper presents several experiments aimed at reducing the amount of single speaker data needed to train high-quality S2S TTS systems. The results show that models trained on a mix- ture of speakers can produce better quality than a state-of-the- art guided unit selection TTS with an inv entory of units rang- ing between 8.5k and 25k utterances. W e show that this is true for S2S models trained on 2.5k utterances from 7 speakers and also for S2S models trained on a mixture of 1.25k utterances from the target speaker and 5k utterances from 6 other speak- ers. Our results also sho w that for databases with up to 15k utterances, multi-speaker models produce better quality than speaker -dependent ones. SD models with more data can pro- duce mar ginally better quality but in terms of stability SD mod- els are always more unstable. The most probable reason for this is that by mixing multiple speakers, the alignment is more robust ag ainst different pronunciations, wrong sentences or dif- ferent initialisation v alues. This seems to also be the case when training on less data but more similar speakers. The different speaker blends do not seem to affect the speech quality but lower v ariability of speaker seems to impact negati vely on the model’ s stability . 6. References [1] J. Y amagishi, L. Zhenhua, and S. King, “Robustness of HMM- based Speech Synthesis, ” Pr oc. INTERSPEECH , 2008. [2] J. Y amagishi, K. T akao, Y . Nakano, K. Ogata, and J. Isogai, “ Anal- ysis of speaker adaptation algorithms for HMM-based speech syn- thesis and a constrained SMAPLR adaptation algorithm, ” IEEE T ransaction on Speech, Audio & Language Processing , vol. 17, no. 1, pp. 66–83, 2009. [3] J. Y amagishi, B. Usabaev , S. King, O. W atts, J. Dines, J. Tian, R. Hu, Y . Guan, K. Oura, K. T okuda, R. Karhila, and M. Kurimo, “Thousands of voices for HMM-based speech synthesis, ” Proc. INTERSPEECH , pp. 420–423, 2009. [4] A. v . d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. V inyals, A. Graves, N. Kalchbrenner , A. Senior, and K. Kavukcuoglu, “W avenet: A generati ve model for raw audio, ” in , 2016. [5] W . Ping, K. Peng, A. Gibiansky , S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller , “Deep v oice 3: 2000-speaker neural text-to-speech, ” in , 2017. [6] S. O. Arik, M. Chrzano wski, A. Coates, G. Diamos, A. Gibiansky , Y . Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech, ” in arXiv:1702.07825v2 , 2017. [7] Y . W ang, R. Skerry-Ryan, D. Stanton, Y . Wu, R. J. W eiss, N. Jaitly , Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “T acotron: T owards end-to-end speech synthesis, ” in Interspeech , 2017. [8] J. Shen, R. Pang, R. J. W eiss, M. Schuster, N. Jaitly , Z. Y ang, Z. Chen, Y . Zhang, Y . W ang, R. J. Skerry-Ryan, R. A. Saurous, Y . Agiomyrgiannakis, and Y . Wu, “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions, ” arXiv pr eprint arXiv: 1712.05884 , 2017. [9] S. Arik, G. Diamos, A. Gibiansky , J. Miller, K. Peng, W . Ping, J. Raiman, and Y . Zhou, “Deep voice 2: Multi-speaker neural text-to-speech, ” in Advances in Neural Information Pr ocessing Systems , 2017, pp. 2966–2974. [10] J. Lorenzo-T rueba, G. E. Henter , S. T akaki, J. Y amagishi, Y . Morino, and Y . Ochiai, “Inv estigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis, ” Speech Communication , vol. 99, pp. 135–143, 2018. [11] M. Podsiadlo and V . Ungureanu, “Experiments with train- ing corpora for statistical text-to-speech systems. ” in Pr oc. Interspeech 2018 , 2018, pp. 2002–2006. [Online]. A vailable: http://dx.doi.org/10.21437/Interspeech.2018- 2400 [12] Y . Jia, Y . Zhang, R. J. W eiss, Q. W ang, J. Shen, F . Ren, Z. Chen, P . Nguyen, R. Pang, I. Lopez-Moreno, and Y . W u, “T ransfer learn- ing from speaker veriﬁcation to multispeaker text-to-speech syn- thesis, ” arXiv pr eprint arXiv:1806.04558v1 , 2018. [13] Y . T aigman, L. W olf, A. Polyak, and E. Nachmani, “V oiceloop: V oice ﬁtting and synthesis via a phonological loop. ” in Interna- tional Confer ence on Learning Representations , 2018. [14] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate, ” arXiv preprint arXiv: 1409.0473 , 2014. [15] T . Salimans and D. P . Kingma, “W eight normalization: A sim- ple reparameterization to accelerate training of deep neural net- works, ” arXiv preprint arXiv: 1602.07868 , 2016. [16] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury , N. Casagrande, E. Lockhart, F . Stimberg, A. v . d. Oord, S. Dieleman, and K. Ka vukcuoglu, “Efﬁcient neural audio synthesis, ” arXiv pr eprint arXiv:1802.08435 , 2018. [17] B. McFee, C. Raf fel, D. Liang, D. P . Ellis, M. McV icar, E. Batten- berg, and O. Nieto, “librosa: Audio and music signal analysis in python, ” in Proceedings of the 14th python in science confer ence , 2015, pp. 18–25. [18] J. Lorenzo-Trueba, T . Drugman, J. Latorre, R. Barra-Chicote, T . Merritt, B. Putrycz, S. Ronanki, and K. , V iacheslav , “Robust univ ersal neural vocoding, ” in Submitted to ICASSP 2019 , 2019. [19] ITUR Recommendation, “Bs. 1534-1. method for the subjecti ve assessment of intermediate sound quality (mushra), ” International T elecommunications Union, Geneva , 2001. [20] T . Merritt, B. Putrycz, A. Nadolski, T . Y e, D. Korzekwa, W . Dolecki, T . Drugman, V . Klimkov , A. Moinet, A. Breen, R. Kuklinski, and N. S. andRoberto Barra-Chicote, “Comprehen- siv e ev aluation of statistical speech wa veform synthesis, ” in Pr oc. SLT , 2018. [21] R. A. J. Clark, M. Podsiadlo, M. Fraser, C. Mayo, and S. King, “Statistical analysis of the blizzard challenge 2007 listening test results, ” in Pr oc. Blizzard 2007 , 2007.

Effect of data reduction on sequence-to-sequence neural TTS

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment