Investigating context features hidden in End-to-End TTS

Recent studies have introduced end-to-end TTS, which integrates the production of context and acoustic features in statistical parametric speech synthesis. As a result, a single neural network replaced laborious feature engineering with automated fea…

Authors: Kohki Mametani, Tsuneo Kato, Seiichi Yamamoto

Investigating context features hidden in End-to-End TTS
INVESTIGA TING CONTEXT FEA TURES HIDDEN IN END-T O-END TTS K ohki Mametani, Tsuneo Kato, Seiichi Y amamoto Department of Intelligent Information Engineering and Sciences, Doshi sha Univ ersity , K yoto, Japan ABSTRA CT Recent studies have introd uced end-to-end TTS, which integrates the production of contex t and acoustic features in statistical parametric speech synthesis. As a result, a single neural network replaced labo- rious f eature engineering with automated feature l earning. Ho wev er , little i s known about what t ypes of context information end-to-end TTS extracts from text input before synthesizing speech, and the pre- vious kno wledge about context features is barely utilized. In this work, we fi rst point out the model similarity between end-to-end TTS and parametric TTS . Based on t he simi l arity , we ev aluate the quality of encoder outputs from an end-to-end TTS system against eight criteria that are deriv ed f rom a standard set of contex t infor- mation used in parametric TTS. W e conduct experiments using an e valuation procedure that has been newly dev eloped in the machine learning li t erature for quantitative analysis of neural representations, while adapting it to the TTS domain. Experimental results show that the encoder outputs reflect both linguistic and phonetic contexts, such as vowe l reduction at phoneme le vel, l exical stress at syllable lev el, and part-of-speech at word lev el, possibly due to the joint op- timization of conte xt and acoustic features. Index T erms — text-to-speech, end-to-end TTS, HTS 1. INTRODUCTION Statistical parametric speech synthesis [1] has steadily adv anced through the history of annual Blizzard Challenges [2], and the Hidden Mark ov Model (HMM)-ba sed speech synthesis system (HTS) [3] has been a dominant frame work in this approach. Since the first release of t he HTS , acoustic modeling in this approach has marked ly improv ed due to t he progress of its generativ e model from HMM to deep neural network [4, 5] and recurrent neural network, especially long short-term memory (LST M) [6, 7 ]. In contrast, little progress can be seen in text analysis, or ”fr ont end. ” Due to the very weak connection between t ext and speech, the f r ont end extracts contex t features (also kno wn as l inguistic features) which are use- ful to bridge t he gap between the two modalities. Con ventiona lly , a standard set of context features giv es a wide range of context information within a giv en text to an acoustic model, extensiv ely cov ering phonetic, linguistic, and prosodic contex ts [8]. Beyon d the partial use of neural networks for an acoustic model, recent studies hav e i ntroduced fully neural TTS systems, known as end-to-end TTS systems, which can be trained in an end-to-end fash- ion, requiring only pairs of an utterance and its transcript. These sys- tems have already outperformed parametric TTS systems in terms of naturalness [9]. In addition, acoustic modeling and text processing in a parametric TTS system are integrated by a single neural network. As a r esult, t his singular solution expels language-specific kno wl- edge used f or the configuration of text analysis and speech-specific This work was supported by J SPS KAKENHI Grant Number 17K02954. tasks to build an acoustic model, such as segmenting and aligning au- dio files, making i t significantly easier to dev elop a ne w T TS system. Additionally , this all ows such models to be conditioned on various attributes such as the speaker’ s prosodic feature [10], enabling a tr uly joint optimization over context and acoustic features. As sho wn by the opening of the Blizzard Machine Learning Challenge [11], TTS has partly become a subject of machine learning and is expected to mov e on to the end-to-end style. These adv antages, howe ver , often come at the cost of model in- terpretability . Understanding of the internal process of end-to-end TTS systems i s dif ficult because neural networks are generally black boxes , making the functionality of systems based on such models un- explaina ble to humans. Therefore, model interpretability is essential to establish a more informed research process and improve current systems. In this vein, recently , a unified procedure for quantitativ e analysis of internal representations in end-to-end neural models has been de veloped. In [12], hidden representations in an end-to-end au- tomatic speech recognition system are thoroughly analyzed with the method, and it re veals the extent to which a character-based connec- tionist temporal classification model uses phonemes as an internal representation. Also, in [13], the same ev aluation process is applied to analyze internal representations from different layers of a neural machine tr anslation model. In this work, by adapting this e valuation procedure t o the TT S domain, we demonstrate what types of context information are uti- lized in end-to-end T TS systems. W e meta-analytically sort out the eight most important context features from the standard feature set in parametric T TS and use them as criteria for our experiments to quantify ho w and to what extent encoder outputs correlate with such contex t features. Specifically , unlike speech recognition and ma- chine translation t asks, the performance of TTS systems has been primarily e valuated using subject tests such as Mean Opinion Score which often takes a lot of time and resources. For this reason, there are benefits to exploring a more con venient and objectiv e ev alua- tion process and i n vestigating its usefulness for the further success of end-to-end TTS research. 2. MODEL SIMILARITY In spite of the differenc e of the generati ve model i n use, the way end- to-end TTS synthesizes speech i s comparable to the way paramet- ric TTS does as both approaches are categorized into the generativ e model-based TTS [14]. In the following explanation, we formally describe text input as w = { w i | i = 1 , 2 , ..., L } and time-domain speech output as x = { x j | j = 1 , 2 , ..., T } , where L is the length of symbols in the text and T is the number of frames of the speech wa veform. Fig. 1 (a) shows a typical speech synthesis process of the HTS, which represents parametric TTS , and it can be mainly divided into three steps. First, a front end extracts linguistic and phonetic con- texts as well as prosodic ones at each of the phonemes within the text w an d accordingly assigns context features l = { l i | i = 1 , 2 , ..., L } . T ypically , a context feature l i is composed of a high dimensional vector , e.g., a 687-dimension al vecto r is used in t he HTS-2.3.1. S econd, an acoustic model generates acoustic features o = { o j | j = 1 , 2 , ..., T } for giv en context features l , estimat- ing features such as spectrum, F 0 , and duration with individually clustered context-de pendent HMMs. Lastly , a vocoder synthesizes a real-time wav eform x f r om acoustic features o . Fig. 1 (b) ill ustrates an end-to-end TTS model that achiev es the integration of the production of contex t and acoustic features wi th a nonlinear function. Mo st of the end-to-end TTS models utilize an attention-based encoder -decoder framew ork [15], directly map- ping text to a speech waveform. The implementation of the frame- work v aries depending on the generati ve model in its decoder , such as LSTM [9, 16] or causal con vo lution [17, 18], modeling tempo- ral dynamic behaviors of speech. First , an encoder encodes the text w , folding context information around each symbol w i and turn- ing it into the corresponding high dimensional vector representations h = { h i | i = 1 , 2 , ..., L } . Then, this is followe d by a decoder with an attention mechanism. Before decoding the encoder outputs, all h is fed to an attention mechanism. At each decoding step j , the at- tention produce s a single vector c j , kno wn as the context vector , by computing the weighted sum of the sequence of t he encoder outputs h i (called alignments) as follows, where α ij is a real value [0, 1]: c j = Σ L i =1 α ij h i The context vector summarizes the most important part of the encoder outputs for the current decoding step j . Although the com- putation of alignments varies depending on the system, t he differ- ences are t rivial in the scheme. Then, the decoder takes the context vector c j and t he previous decod er output o j − 1 as input and gener- ates an acoustic feature o j , and finally a vocoder is used in the same way as in parametric TTS. By using the one-to-one model comparison between the t wo models abo ve, it is shown t hat both l in parametric T TS and h in end- to-end TTS play a similar role: con verting each symbol i n the text w into a high dimensional vector wi thin the corresponding model. Both of them represent context information giv en to each symbol. This is the focus of our work. Our hypothesis i s that the encoder outputs h contain the same type of contex t information utilized in l of parametric TTS. Moreov er , due to the joint optimization with acoustic features, h should embrace extra details that are not seen in l such as ones caused by articulation, allowing i t to be more effecti ve contex t features for the ov erall performance. 3. CONTEXT FEA TURES IN P ARAMETRIC TTS In statistical parametric speech synthesis, there are sev eral studies that in vestigate the quality of the standard set of context f eatures. The contribution of higher-lev el context features, such as part of speech and intonational phrase boundaries, has been studied [19]. This study re veals that features abov e word lev el hav e no significant impact on the quality of synthesized speech. I n [20], a Bayesian net- work is used to ev aluate ho w each of the 26 commonly used context features in the standard set contributes to sev eral aspects of acous- tic features. This revea led the most important conte xt features that are relev ant to three acoustic f eatures (i. e. , spectrum, F 0 , and dura- tion), the features relev ant to the acoustic features exce pt for spec- trum, and the features rele vant to eit her F 0 or duration. By applying a smaller feature set while removing irrelev ant context features, it is demonstrated that a parametric TT S system with fewe r contexts Fig. 1 : Overview of speech synthes is process of (a) parametric TTS consisting of front end, acoustic model illustrated as HMMs, and vocod er and (b) end-to-end TTS model consisting of encoder , attention-based decoder illustrated as causal con volution networks, and vocoder . can produce a speech wav eform with a quality t hat is as good as that of the contextually rich system. As for the r epresentation of positional features, [21] explores the adv antages of categorical and relativ e representations against the absolute representation used in standard models. In the study , four categories are proposed to rep- resent positional v alues: ”beg inning” for the fir st element, ”end” for the last element, ”one” for the segments of length one, and ”middle” for all the others. It turns out that a system with categorical represen- tation generates the best speech quality among other representations. Originally , 11 features are confirmed to be important [20]. T he set of features includes two pairs of positional features, which only differ by whether it counts forward or backwa rd. The dif ference can be canceled by using the aforementioned categorical representation, resulting in a reduction of two features. In addition, t he accent i s considered synonymo us with stress in our work. As a result, the remaining eight features can be summarized (T able 1). 4. EXPERIMENTS 4.1. Methodology Recently , a unified procedure for quantitativ e analysis of inter- nal representations in end-to-end neural models has been dev el- ID Context information Card. p 2 pre vious phoneme identity 39 p 3 current phone me identity 39 p 4 next phoneme identity 39 p 6 (= p 7 ) position of current phone me in syllable 4 b 1 (= b 2 ) whether current syllable stressed or not 2 b 4 (= b 5 ) position of current syllable in word 4 b 16 name of vo wel of current syllable 15 e 1 gpos (guess part-of-speech) of current word 8 T able 1 : Essential context features for parametric TTS and their car- dinalities (Card.). The same ID is giv en to each feature as in [20] oped [22]. In our work, we apply this procedure to analyze t he feature representations learned by an encoder in end-to-end TT S. Fig. 2 sho ws our ev aluation process. After t raining an end-to-end model, we save its learned parameters and create a pre-trained model. Then, we dynamically extract the values from the com- putational graph of the model in order to collect a number of its encoder outputs. With the extracted representations, we follow a basic process of multi-class classifi cation task: training a classifier on a simple supervised task using the encoder outputs and then e valuating the performance of the classifier . W e assume that if a feature related to the classifi cation task i s hidden in the encoder out- puts, it will work as eviden ce for classification, and the classifier’ s performance will be increased. In this manner , the performance of the trained classifier can be used as a proxy for an objectiv e quality of the representations. Since the procedure assesses only one aspect of such representations per classification task, the choice of crit eri on with w hich t he classifier classifies its input needs careful consid- eration. In this preliminary work, we start with the eight contexts in T able 1 as ev aluation criteria of the cl assifi cation and iterate the experimen t ei ght times while changing the criterion and accordingly adjusting the size of the classifi er’ s output. 4.2. Experimental S etup The end-to-end TTS model used in our ex periments is a well-kno wn open source PyT orch implementation 1 of Baidu’ s Deep V oice 3 [18]. The model is trained on the LJ Speech Dataset [23], a public domain speech dataset consisting of 13,100 pairs of a short English speech and its transcript. T o make the input format correspond to parametric TTS, we build a model that takes only phonemes as input by si mply con verting the words in the transcripts to their phonetic r epresenta- tions ( A R P ABET) during a preprocessing step. After training the model, we synthesize speech based on 25,000 short US English sen- tences from the M-AILABS Speech Dataset [24] while collecting its encoded phoneme representations (encoder outputs). Depending on the classifier’ s cri terion, each encoder output is assigned a correct label. Lexical stress is give n by looking up the word in the CMU Pronouncing Dicti onary , syllabication of each word is performed us- ing an open-source tool 2 , and part of speech t ags are assigned by a pre-trained POS tagger dev eloped in t he Penn Treebank project us- ing eight coarse-grained fundamental tags (excluding ”interjection” because of it s scarcity). Then, the encoder outputs are split into train- ing and test sets for the classifier in the ratio of 80/20, and fi nally , we e valuate the classification performance to obtain a quantitati ve mea- sure of the feature representations about the giv en contextual crite- 1 Audio samples are av ailable : https://r9y 9.github .io/deepv oice3 pyto rch/ 2 https:/ /github . com/ky lebgorman/syllabify Fig. 2 : Ill ustration of our ev aluation process. After training encoder and decoder of end-to-end TT S, we (i) extract encoder outputs (e.g., h 3 ) and (ii) train supervised classifier on certain task using extracted representations and e valuate its performance. rion. The implementation of the classifier is made t o be as simple as the one suggested in previous studies [12, 13]. The size of the input to the classifier is 128, which is equal to the dimension of the encoder output of the TTS model. Our classifier is a feed-forward neural network with one hidden layer , where the size of the hidden layer is set to 64. This is followed by a ReLU non-linear activ ation and t hen a softmax layer mapping onto the label set size, which is dependen t on the cardinality of the context. 4.3. Results 4.3.1. Evaluation of phone me identities ( p 2 , p 3 , p 4 ) Phoneme identity is the most primitive feature in speech synthesis. As t he context in which a phoneme occurs affects the speech sound, neighboring phoneme s hav e con ventionally been taken i nto account. W e would like to understand what kinds of phonemes are more re- markable and ho w the identities of neighboring phonemes affect the representations of current phonemes . The overall accuracy of classification of pre vious, current, and next phoneme identities were 73 . 1% , 84 . 0% , and 67 . 1% , respectiv ely . This suggests that a pre vi- ous phoneme affects the representation of a current phoneme slightly more than the next one. For more details, Fig. 3 (a) shows t he accu- racy per appearance frequency of each phoneme. The general trend is that t he classification accurac y clearly drops at each vowel even if the appearance of such phonemes is fairly frequent. In fact, the pre- diction accuracy for the encoder outputs deriv ed from consonants was 88 . 1% on average, while it was 70 . 7% for those from vowels. The result phonetically makes sense. In speech, the acoustic quality of vo wels is sometimes perceiv ed as weakening because of the phys- ical limitations of t he speech organs (e.g., the t ongue), which cannot mov e fast enough to deliv er a full- quality vo wel . V o wel reduction is only seen in a spoken language, but the ef fect appears here i n en- coded ”text” as the drops in the representation quality of vo wels. This can be considered as the r esult of the joint optimization which passes the quality of acoustic features to the encoder outputs. Fig. 3 : Upper : (a) Classification accuracy of phonemes per appearan ce frequency . Phonemes that make up 90% of total phone me appearance in training data are displayed. F rom bottom left to right : (b) confusion matrices for name of vowe l of current syllable and (c) for POS tagging and (d) comparison of prediction accurac y of positional features at lev el of syllable and word. 4.3.2. Evaluation of syllable features ( b 1 , b 16 ) English is a stressed-timed language, so stress is a prominent syllable lev el feature in English TTS systems. Even though lexical stress in English i s truly unpredictable and must be memorized along with the pronunciation of an individual word, we found that the trained classifier was able to attain 86 . 3% accuracy on whether an encoder output was deriv ed from a phoneme in a st ressed syllable. The result sho ws that lexical stress is fairly influential in the encoder outputs, but it i s also probably confused with a different lev el of stress (e.g., prosodic stress), resulting in a reduction in accurac y . In relation to stress that is caused by the properties of a vo wel, it is interesting to see the presence of a vo wel at the syllable lev el. In Fig. 3 ( b), we plot a confusion matrix for classification of vo wel identity in the current syllable. The classifier gav e a mere 63 . 8% accuracy on this task. This result is attributed to the same tread of phoneme leve l contexts where vo wels are less prominent than consonants, while the accuracy drops at rarely observed phonemes (i. e. , AW , OY , UH ) can be ignored. 4.3.3. Evaluation of P OS tag ging ( e 1 ) Part-of-speech (POS ) is a commonly used higher feature t hat asso- ciates acoustic modeling with the grammatical structure of a given sentence. In Fi g. 3 (c), we plot a confusion matrix f or POS tagging results. While tags f or pronouns, determiners, and conjunctions are correctly classified without trouble, much of the misclassification can be seen among nouns, verbs, adjectiv es, and adverbs. This fol- lo ws the fact that a lot of words among such parts look alike on t he surface. For examp le, there are denominal adjectiv es and verbs that are deriv ed from a noun and only differ in their suf fix (e.g., wood - woode n). This syntactic and phonemic resemblance causes the en- coder outputs of phonemes within such words to be more like each other , making them hard to classify . 4.3.4. Evaluation of positional featur es ( b 4 , p 6 ) It is important t o recognize the position of each symbol to read at multiple levels because a r ise or fall in speech quality due to pitch often occurs at linguistic and phonetic boundaries (boundary tone). Fig. 3 (d) compares prediction accuracy of t he phoneme positions in a syllable with the syllable positions in a word. About the higher accurac y of t he syllable positions at the end of words than at the beginn ing of words, a probab le explana tion for this is speech quality changes frequently at the end of a sentence (i.e., a group of words), such as in interrogative sentences, and this makes encoder outputs in the syllables near the end of words more distinctiv e than others. Also, we found a reduction in accuracy at the middle of the phoneme positions in a syllable. This is possibly because vo wels that are less distinctiv e in representations are likely to be located near the middle of a syllable (nucleus). 5. CONCLUSIONS In this work, we inv estigated how and what types of context informa- tion are used in an end-to-end T TS system by comparing its feature representations with the contex ts used in parametric TTS . Our ex- periments reve aled the contexts that play an important role in para- metric TTS were also remarkable in encoder outputs of end-to-end TTS. Furthermore, it t urned out that encoder outputs embrace more detailed information about variou s l e vels of context features. The main factors of such effe cts are the joint optimization of context and acoustic features as well as the generativ e model that captures l ong- term structure. This work prov ides a unique vie wpoint to understand state-of-the-art speech synthesis. The insights gained in this work will be helpful to de velop new strategies for the augmentation of an encoder , conditioning it more effecti vely on v arious conte xts. 6. REFERENCES [1] Alan W Black, Heiga Zen, and Keiichi T okuda, “Statistical parametric speech synthesis, ” Speec h Communication , vol. 51, no. 11, pp. 1039–1064, 2009. [2] Simon King, “Measuring a decade of progress in text-to- speech, ” Loquens , vol. 1, no. 1, 2014. [3] “H T S , ” http:/ / hts.sp.nitech.ac.jp/. [4] Heiga Zen, Andrew Senior , and Mike Schuster , “Statistical parametric speech synthesis using deep neural networks, ” in IEEE International Confer ence on Acoustics, Speech and Sig- nal Pr ocessing (ICASSP) , 2013, pp. 7962–7966. [5] Heiga Zen and Andre w Senior , “Deep mixture density net- works for acoustic modeling in statisti cal parametric speech synthesis, ” i n IEEE International Confere nce on A coustics, Speec h and Signal Pro cessing (ICASSP) , 2014, pp. 3844 – 3848. [6] Y uchen Fan, Y ao Qian, Fenglong Xie, and Fr ank K S oong, “TTS synthesis with bidirectional L STM based Recurrent Neu- ral Networks, ” in INTERSPEECH , 2014, pp. 1964–1968. [7] Heiga Zen and Hasim Sak, “Unidirectional long short-term memory recurrent neural network wi t h recurrent output layer for low-latency speech synthesis, ” in IEEE International Con- fer ence on Acoustics, Spe ech and Signal Pr ocessing (ICASSP) , 2015, pp. 4470– 4474. [8] HTS W orking Group, “ An e xample of context-dependen t label format f or hmm-based speech synthesis in english, ” http://www .cs.columbia.edu/ ∼ ecooper/tts/lab format.pdf, 2015. [9] Y uxuan W ang, RJ Skerry-Ryan, Daisy S t anton, Y onghui W u, Ron J. W eiss, Navdeep Jaitly , Zongheng Y ang, Y ing Xiao, Zhifeng Chen, S amy Bengio, Quoc Le, Y annis Agiomyrgian- nakis, Rob Clark, and Rif A. Saurous, “T acotron: T ow ards end- to-end speech synthesis, ” arXiv prep rint arXiv:1703.1013 5 , 2017. [10] Y uxuan W ang, Daisy Stanton, Y u Z hang, R . J. Skerry-Ryan, Eric Batt enberg , Joel Shor , Y ing Xiao, Fei Ren, Y e Jia, and Rif A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, ” arXiv pr eprint arXiv:1803.09 017 , 2018. [11] Kei Sawad a, Keiichi T okuda, S imon King, and Alan W . B lack, “The blizzard machine learning challenge 2017, ” in IE EE Automatic Speec h Recognition and Understan ding W orkshop (ASR U) , 2017, pp. 331–337. [12] Y onatan Belinkov and James Glass, “ Analyzing hidden rep- resentations in end-to-end automatic speech recognition sys- tems, ” arXiv pr eprints arXiv:1709.04482 , 2017. [13] Y onatan Belinkov , Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass, “What do neural machine translation models learn about morphology?, ” arXiv pre print arXiv:1704.0347 1 , 2017. [14] Heiga Zen, “Generati ve model-based text-to-speech synthe- sis, ” In vited talk given at CBMM workshop on speech repre- sentation, perception and recognition, 2017. [15] Dzmitry Bahdanau, Kyunghy un Cho, and Y oshua Bengio, “Neural machine translation by jointly learning to align and translate, ” arXiv pr eprint arXiv:1409.0473 , 2014. [16] Jonathan Shen, Ruoming Pang, Ron J. W eiss, Mike Schus- ter , Navdeep Jaitly , Zongheng Y ang, Zhifeng Chen, Y u Zhang, Y uxuan W ang, R. J. Skerry-Ryan, Rif A. Saurous, Y annis Agiomyrg iannakis, and Y onghui W u, “Natural TTS Synthesis by Conditioning W av eNet on Mel Spectrogram Predictions, ” arXiv pr eprint arXiv:1712.05884 , 2017. [17] Hideyuki T achibana, Katsuya Uenoya ma, and Shunsuke Ai - hara, “Efficiently trainable text-to-speech system based on deep con vo lutional networks with guided attention, ” arXiv pr eprint arXiv: 1710.08969 , 2017. [18] W ei Ping, K ainan Peng, Andrew Gibiansky , S ercan ¨ O. Ari k, Ajay Kannan, S haran Narang, Jonathan Raiman, and John Miller, “Deep V oice 3: Scaling T ext-to-Speech wi th C onv olu- tional Sequence Learning, ” arXiv pr eprint arXiv:1710.07654 , 2017. [19] Oliver W atts, Junichi Y amagishi, and Simon King, “T he role of higher-le vel linguistic features in hmm-based speech syn- thesis, ” in INT ERSPEECH , 2010, pp. 841–844. [20] Heng L u and S imon King, “Using Bayesian networks to find relev ant context features for HMM-based speech synthesis, ” i n INTERSPEE CH , 2012, pp. 1143– 1146. [21] Rasmus Dall, Kei Hashimoto, K eiichiro Oura, Y oshihiko Nankaku, and Keiichi T okuda, “Redefining the linguistic con- text feature set for hmm and dnn tts through position and pars- ing, ” in INTERSPEECH , 2016, pp. 2851–2855. [22] Y onatan Belinko v , “On internal language representations i n deep l earning: An analysis of machine translation and speech recognition, ” Massachusetts I nstit ute of T echnology (Doctoral Dissertation), 2018. [23] Keith Ito, “The LJ S peech Dataset, ” https://keithito.com/LJ- Speech-Dataset/, 2017. [24] Munich Artificial Intelligence L aboratories, “The M-AILA B S Speech Dataset, ” http://www .m-ailabs.bayern/en/the-mailabs- speech-dataset/, 2018.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment