Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive fa…
Authors: Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan
PREDICTING EXPRESSIVE SPEAKING STYLE FR OM TEXT IN END-TO-END SPEECH SYNTHESIS Daisy Stanton, Y uxuan W ang, RJ Skerry-Ryan Google, Inc. 1600 Amphitheatre Parkw ay Mountain V ie w , CA 94043 ABSTRA CT Global Style T okens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within T acotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expres- siv e factors of variation in speaking style. In this work, we introduce the T ext-Predicted Global Style T oken (TP-GST) architecture, which treats GST combination weights or style embeddings as “virtual” speaking style labels within T acotron. TP-GST learns to predict stylistic renderings from text alone, requiring neither explicit labels during training, nor auxiliary inputs for inference. W e show that, when trained on a dataset of expressi ve speech, our system generates audio with more pitch and ener gy variation than two state-of-the-art baseline models. W e further demonstrate that TP-GSTs can synthe- size speech with background noise remov ed, and corroborate these analyses with positi ve results on human-rated listener preference audiobook tasks. Finally , we demonstrate that multi-speaker TP-GST models successfully factorize speak er identity and speaking style. W e provide a website with audio samples 1 for each of our findings. Index T erms : TTS, disentangled representations, generativ e models, sequence-to-sequence models, prosody 1. INTR ODUCTION A major challenge for modern te xt-to-speech (TTS) research is dev eloping models that can produce a natural-sounding speak- ing style for a gi ven piece of te xt input. Part of the challenge is that many factors contrib ute to “natural-sounding” speech, including high audio fidelity , correct pronunciation, and what is known as good pr osody . Prosody includes low-le vel charac- teristics such as pitch, stress, breaks, and rhythm, and impacts speaking style , which describes higher-le vel characteristics such as emotional valence and arousal. Prosody and style are particularly difficult to model, as the y encompass information typically not specified in text: there are man y different – yet valid – renderings of the same piece of text. Additionally , 1 https://google.github.io/tacotron/publications/ text_predicting_global_style_tokens while considerable ef fort has been spent modeling such ren- derings using annotations, explicit labels are difficult to define precisely , costly to acquire, noisy in nature, and don’t neces- sarily correlate with perceptual quality . T acotron [ 1 ] is a state-of-the-art speech synthesis system that computes its output directly from graphemes or phonemes. Like man y modern TTS systems, it learns an implicit model of prosody from statistics of the training data alone. It can learn, for e xample, to inflect English phrases ending in a question mark with a rise in pitch. As noted in [ 2 ], howev er , syn- thesizing long-form expressi ve datasets (such as audiobooks) presents a challenge, since wide-ranging voice characteristics are collapsed into a single, “av eraged” model of prosodic style. While [ 2 ] learns disentangled factors of speaking style within T acotron, it requires either audio or manually-selected weights at inference time to generate output. Given all of the above, of particular interest would be a speech synthesis system that not only learns to represent a wide range of speaking styles, but that can synthesize expressi ve speech without the need for auxiliary inputs at inference time. In this work, we aim to do just that. Our main contrib ution is a pair of extensions to Global Style T okens (GSTs) [ 2 ] that predict speaking style from text. The tw o alternati ve prediction pathways are easy to implement and require no additional labels. W e sho w that, like baseline GST models, our system can capture speaker- independent factors of v ariation, including speaking style and background noise. W e provide audio samples, analysis, and results sho wing that our models are significantly preferred in subjectiv e e valuations. 2. RELA TED WORK Attempts to model prosody and speaking style span more than three decades in the statistical speech synthesis literature. These methods, howe ver , have largely required explicit an- notations, which pose the difficulties discussed in Section 1. INTSINT [ 3 ], T oBi [ 4 ], Momel [ 5 ], landmark detection [ 6 ], T ilt [ 7 ], and SLAM [ 8 ] all describe methods to annotate or classify prosodic features such as breaks, intonation, rhythm, and melody . Notable among these is AuT oBI [ 9 ], which au- tomatically detects and classifies these features, but which requires models pretrained on labeled data to do so. Substantial effort has also gone into modeling emotion, but these methods, too, ha ve traditionally required keyw ords, semantic representations, or labels for model training. Recent examples include [10], [11], [12], [13], and [14], [15]. [ 16 ] explores v arious methods to predict acoustic features such as i -vectors [ 17 ] from semantic embeddings. These meth- ods rely on a complex set of hand-designed features, howe ver , and require training three models in separate steps (the acous- tic feature predictor , a neutral-prosody synthesis model, and a speaker -adaptation model). The recently-published V AE-Loop [ 18 ] aims to learn speaking style variations by conditioning V oiceLoop [ 19 ], an autoregressi ve speech synthesis model, on the global latent va riable output by a conditional v ariational autoencoder (V AE). In inference mode, howe ver , the latent v ariable z still needs to be fed into the model to achieve control. Furthermore, while z is e xpected to acquire latent representations of global speaking styles, the experimental analysis ([ 18 ], section 4.5) and audio samples [ 20 ] suggest that z has primarily learned speaker gender and identity rather than prosody or speaking style. 3. MODEL Our model is based on an augmented v ersion of T acotron [ 1 ], a recently proposed state-of-the-art speech synthesis model that predicts mel spectrograms directly from grapheme or phoneme sequences. The augmented version we use is the Global Style T oken (GST) [ 2 ] architecture, which adds to T acotron a spectrogram reference encoder [ 21 ], a style attention module, and a style embedding for conditioning. During training, the style atten- tion learns to represent the reference encoder output (called the prosody embedding ) as a con ve x combination of trainable embeddings called style tokens . These are shared across all utterances in the training set, and capture global variation in the data – hence the name Global Style T okens. W e call the con ve x combination of style tokens the style embedding . Our proposed architecture, which we call “T ext-Predicted Global Style T okens” (TP-GST), adds two possible text- prediction pathw ays to a GST -enhanced T acotron. These allo w the system to predict style embeddings at inference time by either: 1. interpolating the GSTs learned during training, using combination weights predicted only from the text (“TPCW”); or 2. directly predicting style embeddings from text fea- tures, ignoring style tokens and combination weights (“TPSE”). Using operators to stop gradient flow , the two text- prediction pathways can be trained jointly . At inference time, the model can be run as a TPCW -GST , as a TPSE-GST , or (by supplying auxiliary inputs) as a traditional GST -T acotron. W e describe each of the two text prediction pathways in more detail below . 3.1. T ext features Both TP-GST pathways use as features the output of T acotron’ s text encoder . This output is computed by an encoder sub- module called a CBHG [ 1 ], which explicitly models local and contextual information in the input sequence. A CBHG consists of a bank of 1-D conv olutional filters, followed by highway networks [ 22 ] and a bidirectional Gated Recur - rent Unit (GR U) [ 23 ] recurrent neural net (RNN). Since the text encoder outputs a variable-length sequence, the first step of TP-GST is to pass this sequence through a 64-unit time-aggregating GR U-RNN, and use its final output as a fixed-length text feature vector . The GRU-RNN acts as a summarizer for the text encoder in much the same way as the 128-unit GR U-RNN in [ 21 ] acts as a summarizer for the reference encoder; both time-aggreg ate their v ariable-length input and output a fixed-length summary . The fixed-length text features are used as input in both text prediction pathways, discussed next. 3.2. Predicting Combination W eights (TPCW) In the GST -augmented T acotron, a reference signal’ s prosody embedding serves as the query to an attention mechanism ov er the style tokens, and the resulting values , normalized via a softmax activ ation, serve as the combination weights. As illustrated in Figure 1a, the simpler version of our model treats these GST combination weights as a prediction tar get during training. W e call this system TPCW -GST , to stand for “text-predicted combination weights”. Note that since the style attention and style tokens are updated via backpropagation, the GST combination weights form moving targets during training. T o learn to predict these weights, we feed the fixed-length text features from the time-aggre gating GR U-RNN to a fully- connected layer . The outputs of this layer are treated as logits, and we compute the cross-entropy loss between these v alues and the (target) combination weights output by the style at- tention module. W e stop the gradient flo w to ensure that te xt prediction error doesn’t backpropagate through the GST layer, and add the cross-entropy result to the final T acotron loss. At inference time, the style tokens are fixed, and this path- way can be used to predict the token combination weights from text features alone. Combination Weight Predictions Spectrogram Slices Prosody Embedding Style Embedding Character/Phone Embeddings Text Encoder CBHG output (text encoder sequence) Trainable Style Tokens A B D E C Time-aggregating GRU RNN Style Attention (multi-head) 0.2 0.1 0.2 0.1 0.4 Reference Encoder Fully-connected layer 0.3 0.1 0.1 0.2 0.3 Combination Weight Targets TPCW-GST to decoder final GRU state + [ ] [ ] TPCW loss = (a) TPCW -GST architecture, the first of two possible prediction pathways. This pathway uses the GST combination weights as targets during training, and adds an additional cross-entrop y ( Y , ˆ Y ) term to the T acotron loss function. Character/Phone Embeddings Text Encoder CBHG output (text encoder sequence) Trainable Style Tokens A B D E C Time-aggregating GRU RNN 0.2 0.1 0.2 0.1 0.4 TPSE-GST to decoder final GRU state + Style Embedding (Target) Style Embedding (Prediction) K fully-connected layers [ ] [ ] TPSE loss = Spectrogram Slices Prosody Embedding Style Attention (multi-head) Reference Encoder (b) TPSE-GST architecture, the second of two possible prediction pathways. This pathway treats the GST style embedding as a target during training, and adds an additional L 1 ( Y , ˆ Y ) term to the T acotron loss function. Fig. 1 . TP-GST architectures. 0 100 200 300 400 500 600 700 Frame −50 0 50 100 150 200 250 300 350 Frequency (Hz) Token A Token B Token C 0 100 200 300 400 Frame −80 −60 −40 −20 0 20 40 Smoothed C0 Token A Token B Token C (a) F 0 and log- C 0 for three tokens 0 100 200 300 400 500 600 700 Frame 0 10 20 30 40 50 60 70 Channel 0 100 200 300 400 500 600 700 Frame 0 10 20 30 40 50 60 70 Channel 0 100 200 300 400 500 600 700 Frame 0 10 20 30 40 50 60 70 Channel (b) Mel spectrograms for the three tokens abov e Fig. 2 . (a) F 0 and log- C 0 of an audiobook phrase, synthesized using three tokens from a single-speaker TP-GST T acotron. (b) Mel-scale spectrograms of the same phrase corresponding to each token. See text for details. 3.3. Predicting Style Embeddings (TPSE) Figure 1b illustrates a second, alternati ve, prediction pathway . W e call this system TPSE-GST , to stand for “te xt-predicted style embeddings”. This v ersion of the model feeds the text feature sequence through one or more fully-connected layers, and outputs a style embedding prediction directly . W e train this pathway using an L 1 loss between the predicted (TPSE-GST) and target (GST) style embeddings. As is done for a TPCW - GST , we stop the gradient flo w to ensure that text prediction error doesn’t backpropagate through the GST layer . W e use ReLU acti vations for the hidden fully-connected layers, and a tanh activ ation on the output layer that emits the text-predicted style embedding. This is intended to match the style token tanh activation (see [ 2 ], section 3.2.2), which, in turn, is chosen to match the GR U tanh activ ation of the final bidirectional RNN in the te xt encoder CBHG [ 1 ]. As in [ 2 ], this choice leads to better token v ariation. In inference mode, this pathway can be used to predict the style embedding directly from text features. Note that the model completely ignores the style tokens in this mode, since they are not needed: they are only used to compute the style embedding prediction target during training. 4. EXPERIMENTS In this section, we e v aluate the performance of synthesis using TP-GST ; we examine both single- and multi-speaker models. As is common for generati ve models, objecti ve metrics often do not correlate well with perception [ 24 ]. While we use visu- alizations for some experiments below , we strongly encourage readers to listen to the audio samples pro vided on our demo page. 4.1. Single Speaker Experiments Our single-speaker TP-GST model is trained on 147 hours of American English audiobook data. The books are read by the 2013 Blizzard Challenge speaker , Catherine Byers, in an animated and emoti ve storytelling style. Some books contain very expressiv e character voices with high dynamic range, which are challenging to model. The model uses 20 Global Style T okens with 4-headed additiv e attention, and predicts both TP-GST tar gets (TPCW and TPSE) during training. While the number of hidden layers in the TPSE-GST pathway is configurable (see Figure 1b), these experiments use a single hidden layer of size 64 . W e 0 200 400 600 800 1000 Frame −50 0 50 100 150 200 250 300 Frequency (Hz) Baseline Tacotron Text-predicted combination weights 0 200 400 600 800 1000 Frame −50 0 50 100 150 200 250 300 350 Frequency (Hz) Baseline Tacotron Text-predicted style embedding (a) Phrase F 0 from both TP-GST systems vs T acotron 0 200 400 600 800 1000 Frame 0 10 20 30 40 50 60 70 Channel 0 200 400 600 800 1000 Frame 0 10 20 30 40 50 60 70 Channel (b) Phrase mel spectrograms for baseline T acotron (left) and TPSE-GST (right) Fig. 3 . The same phrase, unseen during training, synthesized using a baseline T acotron, TPCW -GST , and TPSE-GST . Notice that the baseline’ s “declining pitch” problem is fixed by both text-prediction systems. train all models with a minibatch size of 32 using the Adam optimizer [ 25 ], and perform e valuations at about 250,000 steps. 4.1.1. Style T oken V ariation T oken v ariation is key for a GST model to be able to represent a large range of expressi ve styles in speech data. As in a GST - T acotron, we can v erify that a TP-GST system learns a rich set of style tokens during training. Figure 2 shows the fundamen- tal frequency ( F 0 ) and smoothed av erage power (log- C 0 ) of three different style tok ens learned from this system, superim- posed to highlight their v ariation. It also shows spectrograms generated by conditioning the model on each token in turn. These plots visualize what can be heard on our samples page, which is that dif ferent tok ens capture v ariation in pitch, ener gy , speaking rate. As in a standard GST -T acotron, conditioning the model on a particular token will result in the same F 0 and C 0 trend relativ e to the others, independent of the text input. 4.1.2. Synthesis with Style Pr ediction In addition to verifying tok en variation, we can compare syn- thesis output to that of a v anilla T acotron model. This allo ws us to contrast what two T acotron-based systems can predict from text alone. Note that we choose the v anilla T acotron as a baseline, since comparing TP-GST to a GST -T acotron is not apples to apples: a GST -T acotron requires either a reference signal or a manual selection of style token weights at inference time. Figure 3 shows F 0 contours and mel spectrograms gen- erated by a baseline T acotron model and both pathways of TP-GST model (20 tokens, 4 heads). Each depicts the same audiobook phrase unseen during training. W e see that the TP-GST model yields a more varied F 0 contour and richer spectral detail. This example also highlights a point noted in [ 1 ], which is that baseline T acotron models trained on ex- pressiv e speech can result in synthesis with a continuously declining pitch (green curve). Like a GST T acotron, we see that the TP-GST model fix es this problem, b ut without needing a reference signal for inference. 4.1.3. Subjective Evaluation T o ev aluate the quality of this method at scale, we provide side- by-side subjective test results of TP-GST synthesis versus a baseline T acotron. The ev aluation data used for this test was a set of 260 sentences from an audiobook unseen during training, including many long phrases. In each test, raters listened to the same sentence synthesized by both a baseline T acotron and one of the TP-GST systems. They then e v aluated the pair on a 7-point Likert scale ranging f rom “much worse than” to “much better than”; each comparison recei ved 8 scores from different raters. The results are sho wn in T able 1. In both subjecti ve tests, raters preferred the text-predicted style enhancements over the T acotron baseline. P R E F E R E N C E ( % ) P - V A L U E BA S E L I N E N E U T R A L T P - G S T 3 - P O I N T 7 - P O I N T T P C W 1 8 . 6 % 2 2 . 6 % 5 8 . 8 % < 10 − 37 < 10 − 35 T P S E 1 7 . 5 % 2 4 . 7 % 5 7 . 8 % < 10 − 38 < 10 − 37 T able 1 . Subjectiv e preference (%) of 8 raters on 260 audio- book phrases. Each ro w reports preferences for a baseline T acotron vs one of the TP-GST systems. t -test p -values are giv en for both a 3-point and 7-point rating system. P R E F E R E N C E ( % ) P - V A L U E T P C W- G S T N E U T R A L T P S E - G S T 3 - P O I N T 7 - P O I N T 2 5 . 1 % 4 5 . 4 % 29 . 4 % 0 . 0 6 3 0 . 0 5 4 T able 2 . Subjectiv e preference (%) of 8 raters on 260 audio- book phrases for TPCW -GST vs TPSE-GST . t -test p -values are giv en for both a 3-point and 7-point rating system. T able 2 shows subjecti ve test results comparing TPCW - GST versus TPSE-GST synthesis. The ev aluation sentences, audio clips, and instructions for this test are identical to those used for the pre vious side-by-side results. These results show that, as expected, raters did not have a strong preference be- tween TPCW -GST and TPSE-GST . 4.1.4. Automatic Denoising About 10% of the recordings used to train the model for this experiment contain audible high-frequency background noise. This noise is reproduced in many of the 260 baseline T acotron utterances synthesized for the subjective ev aluations above. By contrast, as can be heard on our samples page, both text- prediction pathw ays completely remov e this background noise. This small empirical finding suggests that TP-GST models can not only separate clean speech from background noise (as demonstrated in [ 2 ]), but that the y can do so without needing a manually-identified “clean” style token at inference time. While we did not measure the total number of utterances with and without noise, about 6% of rater comments mentioned this effect; we also provide a number of examples on our audio samples page. 4.2. Multiple Speaker Experiments W e also present results from a multi-speaker TP-GST system. For these experiments, we use the multi-speaker T acotron architecture described in [ 21 ], conditioning the model on a 64 -dimensional embedding of the input speaker’ s identity . For training data, we use 190 hours of American English speech, read by 22 different female speakers. Importantly , the 22 datasets include both expressi ve and non-e xpressiv e speech: to the expressi ve audiobook data from Section 4.1 (147 hours) we add 21 high-quality proprietary datasets, spoken with neutral prosody . These contain 8.7 hours of long-form ne ws and web articles (20 speakers), and 34.2 hours of of assistant-style speech (one speaker). The multi-speaker TP-GST uses 40 tokens of dimension 252, and a 6-headed style attention. W e train with a mini- batch size of 32 using the Adam optimizer [ 25 ], and perform ev aluations at about 250,000 steps. 4.2.1. Shar ed Multi-Speaker Style T okens Note that, while the multi-speaker TP-GST conditions on speaker identity , style tok ens are shared by all speakers. As with the single-speaker models, we can condition on individ- ual tokens at inference time to uncover the factors of v ariation the model has learned. Figure 4 sho ws F 0 , log- C 0 , and spec- trograms generated by conditioning on each of three learned style tokens. The model is using the expressi ve audiobook voice, synthesizing speech from an audiobook unseen during training. Like in the single-speaker case, these plots demon- strate that different tokens capture v ariation in prosodic factors such as pitch, ener gy , and speaking rate. This ef fect can be heard clearly on our samples page, where individual tokens are synthesized for multiple speakers. These examples sho w that the learned tokens capture a variety of styles, and, at the same time, that the model preserves speaker identity . Importantly , conditioning on individual tokens results in style variation e ven when synthesizing with the neutral-speech voices, despite the fact that these datasets ha ve v ery little dynamic range for the model to learn. 4.2.2. Synthesis with Style Pr ediction Our audio demo page also includes examples of generating text-predicted style from this multi-speaker TP-GST . As ex- pected, the T acotron baseline and TP-GST both generate audio with limited dynamic range when conditioned on the prosodi- cally neutral v oice IDs. When synthesizing with the e xpressi ve audiobook voice, ho wever , the multi-speaker TP-GST model yields more expressi ve speech than a multi-speaker T acotron conditioned on the same data. While we did not run e v alua- tions comparing these models, we encourage listeners to v erify this result for themselves. The samples also re veal that the “expressi ve” multi-speak er TP-GST voice produces similarly 0 100 200 300 400 500 Frame −50 0 50 100 150 200 250 300 350 400 Frequency (Hz) Token 2 Token 3 Token 4 0 100 200 300 400 Frame −120 −100 −80 −60 −40 −20 0 20 Smoothed C0 Token 2 Token 3 Token 4 (a) F 0 and log- C 0 for three tokens 0 100 200 300 400 500 Frame 0 10 20 30 40 50 60 70 Channel 0 100 200 300 400 500 Frame 0 10 20 30 40 50 60 70 Channel 0 100 200 300 400 500 Frame 0 10 20 30 40 50 60 70 Channel (b) Mel spectrograms for the three tokens abov e Fig. 4 . (a) F 0 and log- C 0 of an audiobook phrase, synthesized using three tok ens from a multi-speaker text-prediction GST - T acotron. (b) Mel-scale spectrograms of the same phrase corresponding to each token. See text for details. expressi ve speech to that of the single-speaker voice from Section 4.1, which was trained on the same audiobook data. 5. CONCLUSIONS AND DISCUSSION In this work, we have shown that a T ext-Predicting Global Style T oken model can learn to predict speaking style from text alone, requiring no explicit style labels during training, or input signals at inference time. W e hav e demonstrated that TP-GSTs can synthesize audiobook speech in a manner preferred by human raters over baseline T acotron, and that multi-speaker TP-GST models can learn a shared style space while still preserving speaker identity for synthesis. Future research will explore multi-speaker models more fully , examining ho w well TP-GSTs can learn f actorized rep- resentations across genders, accents, and languages. W e also plan to in vestigate lar ger textual conte xt for prediction, and would like to learn style representations for both finer -grained and hierarchical temporal resolutions. Finally , while this work has only in vestigated style predic- tion as part of T acotron, we belie ve that TP-GSTs can benefit other TTS models, too. T raditional TTS systems, for example, can use predicted style embeddings as labels, and end-to-end TTS systems can integrate our architecture directly . More generally , we en vision that TP-GSTs can be applied to other conditionally generativ e models that aim to reconstruct a high- dimensional signal from underspecified input. 6. A CKNO WLEDGEMENTS The authors thank the Machine Hearing, Google Brain and Google TTS teams for their helpful discussions and feedback. 7. REFERENCES [1] Y . W ang, R. Sk erry-Ryan, D. Stanton, Y . W u, R. J. W eiss, N. Jaitly , Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “T acotron: T o wards end-to-end speech synthesis, ” in Pr oc. Interspeech , Aug. 2017, pp. 4006–4010. [Online]. A v ailable: https://arxiv .org/abs/1703.10135 [2] Y . W ang, D. Stanton, Y . Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor , Y . Xiao, F . Ren, Y . Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, ” International Confer ence on Machine Learning , 2018. [Online]. A v ailable: https://arxiv .org/abs/1803.09017 [3] D. J. Hirst, “La repr ´ esentation linguistique des syst ` emes prosodiques: une approche cognitive, ” Ph.D. dissertation, Aix-Marseille 1, 1987. [4] K. Silverman, M. Beckman, J. Pitrelli, M. Osten- dorf, C. Wightman, P . Price, J. Pierrehumbert, and J. Hirschberg, “T oBI: A standard for labeling english prosody , ” in Second International Conference on Spoken Language Pr ocessing , 1992. [5] D. Hirst and R. Espesser, “ Automatic modelling of fundamental frequency using a quadratic spline function, ” T ravaux de l’Institut de Phon ´ etique d’Aix , vol. 15, pp. 71–85, 1993. [Online]. A vailable: http://citeseerx.ist.psu. edu/viewdoc/summary?doi=10.1.1.40.3623 [6] S. A. Liu, “Landmark detection for distinctiv e feature- based speech recognition, ” The Journal of the Acoustical Society of America , vol. 100, no. 5, pp. 3417–3430, 1996. [7] P . T aylor , “The tilt intonation model, ” in ICSLP . Inter- national Speech Communication Association, 1998. [8] N. Obin, J. Beliao, C. V eaux, and A. Lacheret, “SLAM: Automatic stylization and labelling of speech melody , ” in Speech Pr osody , 2014, pp. 246–250. [9] A. Rosenberg, “AuT oBI-a tool for automatic T oBI annotation. ” in Interspeech , 2010, pp. 146–149. [Online]. A v ailable: http://eniac.cs.qc.cuny .edu/andrew/autobi/ [10] J. Lee and I. T ashev , “High-le vel feature representation using recurrent neural network for speech emotion recog- nition, ” in Interspeech 2015 , September 2015. [11] Z.-Q. W ang and I. T ashe v , “Learning utterance-level rep- resentations for speech emotion and age/gender recogni- tion using deep neural networks, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE Interna- tional Confer ence on . IEEE, 2017, pp. 5150–5154. [12] S. Khorram, Z. Aldeneh, D. Dimitriadis, M. G. McInnis, and E. M. Prov ost, “Capturing long-term temporal de- pendencies with con volutional networks for continuous emotion recognition, ” CoRR , vol. abs/1708.07050, 2017. [13] S. Latif, R. Rana, J. Qadir, and J. Epps, “V ariational autoencoders for learning latent representations of speech emotion, ” CoRR , vol. abs/1712.08708, 2017. [14] J. Lorenzo-T rueba, G. E. Henter , S. T akaki, J. Y amag- ishi, Y . Morino, and Y . Ochiai, “In vestig ating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis, ” Speech Com- munication , vol. 99, pp. 135–143, 2018. [15] J. Deng, X. Xu, Z. Zhang, S. Fruhholz, and B. Schuller , “Semisupervised autoencoders for speech emotion recog- nition, ” IEEE/A CM T ransactions on Audio, Speech and Language Pr ocessing (T ASLP) , vol. 26, no. 1, pp. 31–43, 2018. [16] I. Jauk, “Unsupervised learning for expressiv e speech synthesis, ” Ph.D. dissertation, Univ ersitat Polit ` ecnica de Catalunya, 2017. [17] N. Dehak, P . J. Kenny , R. Dehak, P . Dumouchel, and P . Ouellet, “Front-end factor analysis for speaker ver - ification, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 19, no. 4, pp. 788–798, 2011. [18] K. Akuza wa, Y . Iwasa wa, and Y . Matsuo, “Expressive speech synthesis via modeling expressions with variational autoencoder, ” CoRR , vol. abs/1804.02135, 2018. [Online]. A vailable: https://arxi v .org/abs/1804. 02135 [19] Y . T aigman, L. W olf, A. Polyak, and E. Nachmani, “V oice synthesis for in-the-wild speakers via a phono- logical loop, ” CoRR , vol. abs/1707.06588, 2017. [20] K. Akuzawa, Y . Iwasawa, and Y . Matsuo, “V AELoopDemo: audio samples generated by V AE-Loop, ” https://akuzeee.github.io/V AELoopDemo/, 2018. [21] R. Skerry-Ryan, E. Battenberg, Y . Xiao, Y . W ang, D. Stanton, J. Shor , R. J. W eiss, R. Clark, and R. A. Saurous, “T owards end-to-end prosody transfer for expressi ve speech synthesis with Tacotron, ” Interna- tional Confer ence on Machine Learning , 2018. [Online]. A v ailable: https://arxiv .org/abs/1803.09047 [22] R. K. Sri vasta va, K. Greff, and J. Schmidhuber , “Highway networks, ” in International Conference on Machine Learning: Deep Learning W orkshop , 2015. [Online]. A v ailable: http://arxiv .org/abs/1505.00387 [23] K. Cho, B. van Merrienboer , . Gulcehre, D. Bahdanau, F . Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder -decoder for statistical machine translation. ” A CL, 2014, pp. 1724– 1734. [24] L. Theis, A. v . d. Oord, and M. Bethge, “ A note on the ev aluation of generativ e models, ” arXiv preprint arXiv:1511.01844 , 2015. [25] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” arXiv pr eprint arXiv:1412.6980 , 2014.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment