Investigation of Using Disentangled and Interpretable Representations for One-shot Cross-lingual Voice Conversion

In vestigation of using disentangled and interpr etable repr esentations f or one-shot cross-lingual voice con version Se yed Hamidreza Mohammadi, T aehwan Kim ObEN, Inc. hamid@oben.com, taehwan@oben.com Abstract W e study the problem of cross-lingual voice con version in non- parallel speech corpora and one-shot learning setting. Most prior work require either parallel speech corpora or enough amount of training data from a target speaker . Howe ver , we con vert an arbitrary sentences of an arbitrary source speaker to target speaker’ s given only one target speaker training utterance. T o achie ve this, we formulate the problem as learning disentan- gled speaker -speciﬁc and context-speciﬁc representations and follow the idea of [1] which uses Factorized Hierarchical V aria- tional Autoencoder (FHV AE). After training FHV AE on multi- speaker training data, giv en arbitrary source and target speak- ers’ utterance, we estimate those latent representations and then reconstruct the desired utterance of con verted voice to that of target speaker . W e inv estigate the effecti veness of the approach by conducting voice con version experiments with v arying size of training utterances and it was able to achie ve reasonable per - formance with even just one training utterance. W e also exam- ine the speech representation and sho w that W orld vocoder out- performs Short-time Fourier T ransform (STFT) used in [1]. Fi- nally , in the subjecti ve tests, for one language and cross-lingual voice conv ersion, our approach achieved signiﬁcantly better or comparable results compared to V AE-STFT and GMM base- lines in speech quality and similarity . Index T erms : voice con version, one-shot learning, cross- lingual, variational autoencoder 1. Introduction The task of V oice Con version (VC) [2, 3] is a technique to con vert source speaker’ s spoken sentences into those of a tar- get speaker’ s voice. It requires to preserve not only the tar- get speaker’ s identity , but also phonetic context spoken by the source speaker . T o tackle this problem, many approaches hav e been proposed [4, 5, 6]. Howe ver , most prior work require par- allel spoken corpus and enough amount of data to learn the tar- get speaker’ s voice. Recently , there were approaches proposed for voice conv ersion with non-parallel corpus [7, 8, 9]. But they still require that speaker identity was kno wn priori , or included in training data for the model. Recently , Hsu et al. [1] proposed to use disentangled and in- terpretable representations to ov ercome these limitations by ex- ploiting Factorized Hierarchical V ariation Autoencoder . They achiev ed reasonable quality with just single utterance from a target speaker but it was still not satisfactory . Nev ertheless, most prior work focus on voice conv ersion within one language. But we believ e that if we can capture disentangled representa- tions of phonetic or linguistic contexts and speaker identities, the model should be capable for more challenging cr oss -lingual setting, which means that source and target speakers are from different languages. Therefore, we focus on in vestigating cross- lingual voice con version, and propose to follow the same spirit from Hsu et al. [1] and improve the performance. Our contri- butions are: • W e inv estigate the different feature representations for spoken utterances by considering Mel-cepstrum (MCEP) features and other acoustic features, and achiev e better results compared to baselines. • W e examine the effect of the size of training utterances from source and target speakers, and demonstrate that with just a few , or e ven one, utterances, we are able to achiev e the reasonable performance. • W e conduct cross-lingual voice conv ersion experiments and our approach achieved signiﬁcantly better or com- parable results than baselines in speech quality and sim- ilarity in the subjectiv e tests. 2. Related W ork V oice conv ersion has been an important research problem for ov er a decade. One popular approach to tackle the problem is spectral con version such as Gaussian mixture models (GMMs) [4] and deep neural networks (DNN) [5]. Howe ver , it requires parallel spoken corpus and dynamic time warping (DTW) is usually used to align source and tar get utterances. T o overcome this limitation, non-parallel voice con version approaches were proposed, for instance, eigen voice [6], i-vecotor [10], and V ari- ational Autoencoder [7, 9] based models. Howe ver , eigenv oice based approach [6] still requires reference speaker to train the model, and V AE based approaches [7, 9] require speaker identi- ties to be kno wn priori as included in training data for the model. i-vector based approach [10] looks promising which remains to be studied further . The i-vectors are conv erted by replacing the source latent variable by the target latent variable. The Gaus- sian mixture means are then reconstructed from the conv erted i-vector . The Gaussians with adjusted means are then applied to the source vector to perform the acoustic feature con version. Siamese autoencoder has also been proposed for decomposing speaker identity and linguistic embeddings [11]. Howev er, this approach requires parallel training data to learn the decompos- ing architecture. This decomposition is achie ved by means of applying some similarity and non-similarity costs between the Siamese architectures. Nonetheless, cross-lingual voice con version is also a chal- lenging task since tar get language is not known in training time, and only few work has proposed, including GMMs based ap- proach [12] and eigen voice based approach [13], but still hav e inherent limitations as abov e. Recently , deep generativ e models have been applied and successful for unsupervised learning tasks, and include V ari- ational Autoencoder (V AE) [14], Generative Adversarial Net- works (GAN) [15], and auto-regressi ve models [16, 17]. Among them, V AE can infer latent codes from data and gener- ate data from them by jointly learning inference and generative networks, and V AE has been also applied for voice conv ersion Enco der Decoder Enco der Decoder ! Figure 1: Structur es of V ariation Autoencoder (upper) and F ac- torized Hierar chical V ariational A utoencoder (lower). [7, 9]. Howe ver , in their models, speaker identities are not in- fered from data and instead required to be known in model train- ing time. GAN has been also exploited for non-parallel voice con version [18] with the cycle consistency contraint [19], b ut it still has the limitation that it needs to kno w the target speaker in training time and be trained for each target. T o understand the disentangled and interpretable structure of latent codes, sev eral work were proposed, namely , DC-IGN [20], InfoGAN [21], β -V AE [22], and FHV AE [1]. These ap- proaches to unco ver disentangled representation may help voice con version with very limited resource from target speaker , since it might infer speaker identity information from data without su- pervision, as illustrated in FHV AE [1]. Howe ver , the qualities of conv erted voices were not good enough, therefore, we focus on the model structure of FHV AE and in vestigate to improve it, even with more challenging cross-lingual voice con version setting. 3. Model V ariational autoencoder [14] (V AE) is a powerful model to un- cov er hidden representation and generate ne w data samples. Let observations be x and latent variables z . In the variational au- toencoder model, the encoder (or inference network) q φ ( z | x ) outputs z given input x , and decoder p Φ ( x | z ) generates data x giv en z . The encoder and decoder are neural networks. Training is done by maximizing variational lower bound (or also called evidence lo wer bound): ` (Φ , φ ) = E q [log p Φ ( x, z )] − E q [log q φ ( z | x )] = log p Φ ( x ) − D K L ( q φ ( z | x ) || p Φ ( z | x )) . where D K L is Kullback-Leibler di vergence. Howe ver , V AE considers no structure for latent variable z . Assuming structure for z could be beneﬁcial to exploit the in- herent structures in data. Here we describe Factorized Hierar- chical V ariational Autoencoder proposed by Hsu et al [1]. Let a dataset D consist of N seq i.i.d. sequences X i . For each sequence X i , it consists of N i seg X i,j observation segments. Then we deﬁne factorized latent variables of latent segment variable Z i,j 1 and latent sequence variable Z i,j 2 . In the context of v oice con version, Z i,j 1 is responsible for generating phonetic contexts and Z i,j 2 is for speaker identity . When generating data X i,j , we ﬁrst sample Z i,j 2 from isotropic Gaussian centered at µ i shared for the entire sequence, and also Z i,j 1 independently . Then we generate X i,j conditioned on Z i,j 1 and Z i,j 2 . Thus, joint probability with a sequence X i is: p Φ ( X i , Z i 1 , Z i 2 , µ i ) = p Φ ( µ i ) N i seg Y j =1 p Φ ( X i,j | Z i,j 1 , Z i,j 2 ) p Φ ( Z i,j 1 ) p Φ ( Z i,j 2 | µ i ) This is illustrated in Figure 1. For inference, we use variational inference to approximate the true posterior and hav e: q φ ( Z i 1 , Z i 2 , µ i | X i ) = q φ ( µ i ) N i seg Y j =1 q φ ( Z i,j 1 | X i,j , Z i,j 2 ) q φ ( Z i,j 2 | X i,j ) Since sequence variational lower bound can be decomposed to segment variational lower bound, we can use batches of seg- ment instead of sequence lev el to maximize: ` (Φ , φ ; X i,j ) = ` (Φ , φ ; X i,j | ˜ µ i ) + 1 N i seg log p Φ ( ˜ µ i ) + const ` (Φ , φ ; X i,j | ˜ µ i ) = E q φ ( Z i,j 1 ,Z i,j 2 | X i,j ) [log p Φ ( X i,j | Z i,j 1 , Z i,j 2 )] − E q φ ( Z i,j 2 | X i,j ) [ D K L ( q φ ( Z i,j 1 | X i,j , Z i,j 2 ) || p Φ ( Z i,j 1 ))] − D K L ( q φ ( Z i,j 2 | X i,j ) || p Φ ( Z i,j 2 | ˜ µ i )) where ˜ µ i is the posterior mean of µ i . Please refer to Hsu et al. [1] for more details. Additionally , Hsu et al. also proposed dis- criminativ e segment variational lower bound to encourage Z i 2 to be more sequence-speciﬁc by adding the additional term of inferring the sequence index i from Z i,j 2 . For our experiments, we exploit this FHV AE model and sequence-to-sequence model [24] as the structure of encoder-decoder for sequential data. For performing the voice conv ersion, we compute the av- erage Z 2 from the training utterance(s) of source and target speakers. For a giv en input utterance, we compute Z 1 and Z 2 of the input utterance. There are two ways to perform voice con version. First, we can replace Z 2 values of the source speaker with the av erage Z 2 from the target speaker . This ap- proach resulted in too mufﬂed generated result. Second, we compute a difference vector between source and target average Z dif f 2 = Z trg 2 − Z src 2 . This dif ference vector is added to Z 2 from the input utterance as Z conver ted 2 = Z 2 + Z dif f 2 and then decoded using FHV AE to achiev e the speech features. In an in- formal listening test, we decided to the second approach since it resulted in signiﬁcantly higher quality generated speech. 4. Experiments 4.1. Datasets W e used the TIMIT corpus [25] which is a multi-speaker speech corpus as the training data for FHV AE model. W e used the training speakers as suggested by the corpus to train the model. For English test speakers, we select speakers from TIMIT test- ing part of the corpus. W e also use a proprietary Chinese speech corpus (hereon referred to as CH) with 5200 speakers each ut- tering one sentence. W e consider using the combination of TIMIT and Chinese corpus for training the model as well. For Chinese test speakers, we utilize speakers from the THCHS-30 speech corpus [26]. T o observe the effect of having more utter- ances per speaker but less speakers we also train the model on VCTK corpus [ ? ]. Finally , for objecti ve testing (which requires av ailability of parallel data), we utilized four CMU-arctic voices (BDL, SL T , RMS, CLB)[27]. As speech features, we used 40th- order MCEPs (excluding the ener gy coefﬁcient, dimensionality D=39), extracted using the W orld toolkit [28] with a 5ms frame shift. All audio ﬁles are transformed to 16kHz and 16 bit before any analysis. 4.2. Experimental setting For the encoder and decoder in FHV AE model, we use Long Short T erm Memory (LSTM) [29] as the ﬁrst layer with 256 hidden units with a fully-connected layer on top. W e use 32 dimensions for each latent variable Z 1 and Z 2 . The models were trained with stochastic gradient descent. W e use a mini- batch size of 256. The Adam optimizer [30] is used with β 1 = 0.95, β 2 = 0.999,  = 10 − 8 , and initial learning rate of 10 − 4 . The model is trained for 500 epochs and select the model best performing on the dev elopment set. From now on, we use the abbre viation V AE for FH- V AE model. In our experiments, we consider three models: GMM (GMM MAP Adaptation [4]), V AE-STFT (uses STFT as speech analysis/synthesis[1]), V AE (uses W orld as speech analysis/synthesis[28]). W e consider four gender con versions (F: female, M: male): F2F , F2M, M2F , M2M. W e also consider four cross-language con versions (E: English, Z: Chinese): E2E, E2Z, Z2E, Z2Z. The voice con version samples are av ailable at: https://shamidreza.github .io/is18samples 4.3. V isualizing embeddings In this experiment, we in vestigate the speaker embeddings Z 2 by visualizing them in Figure 2. For visualizing the speaker em- beddings, we use the 10 test speakers from TIMIT test set (red data points for males and blue for female) and 10 test speak- ers from THCHS-30 (orange/greenish data points for males and light blue for female). W e also use V AE models trained on TIMIT (top), CH corpus (middle), and TIMIT+CH corpus (bot- tom). In Figures 2, we show the speaker embedding from 1 sentence in the left subplots and from 5 sentences in the right subplots, where the 2D plot of the speaker embeddings (com- puted using PCA) are shown. In all subplots, the female and male embedding cluster locations are clearly separated. Fur- thermore, the plot sho ws that the speaker embeddings of unique speakers fall near the same location. Although when 5 utter- ances are used to compute the embedding value, the v ariation is visibly less compared to when merely one sentence is used. This shows sensitiv eness of the speaker embedding computed from the model to sentence variations. Also it is interesting to note that when both TIMIT+CH corpus are used for training, the speaker embeddings are further apart suggesting a better model property . One phenomenon that we notice is that the speaker embeddings for different languages and gender fall to different locations. This sho ws the embeddings are still language depen- dent, which might suggest the network learn to use the phonetic information to learn some language embedding within Z 2 as an additional factor . Furthermore, we inv estigate the phonetic context embedding Z 1 for a sentence for four test speakers on TIMIT -trained V AE. The phonetic conte xt matrix ov er the com- puted utterances (compressed using PCA) is shown in Figures 3. Ideally , we want the matrices should be close to each other since the phonetic context embedding is supposed to be speaker- independent. The ﬁgure show the closeness of the embeddings at the similar time frames. There is still some minor discrep- ancy between the embeddings which shows room for further improv ement of model architecture and/or larger speech corpus. 4.4. Effect of training data size W e inv estigate the effect of VC training data size on the per- formance of the system. In order to be able to do objective test using mel-CD [4], we require parallel data from the speakers. 1.0 0.5 0.0 0.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 Figure 2: V isualization of speaker embedding. Each point rep- r esents single utterance and differ ent colors repr esent differ - ent speaker/languages; blueish dots ar e English females and light blueish ar e Chinese females; and r eddish dots ar e English males and orang e dots are Chinese males. See Section 4.3 for details. Figure 3: V isualization of phonetic context embedding sequence of the sentence ”She had your dark suit in greasy wash water all year” aligned to each other for two female speakers (top) and two female speakers (bottom). The embeddings ar e tr ansformed to 2D using PCA. W e use 20 parallel CMU-acric utterance from each speaker for computing the objectiv e score. W e v ary non-parallel sentence numbers from source and target speaker that is used to compute the speaker embeddings. The results are shown in Figure 4. As can be seen, V AE performs better with less than 10 sentences, howe ver , with more than 10 sentences, GMM starts achieving lower mel-CD. This might be due to V AE only having one de- gree of freedom (speaker identity vector) to conv ert the voice, compared to GMM which is able to use all the training data to adapt the background GMM model to better match the speaker data distribution. 4.5. Subjective evaluation T o subjectiv ely e valuate v oice con version performance, we per - formed two perceptual tests. The ﬁrst test measured speech quality , designed to answer the question “how natural does the conv erted speech sound”?, and the second test measured speaker similarity , designed to answer the question “how accu- rate does the con verted speech mimic the target speaker”?. The listening experiments were carried out using Amazon Mechan- ical T urk, with participants who had approv al ratings of at least 90% and were located in North America. Both perceptual tests used three trivial-to-judge trials, added to the experiment to e x- clude unreliable listeners from statistical analysis. No listeners were ﬂagged as unreliable in our experiments. In this subjec- tiv e experiment, we focus on V AE train on TIMIT . W e provide samples of V AE trained on TIMIT , CH, TIMIT+CH and VCTK in the samples webpage demo as well. In informal listening 20 40 60 80 100 8 8 . 5 mel-CD (dB) GMM con version V AE con version Figure 4: Effects of varying number of training sentences from 1 to 100 all E2E E2Z Z2E Z2Z 0 . 2 0 . 4 0 . 6 0 . 8 1 MOS GMMvsV AE all F2F F2M M2F M2M 0 . 2 0 . 4 0 . 6 0 . 8 1 MOS Figure 5: Speech Quality average scor e with gender and lan- guage br eak-down. P ositive scor es favor V AE. (conﬁdence in- tervals for all is close to 0.13, and all scor es are statistically signiﬁcant) tests, we found out that V AE trained on TIMIT performs better than V AE trained on VCTK. Also V AE trained on TIMIT+CH generate better quality than on TIMIT or CH alone. 4.5.1. Speech quality T o ev aluate the speech quality of the con verted utterances, we conducted a Comparative Mean Opinion Score (CMOS) test. In this test, listeners heard two stimuli A and B with the same content, generated using the same source speaker , but in two different processing conditions, and were then asked to indicate whether they thought B was better or worse than A, using a ﬁv e-point scale comprised of +2 (much better), +1 (some what better), 0 (same), -1 (somewhat worse), -2 (much worse). W e randomized the order of stimulus presentation, both the order of A and B, as well as the order of the comparison pairs. W e uti- lized three processing conditions: GMM, V AE-STFT , V AE. W e ran tw o separate experiments. First, to assess the effect of using W orld vocoder instead of STFT , we directly compared V AE- STFT vs. V AE. W e only limited this experiment to English to English con version. The experiment was administered to 40 lis- teners with each listener judging 16 sentence pairs. The results shows a very signiﬁcant preference of V AE over V AE-STFT , achieving +1.25 ± 0.12 mean score to wards V AE. W e performed planned one-sample t-tests with a mean of zero and achieved p < 0 . 0001 . Second, we assessed the VC approach ef fect by directly comparing GMM vs. V AE utterances. The experiment was administered to 40 listeners with each listener judging 80 sentence pairs. The results sho ws V AE has a statistically signif- icant quality improvement ov er GMM, achie ving +0.61 ± 0.14 mean score towards V AE. The language-breakdown of the re- sults are sho wn in Figure 5. W e performed planned one-sample t-tests with a mean of zero and achie ved p < 0 . 05 for all lan- guage and gender conv ersion pairs separately , showing statis- tically signiﬁcant improvements for all break-do wns of gender and language. The Z2E con version is achieving lowered quality compared to other con version pairs. W e speculate the reason is the slight noise present in THCHS-30 recordings which cause some distortion during vocoding. all E2E E2Z Z2E Z2Z − 0 . 5 0 0 . 5 MOS GMM V AE Figure 6: Speech Similarity averag e scor e with language break- down. P ositive scores are desirable. (Comparison of scor es between GMM vs. V AE do not show statistical signiﬁcance) 4.5.2. Speaker similarity T o ev aluate the speaker similarity of the con verted utterances, we conducted a same-dif ferent speaker similarity test [31]. In this test, listeners heard tw o stimuli A and B with dif ferent con- tent, and were then asked to indicate whether they thought that A and B were spoken by the same, or by two different speak- ers, using a ﬁve-point scale comprised of +2 (deﬁnitely same), +1 (probably same), 0 (unsure), -1 (probably different), and - 2 (deﬁnitely different). One of the stimuli in each pair was created by one of the two conv ersion methods, and the other stimulus was a purely MCEP-vocoded condition, used as the reference speaker . The listeners were explicitly instructed to disregard the language of the stimuli and merely judge based on the fact whether they think the utterances are from the same speaker regardless of the language. Half of all pairs were cre- ated with the reference speaker identical to the target speaker of the con version (expecting listeners to reply “same”, ideally); the other half were created with the reference speaker being the same gender , but not identical to the target speaker of the con- version (expecting listeners to reply different). W e only report “same” scores. The experiment was administered to 40 listen- ers, with each listener judging 64 sentence pairs. The results are shown in Figure 6. The results show GMM and V AE achieving -0.18 ± 0.15 and -0.12 ± 0.16, respectiv ely . W e did not ﬁnd any statistical signiﬁcance between GMM vs V AE systems for av er- age, or any of the language/gender conv ersion break-downs of the stimuli. For both V AE and GMM, we noticed that for E2E case, we achieve the highest average score and the only case that is able to transform identity , achie ving P < 0 . 05 in one- sample t-test compared to chance. This is reasonable giv en the training was done only on an English corpus. Furthermore, Z2Z achiev es higher score compared to E2Z and Z2E. This might be due to the listener’ s bias tow ard not rating different language utterances as high score as same language utterances. 5. Conclusions W e proposed to exploit FHV AE model for challenging non- parallel and cross-lingual voice conv ersion, even with very small number of training utterances such as only one target speaker’ s utterance. W e in vestigate the importance of speech representations and found that W orld vocoder outperformed STFT which was used in [1] in experimental ev aluation, both speech quality and similarity . W e also examined the effect of the size of training utterances from target speaker for VC, and our approach outperformed baseline with less than 10 sentences, and achieve reasonable performance e ven with only one train- ing utterance. In the subjectiv e tests, our approach achiev ed signiﬁcantly better results than both V AE-STFT and GMM in speech quality , and outperformed V AE-STFT and comparable to GMM in speech similarity . As future work, we are interested in and working on joint end-to-end learning with W a venet [17] or W avenet-v ocoder [32, 33], and also building models trained on multi-language corpora with or without e xplicit modeling of different languages such as pro viding language coding. 6. References [1] W .-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequential data, ” in Advances in neural information pr ocessing systems , 2017, pp. 1876–1887. [2] Y . Stylianou, “V oice transformation: a survey , ” in Acoustics, Speech and Signal Pr ocessing, 2009. ICASSP 2009. IEEE Inter- national Confer ence on . IEEE, 2009, pp. 3585–3588. [3] S. H. Mohammadi and A. Kain, “ An overvie w of v oice con version systems, ” Speech Communication , v ol. 88, pp. 65–82, 2017. [4] T . T oda, A. W . Black, and K. T okuda, “V oice con version based on maximum-likelihood estimation of spectral parameter trajectory , ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 15, no. 8, pp. 2222–2235, 2007. [5] S. Desai, A. W . Black, B. Y egnanarayana, and K. Prahallad, “Spectral mapping using artiﬁcial neural networks for voice con- version, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 18, no. 5, pp. 954–964, 2010. [6] Z. W u, T . Kinnunen, E. S. Chng, and H. Li, “Mixture of factor analyzers using priors from non-parallel speech for voice conv er- sion, ” IEEE Signal Pr ocessing Letters , vol. 19, no. 12, pp. 914– 917, 2012. [7] C.-C. Hsu, H.-T . Hwang, Y .-C. Wu, Y . Tsao, and H.-M. W ang, “V oice con version from non-parallel corpora using variational auto-encoder , ” in Signal and Information Pr ocessing Associa- tion Annual Summit and Confer ence (APSIP A), 2016 Asia-P aciﬁc . IEEE, 2016, pp. 1–6. [8] P . Song, W . Zheng, and L. Zhao, “Non-parallel training for voice con version based on adaptation method, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2013 IEEE International Con- fer ence on . IEEE, 2013, pp. 6905–6909. [9] C.-C. Hsu, H.-T . Hwang, Y .-C. Wu, Y . Tsao, and H.-M. W ang, “V oice conversion from unaligned corpora using variational au- toencoding wasserstein generative adversarial networks, ” arXiv pr eprint arXiv:1704.00849 , 2017. [10] T . Kinnunen, L. Juvela, P . Alku, and J. Y amagishi, “Non-parallel voice conv ersion using i-vector plda: T owards unifying speaker veriﬁcation and transformation, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE International Conference on . IEEE, 2017, pp. 5535–5539. [11] S. H. Mohammadi and A. Kain, “Siamese autoencoders for speech style extraction and switching applied to voice identiﬁcation and con version, ” Pr oceedings of Interspeech , pp. 1293–1297, 2017. [12] B. Ramani, M. A. Jeeva, P . V ijayalakshmi, and T . Nagarajan, “ A multi-lev el gmm-based cross-lingual voice con version using language-speciﬁc mixture weights for polyglot synthesis, ” Cir- cuits, Systems, and Signal Pr ocessing , vol. 35, no. 4, pp. 1283– 1311, 2016. [13] M. Charlier , Y . Ohtani, T . T oda, A. Moinet, and T . Dutoit, “Cross- language voice con version based on eigenv oices, ” Pr oceedings of Interspeech , 2009. [14] D. P . Kingma and M. W elling, “ Auto-encoding v ariational bayes, ” arXiv pr eprint arXiv:1312.6114 , 2013. [15] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde- Farley , S. Ozair, A. Courville, and Y . Bengio, “Generative adver- sarial nets, ” in Advances in neural information pr ocessing sys- tems , 2014, pp. 2672–2680. [16] A. v . d. Oord, N. Kalchbrenner , and K. Kavukcuoglu, “Pixel re- current neural networks, ” arXiv preprint , 2016. [17] A. V an Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. V inyals, A. Graves, N. Kalchbrenner, A. Senior , and K. Kavukcuoglu, “W av enet: A generative model for raw audio, ” arXiv preprint arXiv:1609.03499 , 2016. [18] T . Kaneko and H. Kameoka, “Parallel-data-free voice con ver - sion using cycle-consistent adversarial networks, ” arXiv preprint arXiv:1711.11293 , 2017. [19] J.-Y . Zhu, T . Park, P . Isola, and A. A. Efros, “Unpaired image- to-image translation using c ycle-consistent adv ersarial networks, ” arXiv pr eprint arXiv:1703.10593 , 2017. [20] T . D. Kulkarni, W . F . Whitney , P . K ohli, and J. T enenbaum, “Deep con volutional in verse graphics network, ” in Advances in Neural Information Pr ocessing Systems , 2015, pp. 2539–2547. [21] X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskev er , and P . Abbeel, “Infogan: Interpretable representation learning by in- formation maximizing generative adversarial nets, ” in Advances in Neural Information Pr ocessing Systems , 2016, pp. 2172–2180. [22] I. Higgins, L. Matthey , A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner , “beta-v ae: Learning basic visual concepts with a constrained variational framework, ” 2016. [23] M. I. Jordan, Z. Ghahramani, T . S. Jaakkola, and L. K. Saul, “ An introduction to variational methods for graphical models, ” Ma- chine learning , v ol. 37, no. 2, pp. 183–233, 1999. [24] I. Sutskever , O. V inyals, and Q. V . Le, “Sequence to sequence learning with neural networks, ” in Advances in neural information pr ocessing systems , 2014, pp. 3104–3112. [25] J. S. Garofolo, L. F . Lamel, W . M. Fisher , J. G. Fiscus, and D. S. Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1, ” N ASA STI/Recon technical r eport n , vol. 93, 1993. [26] D. W ang and X. Zhang, “Thchs-30: A free chinese speech cor- pus, ” arXiv pr eprint arXiv:1512.01882 , 2015. [27] J. Kominek and A. W . Black, “The cmu arctic speech databases, ” in F ifth ISCA W orkshop on Speech Synthesis , 2004. [28] M. Morise, F . Y okomori, and K. Ozawa, “W orld: a vocoder -based high-quality speech synthesis system for real-time applications, ” IEICE TRANSACTIONS on Information and Systems , vol. 99, no. 7, pp. 1877–1884, 2016. [29] S. Hochreiter and J. Schmidhuber , “Long short-term memory , ” Neural computation , v ol. 9, no. 8, pp. 1735–1780, 1997. [30] D. P . Kingma and J. Ba, “ Adam: A method for stochastic opti- mization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [31] A. B. Kain, “High resolution voice transformation, ” 2001. [32] A. T amamori, T . Hayashi, K. Kobayashi, K. T akeda, and T . T oda, “Speaker-dependent wavenet vocoder , ” in Pr oceedings of Inter- speech , 2017, pp. 1118–1122. [33] T . Hayashi, A. T amamori, K. Kobayashi, K. T akeda, and T . T oda, “ An in vestigation of multi-speaker training for wav enet vocoder , ” in Automatic Speech Recognition and Understanding W orkshop (ASR U), 2017 IEEE . IEEE, 2017, pp. 712–718.

Investigation of Using Disentangled and Interpretable Representations for One-shot Cross-lingual Voice Conversion

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment