CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Rece…

Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
CYCLEGAN-VC2: IMPR O VED CYCLEGAN-B ASED NON-P ARALLEL V OICE CONVERSION T akuhir o Kaneko, Hir okazu Kameoka, K ou T anaka, Nobukatsu Hojo NTT Communication Science Laboratories, NTT Corporation, Japan ABSTRA CT Non-parallel voice conv ersion (VC) is a technique for learn- ing the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training condi- tions. Recently , CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any e xtra data, modules, or time alignment pro- cedures. Howe ver , there is still a large gap between the real target and con verted speech, and bridging this gap remains a challenge. T o reduce this gap, we propose CycleGAN- VC2, which is an improved version of CycleGAN-VC incor- porating three new techniques: an improv ed objectiv e (two- step adversarial losses), improv ed generator (2-1-2D CNN), and improv ed discriminator (PatchGAN). W e ev aluated our method on a non-parallel VC task and analyzed the ef fect of each technique in detail. An objectiv e e valuation showed that these techniques help bring the con verted feature sequence closer to the target in terms of both global and local structures, which we assess by using Mel-cepstral distortion and modu- lation spectra distance, respectiv ely . A subjective ev aluation showed that CycleGAN-VC2 outperforms CycleGAN-VC in terms of naturalness and similarity for ev ery speaker pair , in- cluding intra-gender and inter-gender pairs. 1 Index T erms — V oice con version (VC), non-parallel VC, generati ve adv ersarial networks (GANs), CycleGAN, CycleGAN-VC 1. INTR ODUCTION V oice conv ersion (VC) is a technique for transforming the non/para-linguistic information of giv en speech while pre- serving the linguistic information. VC has great potential for application to various tasks, such as speaking aids [ 1 , 2 ] and the con version of style [ 3 , 4 ] and pronunciation [ 5 ]. One successful approach to VC in volv es statistical meth- ods based on Gaussian mixture model (GMM) [ 6 , 7 , 8 ], neu- ral network (NN)-based methods using restricted Boltzmann machines (RBMs) [ 9 , 10 ], feed forward NNs (FNNs) [ 11 , 12 , 13 ], recurrent NNs (RNNs) [ 14 , 15 ], conv olutional NNs (CNNs) [ 5 ], attention networks [ 16 , 17 ], and generativ e ad- versarial networks (GANs) [ 5 ], and exemplar -based methods using non-negati ve matrix factorization (NMF) [ 18 , 19 ]. 1 The conv erted speech samples are provided at http://www. kecl.ntt.co.jp/people/kaneko.takuhiro/projects/ cyclegan- vc2/index.html . Many VC methods (including the above-mentioned) are categorized as parallel VC, which relies on the av ailability of parallel utterance pairs of the source and target speakers. Howe ver , collecting such data is often laborious or impracti- cal. Even if obtaining such data is feasible, man y VC methods require a time alignment procedure as a pre-process, which may occasionally fail and requires careful pre-screening or manual correction. T o ov ercome these restrictions, this paper focuses on non-parallel VC, which does not rely on parallel utterances, transcriptions, or time alignment procedures. In general, non-parallel VC is quite challenging and is inferior to parallel VC in terms of quality due to the disad- vantages of the training conditions. T o alle viate these se- vere conditions, several studies have incorporated an extra module (e.g., an automatic speech recognition (ASR) mod- ule [ 20 , 21 ]) or e xtra data (e.g., parallel utterance pairs among reference speakers [ 22 , 23 , 24 , 25 ]). Although these addi- tional modules or data are helpful for training, preparing them imposes other costs and thus limits application. T o av oid such additional costs, recent studies ha ve examined the use of probabilistic NNs (e.g., an RBN [ 26 ] and variational autoen- coders (V AEs) [ 27 , 28 ]), which embed the acoustic features into common low-dimensional space with the supervision of speaker identification. It is notew orthy that they are free from extra data, modules, and time alignment procedures. How- ev er, one limitation is that they need to approximate data dis- tribution explicitly (e.g., Gaussian is typically used), which tends to cause ov er-smoothing through statistical a veraging. T o ov ercome these limitations, recent studies [ 27 , 29 , 30 ] hav e incorporated GANs [ 31 ], which can learn a generative distribution close to the tar get without explicit approximation, thus avoiding the over -smoothing caused by statistical a verag- ing. Among these, in contrast to some of the frame-by-frame methods [ 27 , 30 ], which hav e difficulty in learning time de- pendencies, CycleGAN-VC [ 29 ] (published in [ 32 ]) makes it possible to learn a sequence-based mapping function by using CycleGAN [ 33 , 34 , 35 ] with a gated CNN [ 36 ] and identity- mapping loss [ 37 ]. This allows sequential and hierarchical structures to be captured while preserving linguistic informa- tion. With this improv ement, CycleGAN-VC performed com- parably to a parallel VC method [ 7 ]. Howe ver , ev en using CycleGAN-VC, there is still a chal- lenging gap to bridge between the real target and con verted speech. T o reduce this gap, we propose CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporat- ing three new techniques: an improv ed objectiv e (two-step adversarial losses), improv ed generator (2-1-2D CNN), and improv ed discriminator (PatchGAN). W e analyzed the ef fect of each technique on the Spoke (i.e., non-parallel VC) task of the V oice Con version Challenge 2018 (VCC 2018) [ 38 ]. An objectiv e ev aluation showed that the proposed techniques help bring the conv erted acoustic feature sequence closer to the target in terms of global and local structures, which we assess by using Mel-cepstral distortion and modulation spec- tra distance, respectively . A subjecti ve e v aluation sho wed that CycleGAN-VC2 outperforms CycleGAN-VC in terms of nat- uralness and similarity for e very speak er pair , including intra- gender and inter-gender pairs. In Section 2 of this paper , we revie w the con ventional CycleGAN-VC. In Section 3 , we describe CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporat- ing three new techniques. In Section 4 , we report the exper - imental results. W e conclude in Section 5 with a brief sum- mary and mention future work. 2. CONVENTIONAL CYCLEGAN-VC 2.1. Objective: One-Step Adversarial Loss Let x ∈ R Q × T x and y ∈ R Q × T y be acoustic feature se- quences belonging to source X and target Y , respectively , where Q is the feature dimension and T x and T y are the sequence lengths. The goal of CycleGAN-VC is to learn mapping G X → Y , which con verts x ∈ X into y ∈ Y , without relying on parallel data. Inspired by CycleGAN [ 33 ], which was originally proposed in computer vision for unpaired image-to-image translation, CycleGAN-VC uses an adve r- sarial loss [ 31 ] and cycle-consistenc y loss [ 39 ]. Addition- ally , to encourage the preservation of linguistic information, CycleGAN-VC also uses an identity-mapping loss [ 37 ]. Adversarial loss: T o make a conv erted feature G X → Y ( x ) indistinguishable from a target y , an adversarial loss is used: L adv ( G X → Y , D Y ) = E y ∼ P Y ( y ) [log D Y ( y )] + E x ∼ P X ( x ) [log(1 − D Y ( G X → Y ( x )))] , (1) where discriminator D Y attempts to find the best decision boundary between real and con verted features by maximiz- ing this loss, and G X → Y attempts to generate a feature that can deceiv e D Y by minimizing this loss. Cycle-consistency loss: The adversarial loss only re- stricts G X → Y ( x ) to follow the target distribution and does not guarantee the linguistic consistency between input and output features. T o further regularize the mapping, a cycle- consistency loss is used: L cy c ( G X → Y , G Y → X ) = E x ∼ P X ( x ) [ k G Y → X ( G X → Y ( x )) − x k 1 ] + E y ∼ P Y ( y ) [ k G X → Y ( G Y → X ( y )) − y k 1 ] , (2) where forward-in verse and in verse-forward mappings are si- multaneously learned to stabilize training. This loss encour- ages G X → Y and G Y → X to find an optimal pseudo pair of ( x, y ) through circular con version, as shown in Fig. 1 (a). Identity-mapping loss: T o further encourage the input preservation, an identity-mapping loss is used: L id ( G X → Y , G Y → X ) = E y ∼ P Y ( y ) [ k G X → Y ( y ) − y k 1 ] + E x ∼ P X ( x ) [ k G Y → X ( x ) − x k 1 ] . (3) L1 Cycle- consistency Adversarial loss (a) One-step adversarial loss L1 Cycle- consistency First adversarial loss Second adversarial loss (b) Two-step adversarial losses (proposed) Fig. 1 . Comparison of objectives Full objective: The full objectiv e is written as L f ull = L adv ( G X → Y , D Y ) + L adv ( G Y → X , D X ) + λ cy c L cy c ( G X → Y , G Y → X ) + λ id L id ( G X → Y , G Y → X ) , (4) where λ cy c and λ id are trade-off parameters. In this formula- tion, an adv ersarial loss is used once for each cycle, as sho wn in Fig. 1 (a). Hence, we call it a one-step adversarial loss . 2.2. Generator: 1D CNN CycleGAN-VC uses a one-dimensional (1D) CNN [ 5 ] for the generator to capture the overall relationship along with the feature direction while preserving the temporal structure. This can be viewed as the direct temporal extension of a frame-by-frame model that captures such features’ relation- ship only per frame. T o capture the wide-range temporal structure efficiently while preserving the input structure, the generator is composed of do wnsampling, residual [ 40 ], and upsampling layers, as shown in Fig. 2 (a). The other notable point is that CycleGAN-VC uses a g ated CNN [ 36 ] to capture the sequential and hierarchical structures of acoustic features. 2.3. Discriminator: FullGAN CycleGAN-VC uses a 2D CNN [ 5 ] for the discriminator to focus on a 2D structure (i.e., 2D spectral texture [ 41 ]). More precisely , as shown in Fig. 3 (a), a fully connected layer is used at the last layer to determine the realness considering the input’ s ov erall structure. Such a model is called FullGAN . 3. CYCLEGAN-VC2 3.1. Impro ved Objective: T wo-Step Adversarial Losses One well-kno wn problem for statistical models is the ov er- smoothing caused by statistical averaging. The adversarial loss used in Eq. 4 helps to alleviate this degradation, but the cycle-consistenc y loss formulated as L1 still causes ov er- smoothing. T o mitigate this negativ e effect, we introduce an additional discriminator D 0 X and impose an adversarial loss on the circularly con verted feature, as L adv 2 ( G X → Y , G Y → X , D 0 X ) = E x ∼ P X ( x ) [log D 0 X ( x )] + E x ∼ P X ( x ) [log(1 − D 0 X ( G Y → X ( G X → Y ( x ))))] . (5) Similarly , we introduce D 0 Y and impose an adversarial loss L adv 2 ( G Y → X , G X → Y , D 0 Y ) for the in verse-forward map- ping. W e add these two adversarial losses to Eq. 4 . In this improv ed objective, we use adversarial losses twice for each cycle, as shown in Fig. 1 (b). Hence, we call them two-step adversarial losses . (a) 1D CNN (b) 2D CNN ResBlocks Upsample ( × r ) Downsample ( 1/ r ) T Q 1 T / r C 1 T / r C 1 T Q 1 ResBlocks T Q 1 T / r Q / r C Downsample ( 1/ r ) Upsample ( × r ) T / r Q / r C T Q 1 (c) 2-1-2D CNN (proposed) Downsample ( 1/ r ) Reshape 1x1Conv 1x1Conv Reshape ResBlocks Upsample ( × r ) Q 1 Q / r C C × Q / r T / r C 1 1 T T / r T / r T / r C 1 C × Q / r 1 T / r Q / r C T / r Q 1 T Fig. 2 . Comparison of generator network architectures. Red and blue blocks indicate 1D and 2D con volution layers, re- spectiv ely . r indicates a downsampling or upsampling rate. Downsample Fully connected T Q 1 Q Downsample Conv T 1 (a) FullGAN (b) PatchGAN (proposed) Fig. 3 . Comparison of discriminator network architectures 3.2. Impro ved Generator: 2-1-2D CNN In a VC framework [ 5 , 29 ] (including CycleGAN-VC), a 1D CNN (Fig. 2 (a)) is commonly used as a generator , whereas in a postfilter framew ork [ 41 , 42 ], a 2D CNN (Fig. 2 (b)) is more preferred. These choices are related to the pros and cons of each network. A 1D CNN is more feasible for capturing dynamical change, as it can capture the ov erall relationship along with the feature dimension. In contrast, a 2D CNN is better suited for conv erting features while preserving the original structures, as it restricts the con verted region to local. Even using a 1D CNN, residual blocks [ 40 ] can mitigate the loss of the original structure, but we find that downsampling and upsampling (which are necessary for effecti vely captur- ing the wide-range structures) become a sev ere cause of this degradation. T o alleviate it, we have de veloped a network ar- chitecture called a 2-1-2D CNN , shown in Fig. 2 (c). In this network, 2D conv olution is used for do wnsampling and up- sampling, and 1D con volution is used for the main conv ersion process (i.e., residual blocks). T o adjust the channel dimen- sion, we apply 1 × 1 con volution before or after reshaping the feature map. 3.3. Impro ved Discriminator: PatchGAN In pre vious GAN-based speech processing models [ 41 , 42 , 5 , 29 ], FullGAN (Fig. 3 (a)) has been extensi vely used. Ho w- ev er, recent studies in computer vision [ 43 , 44 ] indicate that the wide-range receptive fields of the discriminator require more parameters, which causes difficulty in training. Inspired by this, we replace FullGAN with P atchGAN [ 45 , 43 , 44 ] (Fig. 3 (b)), which uses con volution at the last layer and deter - mines the realness on the basis of the patch. W e experimen- tally examine its ef fect for non-parallel VC in Section 4.2 . 4. EXPERIMENTS 4.1. Experimental Conditions Dataset: W e ev aluated our method on the Spoke (i.e., non- parallel VC ) task of the VCC 2018 [ 38 ], which includes recordings of professional US English speakers. W e se- lected a subset of speakers so as to cover all inter-gender and intra-gender con versions: VCC2SF3 ( SF ), VCC2SM3 ( SM ), VCC2TF1 ( TF ), and VCC2TM1 ( TM ), where S, T , F , and M indicate source, target, female, and male, respectiv ely . In the following, we use the abbre viations in the parenthesis (e.g., SF ). Combinations of 2 sources ( SF or SM ) × 2 targets ( TF or TM ) were used for e valuation. Each speaker has sets of 81 (about 5 minutes; relativ ely few for VC) and 35 sentences for training and ev aluation, respectiv ely . In the Spoke task, the source and target speakers hav e a different set of sentences ( no overlap ) so as to ev aluate in a non-parallel setting. The recordings were downsampled to 22.05 kHz for this chal- lenge. W e extracted 34 Mel-cepstral coefficients (MCEPs), logarithmic fundamental frequency ( log F 0 ), and aperiodici- ties (APs) ev ery 5 ms by using the WORLD analyzer [ 46 ]. Con version process: The proposed method was used to con vert MCEPs ( Q = 34 + 1 dimensions including 0 th coef- ficient). 2 The objectiv e of these experiments was to analyze the quality of the conv erted MCEPs; therefore, for the other parts, we used typical methods similar to the baseline of the VCC 2018 [ 38 ]. Specifically , in inter-gender con version, a vocoder -based VC method was used. F 0 was conv erted by us- ing logarithm Gaussian normalized transformation [ 47 ], APs were directly used without modification, and the WORLD vocoder [ 46 ] was used to synthesize speech. In intra-gender con version, we used a vocoder-free VC method [ 48 ]. More precisely , we calculated differential MCEPs by taking the dif- ference between the source and con verted MCEPs. For a sim- ilar reason, we did not use any postfilter [ 41 , 42 , 49 ] or pow- erful vocoder such as the W a veNet vocoder [ 50 , 51 ]. Incor - porating them is one possible direction of future work. T raining details: The implementation was almost the same as that of CycleGAN-VC except that the improved techniques were incorporated. The details of the network architectures are given in Fig. 4 . F or a pre-process, we nor- malized the source and target MCEPs to zero-mean and unit- variance by using the statistics of the training sets. T o stabi- lize training, we used a least squares GAN (LSGAN) [ 52 ]. T o increase the randomness of training data, we randomly cropped a segment (128 frames) from a randomly selected sentence instead of using an overall sentence directly . W e used the Adam optimizer [ 53 ] with a batch size of 1. W e trained the netw orks for 2 × 10 5 iterations with learning rates of 0.0002 for the generator and 0.0001 for the discriminator and with momentum term β 1 of 0.5. W e set λ cy c = 10 and λ id = 5 . W e used L id only for the first 10 4 iterations to guide the learning direction. Note that we did not use any extra data, modules, or time alignment procedures for training . 2 For reference, the conv erted speech samples, in which the pro- posed method was used to con vert all acoustic features (namely , MCEPs, band APs, continuous log F 0 , and voice/un voice indicator), are provided at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/ projects/cyclegan- vc2/index.html . Even in this challenging setting, CycleGAN-VC2 works reasonably well. Conv GLU Conv IN GLU Sum Conv IN GLU PS Conv Input Output k 5x15 c128 s1x1 k5x5 c256 s2x2 k5x5 c512 s2x2 k1x3 c512 s1x1 k1x3 c256 s1x1 k5x5 c1024 s1x1 k5x5 c512 s1x1 k 5x15 c35 s1x1 Input Conv GLU k3x3 c128 s1x1 Conv IN GLU k3x3 c256 s2x2 k3x3 c512 s2x2 h35 w T c1 h35 w T c1 h35 w128 c1 k3x3 c 1024 s2x2 Conv Real/Fake Generator ( 2 - 1 - 2 D CNN ) Downsample ( 2D ) 6 ResBlocks ( 1D ) Upsample ( 2D ) Downsample ( 2D ) Discriminator ( PatchGAN ) k1x5 c 1024 s1x1 k1x3 c1 s1x1 Reshape 1x1Conv Reshape 1x1Conv k1x1 c256 s1x1 IN IN k1x1 c 2304 s1x1 h1 w T /4 c 2304 h9 w T /4 c256 2D → 1D 1D → 2D Fig. 4 . Network architectures of generator and discriminator . In input, output, and reshape layers, h , w , and c denote height, width, and number of channels, respecti vely . In each con volution layer , k , c , and s denote kernel size, number of channels, and stride size, respecti vely . IN, GLU, and PS indicate instance normalization [ 54 ], gated linear unit [ 36 ], and pixel shuffler [ 44 ], respectiv ely . Since the generator is fully con volutional [ 55 ], it can tak e input of arbitrary length T . T able 1 . Comparison of MCD [dB] No. Method Intra-gender Inter -gender CycleGAN-VC2 SF-TF SM-TM SM-TF SF-TM Adv . G D 1 1Step 2-1-2D Patch 6.86 ± .04 6.32 ± .06 7.36 ± .04 6.28 ± .04 2 2Step 1D Patch 6.86 ± .04 6.73 ± .08 7.77 ± .07 6.41 ± .01 3 2Step 2D Patch 7.01 ± .07 6.63 ± .03 7.63 ± .03 6.73 ± .04 4 2Step 2-1-2D Full 7.01 ± .07 6.45 ± .05 7.41 ± .04 6.51 ± .02 5 2Step 2-1-2D Patch 6.83 ± .01 6.31 ± .03 7.22 ± .05 6.26 ± .03 6 CycleGAN-VC [ 29 ] 7.37 ± .03 6.68 ± .07 7.68 ± .05 6.51 ± .05 7 Frame-based CycleGAN [ 30 ] 8.85 ± .07 7.27 ± .11 8.86 ± .27 8.51 ± .36 T able 2 . Comparison of MSD [dB] No. Method Intra-gender Inter -gender CycleGAN-VC2 SF-TF SM-TM SM-TF SF-TM Adv . G D 1 1Step 2-1-2D Patch 1.60 ± .02 1.63 ± .05 1.54 ± .03 1.56 ± .04 2 2Step 1D Patch 3.31 ± .36 4.26 ± .37 2.04 ± .21 5.03 ± .32 3 2Step 2D Patch 1.57 ± .07 1.54 ± .01 1.46 ± .03 1.66 ± .07 4 2Step 2-1-2D Full 1.52 ± .02 1.56 ± .04 1.47 ± .01 1.67 ± .06 5 2Step 2-1-2D Patch 1.49 ± .01 1.53 ± .02 1.45 ± .00 1.52 ± .01 6 CycleGAN-VC [ 29 ] 2.42 ± .08 2.66 ± .08 2.21 ± .13 2.65 ± .15 7 Frame-based CycleGAN [ 30 ] 3.78 ± .26 2.77 ± .10 3.32 ± .06 3.61 ± .15 4.2. Objective Evaluation As discussed in previous studies [ 7 , 41 ], it is fairly complex to design a single metric that can assess the quality of con- verted MCEPs comprehensiv ely . Alternati vely , we used two metrics to assess the local and global structures. T o measure global structural differences, we used the Mel-cepstral dis- tortion (MCD), which measures the distance between the tar- get and conv erted MCEP sequences. T o measure the local structural differences, we used the modulation spectra dis- tance (MSD), which is defined as the root mean square er- ror between the target and conv erted logarithmic modulation spectra of MCEPs averaged over all MCEP dimensions and modulation frequencies. For both metrics, smaller values in- dicate that target and con verted MCEPs are more similar . W e list the MCD and MSD in T ables 1 and 2 , respec- tiv ely . T o eliminate the effect of initialization, we report the av erage and standard deviation scores over three random ini- tializations. T o analyze the effect of each technique, we con- ducted ablation studies on CycleGAN-VC2 (no. 5 is the full model). W e also compared CycleGAN-VC2 with two state- of-the-art methods: CycleGAN-VC [ 29 ] and frame-based CycleGAN [ 30 ] (our reimplementation; we additionally used L id for stabilizing training). The comparison of one-step and two-step adv ersarial losses (nos. 1, 5) indicates that this tech- nique is particularly effecti ve for improving MSD. The com- parisons of generator (nos. 2, 3, 5) and discriminator (nos. 4, 5) network architectures indicate that they contribute to im- proving both MCD and MSD. Finally , the comparison to the baselines (nos. 5, 6, 7) verifies that by incorporating the three proposed techniques, we achieve state-of-the-art performance in terms of MCD and MSD for ev ery speaker pair . 4.3. Subjective Evaluation W e conducted listening tests to ev aluate the quality of con- verted speech. CycleGAN-VC [ 29 ] was used as the base- 1 2 3 MOS SF-TF SM-TM SM-TF SF-TM CycleGAN-VC2 CycleGAN-VC Fig. 5 . MOS for naturalness with 95% confidence intervals SF-TF SM-TM SM-TF 0 20 40 60 10 30 50 70 80 90 100 SF-TM CycleGAN-VC2 CycleGAN-VC Fair Fig. 6 . A verage preference score (%) on speaker similarity line. T o measure naturalness, we conducted a mean opinion score (MOS) test (5: excellent to 1: bad), in which we in- cluded the tar get speech as a reference (MOS for TF and TM are 4.8). T en sentences were randomly selected from the ev aluation sets. T o measure speaker similarity , we con- ducted an XAB test, where “ A ” and “B” were speech con- verted by the baseline and proposed methods, and “X” was target speech. W e selected ten sentence pairs randomly from the ev aluation sets and presented all pairs in both orders (AB and BA) to eliminate bias in the order of stimuli. For each sentence pair , the listeners were asked to select their preferred one (“ A ” or “B”) or to opt for “F air . ” T en listeners partici- pated in these listening tests. Figs. 5 and 6 show the MOS for naturalness and the preference scores for speaker similar- ity , respectiv ely . These results confirm that CycleGAN-VC2 outperforms CycleGAN-VC in terms of both naturalness and similarity for every speaker pair . P articularly , CycleGAN- VC is difficult to apply to a vocoder -free VC frame work [ 48 ] (used in SF-TF and SM-TM ), as this framew ork is sensitiv e to con version error due to the usage of dif ferential MCEPs. Howe ver , the MOS indicates that CycleGAN-VC2 works rel- ativ ely well in such a difficult setting. 5. CONCLUSIONS T o advance the research on non-parallel VC, we have pro- posed CycleGAN-VC2, which is an improv ed version of CycleGAN-VC incorporating three new techniques: an im- prov ed objecti ve (two-step adversarial losses), improv ed gen- erator (2-1-2D CNN), and improved discriminator (P atch- GAN). The experimental results demonstrate that CycleGAN- VC2 outperforms CycleGAN-VC in both objective and sub- jectiv e measures for ev ery speaker pair . Our proposed tech- niques are not limited to one-to-one VC, and adapting them to other settings (e.g., multi-domain VC [ 56 ]) and other appli- cations [ 1 , 2 , 4 , 3 , 5 ] remains an interesting future direction. Acknowledgements: This work was supported by JSPS KAKENHI 17H01763. 6. REFERENCES [1] Alexander B Kain, John-Paul Hosom, Xiaochuan Niu, Jan P . H. van Santen, Melanie Fried-Oken, and Janice Staehely , “Improving the intelligibility of dysarthric speech, ” Speech Commun. , vol. 49, no. 9, pp. 743–759, 2007. [2] Keigo Nakamura, T omoki T oda, Hiroshi Saruwatari, and Kiyohiro Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech, ” Speech Commun. , vol. 54, no. 1, pp. 134–146, 2012. [3] Zeynep Inanoglu and Steve Y oung, “Data-driven emotion con version in spoken English, ” Speech Commun. , vol. 51, no. 3, pp. 268–283, 2009. [4] T omoki T oda, Mikihiro Nakagiri, and Kiyohiro Shikano, “Statistical voice conver - sion techniques for body-conducted un voiced speech enhancement, ” IEEE Tr ans. Audio Speech Lang. Pr ocess. , vol. 20, no. 9, pp. 2505–2517, 2012. [5] T akuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino, “Sequence-to-sequence voice conversion with similarity metric learned using gen- erativ e adversarial networks, ” in Proc. Interspeech , 2017, pp. 1283–1287. [6] Y annis Stylianou, Olivier Capp ´ e, and Eric Moulines, “Continuous probabilistic transform for v oice conversion, ” IEEE Tr ans. Speech and Audio Process. , vol. 6, no. 2, pp. 131–142, 1998. [7] T omoki T oda, Alan W Black, and Keiichi T okuda, “V oice con version based on maximum-likelihood estimation of spectral parameter trajectory , ” IEEE Tr ans. Audio Speech Lang. Pr ocess. , vol. 15, no. 8, pp. 2222–2235, 2007. [8] Elina Helander , T uomas V irtanen, Jani Nurminen, and Moncef Gabbouj, “V oice con version using partial least squares regression, ” IEEE T rans. Audio Speech Lang. Process. , vol. 18, no. 5, pp. 912–921, 2010. [9] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai, “V oice conver - sion using deep neural networks with layer-wise generati ve training, ” IEEE/ACM T rans. Audio Speech Lang. Process. , v ol. 22, no. 12, pp. 1859–1872, 2014. [10] T oru Nakashika, T etsuya T akiguchi, and Y asuo Ariki, “V oice con version based on speaker-dependent restricted Boltzmann machines, ” IEICE Tr ans. Inf. Syst. , vol. 97, no. 6, pp. 1403–1410, 2014. [11] Srinivas Desai, Alan W Black, B Y egnanarayana, and Kishore Prahallad, “Spec- tral mapping using artificial neural networks for voice con version, ” IEEE Tr ans. Audio Speech Lang. Pr ocess. , vol. 18, no. 5, pp. 954–964, 2010. [12] Seyed Hamidreza Mohammadi and Alexander Kain, “V oice con version using deep neural networks with speaker-independent pre-training, ” in Pr oc. SL T , 2014, pp. 19–23. [13] Keisuke Oyamada, Hirokazu Kameoka, T akuhiro Kaneko, Hiroyasu Ando, Kaoru Hiramatsu, and Kunio Kashino, “Non-nati ve speech con version with consistency- aware recursiv e network and generative adversarial network, ” in Pr oc. APSIP A ASC , 2017, pp. 182–188. [14] T oru Nakashika, T etsuya T akiguchi, and Y asuo Ariki, “High-order sequence mod- eling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice con version, ” in Pr oc. Interspeech , 2014, pp. 2278–2282. [15] Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng, “V oice conv ersion using deep bidirectional long short-term memory based recurrent neural networks, ” in Proc. ICASSP , 2015, pp. 4869–4873. [16] Kou T anaka, Hirokazu Kameoka, T akuhiro Kaneko, and Nobukatsu Hojo, “AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms, ” in Pr oc. ICASSP , 2019. [17] Hirokazu Kameoka, K ou T anaka, T akuhiro Kaneko, and Nobukatsu Hojo, “Con vs2s-vc: Fully conv olutional sequence-to-sequence voice conv ersion, ” in arXiv pr eprint arXiv:1811.01609 . Nov . 2018. [18] Ryoichi T akashima, T etsuya T akiguchi, and Y asuo Ariki, “Exampler-based voice con version using sparse representation in noisy environments, ” IEICE Tr ans. Inf. Syst. , vol. E96-A, no. 10, pp. 1946–1953, 2013. [19] Zhizheng Wu, Tuomas V irtanen, Eng Siong Chng, and Haizhou Li, “Exemplar- based sparse representation with residual compensation for voice conversion, ” IEEE/ACM T rans. Audio Speech Lang. Pr ocess. , vol. 22, no. 10, pp. 1506–1521, 2014. [20] Feng-Long Xie, Frank K Soong, and Haifeng Li, “ A KL diver gence and DNN- based approach to voice con version without parallel training sentences, ” in Pr oc. Interspeech , 2016, pp. 287–291. [21] Y uki Saito, Y usuke Ijima, Kyosuke Nishida, and Shinnosuke T akamichi, “Non- parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors, ” in Pr oc. ICASSP , 2018, pp. 5274–5278. [22] Athanasios Mouchtaris, Jan V an der Spiegel, and P aul Mueller , “Nonparallel train- ing for voice con version based on a parameter adaptation approach, ” IEEE Tr ans. Audio Speech Lang. Pr ocess. , vol. 14, no. 3, pp. 952–963, 2006. [23] Chung-Han Lee and Chung-Hsien Wu, “MAP-based adaptation for speech con- version using adaptation data selection and non-parallel training, ” in Proc. ICSLP , 2006, pp. 2254–2257. [24] T omoki T oda, Y amato Ohtani, and Kiyohiro Shikano, “Eigen voice con version based on Gaussian mixture model, ” in Proc. Interspeech , 2006, pp. 2446–2449. [25] Daisuke Saito, Keisuk e Y amamoto, Nobuaki Minematsu, and Keikichi Hirose, “One-to-many v oice conv ersion based on tensor representation of speaker space, ” in Pr oc. Interspeech , 2011, pp. 653–656. [26] T oru Nakashika, T etsuya T akiguchi, and Y asuhiro Minami, “Non-parallel training in voice con version using an adaptiv e restricted Boltzmann machine, ” IEEE/ACM T rans. Audio Speech Lang. Process. , v ol. 24, no. 11, pp. 2032–2045, 2016. [27] Chin-Cheng Hsu, Hsin-T e Hwang, Y i-Chiao W u, Y u Tsao, and Hsin-Min W ang, “V oice con version from unaligned corpora using variational autoencoding W asser- stein generative adversarial networks, ” in Pr oc. Interspeech , 2017, pp. 3364–3368. [28] Hirokazu Kameoka, T akuhiro Kaneko, Kou T anaka, and Nobukatsu Hojo, “A CV AE-VC: Non-parallel many-to-man y voice conversion with auxiliary classi- fier variational autoencoder , ” in arXiv preprint . Aug. 2018. [29] T akuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voice con version using cycle-consistent adversarial networks, ” in arXiv preprint . Nov . 2017. [30] Fuming Fang, Junichi Y amagishi, Isao Echizen, and Jaime Lorenzo-T rueba, “High-quality nonparallel voice conv ersion based on cycle-consistent adversarial network, ” in Proc. ICASSP , 2018, pp. 5279–5283. [31] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde- Farley , Sherjil Ozair, Aaron Courville, and Y oshua Bengio, “Generativ e adver- sarial nets, ” in Proc. NPIS , 2014, pp. 2672–2680. [32] T akuhiro Kaneko and Hirokazu Kameoka, “CycleGAN-VC: Non-parallel voice con version using cycle-consistent adversarial networks, ” in Proc. EUSIPCO , 2018, pp. 2114–2118. [33] Jun-Y an Zhu, T aesung Park, Phillip Isola, and Ale xei A. Efros, “Unpaired image- to-image translation using cycle-consistent adv ersarial networks, ” in Proc. ICCV , 2017, pp. 2223–2232. [34] Zili Y i, Hao Zhang, Ping T an, and Minglun Gong, “DualGAN: Unsupervised dual learning for image-to-image translation, ” in Proc. ICCV , 2017, pp. 2849–2857. [35] T aeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim, “Learning to discover cross-domain relations with generative adversarial net- works, ” in Proc. ICML , 2017, pp. 1857–1865. [36] Y ann N Dauphin, Angela Fan, Michael Auli, and David Grangier , “Language modeling with gated con volutional netw orks, ” in Pr oc. ICML , 2017, pp. 933–941. [37] Y aniv T aigman, Adam Polyak, and Lior W olf, “Unsupervised cross-domain image generation, ” in Proc. ICLR , 2017. [38] Jaime Lorenzo-Trueba, Junichi Y amagishi, T omoki T oda, Daisuke Saito, Fer- nando Villa vicencio, T omi Kinnunen, and Zhenhua Ling, “The voice conv ersion challenge 2018: Promoting development of parallel and nonparallel methods, ” in Pr oc. Speaker Odysse y , 2018, pp. 195–202. [39] Tinghui Zhou, Philipp Kr ¨ ahenb ¨ uhl, Mathieu Aubry, Qixing Huang, and Alexei A Efros, “Learning dense correspondence via 3D-guided cycle consistency , ” in Pr oc. CVPR , 2016, pp. 117–126. [40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learn- ing for image recognition, ” in Proc. CVPR , 2016, pp. 770–778. [41] T akuhiro Kaneko, Hirokazu Kameoka, Nob ukatsu Hojo, Y usuk e Ijima, Kaoru Hi- ramatsu, and Kunio Kashino, “Generative adversarial network-based postfilter for statistical parametric speech synthesis, ” in Proc. ICASSP , 2017, pp. 4910–4914. [42] T akuhiro Kaneko, Shinji T akaki, Hirokazu Kameoka, and Junichi Y amagishi, “Generativ e adversarial network-based postfilter for STFT spectrograms, ” in Proc. Interspeech , 2017, pp. 3389–3393. [43] Phillip Isola, Jun-Y an Zhu, Tinghui Zhou, and Alexei A Efros, “Image-to-image translation with conditional adversarial networks, ” in Pr oc. CVPR , 2017, pp. 5967–5976. [44] W enzhe Shi, Jose Caballero, Ferenc Husz ´ ar , Johannes T otz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan W ang, “Real-time single image and video super-resolution using an ef ficient sub-pixel conv olutional neural network, ” in Pr oc. CVPR , 2016, pp. 1874–1883. [45] Chuan Li and Michael W and, “Precomputed real-time texture synthesis with Markovian generati ve adversarial netw orks, ” in Pr oc. ECCV , 2016, pp. 702–716. [46] Masanori Morise, Fumiya Y okomori, and Kenji Ozaw a, “WORLD: A v ocoder- based high-quality speech synthesis system for real-time applications, ” IEICE T rans. Inf. Syst. , vol. 99, no. 7, pp. 1877–1884, 2016. [47] Kun Liu, Jianping Zhang, and Y onghong Y an, “High quality voice conv ersion through phoneme-based linear mapping functions with STRAIGHT for Man- darin, ” in Proc. FSKD , 2007, pp. 410–414. [48] Kazuhiro Kobayashi, T omoki T oda, and Satoshi Nakamura, “F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential, ” in Pr oc. SLT , 2016, pp. 693–700. [49] Kou T anaka, T akuhiro Kaneko, Nobukatsu Hojo, and Hirokazu Kameoka, “Synthetic-to-natural speech waveform con version using cycle-consistent adver- sarial networks, ” in Proc. SLT , 2018, pp. 632–639. [50] A ¨ aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol V inyals, Alex Grav es, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “W aveNet: A generati ve model for raw audio, ” in arXiv preprint arXiv:1609.03499 . Sep. 2016. [51] Akira T amamori, T omoki Hayashi, Kazuhiro Kobayashi, Kazuya T akeda, and T omoki T oda, “Speaker-dependent W aveNet vocoder, ” in Proc. Interspeech , 2017, pp. 1118–1122. [52] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen W ang, and Stephen Paul Smolley , “Least squares generati ve adversarial networks, ” in Pr oc. ICCV , 2017, pp. 2794–2802. [53] Diederik Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” in Pr oc. ICLR , 2015. [54] Dmitry Ulyanov , Andrea V edaldi, and V ictor Lempitsky , “Instance normalization: The missing ingredient for fast stylization, ” in arXiv pr eprint arXiv:1607.08022 . July 2016. [55] Jonathan Long, Evan Shelhamer, and Tre vor Darrell, “Fully con volutional net- works for semantic segmentation, ” in Pr oc. CVPR , 2015, pp. 3431–3440. [56] Hirokazu Kameoka, T akuhiro Kaneko, Kou T anaka, and Nobukatsu Hojo, “StarGAN-VC: Non-parallel many-to-many voice con version using star genera- tiv e adversarial networks, ” in Proc. SLT , 2018, pp. 266–273.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment