WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

W A VECYCLEGAN: SYNT HETIC-T O-N A TURAL SPEECH W A VEFORM CONVERSION USING CYCLE-CONSISTE NT ADVE RSARIAL NETWORKS K ou T anaka, T akuh ir o Kaneko, Nob ukats u Hojo, and Hir okazu Kameoka NTT Communication Science Laboratories, NTT Corporation, Japan Email: { tanaka.ko, kaneko.takuhiro, hojo.nobukatsu, kameoka.hirokazu } @lab .n tt.co.jp ABSTRA CT W e propose a learning-b ased ﬁlter that allows u s to directly modify a synthetic speech wa vefo r m into a n atural speech wa vefo r m. Speech-pr ocessing systems using a vocoder framework such as statistical parametric speech synthesis and voice conv e r sion ar e con ven ie n t especially for a limited number of data becau se it is possible to represent and process interpretab le acou stic features over a compact space, such as the funda m ental frequency ( F 0 ) and mel-cep strum. Howe ver, a well-known proble m that leads to the quality degradation of gener a ted speech is an over-smoothing effect that elimi- nates some detailed structure of generated/co n verted acoustic features. T o ad dress this issue, we propose a synthetic- to-natur a l speech wa vefor m con version technique that uses cycle-consistent adversarial network s a n d which does not require any explicit assumption about speech waveform in adversarial lear ning. In contrast to curr e nt techniqu es, since our modiﬁcation is per formed at the wav e form lev el, we ex- pect th a t the proposed method will also make it p ossible to generate “vocoder-less” soundin g speech e ven if the input speech is synthesized u sing a v ocoder framework. The ex- perimental resu lts dem onstrate that our pro p osed method can 1) alleviate the over-smoothing effect of the acoustic featu res despite the direct mo d iﬁcation metho d used for the w ave- form and 2 ) greatly im prove the naturalne ss of the gener a ted speech sound s. Index T erms — Statistical parame tr ic sp e ech synthesis, postﬁlter , de e p neural network, gene r ativ e adversarial n e t- work, cycle-consistent adversarial network 1. INTR ODU CTION Speech p rocessing systems such as statistical par ametric speech synthesis [1] and statistical voice co n versio n [2] are well-known fr a mew orks. These ap proach es using a vocoder framework ha ve a signiﬁcant advantage, e specially for a limited number of data, because it is p ossible to represent interpretab le acou stic features over a compact space, such as the funda m ental frequ e ncy ( F 0 ) and mel-cepstrum , which are lower dimension al acoustic features than a short-ter m Fourier transform (STFT) spe ctrogram . Although the se systems aim Postfilter Acoustic features Enhanced Acoustic features Enhanced waveform Synthesis Synthesis Acoustic features W aveform Enhanced waveform Filter Accuracy of acoustic models V ocoding error Over-smoothing ĂͿ ďͿ Fig. 1 . Thr ee major factors [3] that degrade the quality of syn- thesized speech during statistical par a m etric speech sy n thesis and general appro a c hes to ge n erating m ore natur al soun ding speech by using po st-processing . Our prop o sed f ramework is assigned to a process b ) which can ad dress not on ly the over - smoothing pro blem but also the vocoding error . to pro duce speech with a qu a lity indisting uishable from that of clea n an d real speech, p rocessed and synth esized speech can usually be distinguish e d from natural speech. The real- ization of synthetic-to-natur al speech w av eform con version provides sign iﬁcant beneﬁt with many speech pro c essing ap- proach e s, especially when u sing a vocoder framework. Three major factors reported in [3] d egrade the speech syn thesized by a statistical parametric speech synth esis technique: the accuracy of acoustic mod els, over -smoo thing, which elimi- nates some detailed structure o f generated /conv erted acoustic features, and vocoding. In this pape r , we focus on vocoding and over-smoothing. T o addre ss the over-smoothing effect, several tech n iques for restorin g th e ﬁne structur e o f natu r al spee c h over acou stic features have been p roposed [2, 4, 5] . These app r oaches, as shown in Fig. 1 a), have achieved signiﬁcant imp rovements as regards the natura ln ess of synth e sized speech in the respective directions. Howe ver, h euristics app roache s such as enhance- ment of global variance [ 2] and mod u lation spectrum [ 4] are unsuitable f o r covering all the negative factors. On the other hand, although a lear ning-b ased p o stﬁlter [5] en ables us to restore not on ly the global variance and mo dulation spectrum but also o ther factors th at degrade the quality of synthesized speech, it is still insufﬁcient to g e nerate natural speech be- cause of the post ﬁlter ne e ded not for the wa veform but for the h euristic acoustic features such as me l- cepstrum. Further- more, all of th ese approach es suffer fro m vocodin g error be- cause of th e use of the vocoder f ramework to syn thesize the speech wav ef orm. T o a void this limitation, an end -to-end speech enh ance- ment [ 6] m ethod has been p ropo sed within a g enerative adver - sarial framework. As shown in Fig. 1 b) , since the wa vefo rm of the input speech was directly operated to obtain that of the desired speec h af ter the vocoding part, [6] has the potential to address n ot only the over -sm oothing effect but also the vocod- ing error. Fu rthermo re, the generative adversarial fra m ew o rk does n ot req uire us to design any hand- crafted feature that c r e- ates a g ap between natur al speech and syn th etic speech, in ad- vance. In p reliminary exper iments, we f ound that this m ethod is unsuitab le wh en the alignmen ts 1 between the in put wav e- form and the desired wa veform are not perf ect. For exam- ple, th e no ise redu ction of noisy speech simu lated b y addin g noise to the speech wa vefo rm recorded in an ideal environ- ment succeed ed bec ause o f th e p erfect align ment b e twe en th e simulated noisy speech and th e clean source speech. How- ev e r , th e conv ersion of synthetic speec h gener ated by text-to - speech synthesis and voice conv e rsion p r ocessing to natural speech is no t easy to achiev e by apply ing this metho d b e cause of the alignment p r oblem as m entioned above. In this paper , we propose a learn ing-ba sed ﬁlter that allows us to convert a synthetic speech waveform into a natural speech wav eform using cycle-co nsistent adversarial networks with a fully con volution al ar chitecture. W e adopt cycle-consistent adversarial networks because they do n ot require a dataset fo rcibly paired at the tim e fra m e le vel an d as the name implies, they are trained within the adversar - ial learnin g. In contra st to [7] which is also inspired by the cycle-consistent adversarial networks [8] to convert not speech wa vefor m b ut aco ustic feature, since o u r modiﬁca- tion is perfor med at the wa veform level, we expect that the propo sed method will make it possible to genera te “vocoder- less” sounding spe e ch even if the input sp eech is synthesized using a vocode r framework. Furthermo r e, we ad o pt a g a te d conv olutio nal neur al network (CNN) architec tu re [9], which is able to capture long- and short-ter m dependenc ie s in the speech wa vefo rm. The experimental r esults demonstra te th a t our pr oposed method can 1) alleviate the over -smoo thing effect of th e acoustic f eatures despite the d irect m o diﬁcation method used for the waveform an d 2) greatly improve the naturalness of the gen erated speech sou nds. 2. SEGAN: SPEECH ENHANCEMENT GENERA TIVE ADVE RSARIAL NETW ORK 2.1. Generative Adversaria l Networks Generative Adversarial Network s (GANs) [10] are generative models consisting of two neu ral networks. One is a gener ator 1 In this paper , we deﬁne alignment consideri ng both the magnitude infor- mation and the phase information of speech because we focus on modifying the speech wa veform rather than the acoustic features. G that learns to convert a sample z fro m a prior distribution P ( z ) to a target sample x from a distribution P Data( x ) , wh ic h is a sample from the train ing data. The genera to r aims to learn a pro je c tion that c a n imitate the tru e featur e distribution and to generate sam ples related to the train ing d ata. Th e oth er is a discriminator D th a t lea rns the b ound ary between imitated features generated by the gener a tor G and true feature s p icked up fr om the tr a ining data. The adversarial character istic arises fro m the fact that the discriminator D tries to classify the instances x obtain e d fr om the true data d istribution P Data ( x ) as real and the candidates G ( z ) p roduce d by the generator G as fake, while the g ener- ator G tries to make the discrimina to r D classify those G ( z ) as real. Thro ugh back -prop agation, the generator G b ecomes able to generate better can didates G ( z ) and the discrimina to r D become s able to distinguish the g e nerated ones G ( z ) and real data x . The ob jectiv e function of the adversarial lear n ing is fo r mulated as the f ollowing min imax game betwe en G and D , min G max D L gan ( G z → x , D x ) = E x ∼ P Data( x ) [log D x ( x )] + E z ∼ P z ( z ) [log(1 − D x ( G z → x ( z )))] . (1) Although the GANs achie ve state-of-the-art results in a variety of generative tasks [1 1, 12], the difﬁculty of the training is a well-known problem. For instance, the clas- sic appr oach su ffers fr om a vanishing gradien t p roblem due to the sigmoid cross-entropy lo ss used for training. Sev- eral adversar ial training tech n iques have been pro posed to overcome this difﬁculty . The least-squares GAN (LSGAN) approa c h [ 1 3] stabilizes the train ing process by rep lacing the cross-entropy loss shown in Eq. 1 with the least-squares function as follows. min D L lsgan ( D x ) = 1 2 E x ∼ P Data( x ) [( D x ( x ) − 1) 2 ] + 1 2 E z ∼ P z ( z ) [ D x ( G z → x ( z )) 2 ] , (2) min G L lsgan ( G z → x ) = 1 2 E z ∼ P z ( z ) [( D x ( G z → x ( z )) − 1) 2 ] . (3) 2.2. GANs for Speech Enhancement T o retain the linguistic info rmation of spee c h samples, [6] adopts a conditio ned version GAN that has some extra infor- mation in G and D to perfor m map ping and classiﬁcation. As shown in Fig. 2, in th e structur e o f the gen erator G , which is similar to an auto-enco der, a noisy speech signal x n , which is the input of th e G network, is enco ded as x c . After con- catenating the rand om vector z with the encod ed vector x c , which is treated as a cond itio nal vector , th e decodin g part of the G network is perform ed as transpo sed conv olutio ns (a.k.a. Downsampling & convolution Upsampling & convolution Concatenate Input speech signals (e.g. noisy speech) Random variable Features related to input Enhanced speech signals (e.g. clean speech) Fig. 2 . Generator n etwork for speech enhanc e ment repo rted in [6]. Structure is similar to an auto - encode r . deconv o lutions or frac tionally strided conv olutio ns) to obtain the enh anced vector. T o ach ieve the generation of speech sam ples that are closer to clean speech, a second ary comp onent is a d ded to the loss of G . [ 6] adop ts th e L1 no rm, as it has been proven to be effectiv e in the imag e m anipulatio n d o main [14, 15]. In this way, they allow the adversarial comp onent to add more ﬁne-grain ed an d realistic results. A new hyper-parameter λ SEGAN controls the magnitud e of th e L1 norm. Finally , the loss functio n of th e generato r G becom es min G L lsgan ( G x n , z → x ) = 1 2 E z ∼ P z ( z ) ,x n ∼ P Data ( x n ) [( D x ( G x n , z → x ( x n , z )) − 1) 2 ] + λ SEGAN || G x n , z → x ( x n , z ) − x || 1 . (4) 3. SYNTHETIC-T O-NA TURAL SPEECH W A VEFORM CONVERSION USING CYCLE-CONSISTENT AD VERSARIAL NETWORKS 3.1. Concept In preliminary experiments, we f ound tha t SEGAN [6] could not be easily ap plied to the con version of a sy n thetic speech wa vefo r m to a natural speech waveform. On e possible rea- son is that th e misalignmen t caused b y the different len gths and generatio n processes of synthetic and n a tural speech makes it difﬁcult to ensure th e operation o f th e bijec tive fun c- tion in the g enerator G . Speciﬁcally, the phase inf o rmation of the speech wa vefo rm synthesized by using the vocod e r framework is very f ar from that o f natural speech, even if the magnitude inf ormation of the synthetic speech is close to that of natur al speech. W e assume that these factors in- duce “m ode collapse”, which is a well- k nown problem when training GANs, an d the SEGAN does n ot guar a ntee that an individual input and output are paired up in a m eaningf ul way . Generally speakin g , all input sp eech signals map to the same o utput speech signals and the o ptimization fails to m ake progr ess [ 10]. T o solve this p roblem , we focus on cycle-consistent ad- versarial networks [8]. This appr oach h as introduce d a “cycle consistent” property , wh ich ensures return to the original sample [16]. M a th ematically , if we hav e a converter G x → y and anoth er con verter G y → x , G x → y and G y → x should be Adversarial loss Cycle consistency loss a) b) Adversarial loss Cycle consistency loss Fig. 3 . T ra ining procedures of cycle-con sistent adversarial networks: a) Forward-inv e r se m a pping to conside r for ward cycle co nsistency and b) in verse-forward mappin g to consider backward cycle consistency . the inverse of each other, and b oth m apping s should be bijec- tions. W e inco rporate th is p r operty into SEGAN by tr a ining the mapping f unctions G x → y and G y → x simultaneou sly and adding a cycle consistency loss [17] that encourages G y → x ( G x → y ( x )) ≈ x an d G x → y ( G y → x ( y )) ≈ y . Com- bining the cycle co nsistency lo ss with the adversarial losses deﬁnes ou r fu ll objective for a trainin g pr ocedur e using p er- fect align m ent. Furthermo re, we focus o n a conv olutio nal architectu re called a gated CNN. T he gated CNN has recently been shown to be powerful f o r modeling long-term sequential data. It was originally introduced for languag e modelin g and was shown to outp erform long short-term me m ory (L STM) lan- guage models trained in a similar setting [9]. W e p reviously applied a gated CNN arc h itecture for acou stic feature se- quence modeling, and its effectiv eness has already been conﬁrmed [ 18, 19]. With a gated CNN, th e o utput of a hid- den lay er o f a network is described as a linear projection modulated by an outp u t gate. Similar to an LSTM [20] and a gated recur rent unit (GR U) [ 2 1], the output gate contro ls what inform ation should be pro pagated thro ugh the hierarchy of laye r s and allows the capture of long-ter m structu r es. 3.2. Cycle-Consistent Adversarial Networks For each sp e ech samp le x , the speech waveform con version cycle shown in Fig. 3 a ) con strains the samples x to return to the original sp e ech through a target domain correspo nding to the samples y , x → G x → y ( x ) → G y → x ( G x → y ( x )) ≈ x . This cycle consistency is called forward cycle consistency . Similarly , a s shown in Fig. 3 b), fo r eac h speech wa vefor m y , G x → y and G y → x are con strained by a backward cycle con- sistency , y → G y → x ( y ) → G x → y ( G y → x ( y )) ≈ y . There- fore, these are d escribed as the following cycle consistency loss, L cyc = E x ∼ P Data( x ) [ || G y → x ( G x → y ( x )) − x || 1 ] + E y ∼ P Data( y ) [ || G x → y ( G y → x ( y )) − y || 1 ] . (5) Finally , the ob jectiv e f u nction is L full = L gan ( G x → y , D y ) + L gan ( G y → x , D x ) + λ cyc L cyc , (6) where λ cyc is a hyper parameter used to contr o l the cycle con- sistency loss. 3.3. Identity-Mapping Loss Cycle consistency loss allows u s to red uce the possible map- ping function s by constraining a structu r e. Howe ver, in a wa vefo r m mo diﬁcation task, th e linguistic inf o rmation is not always p reserved by incorp orating on ly the cycle consistency loss. Th e identity-mapping loss reported in [22] preserves the co mposition s of the input samples and the conv er ted sam- ples. [ 8] has ap p lied this appr oach to color pr eservation and demonstra te d its effecti veness. Note that th e secondar y com- ponen t o f Eq . 4 is also iden tity-mapp ing lo ss. T o enc o urage the gener ators G x → y and G y → x to pr eserve ling uistic infor- mation, we a lso in corpor ate th is prop e rty as follows. L id = E x ∼ P Data( x ) [ || G y → x ( x ) − x || 1 ] + E y ∼ P Data( y ) [ || G x → y ( y ) − y || 1 ] . (7) In practice , the weighted lo ss λ id L id with a hy per parameter λ id to c ontrol the identity-ma p ping loss is a dded to Eq. 6. 3.4. Sequential Modeling with Gated CNN T o capture lon g- and short- term dependen cies in speech wa vefo r ms, we use a gated CNN [9] to construct both the generato r a nd d iscr iminator network s of the GAN. T he gated CNNs are CNNs equippe d with gated linear units (GLUs) as activ atio n func tions instead of the regular r e c tiﬁed linear units (ReLUs) [23] o r T anh acti vations. The outp u t of the l th hidden layer of a gated CNN is described as a linear projection H l − 1 ∗ W l + b l modulated by an outp ut gate σ ( H l − 1 ∗ V l + c l ) H l = ( H l − 1 ∗ W l + b l ) ⊗ σ ( H l − 1 ∗ V l + c l ) , (8) where W l , V l , b l and c l are the n e twork p a r ameters to be train ed, σ is the sigmo id fu nction and ⊗ indicates the element-wise produ ct. Similar to LSTMs, the output gate multiplies each eleme nt of H l − 1 ∗ W l + b l and contr ols what inform ation should be pro pagated thro ugh the hierarchy of layers in a d ata-driven man ner . 4. EXPERIMENT AL EV ALU A TION 4.1. Experimental Conditions Datasets (Natural): W e used a Japanese speech dataset con- sisting of utterances by one pro fessional f emale n arrator . T o ev aluate the per forman ce, we used 3 0 senten ces (speech sec- tions of 5.3 m inutes). T o train the models, we used about 6,500 sentences for a baseline system and 400 sentences (speech sections of 1.2 hours) for the conventional an d pro- posed meth ods. The sampling rate o f the sp eech sig n als was 22.05 kHz. Aud io samples can b e accessed on our we b page 2 . Baseline syste m (Baseline): W e used a DNN- based statisti- cal parametric speech syn thesis method [1] as the ba selin e. From the speech data, 40 Mel-ce p stral co efﬁcients, loga- rithmic F 0 , and 5- b and aperiodicities wer e extracted every 5 ms with the STRAIGHT ana ly sis system [24, 25]. The contextual features u sed as the input were 5 06-d im ensional linguistic features in c lu ding ph onemes and mo ra positions. The o utput co nsisted of 40 M el-cepstral coe fﬁcients, lo g F 0 , 5-band aper iodicities, their d elta and delta-delta featu res, and a voiced/unv oiced b inary value. The DNN ar c h itectures were feed-fo rward networks includ ing 5 hidden layers each with 1,024 units. Con ventiona l method (GANv): As a conventional app roach, we used a GAN-b ased postﬁlter [5] no t for the speech wave- form but fo r th e acoustic features. The system setting was the same as the re p orted setting, except for the excitation signals. Although [5 ] used th e excitation signals o f natural spe ech, we used the excitation signals g enerated by the vocoding for ev aluatin g all of the syn thetic speech . W e app lied the con ven- tional meth od only to voiced segments. Our proposed method (P roposed): W e designe d a n etwork based on rece n t success of imag e modeling [26]. Figure 4 shows the n e twork arch itec tu res of o ur prop o sed mo del. The network inc lu ded downsampling lay ers, residu al b locks [2 7], and up sampling layers. W e used instance normalizatio n (IN) [ 2 8], instead of batch normalizatio n [29]. W e used pixel shufﬂer (PS) for upsampling where the effectiv eness was demonstra te d in high-resolu tion image generation [26]. W e normalized the speech wav e f orm to zero mean and un it vari- ance using their trainin g sets. T o stabilize the train ing, we used a least square s GAN [1 3]. W e set λ cyc at 10. T o guide the lea r ning p rocess, we set λ id at 5 fo r th e ﬁrst 20k iteratio ns and linearly decay to 0 over the next 20k iter a tions. W e op ti- mized th e model parameters using the Ad am optimizer [30] with a mini-b atch of size 32. Th e learning paramete r s α were set at 0.0001 f or discriminators an d 0. 0 002 fo r generators. W e used the same learning rate for the ﬁrst 2 50k iterations and linearly decay to 0 over th e next 2 50k iteratio ns. The other learning par ameters of the Adam optimizer, β 1 and β 2 , were set at 0.5 an d 0.9 9, respec ti vely . Note th a t since the gen- erators are fully co n volution al, they can handle an arbitrary length inpu t. 2 http://www.kec l.ntt.co.jp/peo ple/tanaka.ko/ projects/s2n/s 2n_speech_wavef orm_conversion.html Generator (1D CNN) k c s 60 32 1 w c 2000 1 Input Output Downsample Conv GLU Sum 6 Residual blocks Upsample PS IN k c s 60 64 2 k c s 60 32 2 k c s 60 64 1 k c s 60 32 1 k c s 60 64 2 k c s 60 32 2 k c s 60 1 1 w c 2000 1 Discriminator (1D CNN) k c s 60 32 1 w c 2000 1 Input Conv GLU IN k c s 60 64 2 k c s 60 32 2 Real/Fake FC Sigmoid Downsample Fig. 4 . Network arch itectures of gener a tor G and discriminator D. “Conv”, “GLU”, “I N”, “PS”, “FC”, an d “Sigmoid ” denote conv olutio n, in stance norm a lization, gated line ar un it, pixel shu fﬂer , fully co nnected, an d sigmoid lay ers, respectiv ely . In an input or output layer, w a n d c repre sen t width and number of chann els, respectively . In each c onv olutio nal lay er , k, c, and s denote kernel size, numb er of chan nels, and strid e size, r e sp ectiv ely . 4.2. Modulation Spectrum ov er Acoustic F eatures T o conﬁrm th e a lleviation of the over-smoothing effect of the acoustic fe a tu res, we ap plied the conventional and proposed methods to speech synth esized b y the baseline system an d ob- tained mod ulation spectrums of m el-cepstrum sequences o n each system. Althoug h the modulation spectrum is tradition - ally deﬁned as a value calcu lated usin g the Fourier tran sform of the pa r ameter sequence [31], this p aper deﬁn es the modu - lation spectrum as its logar ithmic p ower spectrum. W e u sed 8,192 FFT points. The average mod ulation spectrum s o f the ﬁrst 1k indices for the 10th , 20th, 30 th and 40th mel- c epstral co efﬁcient se- quences a r e shown in Fig. 5. W e found that Baseline su ffered more from the over -smo othing ef f ect than G ANv and Nat- ural . On the other h and, GA Nv and Proposed are clo se to Natural . As with the GAN-b a sed postﬁlter for the acous- tic f eature GANv , th e result demon strated that our p ropo sed method for the speech wa vefo rm Proposed successfully a l- leviated the over -smoothin g effect c a used by the statistical parametric speech synthesis pro cess. 4.3. Subjective Evaluatio n f or Natura lness W e condu cted a subjective 5-scale mean opinion scor e test re - garding the natura ln ess of the g enerated speech. 10 listeners participated and each listener ev alu ated 12 0 speech samples (30 speech samples × 4 systems). W e app lied the co n ven- tional and proposed m ethods to the same speech wa veform Baseline as in Sec. 4.2 . Figure 6 shows that our prop osed method Proposed achieved a signiﬁcant improvement in terms of the natu - ralness o f the gen e r ated speech, compare d with Baseline an d GANv . This result indicates that our ap proach is mo re effec- ti ve than the use of postﬁlters fo r the aco u stic featur es because it is possible to add ress bo th the over-smoothing p roblem and the vocoding error . Furtherm ore, with Proposed , the listeners commente d that the “buzzy” sound pec uliar to vocoding was sufﬁciently improved. Ho wever , th ere is still a gap betwe e n Proposed and N atural . One possible re a so n fo r th e gap is the hoarse sound of Proposed . Th e listeners also advised that Proposed was d istinguishable from Na tural because Proposed s ometimes h ad a “hoarse” sound . 5. CONCLUSION In this p aper, to realize a synthetic-to- n atural speech ﬁlter , we propo sed a learn in g-based ﬁlter tha t allows us to con vert a synthetic speech waveform into a natural speech wav eform using cycle-consistent adversarial networks. Since our p ro- cess was app lied after the synthesis part in statistical pa r amet- ric speech synthesis, we expected that our appr o ach would b e able to address no t only the over-smoothing p roblem but also the vocoding e r ror . The experimental results demonstrated that our pr oposed me th od 1 ) alle viated the ov er-smooth ing effect of th e acoustic f eatures despite the d irect modiﬁcation method used for the waveform and 2 ) dramatically improved the natu ralness of the generated speec h sound s. In the f u ture, we will further ﬁll the gap between natural sp eech and syn- thetic speech by consid e ring the auditory prop erty . 0 3 6 9 12 15 18 21 24 (a) Natural Proposed GANv Baseline 0 3 6 9 12 15 (b) 0 3 6 9 12 15 (c) Modulation spectrum 0 3 6 9 12 15 0 250 500 750 1000 (d) Modulation frequency index Fig. 5 . A vera g e modu lation spectrums of the ﬁrst 1 k ind ices for a) 10th, b) 20th, c) 30th a nd d ) 40th mel-cepstra l co efﬁ- cient seque nces. Ackno wledgment This work was sup ported b y JSPS KAKENHI 17H01 763. 6. REFERENCES [1] Heiga Zen, An drew Senior , and Mike Schuster , “Sta- tistical parametric speech sy nthesis using deep neural networks, ” in Acoustics, Speech an d Sig nal Pr ocess- ing (ICASS P), 2013 IE EE I nternation al Confer ence o n . IEEE, 2 013, pp. 7962– 7966 . [2] T om oki T o da, Alan W Black, and Keiichi T oku da, “V oice conversion based on maximu m -likelihood esti- mation of spectral parameter trajectory , ” IEEE T ransac- tions on Audio, Speech, a nd Lan guage Pr ocessing , vol. 15, n o. 8, p p. 2 2 22–2 235, 20 07. [3] Heiga Zen, Keiichi T o kuda, and Alan W Black, “Statis- 1 2 3 4 5 Natural Proposed GANv Baseline 4.53 4.18 2.23 1.71 Mean opinion score Systems Fig. 6 . Subjective 5 -scale mean opinio n score regarding nat- uralness, with 95% co nﬁdence intervals. tical param etric sp e e ch synthesis, ” S peech Co mmunica- tion , vol. 51, no. 11, pp. 103 9 –106 4, 2009 . [4] Shinno su ke T akam ichi, T o moki T oda, Grah am Neubig , Sakriani Sakti, and Satoshi Nakamu ra, “ A postﬁlter to modify the m o dulation spectr um in HMM-based speech synthesis, ” in Acoustics, Speech and Signal Pr o cess- ing (ICAS SP), 201 4 IEEE Interna tional Confer en ce on . IEEE, 201 4, pp. 29 0–29 4. [5] T akuhir o Kaneko, Hirokazu Ka meoka, Nobukatsu Hojo, Y usuke Ijima, Kaoru Hiramatsu, and Kunio Kashino, “Generative adversarial network- based postﬁlter for sta- tistical param etric speech synthesis, ” in Pr oc. 2017 IEEE Internatio nal Confer ence on Acoustics, Speech and Sig nal Pr o cessing ( ICASSP20 17) , 2017, pp . 49 10– 4914. [6] Santiago Pascual, Antonio Bon afonte, and Joan Serr ` a, “SEGAN: Sp eech enhancement g e nerative ad versarial network, ” arXiv pr eprint arXiv:1 703.0 9452 , 20 17. [7] T akuhir o Kaneko and Hirok a zu Kameoka, “Parallel- data-free voice co n versio n u sing cycle- c onsistent ad- versarial networks, ” a rXiv p r ep rint a rXiv:1711 .1129 3 , 2017. [8] Jun-Y an Zhu, T aesung Park, Ph illip Isola, and Alexei A Efros, “Unpaired image-to- image tran slation u sing cycle-consistent adversarial networks, ” arXiv pr eprint arXiv:170 3.105 93 , 2017. [9] Y ann N Daup hin, Angela Fan, Michael A u li, and David Grangier, “Languag e mo d eling with gated conv olu - tional networks, ” arXiv pr eprin t arXiv:1612. 08083 , 2016. [10] Ian Go odfellow , Jean Poug et-Abadie, Me h di M ir za, Bing Xu, David W arde-Farley , Sh erjil Ozair, Aaron Courville, and Y o shua Bengio, “Generative adversar- ial nets, ” in Advance s in n eural information pr oc essing systems , 2014, pp. 2672– 2680. [11] Alec Radford , Luke Metz, and Sou mith Chintala, “Un - supervised re p resentation lear ning with deep conv olu - tional ge nerative adversarial ne twork s, ” arXiv p r eprin t arXiv:151 1.064 34 , 2015. [12] Y uk i Saito, Shinno suke T ak a michi, Hiroshi Saruwatari, Y uki Saito, Shinnosuke T ak amichi, and Hiroshi Saruwatari, “Statistical parametric speech synthe- sis incorp orating ge n erative adversarial networks, ” IEEE/ACM T ransactions on Audio, Speech and Lan- guage Pr oce ssing (T ASLP) , vol. 26, no. 1, pp. 84–96, 2018. [13] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau , Zhen W ang, and Stephen Paul Smo lley , “Least squar es generative adversarial networks, ” in 2017 IEEE Inter- nationa l Co nfer en ce on C omputer V ision (ICCV) . IEEE, 2017, p p. 2813 –282 1. [14] Phillip Isola, Jun - Y an Zh u, T inghui Zh o u, and Alexei A Efros, “Im age-to-im age tr anslation with con d itional ad - versarial networks, ” arXiv preprint , 2017. [15] Deepak Pathak, Philipp Krahenbuhl, Jeff Do nahue, T revor Darrell, an d Alexei A Efro s, “Context en coders: Feature learn ing by inpaintin g , ” in Pr o ceeding s o f the IEEE Confe r ence o n Comp uter V ision a nd P attern Recognition , 201 6 , pp. 253 6–25 44. [16] Richard W Brislin, “Back-tr anslation f or cross-cu ltural research, ” J ourn al o f c r oss-cultural psychology , vol. 1, no. 3 , pp. 1 85–2 16, 1970. [17] T inghu i Zhou, Philipp Krahen buhl, Math ieu Aubry , Qixing Huan g, and Alexei A Efro s, “Learnin g den se correspo n dence via 3D-guid ed cycle consistency , ” in Pr oceedin gs of the I EEE Co nfer en ce on Computer V i- sion and P attern Recognition , 2 0 16, pp. 1 17–1 26. [18] T akuh iro Kaneko, Hirokaz u Kameo ka, Kaoru Hira- matsu, an d K unio Kashino, “Sequence-to -sequence voice conv e rsion with similarity metric learned using generative adversar ial network s, ” Pr oc . Interspeech 2017 , p p. 1283 –128 7, 20 17. [19] K ou T a naka, Hirokazu Kameoka, and Kazuho Morikawa, “V AE-SP ACE: Deep gen e r ativ e mod el of voice f undam e n tal freq uency contour s, ” Pr oc. ICASSP 201 8 , 2018. [20] Sepp Hochreiter an d J ¨ urgen Sch midhub er , “Long short- term memo ry , ” Neural computa tion , vol. 9, no. 8 , pp. 1735– 1780 , 1997. [21] Kyunghyun Cho, Bart V an Merr i ¨ enbo e r, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bo ugares, Holger Schwenk, an d Y o shua Bengio, “Learn ing phrase rep- resentations using RNN enco der-decoder for statistical machine tran slation, ” arXiv p r eprin t arXiv :1406 .1078 , 2014. [22] Y aniv T aigm an, Ad am Polyak, and Lior W olf , “Un- supervised cross-do main image g eneration , ” arXiv pr ep rint arXiv:1611.02 200 , 2016 . [23] V ino d Nair an d Geoffrey E Hinton, “Rectiﬁed linear units impr ove restricted boltzmann ma c hines, ” in Pr o- ceedings of the 27th internation al confer ence on ma- chine learning (ICML-10) , 2010, pp. 807–8 14. [24] Hideki Kawahara, I kuyo Masuda-Ka tsu se, and Alain De Cheveigne, “Restructuring sp eech re presentation s using a pitch- adaptive time–frequ ency smoothing and an instantaneous-f requen cy-based f 0 extraction: Po ssi- ble role of a repetitiv e stru cture in sounds, ” Spe ech com- munication , vol. 2 7 , no. 3, pp. 1 87–20 7, 19 99. [25] Hideki Kawahara, Jo Estill, an d Osamu Fujimura, “ Ape- riodicity extraction and control using mixed mod e ex- citation and group delay ma n ipulation f or a high qu al- ity speech analysis, modiﬁcatio n and syn thesis system straight, ” in Secon d Interna tional W orksho p on Mo dels and Analysis of V ocal Emissions for Biomedical App li- cations , 200 1. [26] Christian Ledig , Lucas T heis, Ferenc Husz ´ ar , Jo se Ca- ballero, An d rew Cunningh a m , Alejandro Acosta, An- drew Aitken, Alykh a n T ejani, Johann e s T otz, Zehan W ang , et al., “Photo-realistic single image super- resolution using a generative adversar ia l network, ” arXiv pr eprin t , 2016. [27] Kaiming He, Xiangyu Zh ang, Shaoqing Ren, and Jian Sun, “ Deep residual learn ing for imag e recogn ition, ” in Pr oceedin gs o f the IEEE conference on computer vision and pattern r ecognition , 2 0 16, pp. 7 70–7 78. [28] Dmitry Ulyan ov , An d rea V edald i, an d V ictor S. Lempit- sky , “Instance normalizatio n : The missing ing redient for fast stylizatio n , ” CoRR , vol. abs/160 7 .080 2 2, 2016 . [29] Sergey Ioffe and Christian Szegedy , “Batch normaliza - tion: Accelerating deep network training by red ucing internal covariate shift, ” in Interna tional conference on machine l earning , 2 015, pp . 448–4 56. [30] Diederik Kingma and Jimmy Ba, “ Ad am: A method for stochastic op timization, ” arXiv pr e print arXiv:141 2.698 0 , 2 014. [31] Les Atlas a nd Shihab A. Shamma, “Joint ac o ustic and modulatio n freq uency , ” EURAS IP Journal on Ad vances in Signa l Pr ocessing , vol. 2 0 03, no. 7, pp. 310 2 90, June 2003.

WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment