ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called C…
Authors: Hirokazu Kameoka, Kou Tanaka, Damian Kwasny
1 Con vS2S-VC: Fully Con v olutional Sequence-to-Sequence V oice Con version Hirokazu Kameoka, K ou T anaka, Damian Kwa ´ sny , T a kuhiro Kaneko, an d Nobukatsu Hojo Abstract —This paper proposes a v oice con ve rsion (VC) method using sequence-to-sequ ence (seq2seq or S2S) learning, which flex- ibly con v erts not only the voice characteristics but also t he p itch contour and duration of input speech. The proposed method, called Con vS2S-VC, has three key features. First, it uses a model with a fully conv olutional ar chitecture. This is particularly advantageo us in that i t is sui table for parallel computations using GPUs. It is also b eneficial since it enables effective normalization techniques such as batch normalization to be used f or all the hidden layers in th e n etworks. Second, it achieves many-to-many con v ersion b y simultaneously learning mappings among mult i ple speakers using only a sin gle model instead of separately learning mappings between each sp eaker p air usin g a different model. This enables the model to full y utilize av ailable training d ata collected from mul tiple sp eakers b y captu ring common latent features that can be shared acr oss di f ferent speakers. Owing to this structure, our model works reasonably well even without source speaker information, thus makin g it able to hand le any- to-many conv ersion tasks. Th i rd, we introduce a mechanism, called the conditional batch normalization that switches batch normalization layers in accordance with the target speaker . This particular mechanism has b een fo und to be extreme ly effective fo r our many-to-many conv ersion model. W e conducted speaker identity con version experiments and f oun d th at Con vS2S-VC obtained higher sound q uality and speaker simil arity than base- line meth od s. W e also found from audio examples that it coul d perfo rm well in various tasks in cluding emotional expression con version, electrolaryngeal speech enhancement, and E nglish accent con version. Index T erms —V oice con version (VC), sequen ce-to-sequ ence learning, attention, fully con volutional model, many-to-many VC. I . I N T RO D U C T I O N V oice conv e rsion (VC) is a techn ique for co n vertin g para/non -linguistic informatio n contained in a given utterance such as the per cei ved iden tity o f a sp e aker while p reserving linguistic information . Pote n tial app lications of this technique include speaker-identity mod ification [1] , speaking aids [2], [3], spee ch enhan cement [4]–[6], and ac cent conversion [7]. Many co n vention al VC methods are designed to use parallel utterances of source and ta rget speech to train aco ustic models for feature mapp in g. A typical pipeline of the training pro cess consists of extracting acoustic featu res from source and target utterances, perf orming dynam ic time warping (DTW) to obtain time-aligned parallel data, a n d training an acoustic model that maps the source f eatures to the target features frame-b y -frame. H. Kameoka, K. T anaka, Damian Kwa´ sny , T . Kaneko and N. Hojo are with NTT Communication Science Labora tories, Nippon T ele graph and T elepho ne Corporation, Atsugi, Kana gawa, 243-0198 Japa n (e-mail: hi- rokazu.kame oka.uh@hco.nt t.co.jp). This work was supported by JSPS KAKENHI 17H017 63 and JST CREST Grant Number J PMJCR19A3, Japan. Examples of the acoustic model include Gaussian mixture models (GMM) [8]–[10] an d deep neu ral networks (DNNs) [11]–[15]. Some attempts have also been made to d ev e lo p methods th a t requ ire no parallel utterances, tra n scriptions, or time alig n ment pro cedures. Recently , d e ep gener a ti ve mod e ls such as variational autoen coders (V AEs), cycle-consistent g en- erative adversarial networks (CycleGAN), and star g enerative adversarial networks (StarGAN) have b een used w ith no table success fo r non- parallel VC tasks [16]– [20]. One limitation of conventional methods inclu d ing th ose mentioned above is that th ey are focused mainly o n learning to conv e rt o nly the local sp e ctral features and less on co n verting prosodic featur es such a s the fund amental freq u ency ( F 0 ) contour, d uration, and rhythm o f the inp u t spe ech. This is because the acoustic models in th ese meth o ds are design ed to describe mappings between local features o nly . Th is pre- vents a model f rom discovering word-level o r sentence-level suprasegmental conversion r ules. In most metho d s, the entire F 0 contour is simply ad justed using a linear tr ansformatio n in the logarithm ic domain while th e duration and rhyth m are usually kept u n changed . However , since these features play as important a r ole as local spectral f eatures in characterizing speaker iden tities a nd speak ing styles, it would be d e sirable if these features could also be co n verted mor e flexibly . T o overcome this limitation , we need a model that can lear n to convert entire featur e sequ e n ces by captu ring and utiliz- ing long-te r m d ependen cies in sour ce and target speech. T o this end, we a dopt a sequen ce-to-sequen ce (seq 2 seq or S2 S) learning app roach. The S2S learning ap proach o ffers a gen eral and powerful framework for transforming one sequenc e into another variable length sequence [21], [22]. This is mad e po ssible b y u sing encoder and d ecoder networks, where the en coder encode s an inpu t sequence to an inter nal repre sen tation whereas the decoder gen e rates an output sequ ence in accorda n ce with the inter nal representatio n . The original S2S model employs recurren t n eural n etworks (RNNs) to mod el the enc o der a n d decoder networks, wh e r e comm on choices for the RNN archi- tectures inv olve long short-term memory (LSTM) network s and gated recurr ent units (GRU). This app r oach has attr acted a lot of attention in recent years after b eing introduced an d applied with n otable success in various tasks such as m achine translation, autom a tic speech recog nition (ASR) [22] and text- to-speech (TTS) [23]–[2 9 ]. The orig inal S2S model suffers from the co nstraint that all input sequ ences are for c e d to b e enco ded into a fixed len gth internal vector . This limits the ab ility of the mod el especially when it come s to long input seque n ces, such as long sentenc e s 2 in text translation problem s. T o overcome this lim itation, a mechanism called “attention ” [30] h as bee n intro duced, which enables the network to learn wh ere to pay attention in th e input sequence for each item in the outp ut sequence. While RNNs are a natural c h oice fo r modelin g long se- quential data, recent work has shown that conv olutional neural networks (CNNs) with gating mechanisms also have excellent potential for captu ring long -term dep endencies [3 1], [ 3 2]. In addition, th ey are suitable fo r parallel comp utations using GPUs unlike RNNs. T o explo it this advantage of CNNs, an S2S model was rec e n tly pro posed that adop ts a fully conv olutio nal architecture [33]. W ith this mod el, the d ecoder is d esigned using causal co n volutions so th a t it enables the model to gen erate an output seq u ence au toregressiv ely . This model with an attention mechan ism is called th e ConvS2S model and has alre a dy b e e n app lied successfully to machine translation [33] an d T T S [ 27], [ 28]. In spired by its success in these tasks, we pr opose a VC method ba sed on the ConvS2S model, w h ich we call Con vS2S-VC, alon g with an architectu re tailored fo r use with VC. In a wide sense, VC is a task of converting the d omain of speech. Here, the ty pes of dom ain inc lude sp e a ker id entities, emotional expressions, speaking styles, an d accents, but for concreten ess, we will restrict our a tten tion to speaker identity conv e rsion tasks in the f ollowing. When we are inter ested in conv e rting speech among multip le speakers, on e naive way of app lying the S2S model is to prep are and train a mod el for e a ch sp eaker pair . Howe ver , this c a n be inefficient since the mod el for o ne pair of speakers fails to use the trainin g data o f th e other speakers f o r trainin g, even though there must be a comm on set o f latent feature s that can be shared across different speakers, especially when the languages are the same. T o fully utilize a vailable training d a ta collected f rom mu ltiple speakers, we further prop ose an extension of the ConvS2S model that allo w s for “many- to-many” VC, which can learn mapping s among multiple speakers using only a sing le mod el. One impor ta n t advantage of using fully conv o lutional net- works is th at it enables the use of batch nor malization in all the hid den layer s. This is practically b eneficial since batch normalizatio n is known to be significantly effecti ve in n ot only accelerating tr a ining but also improvin g the g eneralization ability of the resulting mo d els. Indeed, as describ e d later , it also positively affected ou r pairwise m odel. Howev er, as for the many-to -many mo d el, the distributions o f the layer inputs can change depend ing on the sou r ce and target speakers, which may affect m o del train ing. T o stab ilize layer input distributions, we intro d uce a mechan ism, called the con ditional batch norma liza tio n, that switches batch norm alization layers in acco rdance with th e source a n d target speakers. This p artic- ular mechanism was experimentally fou n d to work very well. I I . R E L A T E D W O R K Note that some attemp ts hav e recently been made to apply S2S models to VC p roblems, in cluding the ones we p r oposed previously [34], [35]. Alth ough most S2S mo dels ty pically require sufficiently large parallel corpora for tr aining, collect- ing a sufficient numb er of p a rallel utter ances is no t always feasible. Thus, particu larly in VC tasks, one ch allenge is how best to train S2S models using a limited amou nt of train ing data. One idea inv olves using text labels as a u xiliary infor mation for mode l train ing, assuming they are readily available. For example, Miyoshi et a l. pr o posed combining acoustic m odels for ASR and T TS with an S2S model [36], wher e an S2 S model is used to convert the context po sterior prob ability sequence pro d uced b y the ASR mo del an d the TTS mo d el is finally used to genera te a target spe e ch feature sequence. Zhang et al. also pr oposed an S2S model-based VC meth od guided by an ASR system, which augm ents inputs with bottleneck features ob tained fro m a pretrain ed ASR system [37]. Subsequen tly , Zhang et al. propo sed a sh ared model for TTS and VC tasks, which enables joint training o f th e TTS and VC func tio ns [ 3 8]. Recently , Biadsy et al. prop osed an end-to- end VC system called Parrotron, which is design e d to train the encod er and decoder alon g with an ASR model on the basis of a mu ltitask lear n ing strategy [3 9]. Ou r m ethod differs from these m ethods in that our model d oes not rely o n ASR or TTS models and requ ires no text an notations for mo del training. Instead, we in troduce several techn iques to stab ilize training and test pred iction. Haque et al. prop osed a method th a t ena b les many-to- many VC sim ilar to ou rs [40]. As detailed in Sub sectio n IV -C, our many-to-many model differs in tha t it d oes not necessarily require sour ce speaker informatio n for the en coder, thus enab lin g it to also h a n dle any -to-m any VC tasks. In additio n, our metho d d iffers from all the meth o ds men - tioned above in that it adopts a f ully conv olu tional m odel, which ca n be poten tially a dvantageous in se veral ways, as already mentio ned. I I I . C O N V S 2 S - V C In th is section, we start by describing a p airwise one -to-one conv e rsion model an d then p resent its mu lti-speaker extension that enables many-to-many VC. The overall arc h itecture of the pairwise conversion mod el is illustrated in Fig. 1. A. F eatur e e xtraction and n ormalization First, we define aco ustic f eatures to be c o n verted . Although one interesting option would be to con sider dir ectly co n verting time-dom a in signals, g i ven the recent sign ificant advances in high-q uality n eural vocoder systems [32], [ 41]–[50], we find it reasonable to co nsider converting aco u stic featur es such as the mel-cepstral coefficients (MCCs) [51] an d log F 0 , since we would expect to gen erate high- fid elity signa ls by using a neural vocoder if we could ob tain a sufficient set of acoustic features. In such systems, the mo del size f or the convertor can be ma d e small enou gh to enable th e system to work well e ven when a limited amoun t of tra ining data is a vailable. Hence, in this pap er we choose to use the MCCs, lo g F 0 , aper iodicity , and voiced/unv oice d indicator o f speech as acoustic f eatures as detailed below . W e first use the WORLD analy zer [5 2] to extract th e spectral e n velope , the log F 0 , the coded ap eriodicity , and the voiced/un voiced indicator within each time fra m e of a speech 3 Fig. 1. Overall structure of the pairwise Con vS2S model. utterance, th en co m pute I MCCs from the extracted spectral en velo p e, and finally construc t an acoustic featu re vector by stacking the MCCs, the log F 0 , the coded ape riodicity , and the voiced/un voiced in dicator . Thu s, each acoustic feature vector consists of I + 3 elements. Here, the log F 0 contour is assumed to be filled with smoothly interpolated v alu e s in unv o iced segments. At training tim e, we normalize each element x i,n ( i = 1 , . . . , I ) o f the MCCs and the log F 0 x I +1 , n at frame n to x i,n ← ( x i,n − µ i ) /σ i where i , µ i and σ i denote the feature index, th e mean, and th e standard deviation of the i -th feature within a ll th e voiced segments of th e train in g samp les of the same spea ker . T o accelerate and stabilize trainin g and infere n ce, we have found it u seful to use a similar trick in troduced by W ang et al. [5 3]. Specifically , we divide the ac o ustic feature sequence obtained a bove into n on-overlappin g segments of equal length r and use th e stack of the aco ustic featur e vectors in each seg- ment as a new feature vector so that the n ew feature sequen ce becomes r times shorter tha n the or iginal feature sequen ce. Furthermo re, we ad d the sinusoid al position encod ings [54] to the r eshaped version of the featur e sequence b e fore feeding it into the mode l. B. Mode l W e hereafter use X ( s ) = [ x ( s ) 1 , . . . , x ( s ) N s ] ∈ R D × N s and X ( t ) = [ x ( t ) 1 , . . . , x ( t ) N t ] ∈ R D × N t to den ote the sourc e and target speech featur e sequences of non -aligned parallel utterances, wher e N s and N t denote the length s of the two sequences and D denote s the featur e dime n sion. W e consider an S2S m odel that aim s to map X ( s ) to X ( t ) . Our pairwise conv e rsion model is insp ir ed b y an d built u pon the mod els presented by V aswani et al. [54] and T ac hibana et al. [27], with the d ifference being that it in volves an addition al n etwork, called a target recon structor . Th is network plays an important role in ensuring that the encode r s preserve contextual info rma- tion abou t the source and target speech , as explained below . Our mod e l thus con sists o f f our networks: source an d target encoder s, a target decod er , and a target recon structor . As with many S2S models, o ur model has an encoder- decoder stru cture (Fig . 1) . T h e sou rce and target enco ders are expected to extract contextual informa tio n fro m source and target speech . Given the contextual vector sequ ence pair produ ced by the enco ders, we can co mpute a contextual similarity matrix b etween the so urce and target speech , wh ich can be u sed to warp the time- axis o f the sou r ce speech. W e can then generate the feature seq uence of th e target speech by letting the target deco der transfor m ea c h elem e nt of th e time-warped version of the con textual vector sequence of th e source speec h. This idea can b e formu lated as f o llows. The source encoder takes X ( s ) as the input and p roduces two interna l vector sequenc e s K , V ∈ R D ′ × N s and the target encoder takes X ( t ) as the inpu t and produces an internal vector sequence Q ∈ R D ′ × N t : [ K ; V ] = SrcEnc ( X ( s ) ) , (1) Q = T rgEnc ( X ( t ) ) , (2) where [; ] denote s vertical concaten ation of m atrices ( or vec- tors) with co mpatible sizes an d D ′ denotes the dimension of the internal vectors. K , V and Q can be metaph orically interpreted as the q ueries and the key-value pairs in a hash table. By using the query and key pair, we can defin e an attention matrix A ∈ R N s × N t as A = softmax K T Q / √ D ′ , (3) where so ftmax d enotes a softmax operatio n p erforme d on the first axis. A can be seen as a similarity matrix , where the ( n, m ) -th element in dicates the similarity between the n -th a n d m -th frames of source and target speec h . The peak trajecto ry of A c a n ther efore be interp reted as a time-warping fun ction that associates the fram es of the sour ce speech with th ose of the target speech . The time-warped version o f the value vector sequence V is thus given as R = V A , (4) which will be passed to the target deco der to g e n erate an output sequen ce: Y = T rgDec ( R ) . (5) Since the target speech f eature sequen ce X ( t ) is o f course not accessible at test time, we want to use a f eature vector that the target decoder h as generated as the in put to the target encoder f or the next time step so that feature vector s can be ge nerated on e-by-on e recur sively . T o en able the m odel to behave in this way , first, we mu st ensure that the target encoder an d decod er must n o t use f uture inf ormation when produ cing an ou tput vector at each time step. This can be ensured b y simply con stra ining the convolution layers in th e target en coder and decoder to be causal. No te that c a usal conv olutio n can be easily implemen ted by paddin g the in put by δ ( κ − 1 ) elemen ts on both the left and r ight sides with zero vectors and removing δ ( κ − 1) e lements from the en d 4 of the conv olu tion ou tput, whe re κ is the kernel size and δ is the dilation factor . Second, the output sequen ce Y must correspo n d to a tim e-shifted versio n of X ( t ) so th at at each time step the decoder will be a b le to predict the target speech feature vector th a t is likely to be g enerated at the next time step. T o this en d , we in clude an L 1 loss L dec = 1 N t k Y : , 1: N t − 1 − X ( t ) : , 2: N t k 1 , (6) in the training loss to b e minimized , where we have used the colon o perator : to specif y the range of indices of the elements in a matrix or a vector we wish to extract. For ease of notation, we use : itself to represen t all elemen ts along an axis. For example, X ( t ) : , 2: N t denotes a subm atrix consisting of the elements in all th e r ows and columns 2 , . . . , N t of X ( t ) . Third, th e first colu mn of X ( t ) must co rrespond to an initial vector with which the recursion is assumed to start. W e th u s assume that it is always set at an all- zero vector . The sou r ce and target enc oders are free to ignore the informa tio n contained in the featur e vector inputs when findin g a time alignmen t between source and target spee ch. One natural way to ensu re that K , V , and Q con ta in n ecessary informa tio n for fin ding an appr opriate time alignme n t is to assist K , V , and Q to preserve suf ficient informatio n for reconstruc ting the input feature sequence. T o this end, we introdu c e a target reco nstructor that aims to rec o nstruct the feature seq u ence of target speech X ( t ) from K , V , and Q : e Y = T rgRec ( R ) , (7) and inclu de a reco n struction loss L rec = 1 M k e Y − X ( t ) k 1 , (8) in the training loss to be minimized. This idea was introdu ced in our previous work [35]. W e call Eq . (8) the context preservation loss. Altho ugh the reconstru ctor and the deco der may appear to have similar r o les, the d ifference is that th e reconstruc tor is only responsible for m aking each column o f R con tain suffi cient inform ation about the curr e nt value of the target f eature seq uence so that the d ecoder can concen trate on predicting the future value u sin g that inform a tio n. As detailed in Su bsection V -B, all the networks are designed using f ully co n volutional architectu res using gated line ar u nits (GLUs) [31] with residual conn ections. The output of the GL U block used in th e present mo del is defined as GLU ( X ) = B 1 ( L 1 ( X )) ⊙ sigmoid ( B 2 ( L 2 ( X ))) wh ere X is the layer input, L 1 and L 2 are dilated con volution la y ers, B 1 and B 2 are batch normalizatio n lay ers, and sigmoi d is a sigmoid g ate function . Similar to LSTMs, GLUs c a n r educe the vanishing g radient problem fo r dee p architectur es by providing a linear path f o r the g radients while retainin g non- linear capabilities. C. Con straints o n Attention Matrix It would be natu ral to assume that th e time alig nment between parallel uttera nces is usually mon otonic and nearly linear . This implies that the diag o nal region in th e attention matrix A should always be dominan t. W e expect that im posing such restriction s on A can significantly reduce the tra ining effort since the searc h space fo r A can be greatly reduced . T o Fig. 2. Plot s of W N s × N t (0 . 3) (left) and W N s × N t (0 . 1) (right) where the lengths of the source and target speech are 2.4[s] and 3.0[s], respect ive ly . penalize A for not h aving a d ia g onally dominant structure , we introdu c e a diagon al attention loss ( D AL) [27]: L dal = 1 N s N t k W N s × N t ( ν ) ⊙ A k 1 , (9) where ⊙ is the elementwise p r oduct and W N s × N t ( ν ) ∈ R N s × N t is a n on-negative weight matrix whose ( n, m ) -th element w n,m is defined as w n,m = 1 − e − ( n/ N s − m/ N t ) 2 / 2 ν 2 . Fig. 2 shows plots of W N s × N t ( ν ) . Each time point of the target feature sequence must cor re- spond to on ly one or at mo st a few time points o f the sour ce feature sequ ence. Th is implies th a t two different co lumns in A mu st be as orthog o nal as p ossible. Althoug h the DAL with a sufficiently small ν value can ind u ce orthog onality , it m ay also lead to un desirable situations where the time alignmen t between the two sequences is forced to be alw ay s strictly linear . Thus, ν must not be set to a value too small to enab le reasonably flexible tim e alignments. T o ach ieve orthogon ality while en a bling ν to be a m o derately greater value, we propo se introdu c ing another loss to constrain A , wh ich we call the orthog onal attention loss (OAL): L oal = 1 N 2 s k W N s × N s ( ρ ) ⊙ ( AA T ) k 1 . (10) D. T raining loss Giv en examples of parallel uttera n ces, the tota l tr aining loss for the Con v S2S-VC model to be min imized is g i ven as L = E X ( s ) , X ( t ) {L dec + λ r L rec + λ d L dal + λ o L oal } , (11) where E X ( s ) , X ( t ) {·} is the sample m ean over all the training examples and λ r ≥ 0 , λ d ≥ 0 and λ o ≥ 0 are regularization parameters, which weigh the impor tances of L rec , L dal and L oal relativ e to L dec . E. Conver sion pr ocess At test time, we can conv e rt a source spee ch feature sequence X via th e following recursion: [ K ; V ] = SrcEnc ( X ) , Y ← 0 for m = 1 to M ′ do Q = T rgEnc ( Y ) A = softmax K T Q / √ D ′ R = V A Y = T rgDec ( R ) Y ← [ 0 , Y ] end for return Y Howe ver, as Fig. 3 shows, it transpir ed tha t with this algo- rithm the atten ded time point does not always move fo r ward 5 Fig. 3. Attenti on matrices predicted without (left) and with (right) forward atten tion. monoto nically a nd continu ously at test time an d can o ccasion- ally beco me stuck at the same time poin t or sud denly jump to a distant time point even thoug h the diag onal and or thogon a l losses ar e considered in tr aining. T o assist the atten ded po int to move forward monoto nically and continuo u sly , we limit the paths thro ugh which the attended p oint is a llowed to move by forcing the atten tio ns to the time points d istant fr o m th e peak of the attention d istribution obtained at the p revious time step to zer os. This can be im p lemented for instance as follows: [ K ; V ] = SrcEnc ( X ) , Y ← 0 for m = 1 to M do Q = T rgEnc ( Y ) A = softmax K T Q / √ D ′ if m > 1 then a = A 1: N ,m a 1:max(1 , ˆ n − N 0 ) = 0 , a min( ˆ n + N 1 ,N ): N = 0 a ← a / sum( a ) A ← [ A 1: N , 1: m − 1 , a ] end if ˆ n = argmax n A n,m R = V A Y = T rgDec ( R ) Y ← [ 0 , Y ] end for return Y where sum( · ) deno te s the sum o f all the elemen ts in a vector . Note that we set N 0 and N 1 at th e n earest integers that correspo n d to 160 [m s] and 320 [ m s], respectively . Fig. 3 shows an e xample of how attention matrices loo k different when this p rocedur e has been u ndertaken. After we ob tain R with the a b ove alg orithm, we can use th e target re c onstructor to compute e Y = T rgRec ( R ) and use it instead of Y as the feature seq u ence of the converted speech. Once Y or e Y has be e n obtain ed, we ad just the mean and variance of the generated featur e sequenc e so that they match the pretrained mean an d variance of the feature vectors of th e target speaker . W e can then generate a time-do main signal u sing the WORLD vocoder or any recen tly developed neural vocoder [32], [41]–[ 5 0 ] . Note that in the following experiments, we chose to use e Y fo r final waveform generation as it resulted in better-soundin g speech. F . Real- T ime System Design Real-time requir e ments must b e c o nsidered wh en building VC systems. If we want our m odel to work in real-time, first, we mu st not allow the sou rce encoder to use futur e information as with th e target en coder and deco der dur in g train ing. Th is requirem ent can easily be implem e nted by constrain ing the conv olutio n layers in th e source encoder (and the target reconstruc tor , if we assum e it is used to generate the con verted feature sequen ce) to be causal. An o ther point we must consider is tha t the speak ing rate and rhythm of inp ut speech cann o t be changed drastically at test time. One sim p le way of keep ing them uncha n ged is to set A to an iden tity m a trix. In this way , the autoregressive recursion will be no longer needed and the conv e rsion c a n be per formed in a sliding- window fashion as: [ K ; V ] = SrcEnc ( X ) Y = T rgDec ( V ) o r Y = T rgRe c ( V ) return Y W e will show later ho w the se mo difications c a n affect the VC per f ormance. Note that e ven un der this setting, the ability to learn and apply conversion rules that ca p ture long- term depend encies is still effective. G. Impa ct of Batch Normalizatio n As mentioned earlier, using fu lly conv olu tional architecture s allows the use of batch no rmalization for a ll the hidd en layer s in the n etworks, which is not straightfo r ward for arc h itectures including recurr ent mo dules. On e benefit of using batch no r- malization layers is that it ena b les th e networks to use a high er learning rate witho ut vanishing or explo ding g radients. It is also believ ed to help regularize the networks such that it is easier to gener a lize a nd mitigate overfitting. The ef f ect of batch normalizatio n will be verified experimentally in Sectio n V. I V . M A N Y - T O - M A N Y C O N V S 2 S - V C A. Mode l and T raining Loss W e now describ e an extension o f the ConvS2S mo del that enables many-to -many VC. Here, the idea is to u se a single mo del to achieve map pings amon g multiple speakers. The model co nsists of the same set of th e network s as th e pairwise model. The only difference is that each network takes a speaker index as an ad ditional inp u t. Let X (1) , . . . , X ( K ) be examples of the aco ustic f eature sequences of speech in different speakers readin g the same sentence. Given a single pair of parallel u tterances X ( k ) and X ( k ′ ) , where k an d k ′ denote the sou r ce and target speaker indices (in tegers), the source encoder takes X ( k ) and the source speaker index k as the inputs and produ ces two internal vector sequences K ( k ) , V ( k ) , wher eas the target encoder takes X ( k ′ ) and the target speaker index k ′ as the inputs and produ ces an internal vector sequ e n ce Q ( k ′ ) : [ K ( k ) ; V ( k ) ] = SrcEnc ( X ( k ) , k ) , (12) Q ( k ′ ) = T rgEnc ( X ( k ′ ) , k ′ ) . (13) The atten tion m atrix A ( k,k ′ ) and the time-warped version of V ( k ) are th en comp u ted u sing K ( k ) and Q ( k ′ ) : A ( k,k ′ ) = softmax n K ( k ) T Q ( k ′ ) √ D ′ , (14) R ( k,k ′ ) = V ( k ) A ( k,k ′ ) . (15) 6 Fig. 4. Examples of the attenti on matrice s predicted from test input female speech using the many-to-man y model: with batch normaliza tion (left) and with conditiona l batch normaliza tion (right). The outp uts of the re c o nstructor and decoder given the in put R ( k,k ′ ) with target spe aker conditioning are finally given as e Y ( k,k ′ ) = T rgRec ( R ( k,k ′ ) , k ′ ) , (16) Y ( k,k ′ ) = T rgDec ( R ( k,k ′ ) , k ′ ) . (17) The loss f unctions to be m inimized given this single training example are g i ven as L ( k,k ′ ) dec = 1 N k ′ k Y ( k,k ′ ) : , 1: N k ′ − 1 − X ( k ′ ) : , 2: N k ′ k 1 , (18) L ( k,k ′ ) dal = 1 N k N k ′ k W N k × N k ′ ( ν ) ⊙ A ( k,k ′ ) k 1 , (19) L ( k,k ′ ) oal = 1 N k 2 k W N k × N k ( ρ ) ⊙ ( A ( k,k ′ ) A ( k,k ′ ) T ) k 1 , (20) L ( k,k ′ ) rec = 1 N k ′ k e Y ( k,k ′ ) − X ( k ′ ) k 1 . (21) W ith the above mo d el, th e case wh ere k = k ′ would be reasonable to also consider . Minimizing this loss correspo nds to ensu ring that the in put fe a ture sequenc e X ( k ) will r emain unchan ged when the source and target speakers are the same. W e call this loss the “iden tity mapping lo ss (IML)”. The effect giv en by this loss will be shown later . Hence, the total training loss to be min imized beco mes L = X k,k ′ 6 = k E X ( k ) , X ( k ′ ) L ( k,k ′ ) all + λ i X k E X ( k ) L ( k,k ) all , L ( k,k ′ ) all = L ( k,k ′ ) dec + λ r L ( k,k ′ ) rec + λ d L ( k,k ′ ) dal + λ o L ( k,k ′ ) oal , (22) where E X ( k ) , X ( k ′ ) [ · ] and E X ( k ) [ · ] denote the sample means over all th e trainin g examples o f par allel utteran ces in speakers k and k ′ , and λ i ≥ 0 is a regularization parameter, which weighs the im p ortance of the IML. B. Cond itional Batch Normalization The left figure in Fig. 4 shows the attention matrix predicted from inp ut female speech using the many- to-many mo del with regular batch norm alization layers. As this example shows, at- tention matrices predicted b y the many-to-many model tended to become b lurry , mostly resulting in unintelligib le speech . W e conjecture that this was caused by the fact th at the distributions of the in p uts to the hid den lay ers can ch ange in accor dance with the sour ce an d/or target speakers. T o normalize layer in put distributions on a speaker-depend ent basis, we propose using conditional batch normalization layers for the many-to-many mo del. Each eleme nt y b,d,n of the output of a r egular batch normaliza tion lay er Y = B ( X ) is d efined as y b,d,n = γ d x b,d,n − µ d ( X ) σ d ( X ) + β d , wher e X de notes the layer input given by a th r ee-way arra y with b atch, chann el, and time axes, x b,d,n denotes its ( b, d, n ) -th elemen t, µ d ( X ) and σ d ( X ) de n ote th e mean and standar d deviation of the d -th channel com ponents of X comp uted alo ng the batch and time axes, an d γ = [ γ 1 , . . . , γ D ] an d β = [ β 1 , . . . , β D ] d enote the parame te r s to be learned. In contrast, the o utput of a condition al batch no rmalization lay er Y = B k ( X ) is defined as y b,d,n = γ k d x b,d,n − µ d ( X ) σ d ( X ) + β k d , wh ere the only difference is that the p arameters γ k = [ γ k 1 , . . . , γ k D ] and β k = [ β k 1 , . . . , β k D ] are cond itioned o n speaker k . Note th a t a similar id ea, called the co nditional instan c e nor m alization, has been introdu ced to modif y the instance nor malization process for imag e style transfer [55] and n on-para llel VC [ 56]. C. Any-to -many Con version W ith the mod els presented a b ove, th e source speaker must be k nown and specified du ring bo th training and inferen ce. Howe ver, ther e can be ce r tain situations where the source speaker is un known or ar b itrary . W e call VC tasks in such scenarios any-to-on e or any-to - many VC. Ou r many-to-m any model can be modified to handle any-to-many VC tasks by not allowing the sou rce encod er to ta ke the source spe aker index k as an input at both tr aining and test time . The modified version can be fo r mulated by simply re p lacing Eq. (12) in the many-to-ma ny mo del with [ K ( k ) ; V ( k ) ] = SrcEnc ( X ( k ) ) . (23) V . E X P E R I M E N T S A. Exp e rimental Settings T o evaluate the effects of the ideas p resented in Sections III and IV, we con d ucted ob jecti ve and su b jectiv e ev aluation experiments in volving a speaker id entity conversion task. For the exp eriment, we used the CMU Arctic database [57], which consists of record in gs of 1 132 ph onetically b alanced English utteran ces spoken by four US English speakers. W e used all th e speakers (clb (fema le ) , bdl (male), slt (female), and rms (male) ) for trainin g and evaluation. Thus, in total there were 12 different comb in ations of so urce and target speakers. The aud io files fo r each speaker were manu ally divided into 1 000 and 132 files, which were provided as train- ing and evaluation sets, respectively . All th e speech signals were sampled at 16 k Hz. As already detailed in Subsectio n III-A, for each u tterance, the spe c tral envelope, log F 0 , co ded aperiodicity , and voiced/unv oice d inf o rmation were extracted ev e ry 8 ms using the WORLD analyzer [5 2]. Th en, 28 MCCs were extrac te d from each spectral envelope u sing the Speech Processing T oolkit (SPTK) [58]. T he red uction factor r was set to 3 . Hence, th e dimen sion o f the aco ustic feature was D = (28 + 3) × 3 = 93 . B. Network ar chitectures W e use the notations in T ab . I to describe the network architecture s. T he arch itectures of all th e n etworks in th e pairwise an d many- to-many models are detailed in T ab . II. Note that in T ab. II the layer in dex is omitted for simp licity of notation and each layer has a different set o f fr e e parameter s ev e n though the same symb ol is used . 7 T ABLE I N O TA T I O N S F O R N E T W O R K A R C H I T E C T U R E D E S C R I P T I O N S Notatio n Meaning X (bold symbol) two-w ay array with channel and time axe s ( f ◦ g )( X ) function composition f ( g ( X )) ( N n =0 f n )( X ) multiple compositions ( f N ◦ · · · ◦ f 1 ◦ f 0 )( X ) RL o ← i κ⋆δ ( X ) 1D regular (non-causal) con voluti on ( κ : kernel size, δ : dilation fac tor , i : input channel size, o : output channel size) CL o ← i κ⋆δ ( X ) 1D causal con voluti on ( κ : kerne l size, δ : dilation factor , i : input channe l size, o : output channe l size) B ( X ) normaliz ation (BN or IN) B k ( X ) conditi onal normalizat ion (CBN or CIN) condit ioned on speak er k ResRGLU o ← i κ⋆δ ( X ) B ( RL o ← i κ⋆δ ( X )) ⊙ sigmoid ( B ( RL o ← i κ⋆δ ( X ))) + X 1: o, : ResCGLU o ← i κ⋆δ ( X ) B ( CL o ← i κ⋆δ ( X )) ⊙ sigmoid ( B ( CL o ← i κ⋆δ ( X ))) + X 1: o, : ResRGLU k o ← i κ⋆δ ( X ) B k ( RL o ← i κ⋆δ ( X )) ⊙ sigmoid ( B k ( RL o ← i κ⋆δ ( X ))) + X 1: o, : ResCGLU k o ← i κ⋆δ ( X ) B k ( CL o ← i κ⋆δ ( X )) ⊙ sigmoid ( B k ( CL o ← i κ⋆δ ( X ))) + X 1: o, : drop ( X ) dropout with ratio 0. 1 embed o ( k ) retrie ving an o -dimensional embedding vecto r from an inte ger k A Z ( X ) appendi ng a broadca s t version of Z (expan ded along the time axis) to X along the channel dimension T ABLE II N E T W O R K A R C H I T E C T U R E S O F PA I RW I S E A N D M A N Y - T O - M A N Y M O D E L S Model Network Archite cture pairwise SrcEnc ( X ) ( RL 512 ← 256 1 ⋆ 1 ◦ ( 2 l ′ =0 ( 3 l =0 ResRGLU 256 ← 256 5 ⋆ 3 l )) ◦ B ◦ RL 256 ← 93 1 ⋆ 1 ◦ drop )( X ) T rgEnc ( Y ) ( CL 256 ← 256 1 ⋆ 1 ◦ ( 2 l ′ =0 ( 3 l =0 ResCGLU 256 ← 256 3 ⋆ 3 l )) ◦ B ◦ CL 256 ← 93 1 ⋆ 1 ◦ drop )( Y ) T rgRec ( R ) ( RL 93 ← 256 1 ⋆ 1 ◦ ( 2 l ′ =0 ( 3 l =0 ResRGLU 256 ← 256 5 ⋆ 3 l )) ◦ B ◦ RL 256 ← 256 1 ⋆ 1 ◦ drop )( R ) T rgDec ( R ) ( CL 93 ← 256 1 ⋆ 1 ◦ ( 2 l ′ =0 ( 3 l =0 ResCGLU 256 ← 256 3 ⋆ 3 l )) ◦ B ◦ CL 256 ← 256 1 ⋆ 1 ◦ drop )( R ) many-t o-many (batch) SrcEnc ( X , k ) ( RL 1024 ← 54 4 1 ⋆ 1 ◦ A Z ◦ ( 2 l ′ =0 ( 3 l =0 ( ResRGLU k 512 ← 544 5 ⋆ 3 l ◦ A Z ))) ◦ B k ◦ RL 512 ← 125 1 ⋆ 1 ◦ A Z ◦ drop )( X ) T rgEnc ( Y , k ′ ) ( CL 512 ← 544 1 ⋆ 1 ◦ A Z ′ ◦ ( 2 l ′ =0 ( 3 l =0 ( ResCGLU k ′ 512 ← 544 3 ⋆ 3 l ◦ A Z ′ ))) ◦ B k ′ ◦ CL 512 ← 125 1 ⋆ 1 ◦ A Z ′ ◦ drop )( Y ) T rgRec ( R , k ′ ) ( RL 93 ← 544 1 ⋆ 1 ◦ A Z ′ ◦ ( 2 l ′ =0 ( 3 l =0 ( ResRGLU k ′ 512 ← 544 5 ⋆ 3 l ◦ A Z ′ ))) ◦ B k ′ ◦ RL 512 ← 544 1 ⋆ 1 ◦ A Z ′ ◦ drop )( R ) T rgDec ( R , k ′ ) ( CL 93 ← 544 1 ⋆ 1 ◦ A Z ′ ◦ ( 2 l ′ =0 ( 3 l =0 ( ResCGLU k ′ 512 ← 544 3 ⋆ 3 l ◦ A Z ′ ))) ◦ B k ′ ◦ CL 512 ← 544 1 ⋆ 1 ◦ A Z ′ ◦ drop )( R ) where Z = embed 32 ( k ) and Z ′ = emb ed 32 ( k ′ ) any-t o-many SrcEnc ( X ) ( RL 1024 ← 51 2 1 ⋆ 1 ◦ ( 2 l ′ =0 ( 3 l =0 ResRGLU 512 ← 512 5 ⋆ 3 l )) ◦ B ◦ RL 512 ← 93 1 ⋆ 1 ◦ drop )( X ) others same as abo ve many-t o-many (real-t ime) SrcEnc ( X , k ) ( CL 1024 ← 54 4 1 ⋆ 1 ◦ A Z ◦ ( 2 l ′ =0 ( 3 l =0 ( ResCGLU k 512 ← 544 3 ⋆ 3 l ◦ A Z ))) ◦ B k ◦ CL 512 ← 125 1 ⋆ 1 ◦ A Z ◦ drop )( X ) T rgRec ( R , k ′ ) ( CL 93 ← 544 1 ⋆ 1 ◦ A Z ′ ◦ ( 2 l ′ =0 ( 3 l =0 ( ResCGLU k ′ 512 ← 544 3 ⋆ 3 l ◦ A Z ′ ))) ◦ B k ′ ◦ CL 512 ← 544 1 ⋆ 1 ◦ A Z ′ ◦ drop )( R ) T rgDec ( R , k ′ ) same as above C. Hyp erparameter Settings λ r , λ d , λ o , a nd λ i were set at 1, 20 00, 2 000, and 1 , respectively . ν and ρ were set a t 0.3 and 0.3 for both the pairwise and many-to- many models. The L 1 norm k X k 1 used in (6), (8), ( 18), and ( 21) was defin ed as a weighted n orm k X k 1 = N X n =1 1 r r X j =1 31 X i =1 α i | x ij,n | , where x 1 j,n , . . . , x 28 j,n , x 29 j,n , x 30 j,n and x 31 j,n denote the entries of X c o rrespond ing to th e 28 MCCs, log F 0 , coded aperiodicity an d voiced/unv o iced in dicator at time n , an d the weights were set at α 1 = · · · = α 28 = 1 28 , α 29 = 1 10 , and α 30 = α 31 = 1 50 , resp ecti vely . All the n etworks were traine d simultaneously with rando m initialization. Adam optimization [59] was used for mod e l training where the min i-batch size was 16 and 25,0 00 iterations were run. The learn ing r ate and the expo nential decay ra te fo r the first mome n t for Adam we r e set at 0.000 15 and 0.9. D. Objective P erformance Mea sur es The test dataset con sists of speech samples o f each spe a ker reading th e same sentences. Th us, the quality o f a conv erted feature sequenc e can b e assessed b y co mparing it with the feature seq u ence of the refer e nce u tterance. 1) Mel-cepstral distortion (MCD): Gi ven two mel-cepstra, ˆ x = [ ˆ x 1 , . . . , ˆ x 28 ] T and x = [ x 1 , . . . , x 28 ] T , w e can use th e mel-cepstral distortio n (MCD): MCD[dB] = 10 ln 10 v u u t 2 28 X i =2 ( ˆ x i − x i ) 2 , (24) to measure their difference. Here, we used the average of the MCDs taken along the DTW path b etween co n verted and ref erence feature sequences as the objective per f ormance measure for each test utteran ce. 2) Log F 0 Corr elation Coefficient (LFC): T o ev aluate the F 0 contour o f converted speech, we used the co r relation coefficient between th e predic ted and target lo g F 0 contour s [60] as the objecti ve performanc e measur e. Since the converted and ref erence u tter ances were not necessarily aligned in time, we computed the correlation coefficient after prop e rly alignin g them. Here, we used the MCC sequences ˆ X 1:28 , 1: N , X 1:28 , 1: M of converted and r eference utteran c e s to find phon eme-based alignment, assuming th a t the pr edicted an d r eference MCCs at the co r respondin g frames were su fficiently close. Given th e log F 0 contour s ˆ X 29 , 1: N , X 29 , 1: M and the voiced/unv oiced indicator sequences ˆ X 31 , 1: N , X 31 , 1: M of conv erted and ref - erence utterances, we fir st warp the time axis of ˆ X 29 , 1: N and ˆ X 31 , 1: N in acco r dance with the DTW path b etween the 8 MCC seq uences ˆ X 1:28 , 1: N , X 1:28 , 1: M of the two u tterances and ob tain their time-warped versions, ˜ X 29 , 1: M , ˜ X 31 , 1: M . W e then extract th e e le m ents of ˜ X 29 , 1: M and X 29 , 1: M at all the time p oints corre sp onding to the voiced segments such tha t { m | ˜ X 31 ,m = X 31 ,m = 1 } . If we use ˜ y = [ ˜ y 1 , . . . , ˜ y M ′ ] and y = [ y 1 , . . . , y M ′ ] to den ote the vectors con sisting o f th e elements extracted fro m ˜ X 29 , 1: M and X 29 , 1: M , we can u se the co rrelation coefficient between ˜ y and y R = P M ′ m ′ =1 ( ˜ y m ′ − ˜ ϕ )( y m ′ − ϕ ) q P M ′ m ′ =1 ( ˜ y m ′ − ˜ ϕ ) 2 q P M ′ m ′ =1 ( y m ′ − ϕ ) 2 , (25) where ˜ ϕ = 1 M ′ P M ′ m ′ =1 ˜ y m ′ and ϕ = 1 M ′ P M ′ m ′ =1 y m ′ , to measure the similarity between the two log F 0 contour s. In the curr ent experim ent, we used the a verag e o f th e correlatio n coefficients taken over all the test utter ances as the o bjectiv e perfor mance measure for log F 0 prediction . Thus, the closer it is to 1, the better the performa n ce. W e ca ll this measure the log F 0 correlation co efficient (LFC). 3) Loca l Duration R a tio (LDR): T o evaluate th e speak- ing rate an d the rh ythm of con verted sp e e ch, we used the local slopes of the DTW path betwe e n converted and ref- erence utter ances to d e te r mine the ob jecti ve performa nce measure. If th e sp e aking r ate and th e rhy th m of the two utterances are exactly the sam e, all the local slopes should be 1. Hence, the better the conversion, the closer the lo- cal slopes becom e to 1. T o c o mpute the local slopes, we underto ok th e fo llowing pro cess. Given the MCC seq u ences ˆ X 1:28 , 1: N , X 1:28 , 1: M of converted and referenc e uttera n ces, we first p e rformed DTW o n ˆ X 1:28 , 1: N and X 1:28 , 1: M . If we use ( p 1 , q 1 ) , . . . , ( p j , q j ) , . . . , ( p J , q J ) to d enote the ob tained DTW path where ( p 1 , q 1 ) = (1 , 1) and ( p J , q J ) = ( M , N ) , we compu ted the slope o f the regression line fitted to th e 33 local consecu ti ve p oints for e ach j : s j = P j +16 j ′ = j − 16 ( p j ′ − ¯ p j )( q j ′ − ¯ q j ) P j +16 j ′ = j − 16 ( p j ′ − ¯ p j ) 2 , (26) where ¯ p j = 1 33 P j +16 j ′ = j − 16 p j ′ and ¯ q j = 1 33 P j +16 j ′ = j − 16 q j ′ , and then c o mputed the median of s 1 , . . . , s J . W e call this measure the local d uration ratio (LDR). The greate r this ratio, the longer the d uration of the converted utteran ce is relative to the ref erence utterance. In th e following, we use the mean absolute difference between th e LDRs and 1 (in percen tages) as the overall measure for the LDRs. Thu s, the closer it is to zero, the b etter the perfo rmance. For example, if the converted speech is 2 times faster than th e r eference speech, the LDR will be 0 .5 e verywh ere, an d so its mean absolute dif ference from 1 will be 50 % . E. Baselin e Methods 1) spr o ck et: W e chose th e o pen-sour ce VC system called sprocket [ 61] fo r com parison in o ur exp e r iments. T o run this method, we used the sourc e code provided by its author [62]. Note that th is system was used as a baseline sy stem in the V oice Conversion Challen ge (VCC) 2 018 [63]. 2) RNN-S2S- VC: T o e valuate the effect of th e fully c o n vo- lutional architectu re adopted in Con vS2S-VC, we impleme n ted its recu rrent c ounterpa r t [35], inspir ed by th e architectu re introdu c ed in a S2S model-based TTS sy stem called T acotron [23] and considered it as ano ther baseline. Although the original T acotron used mel-spectra as th e acou stic f eatures, the baseline system was design e d to use the same acoustic featu res as o u r system. Th e architectu re was sp ecifically design ed as follows. The encod e r consisted of a bottleneck fu lly-conn ected prenet f ollowed by a stack o f 1 × 1 1D GLU conv o lu tions and a bi-directio n al LSTM layer . The decoder was an autoregressive content-b ased attention network, con sisting o f a b ottleneck fully-co n nected p renet followed by a statefu l LSTM lay er produ cing the attentio n qu ery , which was then p assed to a stack of two uni-d ir ectional r esidual LSTM layer s, fo llowed by a line ar pro jection to g enerate the features. Note that we replaced all rectified lin ear unit (ReLU) activ ations with GLUs as with o ur mod el. W e also designed and impleme n ted a many- to-many extension of th e above RNN-based model. F . Objective Evaluatio ns 1) Effect of re gularization: First, we e valuated the ind i- vidual ef fects of the regularization techniqu es presented in Subsections III -B an d III-C on both the pairwise and many- to-many models. T a bs. I II, IV, and V show the av erage MCDs (with 95% confidence intervals ), L FCs, and LDR deviations of the converted speech ob ta in ed usin g the pa irwise an d many- to-many models un der d if ferent ( λ r , λ o ) settings (0 , 0) , (1 , 0) , and (1 , 20 00) for the pair wise conversion mo del and different ( λ r , λ i , λ o ) settings (0 , 0 , 0) , (1 , 0 , 0) , (1 , 1 , 0) , and (1 , 1 , 2000) for the many-to-many model. Owin g to the limited amo unt of training data, the m odels trained without D AL d id not successfully produ ce recognizab le speech. Th us, we o mit the results ob tained when λ d = 0 . As the results show , althoug h there are a few exceptions, b oth the pairwise an d m any-to- many models p e r formed better f or mo st speaker pairs in term s of th e MCD measure wh en all the regularizatio n terms were simultaneou sly taken in to account durin g training. W e also found that th e effects of L rec and L oal on the LFC and LDR measures were less significant than o n the MCD measu re. Fig. 5 shows examp les of h ow each of the regularizatio n techniques can af fe c t th e prediction of the attention matrices by the many- to-many mo d el at test time. As these examples show , the CPL tended to have a nota b le effect o n pr omoting monotonicity an d continuity of the attention p rediction. Howe ver, it also had a negativ e effect of blu rring the predicted attention distributions. The OAL and IML con tr ibuted to counter a cting th is n egati ve effect by sharpen in g the attention matrices wh ile keep ing th em monoto nic an d continu ous. 2) Comparison o f normalization metho ds: For nor maliza- tion, there are se veral cho ices inclu d ing instance normaliza tio n (IN) [64], weight no rmalization (WN) [ 65], and batch nor mal- ization (BN) [66]. For our many-to-many mod e l, other choices include condition al IN (CIN) an d conditional BN (CBN). W e compare d the effects of these no r malization methods on b oth the p airwise and m a ny- to-many models on the b asis of the MCD, LFC, an d LDR measur es. Note that all the n ormal- ization layers in T ab. I I are exclude d in th e WN cou nterparts. 9 T ABLE III A V E R A G E M C D S [ D B ] O B TA I N E D W I T H T H E PA I RW I S E A N D M A N Y - T O - M A N Y M O D E L S T R A I N E D W I T H A N D W I T H O U T R E G U L A R I Z AT I O N . Speak ers Pairwi s e many-t o-many source tar get λ r = λ o = 0 λ o = 0 full version λ r = λ i = λ o = 0 λ i = λ o = 0 λ o = 0 full version bdl 7 . 00 ± . 09 6 . 84 ± . 09 6 . 93 ± . 10 6 . 98 ± . 10 6 . 76 ± . 12 6 . 67 ± . 14 6 . 57 ± . 12 clb slt 6 . 37 ± . 09 6 . 34 ± . 10 6 . 37 ± . 08 6 . 34 ± . 08 6 . 11 ± . 05 6 . 01 ± . 09 5 . 98 ± . 08 rms 6 . 54 ± . 10 6 . 52 ± . 1 3 6 . 53 ± . 16 6 . 27 ± . 05 6 . 23 ± . 06 6 . 44 ± . 12 6 . 11 ± . 08 clb 6 . 30 ± . 06 6 . 09 ± . 08 6 . 03 ± . 09 6 . 03 ± . 07 6 . 05 ± . 11 5 . 91 ± . 08 5 . 86 ± . 09 bdl slt 6 . 61 ± . 10 6 . 66 ± . 10 6 . 51 ± . 12 6 . 58 ± . 12 6 . 47 ± . 06 6 . 25 ± . 09 6 . 19 ± . 11 rms 6 . 75 ± . 11 6 . 68 ± . 1 3 6 . 79 ± . 16 6 . 65 ± . 14 6 . 45 ± . 07 6 . 44 ± . 11 6 . 24 ± . 12 clb 6 . 21 ± . 12 6 . 12 ± . 08 6 . 03 ± . 06 6 . 08 ± . 07 6 . 14 ± . 09 5 . 79 ± . 09 5 . 78 ± . 16 slt bdl 7 . 25 ± . 16 7 . 21 ± . 18 7 . 0 7 ± . 16 7 . 10 ± . 14 6 . 99 ± . 14 6 . 80 ± . 14 6 . 72 ± . 1 4 rms 6 . 61 ± . 07 6 . 56 ± . 0 8 6 . 61 ± . 09 6 . 50 ± . 09 6 . 44 ± . 12 6 . 51 ± . 10 6 . 27 ± . 08 clb 6 . 42 ± . 14 6 . 37 ± . 18 6 . 30 ± . 11 6 . 05 ± . 06 6 . 10 ± . 11 5 . 96 ± . 09 5 . 93 ± . 11 rms bdl 7 . 16 ± . 10 7 . 14 ± . 14 7 . 0 8 ± . 15 7 . 07 ± . 10 6 . 91 ± . 11 6 . 74 ± . 11 6 . 67 ± . 1 2 slt 6 . 78 ± . 19 6 . 72 ± . 23 6 . 50 ± . 07 6 . 49 ± . 11 6 . 28 ± . 1 1 6 . 35 ± . 16 6 . 32 ± . 16 All pairs 6 . 67 ± . 04 6 . 61 ± . 05 6 . 56 ± . 05 6 . 51 ± . 04 6 . 41 ± . 04 6 . 32 ± . 04 6 . 22 ± . 04 T ABLE IV A V E R A G E L F C S O B TA I N E D W I T H T H E PA I R W I S E A N D M A N Y - T O - M A N Y M O D E L S T R A I N E D W I T H A N D W I T H O U T R E G U L A R I Z ATI O N . Speak ers Pairwi s e many-t o-many source target λ r = λ o = 0 λ o = 0 full ver sion λ r = λ i = λ o = 0 λ i = λ o = 0 λ o = 0 full ver sion bdl 0 . 852 0 . 856 0 . 869 0 . 876 0 . 874 0 . 851 0 . 845 clb slt 0 . 8 46 0 . 835 0 . 833 0 . 834 0 . 852 0 . 831 0 . 841 rms 0 . 829 0 . 805 0 . 771 0 . 835 0 . 741 0 . 751 0 . 811 clb 0 . 844 0 . 815 0 . 810 0 . 846 0 . 835 0 . 831 0 . 823 bdl slt 0 . 768 0 . 799 0 . 815 0 . 775 0 . 792 0 . 8 75 0 . 864 rms 0 . 805 0 . 750 0 . 800 0 . 759 0 . 742 0 . 788 0 . 855 clb 0 . 838 0 . 796 0 . 861 0 . 795 0 . 770 0 . 8 30 0 . 812 slt bd l 0 . 821 0 . 832 0 . 850 0 . 860 0 . 859 0 . 86 1 0 . 850 rms 0 . 804 0 . 797 0 . 785 0 . 789 0 . 783 0 . 759 0 . 799 clb 0 . 850 0 . 830 0 . 821 0 . 833 0 . 794 0 . 8 34 0 . 819 rms bdl 0 . 854 0 . 853 0 . 825 0 . 864 0 . 845 0 . 8 77 0 . 870 slt 0 . 792 0 . 825 0 . 809 0 . 776 0 . 794 0 . 866 0 . 826 All pairs 0 . 829 0 . 818 0 . 826 0 . 833 0 . 822 0 . 834 0 . 838 T ABLE V A V E R A G E L D R D E V I AT I O N S ( % ) O B TA I N E D W I T H T H E PA I RW I S E A N D M A N Y - T O - M A N Y M O D E L S T R A I N E D WI T H A N D W I T H O U T R E G U L A R I Z AT I O N . Speak ers Pairwi s e many-t o-many source target λ r = λ o = 0 λ o = 0 full version λ r = λ i = λ o = 0 λ i = λ o = 0 λ o = 0 full ver sion bdl 5 . 28 3 . 54 5 . 16 3 . 96 3 . 48 6 . 14 8 . 58 clb slt 2 . 42 3 . 76 1 . 96 2 . 61 3 . 43 0 . 55 5 . 85 rms 2 . 59 4 . 25 5 . 86 2 . 34 10 . 44 3 . 01 4 . 20 clb 4 . 96 3 . 87 1 . 97 5 . 66 5 . 20 3 . 78 3 . 47 bdl slt 6 . 16 4 . 30 6 . 78 6 . 43 4 . 09 3 . 56 5 . 53 rms 4 . 22 6 . 41 0 . 79 2 . 45 5 . 43 5 . 25 6 . 06 clb 2 . 59 0 . 60 0 . 81 0 . 31 0 . 75 2 . 45 0 . 66 slt bdl 7 . 36 5 . 59 6 . 70 1 . 73 3 . 64 4 . 25 5 . 49 rms 4 . 49 4 . 30 4 . 08 4 . 17 11 . 40 2 . 7 6 4 . 66 clb 1 . 04 3 . 07 1 . 99 5 . 40 4 . 83 2 . 9 4 3 . 55 rms bdl 2 . 87 5 . 10 6 . 87 2 . 59 3 . 63 5 . 13 10 . 85 slt 2 . 88 7 . 16 3 . 29 6 . 93 4 . 62 3 . 38 2 . 77 All pairs 4 . 17 4 . 21 3 . 47 3 . 51 4 . 48 4 . 01 4 . 65 Fig. 5. E xamples of the attent ion matrices predicte d from test input female speec h using the many- to-man y model traine d under four settings of ( λ r , λ o , λ i ) : (0 , 0 , 0) , (1 , 0 , 0) , (1 , 2000 , 0) , and (1 , 200 0 , 1) (from left to right). 10 T ABLE VI M C D [ D B ] C O M PA R I S O N O F N O R M A L I Z A T I O N M E T H O D S . Speak ers Pairwi s e many-t o-many source target IN WN BN IN CIN WN BN CBN bdl 7 . 76 ± . 19 7 . 10 ± . 11 6 . 93 ± . 10 9 . 04 ± . 21 9 . 00 ± . 25 6 . 83 ± . 13 7 . 42 ± . 10 6 . 57 ± . 12 clb slt 6 . 79 ± . 10 6 . 63 ± . 10 6 . 37 ± . 08 8 . 16 ± . 15 7 . 96 ± . 17 6 . 16 ± . 07 6 . 96 ± . 09 5 . 98 ± . 08 rms 7 . 57 ± . 25 6 . 71 ± . 12 6 . 53 ± . 16 7 . 72 ± . 23 7 . 96 ± . 21 6 . 20 ± . 08 8 . 01 ± . 05 6 . 11 ± . 08 clb 6 . 85 ± . 12 6 . 38 ± . 07 6 . 03 ± . 09 7 . 93 ± . 26 7 . 52 ± . 30 5 . 85 ± . 0 7 8 . 04 ± . 20 5 . 86 ± . 09 bdl slt 7 . 32 ± . 20 6 . 69 ± . 06 6 . 51 ± . 1 2 7 . 97 ± . 18 7 . 90 ± . 21 6 . 36 ± . 08 8 . 03 ± . 25 6 . 19 ± . 11 rms 7 . 49 ± . 16 6 . 93 ± . 12 6 . 79 ± . 16 7 . 86 ± . 18 7 . 69 ± . 24 6 . 31 ± . 09 8 . 06 ± . 13 6 . 24 ± . 12 clb 6 . 64 ± . 15 6 . 24 ± . 05 6 . 03 ± . 06 7 . 68 ± . 28 8 . 26 ± . 26 5 . 82 ± . 09 8 . 06 ± . 21 5 . 78 ± . 16 slt bdl 7 . 95 ± . 24 7 . 28 ± . 16 7 . 07 ± . 16 9 . 14 ± . 23 9 . 61 ± . 26 6 . 95 ± . 13 7 . 73 ± . 20 6 . 72 ± . 14 rms 7 . 32 ± . 19 6 . 76 ± . 12 6 . 61 ± . 09 7 . 60 ± . 14 7 . 69 ± . 14 6 . 42 ± . 14 8 . 54 ± . 09 6 . 27 ± . 08 clb 6 . 96 ± . 15 6 . 43 ± . 12 6 . 30 ± . 11 7 . 69 ± . 20 7 . 99 ± . 26 6 . 04 ± . 13 7 . 93 ± . 20 5 . 93 ± . 11 rms bdl 8 . 04 ± . 19 7 . 25 ± . 12 7 . 08 ± . 15 8 . 97 ± . 26 9 . 16 ± . 28 6 . 93 ± . 11 7 . 88 ± . 39 6 . 6 7 ± . 12 slt 7 . 48 ± . 34 6 . 92 ± . 12 6 . 50 ± . 07 8 . 32 ± . 21 8 . 15 ± . 25 6 . 47 ± . 17 8 . 43 ± . 62 6 . 32 ± . 16 All pairs 7 . 34 ± . 07 6 . 78 ± . 05 6 . 56 ± . 05 8 . 17 ± . 08 8 . 24 ± . 90 6 . 36 ± . 05 7 . 92 ± . 08 6 . 22 ± . 04 T ABLE VII L F C C O M PA R I S O N O F N O R M A L I Z ATI O N M E T H O D S . Speak ers Pairwi s e many-t o-many source target IN WN BN IN CIN WN BN CBN bdl 0 . 827 0 . 840 0 . 869 0 . 556 0 . 456 0 . 87 3 0 . 769 0 . 845 clb slt 0 . 819 0 . 813 0 . 833 0 . 547 0 . 794 0 . 830 0 . 815 0 . 8 41 rms 0 . 657 0 . 770 0 . 7 71 0 . 475 0 . 338 0 . 831 0 . 494 0 . 811 clb 0 . 781 0 . 779 0 . 810 0 . 766 0 . 725 0 . 829 0 . 652 0 . 823 bdl slt 0 . 746 0 . 831 0 . 815 0 . 787 0 . 693 0 . 789 0 . 617 0 . 864 rms 0 . 634 0 . 738 0 . 8 00 0 . 588 0 . 608 0 . 763 0 . 444 0 . 855 clb 0 . 811 0 . 835 0 . 861 0 . 594 0 . 647 0 . 803 0 . 745 0 . 812 slt bdl 0 . 792 0 . 819 0 . 850 0 . 537 0 . 356 0 . 852 0 . 748 0 . 850 rms 0 . 680 0 . 729 0 . 7 85 0 . 504 0 . 327 0 . 783 0 . 293 0 . 799 clb 0 . 738 0 . 761 0 . 821 0 . 530 0 . 471 0 . 794 0 . 723 0 . 819 rms bdl 0 . 8 41 0 . 787 0 . 825 0 . 675 0 . 479 0 . 853 0 . 709 0 . 870 slt 0 . 678 0 . 768 0 . 8 09 0 . 525 0 . 657 0 . 781 0 . 554 0 . 826 All pairs 0 . 772 0 . 800 0 . 826 0 . 582 0 . 572 0 . 818 0 . 672 0 . 838 T ABLE VIII L D R D E V I AT I O N ( % ) C O M PA R I S O N O F N O R M A L I Z AT I O N M E T H O D S . Speak ers Pairwi s e many-t o-many source target IN WN BN IN CI N WN BN CB N bdl 2 . 10 5 . 72 5 . 16 9 . 69 15 . 10 4 . 76 6 . 91 8 . 58 clb slt 3 . 46 1 . 3 6 1 . 96 13 . 38 13 . 56 2 . 04 3 . 68 5 . 85 rms 4 . 20 2 . 69 5 . 86 15 . 22 23 . 06 6 . 01 5 . 62 4 . 2 0 clb 4 . 61 4 . 36 1 . 97 15 . 59 25 . 97 4 . 83 7 . 91 3 . 47 bdl slt 9 . 36 3 . 7 2 6 . 78 15 . 49 13 . 45 4 . 14 2 . 2 4 5 . 53 rms 3 . 09 2 . 62 0 . 7 9 19 . 81 26 . 39 5 . 37 8 . 64 6 . 06 clb 2 . 01 1 . 52 0 . 81 10 . 67 15 . 91 4 . 13 3 . 46 0 . 66 slt bdl 3 . 91 8 . 31 6 . 70 16 . 96 19 . 57 3 . 12 7 . 78 5 . 49 rms 6 . 16 4 . 14 4 . 0 8 18 . 91 24 . 02 4 . 68 0 . 01 4 . 66 clb 2 . 22 2 . 51 1 . 99 14 . 70 20 . 74 2 . 48 7 . 44 3 . 55 rms bdl 5 . 23 4 . 38 6 . 87 16 . 41 13 . 97 5 . 26 9 . 68 10 . 85 slt 3 . 25 5 . 35 3 . 29 10 . 76 13 . 63 4 . 31 5 . 23 2 . 7 7 All pairs 3 . 87 3 . 87 3 . 47 13 . 77 18 . 03 4 . 4 6 4 . 77 4 . 65 The a verage MCDs, LFCs, and LDR deviations obtained using these normaliza tio n m e th ods ar e d emonstrated in T ab s. VI, VII, and VIII. As th e results show , BN worked be tter than IN and WN when applied to the pairwise con version mod el especially in terms of the MCD an d LFC measur es. Howe ver, naively applying it directly to the many-to-m any mod el did not work satisfactorily , as expected in Subsection IV -B. This was also th e case with IN. Althoug h CIN was found to perform poorly , CBN worked significan tly better . 3) Comp arisons with baseline meth ods: T abs. IX, X, and XI show the average MCDs, LFCs, an d LDRs o btained with the p roposed an d b a seline meth ods. As T abs. IX and X show , the pairwise versions of ConvS2S-VC and RNN-S2 S- VC p e r formed comp arably to eac h other and significantly better than spro cket. The effect of the many-to -many extension was no ticeable for both Con vS2S-VC an d RNN-S2S-VC, revealing the advantage o f exploiting the training data of all the speakers. The many-to -many ConvS2S-VC p e rformed better than its RNN counter p art. This d emonstrates the ef f ect of the conv olutio nal architecture. Since sprocket is designed to keep the spe a king rate a nd rhy thm of inpu t speech unchan g ed, the perfor mance gains over spro cket in terms of the LDR measure 11 T ABLE IX A V E R A G E M C D S ( D B ) W I T H 9 5 % C O N FI D E N C E I N T E RV A L S O B TA I N E D W I T H T H E B A S E L I N E A N D P R O P O S E D M E T H O D S Speak ers sprocke t RNN-S2S proposed (Con vS2S) source target pairwise many-t o-many pairwise many- to-man y bdl 6 . 98 ± . 10 6 . 87 ± . 09 6 . 94 ± . 15 6 . 93 ± . 10 6 . 57 ± . 12 clb slt 6 . 34 ± . 06 6 . 22 ± . 07 6 . 26 ± . 09 6 . 37 ± . 08 5 . 98 ± . 08 rms 6 . 84 ± . 07 6 . 45 ± . 09 6 . 23 ± . 06 6 . 53 ± . 16 6 . 11 ± . 08 clb 6 . 44 ± . 10 6 . 21 ± . 13 6 . 02 ± . 12 6 . 03 ± . 09 5 . 86 ± . 09 bdl slt 6 . 46 ± . 04 6 . 68 ± . 11 6 . 38 ± . 14 6 . 51 ± . 12 6 . 19 ± . 1 1 rms 7 . 24 ± . 12 6 . 69 ± . 21 6 . 35 ± . 09 6 . 79 ± . 16 6 . 24 ± . 12 clb 6 . 21 ± . 06 6 . 13 ± . 10 6 . 03 ± . 10 6 . 03 ± . 06 5 . 78 ± . 16 slt bdl 6 . 80 ± . 05 7 . 08 ± . 11 7 . 09 ± . 12 7 . 07 ± . 16 6 . 72 ± . 14 rms 6 . 87 ± . 10 6 . 64 ± . 13 6 . 38 ± . 07 6 . 61 ± . 09 6 . 27 ± . 08 clb 6 . 43 ± . 06 6 . 26 ± . 14 6 . 23 ± . 12 6 . 30 ± . 11 5 . 93 ± . 11 rms bdl 7 . 40 ± . 15 7 . 11 ± . 16 7 . 22 ± . 16 7 . 08 ± . 15 6 . 67 ± . 12 slt 6 . 76 ± . 09 6 . 53 ± . 11 6 . 41 ± . 12 6 . 50 ± . 07 6 . 32 ± . 16 All pairs 6 . 73 ± . 03 6 . 57 ± . 05 6 . 46 ± . 05 6 . 56 ± . 05 6 . 22 ± . 0 4 T ABLE X L F C S O B TA I N E D W I T H T H E BA S E L I N E A N D P R O P O S E D M E T H O D S Speak ers sprocke t RNN-S2S proposed (Con vS2S) source target pairwise many-to-man y pairwise many-to- many bdl 0 . 643 0 . 851 0 . 875 0 . 869 0 . 847 clb slt 0 . 790 0 . 765 0 . 815 0 . 833 0 . 845 rms 0 . 556 0 . 784 0 . 787 0 . 771 0 . 795 clb 0 . 642 0 . 748 0 . 840 0 . 810 0 . 826 bdl slt 0 . 632 0 . 738 0 . 797 0 . 815 0 . 863 rms 0 . 467 0 . 719 0 . 715 0 . 800 0 . 829 clb 0 . 820 0 . 847 0 . 776 0 . 861 0 . 827 slt bdl 0 . 663 0 . 812 0 . 834 0 . 850 0 . 852 rms 0 . 611 0 . 753 0 . 773 0 . 785 0 . 806 clb 0 . 632 0 . 753 0 . 818 0 . 821 0 . 796 rms bdl 0 . 648 0 . 817 0 . 854 0 . 825 0 . 877 slt 0 . 674 0 . 783 0 . 785 0 . 809 0 . 838 All pairs 0 . 653 0 . 798 0 . 808 0 . 826 0 . 836 T ABLE XI L D R D E V I ATI O N S ( % ) O B TA I N E D W I T H T H E BA S E L I N E A N D P R O P O S E D M E T H O D S Speak ers sprocke t RNN-S2S proposed (Con vS2S) source target pairwise many-to-man y pairwise many-to- many bdl 17 . 66 0 . 5 2 1 . 30 5 . 16 8 . 58 clb slt 9 . 74 2 . 95 1 . 24 1 . 96 5 . 85 rms 3 . 24 2 . 27 4 . 92 5 . 86 4 . 20 clb 16 . 65 3 . 52 4 . 94 1 . 97 3 . 47 bdl slt 4 . 58 7 . 76 7 . 18 6 . 78 5 . 53 rms 15 . 20 2 . 65 3 . 72 0 . 79 6 . 06 clb 9 . 25 2 . 63 3 . 49 0 . 81 0 . 66 slt bdl 5 . 52 4 . 61 0 . 01 6 . 70 5 . 49 rms 11 . 46 3 . 36 3 . 92 4 . 08 4 . 66 clb 2 . 84 2 . 80 5 . 40 1 . 99 3 . 55 rms bdl 17 . 76 4 . 53 3 . 19 6 . 87 10 . 85 slt 11 . 95 6 . 84 4 . 15 3 . 29 2 . 77 All pairs 10 . 60 3 . 62 3 . 56 3 . 47 4 . 65 show how well the comp eting method s are ab le to predict th e speaking rate and rhythm of target speech. As T ab. XI shows, both the pairwise and m any-to-many versions of RNN-S2S- VC and ConvS2S-VC obtaine d LDR deviations closer to 0 than spro cket. As m entioned e a r lier , o ne impor ta n t advantage of the p ro- posed mod el over its RNN co unterpar t is that it can be trained efficiently thank s to the n a ture of the co n volutional architecture s. In fact, whe reas th e p airwise a n d many-to-many versions of the RNN-based mod el took abou t 30 and 50 hours to tr a in , the two versio n s of the p roposed mo del only took about 4 and 7 hours to train under th e cu rrent exper imental settings. W e impleme n ted all the algorithms in PyT orch and used a sing le T esla V10 0 GPU with a 3 2.0 G B memory for training each model. 4) P erformanc e of any- to-many setting: Th e modification s described in Subsection IV -C make it possible to hand le any- to-many VC tasks. W e ev aluated how these m odifications actually affected the perfo rmance. T ab . XII shows th e a vera g e MCDs, LFCs, an d LDR deviations ob tain ed with the any-to- many setting under a clo sed-set co n dition, where the speaker of inp ut speech is un k nown but is seen in th e train ing data. 12 T ABLE XII A V E R A G E M C D S ( D B ) , L F C S , AN D L D R D E V I ATI O N S ( % ) O B TA I N E D W I T H T H E A N Y - T O - M A N Y S E T T I N G U N D E R A C L O S E D - S E T C O N D I T I O N . Speak er pair Measures source target MCD (dB) LFC LDR (% ) clb bdl 6 . 81 ± . 10 0 . 907 11 . 33 slt 6 . 27 ± . 06 0 . 842 3 . 70 rms 6 . 26 ± . 07 0 . 794 7 . 54 bdl clb 6 . 10 ± . 10 0 . 849 3 . 68 slt 6 . 55 ± . 13 0 . 826 8 . 59 rms 6 . 49 ± . 13 0 . 811 2 . 48 slt clb 6 . 02 ± . 09 0 . 813 1 . 90 bdl 7 . 03 ± . 13 0 . 878 5 . 87 rms 6 . 44 ± . 07 0 . 810 4 . 17 rms clb 6 . 33 ± . 13 0 . 815 4 . 73 bdl 6 . 95 ± . 09 0 . 825 10 . 77 slt 6 . 51 ± . 13 0 . 868 3 . 62 All pairs 6 . 48 ± . 04 0 . 829 4 . 67 T ABLE XIII A V E R A G E M C D S ( D B ) , L F C S , AN D L D R D E V I ATI O N S ( % ) O B TA I N E D W I T H T H E A N Y - T O - M A N Y S E T T I N G U N D E R A N O P E N - S E T C O N D I T I O N . Speak er pair Measures source target MCD (dB) LFC LDR (% ) lnh clb 6 . 29 ± . 12 0 . 778 2 . 74 bdl 7 . 12 ± . 12 0 . 803 9 . 06 slt 6 . 44 ± . 07 0 . 694 0 . 45 rms 6 . 67 ± . 09 0 . 720 4 . 44 All pairs 6 . 63 ± . 07 0 . 747 4 . 36 Whereas th e pairwise a nd th e default m any-to-many version s must be inf o rmed about the speaker of each input utterance at test time, the any-to-many version requires no in formation . This can be co n venien t in practical scena r ios of VC applica- tions, but because of the disad vantage in the test con dition, the prob lem be comes m o re ch allenging. As the resu lts show , the MCDs an d LFCs ob tained with the any-to-many version were only slightly worse than those o btained with the default many-to-ma ny model despite th is disadvantage. It is also worth noting that they were better than th ose obtained with sprocket and the pairwise versions of ConvS2S-VC and RNN-S2S, all of which were trained und er a speaker-depend ent closed-set condition . W e furthe r ev aluated the performa nce of the any-to-many model und er an ope n -set cond ition wher e the speaker of the test u tter ances is u nseen in the training data. W e u sed the utteranc es of th e speaker lnh (fe male) as th e test input speech. The results are shown in T ab . XIII. For comp arison, T ab. XIV shows results of spro cket perfor m ed on the same speaker p airs under a speaker-dependent closed-set condition . As these results show , th e propo sed mo del with the open-set any-to-many setting still perform ed better than sprocket, even though sprocket had an advantage in b oth th e training and test condition s. 5) P e rfo rma nce with r eal-time system settings: W e e valu- ated the MCDs and LFCs obtaine d with the ma ny-to-many model under the real-time system setting described in Subsec- tion II I-F. The results are shown in T ab . XV. As the results show , the MCDs and LFCs were only slightly worse than tho se obtained with the default setting despite the disadvantage of using ca u sal conv olu tions fo r all the networks and fo rcing T ABLE XIV A V E R A G E M C D S ( D B ) , L F C S , AN D L D R D E V I ATI O N S ( % ) O B TA I N E D W I T H S P RO C K E T U N D E R A S P E A K E R - D E P E N D E N T C O N D I T I O N . Speak er pair Measures source target MCD (dB) LFC LDR % lnh clb 6 . 76 ± . 08 0 . 716 6 . 61 bdl 8 . 26 ± . 35 0 . 523 13 . 38 slt 6 . 62 ± . 11 0 . 771 5 . 72 rms 7 . 22 ± . 10 0 . 480 4 . 87 All pairs 7 . 21 ± . 14 0 . 579 7 . 61 T ABLE XV M C D S A N D L F C S O B TA I N E D W I T H T H E R E A L - T I M E S Y S T E M S E T T I N G S . Speak er pair Measures source target MCD (dB) LFC clb bdl 6 . 77 ± . 10 0 . 821 slt 5 . 95 ± . 08 0 . 829 rms 6 . 35 ± . 10 0 . 764 bdl clb 5 . 99 ± . 10 0 . 820 slt 6 . 32 ± . 10 0 . 821 rms 6 . 54 ± . 15 0 . 797 slt clb 5 . 84 ± . 10 0 . 818 bdl 6 . 92 ± . 16 0 . 810 rms 6 . 48 ± . 09 0 . 789 rms clb 6 . 17 ± . 11 0 . 789 bdl 6 . 74 ± . 13 0 . 856 slt 6 . 42 ± . 12 0 . 833 All pairs 6 . 37 ± . 04 0 . 812 attention matrices to be exactly diago nal (instead of having them be predicted ). A comp arison of T ab. XV with th e r esults obtained with sprocket in T abs. IX and X may also sho w how well the prop osed method can per f orm with the real-time system setting. G. Sub jective Listening T ests W e condu cted mean opinion sco re (M OS) tests to comp are the sound q uality and speaker similarity o f the converted speech samples obtaine d with th e propo sed and ba seline methods. W ith the sou nd quality test, we inclu ded the speech samples synthesized in the sam e way as th e p roposed a n d ba selin e methods (nam ely the W ORLD synthesizer) u sing the a c oustic features d irectly extracted from real speech samples. Hence, the scores of these samp le s are expected to show the upper limit of the p erforma n ce. W e also inc lu ded speech samples produ ced using the p a irwise and many-to-ma ny versions of RNN-S2S-VC and sprocket in the stimuli. Speech samples were p resented in rand om orders to eliminate bias as regard s the or der of the stimuli. T en listeners p articipated in ou r listening tests. Each listener was presented 6 × 1 0 utterances and asked to ev alu ate the n aturalness by selec tin g 5 : E xcellent, 4: Goo d, 3: Fair , 2 : Poo r , or 1: Bad f or each utteran c e . The results are shown in Fig. 6. As the resu lts show , the pairwise Con vS2S-VC pe rformed slightly better than sprocket and sign ifica n tly better than the two versio n s of RNN-S2S- VC. The m any-to-many ConvS2S-VC perf o rmed better than all other method s, revealing the effect of the many-to-ma ny extension, and reached clo se to the up p er limit obtained with the analysis and syn thesis technique . 13 Fig. 6. Results of the MOS test for sound qualit y . Fig. 7. Results of the MOS test for speaker similarit y . In the speaker similarity test, each subject was giv en a conv e rted speech sample and a real speech sample of the correspo n ding target speaker an d was a sked to evaluate how likely they ar e to be produced by the same speaker by selecting 5: Definitely , 4: Likely , 3: Fair , 2: Not very likely , or 1: Unlikely . W e u sed converted sp eech samples g enerated by the pairwise and many-to-many versions of RNN-S2S-VC an d sprocket f or comp a rison as with the sou nd quality test. E ach listener was pre sented 5 × 10 p airs of utterance s. As the results in Fig. 7 show , bo th the pairwise and many-to - many versions of ConvS2S-VC perform ed better than a ll other meth ods. H. Audio examples of v a rious conver sio n ta sks Although we on ly co n sidered a speaker id entity conversion task in the above expe r iments, ConvS2S-VC can a lso be ap- plied to other tasks. Audio samp les of ConvS2S-VC tested on se veral tasks, including speaker iden tity conv e rsion, emotional expression conversion, electrolar yngeal speech enhanceme nt, and Eng lish accent c o n version , are pr ovided at [67]. Fro m these examples, we can expect that Con vS2S-VC can also perfor m reason a bly well in various ta sk s other th an speaker identity co n version . V I . C O N C L U S I O N S This pap er prop osed a voice co n version (VC) meth od based on the ConvS2S learning fram e work. The prop osed method provides a n a tural way of con verting the F 0 contour, speaking rate, and r hythm as well as the voice charac teristics o f input speech and th e flexibility of hand ling many-to-many , any- to-many , and real-time VC tasks without r elying on auto- matic speech recognitio n (ASR) models and text ann otations. Throu g h ablation studies, we demo nstrated the individual effect of each of the ideas in troduced in th e p roposed m ethod. Objective and su bjectiv e e valuation experiments on a speaker identity conversion task showed that the p roposed method could perform better th an baseline methods. Furthermore, au - dio examples showed the potential of the proposed method to perfor m well in various task s in cluding emotio n al expression conv e rsion, electrolaryngeal speech e n hancemen t, and English accent conversion. A C K N OW L E D G M E N T S This work was supported by JSPS KA KE NHI 17H0 1763 and JST CREST Grant Num b er JPMJCR19A3, Japan. R E F E R E N C E S [1] A. Kain a nd M. W . Macon, “Spec tral voice c on version for te xt-to-speech synthesis, ” in Pr oc. Interna tional Confer ence on Acoustics, Speech, and Signal Proce s sing (ICASSP) , 1998, pp. 285–288. [2] A. B. Kain, J.-P . Hosom, X. Niu, J. P . van Santen, M. Fried-Oken, and J. Staehely , “Improvin g the intelligibi lity of dysarthri c speech, ” Speech Communicat ion , vol . 49, no. 9, pp. 743–759, 2007. [3] K. Nakamura, T . T oda, H. Saruwa tari, and K. Shikano, “Speaking- aid systems using GMM-based v oice con version for ele ctrolaryngeal speech, ” Speec h Communication , vol . 54, no. 1, pp. 134–146, 2012. [4] Z. Inanoglu and S. Y oung, “Data-dri ven emotion con version in spoken English, ” Speech Communication , vol. 51, no. 3, pp. 268–283, 2009. [5] O. T ¨ urk and M. Schr ¨ oder , “Eval uation of expressi ve speech synthesi s with v oice con version and copy resynthesis techniqu es, ” IE EE T ransac- tions on Audio, Spee ch, and Language Pr ocessing , vol. 18, no. 5, pp. 965–973, 2010. [6] T . T oda, M. Nakagi ri, and K. Shikano, “Statistic al voice con version techni ques for body- conduct ed un voiced speech enhancement, ” IEEE T ransacti ons on Audio, Speech, and Languag e Processi ng , v ol. 20, no. 9, pp. 2505–2517, 2012. [7] D. Felps, H. Bortfe ld, and R. Gutierrez- Osuna, “Foreign accent con ver - sion in computer assisted pronunciat ion trai ning, ” Speech Communica- tion , vol. 51, no. 10, pp. 920–932, 2009. [8] Y . Stylianou, O. Cap p ´ e, and E. Moulines, “Conti nuous probabili stic transform for voice conv ersion, ” IEEE T rans. SAP , vol. 6, no. 2, pp. 131–142, 1998. [9] T . T oda, A. W . Black, and K. T okuda, “V oice con versi on based on maximum-lik elihood estimat ion of spectral paramete r trajector y , ” IE EE T ransacti ons on Audio, Speech, and Languag e Processi ng , v ol. 15, no. 8, pp. 2222–2235, 2007. [10] E. Helande r, T . V irtanen, J. Nurmine n, and M. Gabbouj, “V oi ce conv er- sion usin g part ial least squa res re gression, ” IEEE T ransact ions on A udio, Speec h, and Languag e Processi ng , vol. 18, no. 5, pp. 912–921, 2010. [11] S. Desai, A. W . Black, B. Y egnanaraya na, and K. Prahallad, “Spec tral mapping using artificial neural network s for voice con version, ” IEE E T ransacti ons on Audio, Speech, and Languag e Processi ng , v ol. 18, no. 5, pp. 954–964, 2010. [12] S. H. Mohammadi and A. Kain, “V oice con version using deep neural netw orks with speak er-indepe ndent pre-tra ining, ” in Pr oc. IE EE Spoken Languag e T echn olog y W orkshop (SLT) , 2014, pp. 19–23. [13] Y . Saito, S. T akamichi, and H. Saruwa tari, “V oice con version using input-t o-output high way netw orks, ” IEICE T rans Inf. Syst. , v ol. E100-D, no. 8, pp. 1925–1928, 2017. [14] L. Sun, S. Kang, K. Li, and H. Meng, “V oice con version using deep bidirec tional long short-term memory based recurren t neural network s, ” in Pr oc. Internat ional Confere nce on Acoustics, Speec h, and Signal Pr ocessing (ICASSP) , 2015, pp. 4869–4873. [15] T . Kaneko , H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence- to-sequen ce voice con version with similarity metric learned using generat ive adv ersarial networks, ” in P r oc. Annual Confer ence of the Internati onal Speech Communicati on Association (Interspeec h) , 2017, pp. 1283–1287. [16] C.-C. Hsu, H.-T . Hwang, Y .-C. W u, Y . T sao, and H.-M. W ang, “V oice con version from non-parall el corpora using varia tional auto-enc oder , ” in Pr oc. Asia-P acific Signal and Informat ion Pr ocessing Associati on Annual Summit and Confer ence (APSIP A ASC) , 2016, pp. 1–6. 14 [17] C.-C. H s u, H.-T . Hwang, Y . -C. Wu, Y . Ts ao, and H.-M. W ang, “V oice con version from unaligne d corpora using varia tional autoen- coding W asserstein generati ve adversa rial networks, ” in Proc. Annual Confer ence of the International Speec h Communication Association (Inter speec h) , 2017, pp. 3364–3368. [18] H. Kame oka, T . Kaneko, K. T anaka, and N. Hojo, “A CV AE-VC: Non-paral lel voice con version with auxiliary classifie r var iation al au- toencod er , ” IEEE /A CM T ransact ions on Audio, Speech, and Language Pr ocessing , vol. 27, no. 9, pp. 1432–1443, 2019. [19] T . Kanek o and H. Kameoka, “Paralle l-data -free voic e con ve rsion us- ing cycl e-consist ent adversar ial networks, ” arXiv:1711.11293 [stat.ML] , Nov . 2017. [20] H. Kameoka, T . Kaneko, K. T anaka, and N. Hojo, “StarGAN-VC: Non- parall el many-to- many v oice con versio n using sta r gene rati ve adversari al netw orks, ” in Proc . IE E E Spoken Language T echno logy W orkshop (SL T) , 2018, pp. 266–273. [21] I. Sutske ver , O . V inyal s , and Q. V . Le, “Sequenc e to sequence learning with neural networks, ” in Adv . Neural Information Proce ssing Systems (NIPS) , 2014, pp. 3104–3112. [22] J. Choro wski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “ Attention-base d models for speech recognit ion, ” in Adv . Neural Infor- mation Proce ssing Systems (NIPS) , 2015, pp. 577–585. [23] Y . W ang, R. Skerry-Ryan, D. Stanton, Y . W u, R. J. W eiss, N. Jaitly , Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrg iannaki s , R. Clark, and R. A. Saurous, “T acotron: T owar ds end-to-end speech synthesis, ” in Pr oc. Annual Confer ence of the Internation al Speech Communicat ion Association (Interspeec h) , 2017, pp. 4006–4010. [24] S. O. Arık, M. Chrzano wski, A. Coates, G. Diamos, A. Gibia nsky , Y . Kang, X. Li, J. Mill er, A. Ng, J . Raiman, S. Sengupt a, and M. Shoeybi , “Deep voice: Real-ti m e neural text- to-speec h, ” in Pro c. Internati onal Confer ence on Mac hine Learning (ICML) , 2017. [25] S. O. Arık, G. Diamos, A. Gibiansk y , J. Miller , K. Peng, W . Ping, J. Raiman, and Y . Zhou, “Deep voice 2: Multi-spea ker neural text-t o- speech, ” in Proc. Neural Informati on Proce ss ing Syst ems (NIPS) , 2017. [26] J. Sotelo, S. Mehri, K. Kumar , J. F . Santos, K. Kastne r , A. Courville , and Y . Bengio, “Char2W a v: E nd-to-e nd speech synthesis, ” in Pr oc. Internati onal Confer ence on Learning Repre sentati ons (ICLR) , 2017. [27] H. T achiba na, K. Uenoyama, and S. Aihara, “Efficie ntly traina ble te xt- to-speec h system based on deep con volutional networks with guided atten tion, ” in Pr oc. Internationa l Confer ence on Acoustics, Speec h, and Signal Proce s sing (ICASSP) , 2018, pp. 4784–4788. [28] W . Ping, K. Peng, A. Gibiansk y , S. O. Arık, A. Kannan, S. Narang, J. Raiman, and J . Miller , “Deep V oice 3: Scaling text-t o-speech with con voluti onal sequenc e learning, ” in Proc . Internat ional Confer ence on Learning Represent ations (ICLR) , 2018. [29] J. Shen, R. Pa ng, R. J. W eiss, M. Schuster , N. Jaitly , Z. Y ang, Z. Chen, Y . Zhang, Y . W ang, R. Skerry-Ryan, R. A. Saurou s, Y . Agiomyrgian- nakis, and Y . Wu, “Natural tts synthesis by conditioni ng W av eNet on mel spectrogram predicti ons, ” in Proc. Internati onal Confere nce on Acoustics, Speech, and Signal Processi ng (ICASSP) , 2018, pp. 4779– 4783. [30] M.-T . Luong, H. Pham, and C. D. Manning , “Effe cti ve approach es to atten tion-ba sed neural machine translat ion, ” in Proc . EMNLP , 2015. [31] Y . N. Dauphin, A. Fan , M. Auli, and D. Grang ier , “Language modeling with gate d con volutiona l network s, ” in Pr oc. Inte rnational Conf er ence on Machine Learning (ICML) , 2017, pp. 933–941. [32] A. van den Oord, S. Dieleman, H. Zen, K. Simony an, O. V inyal s , A. Gra ves, N. Kal chbrenne r, A. Senior , and K. Ka vukcuoglu, “W aveNet : A generati ve model for ra w audi o, ” arX iv:1609 .03499 [cs.SD] , Sep. 2016. [33] J. Gehring, M. Auli, D. Grangier , D. Y arats, and Y . N. Dauphin, “Con volut ional sequence to sequenc e learning, ” in Pro c. Internati onal Confer ence on Mac hine Learning (ICML) , 2017. [34] H. Kameoka, K. T anaka, T . Kanek o, and N. Hojo, “Con vS2S-VC: Fully con volutiona l sequence-t o-sequenc e voice con version, ” arXiv , Nov . 2018. [35] K. T anaka, H. Kameoka , T . Kaneko , and N. Hojo, “Att S2S-VC: Sequence -to-sequence v oice con version with attention and conte xt preserv ation mechanisms, ” in P roc. Internati onal Confer ence on Acous- tics, Speech, and Signal Proce ssing (ICASSP) , 2019, pp. 6805–6809. [36] H. Miyoshi, Y . Saito, S. T akamichi, and H. Saruwatari, “V oice con- versi on using sequence-to-se quence learnin g of context posterior prob- abili ties, ” in Pr oc. Annual Confer ence of the Internat ional Speech Communicat ion Association (Interspeec h) , 2017, pp. 1268–1272. [37] J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y . Jiang, and L.-R. Dai, “Sequ ence- to-sequen ce acousti c modeling for vo ice conv ersion, ” [cs.SD] , Oct. 2018. [38] M. Z hang, X. W ang, F . Fang, H. Li, and J. Y amagishi, “Joint trainin g frame work for text-to -speech and voice conv ersion using multi-so urce T acotro n and W aveNe t, ” in Pr oc. Annual Confer ence of the International Speec h Communication Association (I ntersp eec h) , 2019, pp. 1298–1302. [39] F . Bi adsy , R. J. W eiss, P . J. Moreno, D. Kane vsky , and Y . Jia, “P arrotron: An end-to-end s peech -to-spee ch con versi on model and its applicat ions to hea ring-impaired speech and speech separatio n, ” in Pr oc. Annual Confer ence of the International Speech Communication Association (Inter speec h) , 2019, pp. 4115–4119. [40] A. Haque, M. Guo, and P . V erma, “Conditional en d-to-end audio transforms, ” in P r oc. Annual Conferen ce of the Internation al Speech Communicat ion Association (Interspeec h) , 2018, pp. 2295–2299. [41] A. T amamori, T . Hayashi, K. Ko bayashi, K. T aked a, and T . T oda, “Speak er-depende nt W aveNet vocod er , ” in P r oc. Annual Confere nce of the Internat ional Speech Communicati on Association (Interspee ch) , 2017, pp. 1118–1122. [42] N. Kalc hbrenner , E. Elsen, K. Simonyan, S. Noury , N. Casagra nde, E. Lockhart, F . Stimber g, A. v an den Oord, S. Dieleman, and K. Kavukcuoglu, “Effici ent neural audi o synthesis, ” [cs.SD] , Feb. 2018. [43] S. Mehri, K. K umar , I. Gulraj ani, R. Kumar , S. Jain , J. Sotelo, A. Courvil le, and Y . Bengio, “SampleRNN: An unconditi onal end-to- end neural audio generation model, ” arXiv:1612.07837 [cs.SD] , Dec. 2016. [44] Z. Jin, A. Fink elstein, G. J . Mysore, and J. L u, “FFTNet: A real-t ime speak er-de pendent neura l vocode r, ” in Proc. Internat ional Confer ence on Acoustics, Spee ch, and Signal Pr ocessing (ICASSP) , 2018, pp. 2251 – 2255. [45] A. van den Oord, Y . Li, I. Babusc hkin, K. Simonyan , O. V inyals, K. Kavukcuoglu, G. v an den Driessche, E. Lockhart, L . C. Cobo, F . St imberg , N. Casagrande , D. Grewe, S. Noury , S. Dieleman, E. Elsen, N. Kalchbrenne r, H. Zen, A. Gra ves, H. King, T . W alters, D. Bel ov , and D. Hassabis, “Para llel W ave Net: Fa s t high -fidelit y speech synthesis, ” arXiv:1711.10433 [cs.LG] , Nov . 2017. [46] W . Ping, K. P eng, and J. Chen, “ClariNet: Pa ralle l wave generation in end-to-e nd text-t o-speech, ” arXiv:1807.072 81 [cs.CL] , Feb . 2019. [47] R. Prenger , R. V alle, and B. Catanz aro, “W ave Glo w: A flow-based generat ive netw ork for speech synthesi s, ” arXiv:18 11.00002 [cs.SD] , Oct. 2018. [48] S. Kim, S. Lee, J. Song, and S. Y oon, “Fl oW av eNet: A genera ti ve flow for raw audio, ” arXiv:1811.021 55 [cs.SD] , Nov . 2018. [49] X. W ang, S. T akaki, and J. Y amagi shi, “Neural source-filter - based wave form model for statistical parametric speech synthesis, ” arXiv:1810.11946 [eess.AS] , Oct. 2018. [50] K. T anaka, T . Kan eko, N. Hojo, and H. Kameoka , “Synt hetic -to- natural s peec h wa veform con version using c ycle-consistent adversarial netw orks, ” i n Pr oc. IEEE Spok en Languag e T echnolo gy W orkshop (SL T) , 2018, pp. 632–639. [51] T . Fukada, K. T okuda, T . Koba yashi, and S. Imai, “ An adapti ve algori thm for mel-cepst ral analysis of speech, ” in P r oc. Internati onal Confe rence on Acoustic s, Speech, and Signal Pr ocessing (ICASSP) , 1992, pp. 137– 140. [52] M. Morise, F . Y okomori, and K. Ozaw a, “WORLD: a vocoder -based high-qua lity speech synthesis system for real-time applicatio ns, ” IEICE T ransacti ons on Information and Systems , vol. E99-D, no. 7, pp. 1877– 1884, 2016. [53] Y . W ang, R. Skerry-Ry an, D. Stanton, Y . W u, R. J. W eiss, N. Jaitly , Z. Y ang, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrg iannaki s , R. Clark, and R. A. Saurous, “T acotron: T owar ds end-to-end speech synthesis, ” in Pr oc. Annual Confer ence of the Internat ional Speech Communicat ion Association (Interspeec h) , 2017, pp. 4006–4010. [54] A. V asw ani, N. Shazeer , N. P armar , J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser , and I. Polosukhin, “ Attention is all you need, ” in Adv . Neura l Informatio n Proce s sing Systems (NIPS) , 2017. [55] V . Dumouli n, J. Shlen s, and M. Kudlur , “ A lea rned represent ation for artisti c style, ” in P r oc. Inte r nation al Confer ence on Learning Represen- tations (ICLR) , 2017. [56] T . Kane ko, H. Kameoka, K. T anaka, and N. Hojo, “Sta rGAN-VC2: Rethin king conditional methods for stargan- based voice con version, ” in Pr oc. Annual Confer ence of the International Speech Communica tion Association (Interspee ch) , 2019, pp. 679–683. [57] J. Kominek and A. W . Black, “The CMU Arctic speech databases, ” in Pr oc. ISCA Speech Synthesis W orkshop (SSW) , 2004, pp. 223–224. [58] https://gi thub .com/r9y9/pysptk. [59] D. Kingma and J. Ba, “ Adam: A method for stochasti c optimization, ” in Pr oc. Internati onal Confe rence on Learning Repr esentations (ICLR) , 2015. 15 [60] D. J . Hermes, “Mea suring the perceptual simila rity of pitch contours, ” J . Speech Lang. Hear . Res. , vol. 41, no. 1, pp. 73–82, 1998. [61] K. Kobaya shi and T . T oda, “sprocke t: Open-source voice con version softwar e, ” in Proc. Odysse y , 2018, pp. 203–210. [62] https://gi thub .com/k2kobayashi/sprock e t, (Accessed on 01/28/2019). [63] J. Lorenzo-T rueba, J. Y amagishi , T . T oda, D. Saito, F . V illavic en- cio, T . Kinnunen, and Z. Ling, “The voice con version challenge 2018: Promoting dev elopment of parallel and nonparal lel methods, ” arXiv:1804.04262 [eess.A S] , Apr . 2018. [64] D. Ulyanov , A. V edaldi, and V . Lempitsky , “Instance normali zation: The missing ingredi ent for fast stylizati on, ” arXiv:1607.08022 [ cs. CV ] , Jul. 2016. [65] T . S alimans and D. P . Kingma, “W eight normalizati on: A simple reparamet erization to accele rate tra ining of deep neural networks, ” in Adv . Neural Inf ormation Pr ocessing Systems (N IPS) , 2016, pp. 901–909. [66] S. Ioffe and C. Szegedy , “Batch normalizatio n: accel eratin g deep net- work training by reducing inte rnal cov ariate shift, ” in Pr oc. Int ernation al Confer ence on Mac hine Learning (ICML) , 2015, pp. 448–456. [67] http://www .kecl.ntt.co.jp/ pe ople/kameoka.hirokazu/Demos/con vs2s-vc/.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment