WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

WGANSing: A Multi-V oice Singing V oice Synthesizer Based on the W asserstein-GAN Pritish Chandna ∗ , Merlijn Blaauw ∗ , Jordi Bonada ∗ and Emilia Gómez ∗ † ∗ Music T echnology Group, Univ ersitat Pompeu Fabra, Barcelona, Spain † Joint Research Centre, European Commission, Seville, Spain Email:pritish.chandna@upf.edu, merlijn.blaauw@upf.edu, jordi.bonada@upf.edu, emilia.gomez@upf.edu Abstract —W e present a deep neural network based singing voice synthesizer , inspired by the Deep Conv olutions Generative Adversarial Networks (DCGAN) architectur e and optimized us- ing the W asserstein-GAN algorithm. W e use vocoder parameters for acoustic modelling, to separate the inﬂuence of pitch and timbre. This facilitates the modelling of the large variability of pitch in the singing voice. Our network takes a block of consecutive frame-wise linguistic and fundamental frequency features, along with global singer identity as input and outputs vocoder featur es, corresponding to the block of features. This block-wise appr oach, along with the training methodology allows us to model temporal dependencies within the features of the input block. For inference, sequential blocks are concatenated using an overlap-add pr ocedure. W e show that the performance of our model is competitive with regards to the state-of-the-art and the original sample using objectiv e metrics and a subjective listening test. W e also present examples of the synthesis on a supplementary website and the source code via GitHub. Index T erms —W asserstein-GAN, DCGAN, WORLD vocoder , Singing V oice Synthesis, Block-wise Predictions I . I N T R O D U CT I O N Singing v oice synthesis and T e xt-T o-Speech (TTS) synthesis are related but distinct research ﬁelds. While both ﬁelds try to generate signals mimicking the human v oice, singing voice synthesis models a much higher range of pitches and vo wel durations. In addition, while speech synthesis is controlled primarily by textual information such as words or syllables, singing voice synthesis is additionally guided by a score, which puts constraints on pitch and timing. These constraints and differences also cause singing voice synthesis models to deviate somewhat from their speech counterparts. Historically , both speech and singing voice synthesis have been based on concatenative methods, which inv olve transformation and concatenation of waveforms from a large corpus of specialized recordings. Recently , se veral machine learning based methods hav e been proposed in both ﬁelds, most of which also require a large amount of data for training. In terms of quality , the ﬁeld of TTS has seen a re volution in the last few years, with the introduction of the W av eNet [1] autoregressiv e framework, capable of synthesizing speech virtually indistinguishable from a real voice recording. This architecture inspired the Neural Parametric Singing Synthesiser (NPSS) [2], a deep learning based singing voice synthesis method which is trained on a dataset of annotated natural singing and produces high quality synthesis. The W a veNet [1] directly generates the wav eform giv en local linguistic and global speaker identity conditions. While a high quality synthesis is generated, the drawback of this model is that it requires a large amount of annotated data. As such, some following works, like the T acotron 2 [3], use the W a veNet as a vocoder for con verting acoustic features to a wav eform and use a separate architecture for modelling these acoustic features from the linguistic input. The W av eNet vocoder architecture, trained on unlabeled data, is also capable of synthesizing high-quality speech from an intermediate fea- ture representation. The task that we focus on in this paper is generating acoustic features giv en an input of linguistic features. V arious acoustic feature representations hav e been proposed for speech synthesis, including the mel-spectrogram [3], which is a compressed version of the linear-spectrogram. Howe ver , for the singing voice, a good option is to use vocoder features, as they separate pitch from timbre of the signal. This is ideal for the singing voice as the pitch range of the voice while singing is much higher than that while speaking normally . Modelling the timbre independently of the pitch has been shown to be an effecti ve methodology [2]. W e note that the use of a vocoder for direct synthesis can lead to a degradation of sound quality , but this degradation can be mitigated by the use of a W av eNet v ocoder trained to synthesize the wa veform from the parametric vocoder features. As such, for the scope of this study , we limit the upper-bound of the performance of the model to that of the vocoder . Like autoregressi ve networks, Generative adversarial net- works (GANs) [4]–[6] are a family of generati ve frameworks for deep learning, which includes the W asserstein-GAN [7] variant. While the methodology has provided exceptional results in ﬁelds related to computer vision, it has only a fe w adaptations in the audio domain and indeed in TTS, that we discuss in the following sections. W e adapt the W asserstein- GAN model for singing voice synthesis. In this paper , we present a nov el block-wise generative model for singing voice synthesis, trained using the W asserstein-GAN framew ork 1 . The block-wise nature of the model allo ws us to model 1 The code for this model in the T ensorFlow framework is av ailable at https://github .com/MTG/WGANSing. Audio examples, including synthesis with voice change and without the reconstruction loss can be heard at https://pc2752.github .io/sing_synth_examples/ temporal dependencies among features, much like the inter- pixel dependencies are modelled by the GAN architecture in image generation. For this study , we use the original fundamental frequency for synthesis, leading to a performance driv en synthesis. As a result, we only model the timbre of the singing voice and not the expression via the f 0 curve or the timing deviations of the singers. The rest of the paper is organized as follows. Section II provides a brief ov erview of the GAN and W asserstein-GAN generativ e frame works. Section III discusses the state-of-the- art singing voice synthesis model that we use as a baseline in this paper and some of the recent applications of GANs in the ﬁeld of TTS and in general, in the audio domain. The following sections, section IV and section V present our model for singing voice synthesis, followed by a brief discussion on the dataset used and the hyperparameters of the model in sections VI and VII respecti vely . W e then present an e valuation of the model, compared to the baseline in section IX, before wrapping up with the conclusions of the paper and a discussion of our future direction in section X. I I . G A N S A N D W A S S E R S T E I N - G A N S Generativ e Adversarial Networks (GANs) have been ex- tensiv ely used for various applications in computer vision since their introduction. GANs can be viewed as a net- work optimization methodology based on a two-player non- cooperativ e training that tries to minimize the div ergence between a parameterized generated distribution P g and a real data distribution, P r . It consists of two networks, a generator , G and a discriminator , D , which are trained simultaneously . The discriminator is trained to distinguish between a real input and a synthetic input output by the generator , while the generator is trained to fool the discriminator . The loss function for the network is shown in equation 1. L GAN = min G max D E y ∼ P r [log( D ( y ))] + E x ∼ P x [log(1 − D ( G ( x )))] (1) Where y is a sample from the real distribution and x is the input to the generator , which may be noise or conditioning as in the Conditional GAN [5] and is taken from a distribution of such inputs, P x . While GANs have been shown to produce realistic images, there are dif ﬁculties in training including v anishing gradient, mode collapse and instability . T o mitigate these difﬁculties, the W asserstein-GAN [7] has been proposed, which optimizes an efﬁcient approximation of the Earth-Mov er (EM) distance between the generated and real distributions and has been shown to produce realistic images. The loss function for the WGAN is shown in equation 2. In this version of the GAN, the discriminator network is replaced by a network termed as critic, also represented by D , which can be trained to optimality and does not saturate, con ver ging to a linear function. L W GAN = min G max D E y ∼ P r [ D ( y )] − E x ∼ P x [ D ( G ( x ))] (2) W e use a conditional version of the model, which generates a distribution, parametrized by the network and conditioned on a conditional v ector , described in section V and follow the training algorithm proposed in the original paper [7], with the same hyperparameters for training. I I I . R E L A T E D W O R K GANs have been adapted for TTS in se veral variations over recent years. The work closest to ours was the one proposed by Zhao et al. [8], which uses a W asserstein-GAN frame work, followed by a W aveNet vocoder and a complimentary wa ve- form based loss. Y ang et al. [9] use the mean squared error (MSE) and a v ariational autoencoder (V AE) to enable the GAN optimization process in a multi-task learning framew ork. A BLSTM based GAN framework complemented with a style and reconstruction loss is used by Zhao et al. [10]. While these models use recurrent networks for frame-wise sequential prediction, we propose a con volutional network based system to directly predict a block of vocoder features, based on an input conditioning of the same size in the time dimension. Other examples of the application of GANs for speech synthesis include work done by Kaneko et al., [11] and [12], which use GANs as a post-ﬁlter for acoustic models to overcome the ov ersmoothing related to the models used. GANs have also been adapted to synthesize waveforms di- rectly; W av eGAN [13] is an example of the use of GANs to synthesize spoken instances of numerical digits, as well as other audio examples. GANSynth [14] has also been proposed to synthesize high quality musical audio using GANs. For singing voice synthesis, Hono et al. [15] use a GAN- based architecture to model frame-wise vocoder features. This models the inter-feature dependencies within a frame of the output. In contrast, our model directly models a block of consecutiv e audio frames via the W asserstein-GAN frame- work. This allows us to model temporal dependencies be- tween features within that block. This temporal dependence is modelled via autoregression in the Neural Parametric Singing Synthesizer (NPSS) [2] model, which we use as a baseline in our study . The NPSS uses an autoregressiv e architecture, inspired by the W aveNet [1], to make frame-wise predictions of vocoder features, using a mixture density output. The network models the frame-wise distribution as a sum of Gaussians, the param- eters of which are estimated by the network. In contrast, the use of adversarial networks for estimation, imposes no explicit constraints on the output distribution. The NPSS model has been shown to generate high quality singing voice synthesis, comparable or exceeding state-of-the-art concatenative meth- ods. A multi-singer variation of the NPSS model has also been proposed recently [16], and is used as the baseline for our study . Conditioning Vector Continuous f0 contour Singer Identity Phoneme as a one-hot vector Generator Generated Sample Rea l S am pl e Critic Noise Fig. 1: The framew ork for the proposed model. A conditioning vector , consisting of frame-wise phoneme and f 0 annotations along with speaker identity is passed to the generator . The critic is trained to distinguish between the generated sample and a real sample. I V . P RO P O S E D S Y S T E M W e adopt an architecture similar to the DCGAN [17], which was used for the original WGAN. For the generator , we use an encoder-decoder schema, shown in ﬁgure 2 wherein both the encoder and decoder consist of 5 con volutional layers with ﬁlter size 3 . Connections between the corresponding layers of the encoder and decoder , as in the U-Net [18] architecture are also implemented. Conditioning Ve c to r 64 32 64 128 256 512 Generated Sample Conv Lay er , size=1, stride=1 Conv Lay er , size=3, stride=2 256 128 64 32 64 64 Upsample and Conv Lay er , size=3 Fig. 2: The architecture for the generator of the proposed network. The generator consists of an encoder and a decoder , based on the U-Net architecture [18]. While conv olutional networks are capable of modelling arbitrary length sequences, the critic in our model takes a block of a ﬁxed length input, thereby modelling the depen- dencies within the block. Furthermore, the encoder-decoder schema leads to conditional dependence between the features of the generator output, within the predicted block of features. This approach implies implicit dependence between vocoder features of a single block but not within the blocks themselves. As a result, for inference, we use ov erlap-add of consecuti ve blocks of output vocoder features, as shown in ﬁgure 3. An ov erlap of 50% was used with a triangular window across features. This appraoch is similar to that used by Chandna et al. [19] t t+N t+N /2 t+3N/2 t t+3N/2 t+N t+N /2 Fig. 3: The ov erlap add process for the generated features. As shown, predicted features from time t to time t + N are ov erlap-added with features from time t + N / 2 to t + 3 N / 2 , where t is the start time of the process in view . A triangular window is used for the adding process, applied across each of the features. As proposed by Radford et al. [17], we use strided conv olu- tions in the encoder instead of deterministic pooling functions for downsampling. For the decoder , we use linear interpolation followed by normal con volution for upsampling instead of transposed con volutions, as this has been shown to av oid the high frequency artifacts which can be introduced by the latter [20]. Blocks of size N consecutive frames are passed as input to the network and the output has the same size. Like the DCGAN, we use ReLU activ ations for all layers in the generator , except the ﬁnal layer , which uses a tanh activ ation. W e found that the use of batch normalization did not affect the performance much. W e also found it helpful to guide the WGAN training by adding a reconstruction loss, as shown in equations 3 and 4. This reconstruction loss is often used in conditional image generation models like Lee et al. [21]. L recon = min G E x,y k G ( x ) − y k (3) L total = L W GAN + λ recon L recon (4) Where λ recon is the weight giv en to the reconstruction loss. The networks are optimized following the scheme described in Arjovsky et al. [7]. The critic for our system uses an architecture similar to the encoder part of the generator, but uses LeakyReLU activ ation instead of ReLU, as used by Radford et al. [17]. Con volutional neural networks offer translation inv ariance across the dimensions conv olved, making them highly useful in image modelling. Ho wev er , for audio signals, this in variance is useful only across the time-dimension but undesirable across the frequency dimension. As such, we follow the approach of NPSS [2], representing the features as a 1D signal with multiple channels. V . L I N G U I S T I C A N D V O C O D E R F E AT U R E S The input conditioning to our system consists of frame- wise phoneme annotations, represented as a one-hot v ector and continuous fundamental frequency extracted by the spectral autocorrelation (SA C) algorithm. This conditioning is similar to the one used in NPSS. In addition, we condition the system on the singer identity , as a one-hot vector , broadcast throughout the time dimension. This approach is similar to that used in the W a veNet [1]. The three conditioning vectors are then passed through a 1 × 1 con volution and concatenated together along with noise sampled from a uniform distribution and passed to the generator as input. W e use the WORLD vocoder [22] for acoustic modelling of the singing voice. The system decomposes a speech signal into the harmonic spectral en velope and aperiodicity env elope, based on the fundamental frequency f 0 . W e apply dimension- ality reduction to the vocoder features, similar to that used in [2]. V I . D A TA S E T W e use the NUS-48E corpus [23], which consists of 48 popular English songs, sung by 12 male and female singers. The singers are all non-professional and non-native English speakers. Each singer sings 4 different songs from a set of 20 songs, leading to a total of 169 min of recordings, with 25 , 474 phoneme annotations. W e train the system using all but 2 of the song instances, which are used for ev aluation. V I I . H Y P E R PA R A M E T E R S A hoptime of 5 ms was used for extracting the vocoder features and the conditioning. W e tried different block-sizes, but empirically found that N = 128 frames, corresponding to 640 ms produced the best results. W e used a weight of λ recon = 0 . 0005 for L recon and trained the network for 3000 epochs. As suggested in [7], we used RMSProp for network optimization, with a learning rate of 0 . 0001 . After dimension reduction, we used 60 harmonic and 4 aperiodic features per frame, leading to a total of 64 vocoder features. V I I I . E V A L U A T I O N M E T H O D O L O G Y For objecti ve e valuation, we use the Mel-Cepstral Distortion metric. This metric is presented in table I. For subjecti ve e valu- ation, we used an online AB test wherein the participants were asked to choose between two presented 5 s to 7 s examples 2 , representing phrases from the songs. The participant’ s choice was based on the criteria of Intelligibility and Audio Quality . W e compare our system to the NPSS, trained on the same dataset. Along with the NPSS, we use a re-synthesis with the WORLD vocoder as the baseline as this is the upper limit of the performance of our system. W e compared 3 pairs for this ev aluation: • WGANSing - Original song re-synthesized with W ORLD vocoder . • WGANSing - NPSS 2 W e found that WGANSing without the reconstruction loss as a guide did not produce very pleasant results and did not include this in the evaluation. Howe ver , examples for the same can be heard at https://pc2752.github .io/sing_synth_examples/ • WGANSing, original singer - WGANSing, sample with different singer . 27% 25% 20% 22% 53% 53% Intelligibil ity Au di o Q ua lity WG AN Si ng - NP SS WGA NSing No Pr ef NP SS Fig. 4: Subjectiv e test results for the WGANSing-NPSS pair . For the synthesis with a changed singer , we included samples with both singers of the same gender as the original singer and of a dif ferent gender . The input f 0 to the system was adjusted by an octav e to account for the dif ferent ranges of the genders. For each criteria, the participants were presented with 5 questions for each of the pairs, leading to a total of 15 questions per criteria and 30 questions ov erall 3 . I X . R E S U LT S There were a total of 27 participants from over 10 nationali- ties, including nati ve English speaking countries like the USA and England, and ages ranging from 18 to 37 in our study . The results of the tests are shown in ﬁgures 5, 4 and 6. 21% 25% 13% 17% 66% 58% Intelligibil ity Au di o Q ua lity WG AN Si ng - Orig ina l WGA NSing No Pr ef Or ig in al Fig. 5: Subjecti ve test results for the WGANSing-Original pair . From the ﬁrst two ﬁgures, it can be seen that our model is qualitati vely competiti ve with regards to both the original baseline and the NPSS, even though a preference is observed for the later . This result is supported by the objectiv e measures, seen in table I, which sho w parity between WGANSing and the NPSS models. Figure 6 sho ws that the percei ved intelligibility of the audio is preserved even after speaker change, ev en though there is a slight compromise on the audio quality . 42% 51% 31% 30% 27% 19% Intelligibil ity Au di o Q ua lity WG AN Si ng - Voi c e Ch a n g e WGA NSing No Pr ef Voice Chan ge Fig. 6: Subjecti ve test results for the WGANSing-WGANSing V oice Change pair . V ariability in the observed results can be attrib uted to the subjecti ve nature of the listening test, the diversity of 3 The subjecti ve listening test used in our study can be found at https://trompa- mtg.upf.edu/synth_eval/ participants and the dataset used, which comprises of non- nativ e, non-professional singers. W e note that there is room for improvement in the quality of the system, as discussed in next section. Song WGAN + L recon WGAN NPSS Song 1 JLEE 05 5.36 dB 9.70 dB 5.62 dB Song 2 MCUR 04 5.40 dB 9.63 dB 5.79 dB T ABLE I: The MCD metric for the two songs used for validation of the model. The three models compared are the NPSS [2] and the WGANsing model with and without the reconstruction loss. X . C O N C L U S I O N S A N D D I S C U S S I O N W e hav e presented a multi-singer singing voice synthesizer based on a block-wise prediction topology . This block-wise methodology , inspired by the work done in image generation allows the frame work to model inter-block feature dependency , while long-term dependencies are handled via an overlap-add procedure. W e believe that synthesis quality can further be improv ed by adding the previously predicted block of features as a further conditioning to current batch of features to be predicted. Further contextual conditioning, such as phoneme duration, which is currently implicitly modelled, can also be added to improve synthesis. Synthesis quality can also be improv ed through post-processing techniques such as the use of the W a veNet vocoder on the generated features. W e also note that the synthesis was greatly helped by the addition of the reconstruction loss, while the use of batch normalization did not affect performance either way . The syn- thesis quality of the model was ev aluated to be comparable to that of stat-of-the-art synthesis systems, ho wev er , we note that variability in subjecti ve measures like intelligibility and quality is introduced during the listening test, o wing to the diversity of the population participating. This variability can be reduced through a targeted listening test with expert participants. The generativ e methodology used allows for potential exploration in expressiv e singing synthesis. Furthermore, the fully con- volutional nature of the model leads to faster inference than autoregressi ve or recurrent network based models. AC K N O W L E D G M E N T S This work is partially supported by the European Com- mission under the TR OMP A project (H2020 770376). The TIT AN X used for this research was donated by the NVIDIA Corporation. R E F E R E N C E S [1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. V inyals, A. Graves, N. Kalchbrenner, A. W . Senior, and K. Kavukcuoglu, “W avenet: A generativ e model for raw audio, ” in SSW , 2016. [2] M. Blaauw and J. Bonada, “ A neural parametric singing synthesizer, ” in INTERSPEECH , 2017. [3] J. Shen, R. Pang, R. J. W eiss, M. Schuster, N. Jaitly , Z. Y ang, Z. Chen, Y . Zhang, Y . W ang, R. Skerrv-Ryan et al. , “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions, ” in 2018 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2018, pp. 4779–4783. [4] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde-Farley , S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets, ” in Advances in neural information processing systems , 2014. [5] M. Mirza and S. Osindero, “Conditional generative adversarial nets, ” CoRR , vol. abs/1411.1784, 2014. [6] T . Salimans, I. Goodfello w , W . Zaremba, V . Cheung, A. Radford, and X. Chen, “Improved techniques for training gans, ” in Advances in neural information pr ocessing systems , 2016, pp. 2234–2242. [7] M. Arjovsky , S. Chintala, and L. Bottou, “W asserstein generative ad- versarial networks, ” in International Conference on Machine Learning , 2017, pp. 214–223. [8] Y . Zhao, S. T akaki, H.-T . Luong, J. Y amagishi, D. Saito, and N. Mine- matsu, “W asserstein gan and wa veform loss-based acoustic model train- ing for multi-speaker text-to-speech synthesis systems using a wavenet vocoder , ” IEEE Access , vol. 6, pp. 60 478–60 488, 2018. [9] S. Y ang, L. Xie, X. Chen, X. Lou, X. Zhu, D. Huang, and H. Li, “Statis- tical parametric speech synthesis using generative adversarial networks under a multi-task learning framework, ” in 2017 IEEE Automatic Speech Recognition and Understanding W orkshop (ASR U) . IEEE, 2017, pp. 685–691. [10] S. Ma, D. Mcduff, and Y . Song, “ A generati ve adversarial network for style modeling in a text-to-speech system, ” in International Conference on Learning Repr esentations , 2019. [11] T . Kaneko, H. Kameoka, N. Hojo, Y . Ijima, K. Hiramatsu, and K. Kashino, “Generative adversarial network-based postﬁlter for sta- tistical parametric speech synthesis, ” in Proc. ICASSP , vol. 2017, 2017, pp. 4910–4914. [12] T . Kaneko, S. T akaki, H. Kameoka, and J. Y amagishi, “Generative ad- versarial network-based postﬁlter for stft spectrograms, ” in Pr oceedings of Interspeech , 2017. [13] C. Donahue, J. McAuley , and M. Puckette, “ Adversarial audio synthe- sis, ” in International Conference on Learning Representations , 2019. [14] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “Gansynth: Adversarial neural audio synthesis, ” in Inter- national Confer ence on Learning Representations , 2018. [15] Y . Hono, K. Hashimoto, K. Oura, Y . Nankaku, and K. T okuda, “Singing voice synthesis based on generative adversarial networks, ” in ICASSP 2019-2019 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2019, pp. 6955–6959. [16] M. Blaauw , J. Bonada, and R. Daido, “Data ef ﬁcient voice cloning for neural singing synthesis, ” in 2019 IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) , 2019. [17] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep conv olutional generative adversarial networks, ” CoRR , vol. abs/1511.06434, 2016. [18] O. Ronneberger, P . Fischer , and T . Brox, “U-net: Con volutional networks for biomedical image segmentation, ” in International Conference on Medical imag e computing and computer -assisted intervention . Springer , 2015, pp. 234–241. [19] P . Chandna, M. Miron, J. Janer, and E. Gómez, “Monoaural audio source separation using deep con volutional neural networks, ” in Inter- national Conference on Latent V ariable Analysis and Signal Separation . Springer , 2017, pp. 258–266. [20] D. Stoller, S. Ewert, and S. Dixon, “W av e-u-net: A multi-scale neural network for end-to-end audio source separation, ” in ISMIR , 2018. [21] S. Lee, J. Ha, and G. Kim, “Harmonizing maximum likelihood with GANs for multimodal conditional generation, ” in International Confer- ence on Learning Repr esentations , 2019. [22] M. Morise, F . Y okomori, and K. Ozawa, “W orld: a vocoder -based high-quality speech synthesis system for real-time applications, ” IEICE TRANSACTIONS on Information and Systems , vol. 99, no. 7, pp. 1877– 1884, 2016. [23] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y . W ang, “The nus sung and spoken lyrics corpus: A quantitati ve comparison of singing and speech, ” in 2013 Asia-P aciﬁc Signal and Information Processing Association Annual Summit and Confer ence . IEEE, 2013, pp. 1–9.

WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment