On Using Backpropagation for Speech Texture Generation and Voice Conversion

ON USING B A CKPR OP A GA TION FOR SPEECH TEXTURE GENERA TION AND V OICE CONVERSION J an Chor owski, Ron J. W eiss, Rif A. Saur ous, Samy Bengio Google Brain { chorowski,ronw,rif,bengio } @google.com ABSTRA CT Inspired by recent work on neural netw ork image generation which rely on backpropagation to wards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice con version based on two mechanisms: approximate in version of the representation learned by a speech recognition neural network, and on matching statistics of neuron acti vations between different source and target utterances. Similar to image texture synthesis and neural style transfer , the system works by optimizing a cost function with respect to the input wav eform samples. T o this end we use a differentiable mel-ﬁlterbank feature extraction pipeline and train a con volutional CTC speech recognition netw ork. Our system is able to extract speaker characteristics from very limited amounts of tar get speaker data, as little as a fe w seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a dif ferent v oice. Index T erms — T exture synthesis, v oice con version, style trans- fer , deep neural networks, con volutional networks, CTC 1. INTRODUCTION Deep neural networks are a family of ﬂe xible and powerful machine learning models. T rained discriminati vely , the y ha ve become the tech- nique of choice in many applications, including image recognition [ 1 ], speech recognition [ 2 , 3 ], and machine translation [ 4 , 5 , 6 ]. Ad- ditionally , neural networks can be used to generate new data, ha ving been applied to speech synthesis [ 7 , 8 ], image generation [ 9 ], and image inpainting and superresolution [10]. The representation learned by a discriminativ ely trained deep neural network can be approximately inv erted, turning a classiﬁca- tion model into a generator . While exact inv ersion is impossible, the backpropagation algorithm can be used to ﬁnd inputs which acti vate the network in the desired manner . This technique has been applied to the computer vision domain in order to gain insights into network operation [ 11 ], ﬁnd adversarial examples which make impercepti- ble modiﬁcations to image inputs in order to change the network’ s predictions [ 12 ], synthesize te xtures [ 13 ], and re generate an image according to the style (essentially matching the lo w-lev el texture) of another , referred to as style transfer [14]. In this w ork we in vestigate the possibility of con verting a discrim- inativ ely trained CTC speech recognition network into a generator . In particular , we in vestigate: (i) generating wa veforms based solely on the activ ations of selected netw ork layers, giving insights into the nature of the network’ s internal representations, (ii) speech texture synthesis by generating wa veforms which result in neuron acti v ations in shallow layers whose statistics are similar to those of real speech, and (iii) voice con version, the speech analog of image style transfer , where the pre vious two methods are combined to generate wa veforms which match the high lev el network acti vations from a content utter- ance while simultaneously matching low le vel statistics computed from lower le vel activ ations from a style ( identity ) utterance. 2. BA CKGROUND 2.1. T exture synthesis based on matching statistics Julesz [ 15 ] proposed that visual texture discrimination is a function of an image’ s low le vel statistical properties. McDermott et al. [ 16 , 17 ] applied the same idea to sound, showing that perception of sound textures relies on matching certain lo w le vel signal statistics. Fur- thermore, following earlier work on image texture synthesis [ 18 ], they demonstrated that simple sound textures, such as rain or ﬁre, can be synthesized using a gradient-based optimization procedure to iterativ ely update a white noise signal to match the statistics of observed te xture signals. Recently , Gatys el al. [ 13 ] proposed a similar statistic matching algorithm to synthesize visual textures. Ho wev er , instead of manually designing the relevant statistics as a function of the image pixels, they utilized a deep conv olutional neural network discriminativ ely trained on an image classiﬁcation task. Speciﬁcally , they proposed to match uncentered correlations between neuron acti vations in a selected network layer . Formally , let C ( n ) ∈ R W × H × D denote the activ ations of the n -th con volutional layer , where W is the width of the layer , H is its height, and D is the number of ﬁlters. The Gram matrix of uncentered correlations G ( n ) ∈ R D × D is deﬁned as: G ( n ) i,j = 1 W H W X w =1 H X h =1 C ( n ) whi C ( n ) whj . (1) Gatys et al. demonstrated that realistic visual textures can be synthesized by matching the Gram matrices. In other words, the statistics necessary for texture synthesis are the correlations between the values of two con volutional ﬁlters taken over all the pixels in a giv en con volutional ﬁlter map. W e note that the Gram features in equation (1) are averaged ov er all image pixels, and therefore are stationary with respect to the pixel location. 2.2. Style transfer Approximate network in versions and statistic-matching texture syn- thesis both generate images by minimizing a loss function with back- propagation to wards the inputs. These tw o approaches can be com- bined to sample images whose content is similar to a seed image, and whose texture is similar to another one [ 14 ]. This approach to style transfer is attractiv e because it lev erages a pretrained neural network which has learned the distribution of natural images, and therefore does not require a large dataset at generation time – a single image of a giv en style is all that is required, and it need not be related to the images used to train the network. 3. SPEECH RECOGNITION INPUT RECONSTRUCTION 3.1. Network architecture T o apply the texture generation and stylization techniques to speech we train a fully con volutional speech recognition network follow- ing [ 19 ] on the W all Street Journal dataset. The network is trained to predict character sequences in an end-to-end fashion using the CTC [ 20 ] criterion. W e use parameters typical for a speech recog- nition network: wa veforms sampled at 16kHz are segmented into 25ms windows tak en ev ery 10ms. From each window we e xtract 80 log-mel ﬁlterbank features augmented with deltas and delta-deltas. The 13 layer network architecture is deri ved from [19]; C0 128-dimensional 5 × 5 con volution with 2 × 2 max-pooling, C1 128-dimensional 5 × 5 con volution with 1 × 2 max-pooling, C2 128-dimensional 5 × 3 con volution, C3 256-dimensional 5 × 3 con volution with 1 × 2 max-pooling, C4-9 six blocks of 256-dimensional 5 × 3 con volution, FC0-1 tw o 1024-dimensional fully connected layers, CTC a fully connected layer and CTC cost o ver characters, where ﬁlter and pooling windo w sizes are speciﬁed in time × fre- quency . All layers use batch normalization, ReLU acti vations, and dropout regularization. Con volutional layers C0-9 use dropout keep probability 0.75, and fully connected layers use keep probability 0.9. The network is trained using 10 asynchronous w orkers with the Adam [ 21 ] optimizer using β 1 = 0 . 9 , β 2 = 0 . 999 ,  = 10 − 6 and learning rate annealing from 10 − 3 to 10 − 6 . W e also use L2 weight decay of 10 − 6 . When decoded using the extended trigram language model from the Kaldi WSJ S5 recipe [ 22 ], the model reaches an ev al WER of 7 . 8% on eval92 . While our network does not reach state-of- the-art accuracy on this dataset, it has reasonable performance and is easily amenable to backpropagation tow ards inputs. Even though the network was trained on the WSJ 1 corpus, we use the VCTK dataset 2 for all subsequent experiments. 3.2. W avef orm sample reconstruction Our goal is to generate w aveforms that will result in a particular neuron activ ation pattern when processed by a deep network. Ideally , wa veform samples would be optimized directly using the backprop- agation algorithm. One possibility is to train networks that operate on raw wa veforms as in [ 23 ]. Howev er , it is also possible to imple- ment the typical speech feature pipeline in a dif ferentiable way . W e follow the second approach, which is f acilitated by readily av ailable T ensorﬂow implementation of signal processing routines [24]: 1. W av eform framing and Hamming window application. 2. DFT computation, which multiplies wav eform frames by a complex-v alued DFT matrix. 3. Smooth approximate modulus computation, implemented as abs ( x ) ≈ p  + re ( x ) 2 + im ( x ) 2 , with  = 10 − 3 . 4. Filterbank 3 feature computation, which can be implemented as a matrix multiplication. 1 https://catalog.ldc.upenn.edu/ldc93s6a , https://catalog. ldc.upenn.edu/ldc94s13a 2 http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/ page58.html 3 W e use a mixed linear and mel scale, where frequencies below 1 kHz are copied from the STFT and higher frequencies are compressed using the mel scale. W e employ this scaling below 1 kHz because the mel scale allocates too p225 p226 C0 FC1 Fig. 1 . Mel spectrograms of wa veforms reconstructed from layers C0 and FC1 of speakers p225 (female) and p226 (male) from the VCTK dataset. The reconstructions from C0 are nearly exact, while the reconstructions from FC1 are very noisy and are barely intelligible. 5. T aking the elementwise logarithm of the ﬁlterbank features. 6. Computing deltas and delta-deltas using con volution over time. This feature extraction pipeline facilitates two methods for re- constructing wav eform samples: (i) gradient-based optimization with backpropagation directly to the wa veform, or (ii) gradient-based op- timization of the linear spectrogram, followed by Grif ﬁn-Lim [ 25 ] phase reconstruction. W e ﬁnd that a dual strategy works best, where we ﬁrst perform spectrogram reconstruction, then in vert the spectro- gram to yield an initial wav eform which is further optimized directly . W e use the L-BFGS optimizer [26] for both optimization stages. 3.3. Speech reconstruction from network activations W e implement wa veform reconstruction based on network activ ations following the ReLU non-linearity in a speciﬁed layer . Figure 1 sho ws the spectrograms of wav eform reconstructions for speakers p225 and p226 from the VCTK dataset 4 . W e hav e qualitati vely established that wa veforms reconstructed from shallo w network layers are intelligible and the speaker can be clearly identiﬁed. Audible phase artifacts are introduced in reconstructions from layer C3 and abo ve, after the ﬁnal pooling operation over time. While the speech quality degrades, many speaker characteristics are preserved in the reconstructions up to the fully connected layers. Listening to reconstructions from layer C9 it remains possible to recognize the speaker’ s gender . In order to reconstruct the wav eforms from activ ations in the fully connected layers FC0 and FC1, we ﬁnd that the reconstruction cost must be e xtended with a term penalizing dif ferences between the total energy in each feature frame of the reference and reconstruction. W e hypothesize that the network’ s representation in deeper layers has learned a de gree of in variance to the signal magnitude, which hampers reconstruction of realistic signals. For example, the network reliably predicts the CTC blank symbol both for silence and white noise at different amplitudes. The addition of this energy matching penalty enables the network to correctly reconstruct silent segments. Ho wev er , ev en with this additional penalty , reconstructions from layers FC0 many bands to low frequencies, some of which are always zero when using 80 mel bands and 256 FFT bins, which was found to be optimal for recognition. 4 Sound samples are av ailable at https://google.github.io/ speech_style_transfer/samples.html Fig. 2 . MDS embeddings of speaker vectors computed on original VCTK recordings, reconstructions from the network, synthesized wa veforms, and voice conv erted wa veforms. The synthesized and voice con verted utterances are close to the original utterances and reconstructions from early layers. Reconstructions from deep layers con ver ge to a single point, indicating that the speaker identity is lost. and FC1 are highly distorted. The words are only intelligible with difﬁculty and the speak er identity is lost. T o e valuate ho w well reconstructions based on different layers capture characteristics of different speakers, we visualize embedding vectors computed using an internal speaker identiﬁcation system that uses a Resnet-50 architecture [ 27 ] trained on LibriV ox 5 using a triplet- loss [ 28 ]. Nearest neighbor classiﬁcation using these embeddings obtains nearly perfect accuracy on the original VCTK signals. Fig- ure 2 sho ws a two-dimensional MDS [ 29 ] embedding of these vectors. In reconstructions from early layers, signals from each speaker cluster together with no o verlap. As the depth increases, the embeddings for all speakers begin to conv erge on a single point, indicating that the speakers become progressi vely more dif ﬁcult to recognize. From this we can conclude that the network’ s internal representation be- comes progressiv ely more speaker in variant with increasing depth, a desirable property for speaker -independent speech recognition. 3.4. Speech texture synthesis Unlike image textures whose statistics can be assumed to be station- ary across both spatial dimensions, the two dimensions of speech spectrogram features, i.e. time and frequency , have dif ferent seman- tics and should be treated dif ferently . Sound textures are stationary ov er time but are nonstationary across frequenc y . This suggests that features extracted from layer acti vations should in volve correlations ov er time alone. Let C ( n ) ∈ R T × F × D be the tensor of acti v ations of the n -th layer of the network which consists of D ﬁlters computed for T frames and F frequencies. The temporally stationary Gram tensor , G ( n ) ∈ R F × F × D × D , can be written as: G ( n ) ij kl = 1 T T X t =1 C ( n ) tik C ( n ) tj l (2) W e demonstrate that these Gram tensors capture speaker iden- tity by using them as features in a simple nearest neighbor speaker identiﬁcation system. Figure 3 shows speaker identiﬁcation accuracy of this system ov er the ﬁrst 15 utterances of 30 ﬁrst speakers of the VCTK dataset. Using the lower netw ork layers (up to C3) yields an accuracy close to 95%, whereas using similar Gram tensors of raw mel-spectrograms extended with deltas and delta-deltas yields only 5 https://librivox.org/ Fig. 3 . Accuracy of nearest-neighbor speaker classiﬁcation using Gram tensors extracted from dif ferent network layers. p225, C0 p226, C0-C5 p225, C0-C5 Polish speaker , C0-C5 Fig. 4 . Mel spectrograms of textures synthesized from Gram matrices computed on 20 utterances from VCTK speakers p225 (female) and p226 (male), as well as a short (1s) utterance in Polish (male). When deeper layers are used, the generated sound captures more temporal structure. Intuiti vely , listening to “p225, C0” it is hard to discern words, whereas one can hear word boundaries in “p225, C0-C5”. One can also see the characteristic lower pitch in synthesized male voices. 65% accuracy . Deeper layers of the network become progressiv ely less speaker sensiti ve, mirroring our observ ations from Figure 2. W e also observe that network training is crucial for Gram features to become speaker-selecti ve and for the texture synthesis to work. After a random initialization the network beha ves dif ferently than it does after training: the Gram tensors computed on shallow layers of the untrained network are less sensitiv e to speaker identity than the corresponding layers in the trained network, while their deeper layers don’t e xhibit as dramatic decrease in speaker sensitivity . In contrast, image texture synthesis and style transfer have been reported to work with randomly initialized networks [30]. Figure 4 shows spectrograms of generated speech textures based on speech from the VCTK dataset and a male nati ve Polish speak er . The Gram tensor computed on ﬁrst layer activations captures the fundamental frequency and harmonics but yields a fairly uniform temporal structure. When features computed on deeper layers are used, longer term phonemic structure can be seen, although the overall speech is not intelligible. This is a consequence of the increased temporal receptive ﬁeld of ﬁlters in deeper layers, where a single activ ation is a function of structure spanning tens of frames, enabling p225 p226 Original p225 → p226 p225 → p226 Con verted Fig. 5 . Mel spectrograms of voice con version, mapping VCTK utter- ance 004 between speakers p225 (female) and p226 (male) performed by matching neuron activ ations to those from a content utterance and Gram features to those computed from 19 speaker identity utterances (about 2 minutes). the reconstruction of realistic speech babble sounds. 3.5. V oice conversion The methods described in the previous two sections can be combined to produce the speech analog of image style transfer: a voice con ver- sion system. Speciﬁcally we reconstruct the deep-layer acti vations of a content utterance, and the shallow-layer Gram features of identity or style utterances. Listening to the con verted samples, we found that a good tradeof f between matching the target speaker’ s v oice and sound quality occurs when optimizing a loss that spans all layers, with the layers C0-C5 matched to style utterances using Gram features, and layers C6-FC1 matched to the content utterance. W e normalize the contribution of each layer to the cost by dividing the squared dif ference between the Gram or activ ation matrices by their dimensionality . Furthermore, the Gram features of style layers C0-C5 use a weight of 10 5 , activ ations of content layers C6-C9 use weight 0 . 2 and activ ations of layers FC0-FC1 use weight 10 , to base the reconstruction on the deepest layers, b ut provide some signal from those in the middle which are responsible for ﬁnal voice quality . While the speaker remains identiﬁable in reconstructions from layers C6-C9 as described in Section 3.3, we ﬁnd that including these layers in the content loss leads to more natural sounding synthesis. The speak er identity still changes when the Gram feature weight is sufﬁciently lar ge. Spectrograms of utterance generated using this procedure are shown in Figure 5. From the spectrograms one can see that the con verted utterances contain very dif ferent pitch, consistent with the opposite gender . Howev er , because the content loss is applied directly to neuron acti vations, the exact temporal structure of the content utterance is retained. This highlights a limitation of this approach: the ﬁxed temporal alignment to the content utterance means that it is unable to model temporal variation characteristic to different speakers, such as changes in speaking rate. 4. RELA TED WORK The success of neural image style transfer has prompted a few at- tempts to apply it to audio. Roberts et al. [ 31 ] trained audio clip embeddings using a con volutional network applied directly to ra w wa veforms and attempted to generate waveforms by maximizing acti- vations of neurons in selected layers. The authors claim noisy results and attribute it to the lo w quality of the learned ﬁlters. Ulyano v et al. [ 32 ] used an untrained single-layer network to synthesize simple audio textures such as ke yboard and machine gun, and attempted au- dio style transfer between different musical pieces. The recent work of W yse [ 33 ] is most similar to ours. He examines the application of pretrained con volutional networks for image recognition and for en vironmental sound classiﬁcation. An example of style transfer from human speech to a cro wing rooster demonstrates the importance of using a network that has been trained on audio features, which is in line with our ﬁndings. T o the best of our knowledge, our work is the ﬁrst to demonstrate that style transfer techniques applied to speech recognition networks can be used for v oice con version. Speech babble sounds have been pre viously generated using an unconditioned W av eNet [ 8 ] model trained to synthesize speech wa veforms. In contrast, we demonstrate that such complex sound textures can be generated from a speech recognition network, using very limited amounts of data from the tar get speaker . T ypical voice con version systems rely on advanced speech repre- sentations, such as STRAIGHT [ 34 ], and use a dedicated con version function trained on aligned, parallel corpora of different speakers. An overvie w of the state-of-the-art in this area can be seen in the recent V oice Conv ersion Challenge [ 35 ]. While our system produces samples that hav e an inferior quality , it operates using a different and nov el principle: rather than learning a frame-to-frame conv ersion, it uses a speech recognition network to deﬁne a speaker similarity cost that can be optimized to change the perceived identity of the speaker . 5. LIMIT A TIONS AND FUTURE WORK W e demonstrate a proof-of-concept speech texture synthesis and voice con version system that deri ves a statistical description of the target v oice from the acti vations of a deep con volutional neural net- work trained to perform speech recognition. The main beneﬁt of the proposed approach is the ability to utilize v ery limited amounts of data from the target speaker . Leveraging the distrib ution of natural speech captured by the pretrained network, a fe w seconds of speech are sufﬁcient to synthesize recognizable characteristics of the target voice. Ho wev er , the proposed approach is also quite slo w , requiring sev eral thousand gradient descent steps. In addition, the synthesized utterances are of relativ ely low quality . The proposed approach can be extended in may ways. First, analogously to the fast image style transfer algorithms [ 36 , 37 , 38 ], the Gram tensor loss can be used as additional supervision for a speech synthesis neural network such as W aveNet [ 8 ] or T acotron [ 39 ]. For example, it might be feasible to use the style loss to extend a neural speech synthesis system to a wide set of speakers gi ven only a fe w seconds of recorded speech from each one. Second, the method depends on a pretrained speech recognition network. In this w ork we used a fairly basic network using feature extraction parameters tuned for speech recognition. Synthesis quality could probably be impro ved by using higher sampling rates, increasing the window ov erlap and running the network on linear -, rather than mel-ﬁlterbank features. 6. A CKNO WLEDGMENTS Authors thank Y oram Singer , Colin Raf fel, Matt Hoffman, Joseph An- tognini, and Navdeep Jaitly for helpful discussions and inspirations, RJ Skerry-Ryan for signal processing in TF , and Aren Jansen and Sourish Chaudhuri for help with the speaker identiﬁcation system. 7. REFERENCES [1] A. Krizhevsk y , I. Sutskever , and G. E. Hinton, “Imagenet classiﬁcation with deep con volutional neural networks, ” in Advances in Neural Information Pr ocessing Systems , 2012. [2] A. Grav es and N. Jaitly , “T owards end-to-end speech recogni- tion with recurrent neural networks, ” in Pr oc. ICML , 2014, pp. 1764–1772. [3] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “ Attention-based models for speech recognition, ” in Ad- vances in Neural Information Pr ocessing Systems , 2015. [4] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine transla- tion by jointly learning to align and translate, ” , 2014. [5] I. Sutske ver , O. V inyals, and Q. V . Le, “Sequence to sequence learning with neural networks, ” in Advances in Neural Informa- tion Pr ocessing Systems , 2014, pp. 3104–3112. [6] Y . W u, M. Schuster , Z. Chen, et al., “Google’ s neural ma- chine translation system: Bridging the gap between human and machine translation, ” , 2016. [7] H. Zen, Y . Agiomyrgiannakis, N. Egberts, F . Henderson, and P . Szczepaniak, “Fast, compact, and high quality lstm-rnn based statistical parametric speech synthesizers for mobile devices, ” arXiv:1606.06061 , 2016. [8] A. v . d. Oord, S. Dieleman, H. Zen, et al., “W av enet: A genera- tiv e model for raw audio, ” , 2016. [9] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen- tation learning with deep con volutional generativ e adversarial networks, ” , 2015. [10] C. Dong, C. C. Loy , K. He, and X. T ang, “Learning a deep con volutional network for image super-resolution, ” in European Confer ence on Computer V ision . Springer , 2014, pp. 184–199. [11] K. Simonyan, A. V edaldi, and A. Zisserman, “Deep inside con- volutional networks: V isualising image classiﬁcation models and saliency maps, ” , 2013. [12] I. J. Goodfello w , J. Shlens, and C. Szegedy , “Explaining and harnessing adversarial e xamples, ” , 2014. [13] L. Gatys, A. S. Ecker , and M. Bethge, “T exture synthesis using con volutional neural networks, ” in Advances in Neur al Information Pr ocessing Systems , 2015, pp. 262–270. [14] L. A. Gatys, A. S. Ecker , and M. Bethge, “ A neural algorithm of artistic style, ” , 2015. [15] B. Julesz, “V isual pattern discrimination, ” IRE T ransactions on Information Theory , vol. 8, no. 2, pp. 84–92, 1962. [16] J. H. McDermott, A. J. Oxenham, and E. P . Simoncelli, “Sound texture synthesis via ﬁlter statistics, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics , 2009. [17] J. H. McDermott and E. P . Simoncelli, “Sound texture per- ception via statistics of the auditory periphery: e vidence from sound synthesis, ” Neur on , vol. 71, no. 5, pp. 926–940, 2011. [18] J. Portilla and E. P . Simoncelli, “ A parametric texture model based on joint statistics of complex wav elet coefﬁcients, ” Inter- national Journal of Computer V ision , vol. 40, no. 1, 2000. [19] Y . Zhang, M. Pezeshki, P . Brakel, et al., “T owards end-to-end speech recognition with deep con volutional neural networks, ” arXiv:1701.02720 , 2017. [20] A. Grav es, S. Fern ´ andez, F . Gomez, and J. Schmidhuber , “Con- nectionist temporal classiﬁcation: labelling unsegmented se- quence data with recurrent neural netw orks, ” in Proc. ICML . A CM, 2006, pp. 369–376. [21] D. Kingma and J. Ba, “ Adam: A method for stochastic opti- mization, ” , 2014. [22] D. Pov ey , A. Ghoshal, G. Boulianne, et al., “The kaldi speech recognition toolkit, ” in Pr oc. ASR U , 2011. [23] T . N. Sainath, R. J. W eiss, A. Senior, K. W . W ilson, and O. V inyals, “Learning the speech front-end with raw wav e- form CLDNNs, ” in Pr oc. Interspeech , 2015. [24] M. Abadi, A. Agarwal, P . Barham, et al., “T ensorﬂow: Large- scale machine learning on heterogeneous distrib uted systems, ” arXiv:1603.04467 , 2016. [25] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed short- time F ourier transform, ” IEEE T rans. on Acoustics, Speech, and Signal Pr ocessing , vol. 32, no. 2, pp. 236–243, 1984. [26] C. Zhu, R. H. Byrd, P . Lu, and J. Nocedal, “ Algorithm 778: L- BFGS-B: Fortran subroutines for lar ge-scale bound-constrained optimization, ” A CM T ransactions on Mathematical Software , vol. 23, no. 4, pp. 550–560, 1997. [27] S. Hershey , S. Chaudhuri, D. P . Ellis, et al., “CNN architectures for lar ge-scale audio classiﬁcation, ” in Pr oc. ICASSP . IEEE, 2017, pp. 131–135. [28] H. Bredin, “Tristounet: triplet loss for speaker turn embedding, ” in Pr oc. ICASSP , 2017, pp. 5430–5434. [29] J. B. Kruskal and M. W ish, Multidimensional scaling , vol. 11, Sage, 1978. [30] K. He, Y . W ang, and J. Hopcroft, “ A po werful generati ve model using random weights for the deep image representation, ” in Advances in Neural Information Pr ocessing Systems , 2016. [31] A. Roberts, C. Resnick, D. Ardila, and D. Eck, “ Audio deep- dream: Optimizing raw audio with conv olutional networks, ” in Pr oc. ISMIR , 2016. [32] D. Ulyanov and V . Lebedev , “ Audio texture synthesis and style transfer , ” https://dmitryulyanov.github.io/ audio- texture- synthesis- and- style- transfer/ , 2016. [33] L. W yse, “ Audio spectrogram representations for processing with con volutional neural netw orks, ” , 2017. [34] H. Kaw ahara, I. Masuda-Katsuse, and A. De Che veigne, “Re- structuring speech representations using a pitch-adaptiv e time– frequency smoothing and an instantaneous-frequenc y-based f0 extraction: Possible role of a repetitive structure in sounds, ” Speech Communication , vol. 27, no. 3, pp. 187–207, 1999. [35] T . T oda, L.-H. Chen, D. Saito, et al., “The voice con version challenge 2016., ” in Pr oc. Interspeech , 2016, pp. 1632–1636. [36] D. Ulyanov , V . Lebede v , A. V edaldi, and V . S. Lempitsky , “T ex- ture networks: Feed-forward synthesis of textures and stylized images., ” in Pr oc. ICML , 2016, pp. 1349–1357. [37] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super -resolution, ” in Eur opean Confer ence on Computer V ision . Springer , 2016, pp. 694–711. [38] V . Dumoulin, J. Shlens, and M. Kudlur , “ A learned representa- tion for artistic style, ” in Pr oc. ICLR , 2017. [39] Y . W ang, R. Skerry-Ryan, D. Stanton, et al., “T acotron: A fully end-to-end text-to-speech synthesis model, ” in Proc. In- terspeech , 2017.

On Using Backpropagation for Speech Texture Generation and Voice Conversion

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment