Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

LEARNING DISENT ANGLED REPRESENT A TIONS OF TIMBRE AND PITCH FOR MUSICAL INSTR UMENT SOUNDS USING GA USSIAN MIXTURE V ARIA TION AL A UTOENCODERS Y in-Jyun Luo 1 , 2 Kat Agres 2 , 3 Dorien Herremans 1 , 2 1 Singapore Uni versity of T echnology and Design 2 Institute of High Performance Computing, A*ST AR, Singapore 3 Y ong Sie w T oh Conserv atory of Music, National Uni versity of Singapore yinjyun_luo@mymail.sutd.edu.sg,kat_agres@ihpc.astar.edu.sg,dorien_herremans@sutd.edu.sg ABSTRA CT In this paper , we learn disentangled representations of tim- bre and pitch for musical instrument sounds. W e adapt a framew ork based on variational autoencoders with Gaus- sian mixture latent distributions. Speciﬁcally , we use two separate encoders to learn distinct latent spaces for tim- bre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respecti vely . For reconstruction, latent variables of timbre and pitch are sampled from corresponding mixture components, and are concatenated as the input to a decoder . W e show the model’ s ef ﬁcacy using latent space visualization, and a quantitativ e analysis indicates the discriminability of these spaces, even with a limited number of instrument labels for training. The model allows for controllable synthesis of selected instrument sounds by sampling from the la- tent spaces. T o ev aluate this, we trained instrument and pitch classiﬁers using original labeled data. These classi- ﬁers achie ve high F-scores when tested on our synthesized sounds, which veriﬁes the model’ s performance of control- lable realistic timbre/pitch synthesis. Our model also en- ables timbre transfer between multiple instruments, with a single encoder -decoder architecture, which is ev aluated by measuring the shift in the posterior of instrument clas- siﬁcation. Our in-depth ev aluation conﬁrms the model’ s ability to successfully disentangle timbre and pitch. 1 1. INTR ODUCTION A disentangled feature representation is deﬁned as hav- ing disjoint subsets of feature dimensions that are only sensitiv e to changes in corresponding factors of v ariation from observed data [2, 27, 32]. Deep generativ e mod- els [13, 19, 25, 33] have been exploited to learn disentan- gled representations in dif ferent domains. In the visual do- 1 Example audio ﬁles and code at http://bit.ly/2Dbyt9j c  Y in-Jyun Luo, Kat Agres, Dorien Herremans. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Y in-Jyun Luo, Kat Agres, Dorien Herremans. “Learning Disentangled Representations of T imbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture V ariational Autoencoders”, 20th International Society for Music Information Retriev al Conference, Delft, The Netherlands, 2019. main, studies are focused on learning independent repre- sentations for data generativ e factors such as identity and azimuth [5, 14, 26]. In natural language generation, efforts hav e been made to generate texts with controlled senti- ment [10, 18, 36]. Also in the speech domain, we have wit- nessed successful attempts in controllable speech synthesis by disentangling factors such as speaker identity , speed of speech, emotion, and noise lev el [15, 17, 35]. There has been relati vely little research on learning disentangled rep- resentations for music. In this paper , we disentangle the pitch and timbre of musical instrument sound recordings. Pitch and timbre are essential properties of musical sounds. Gi ven that one pitch can be played with differ- ent instruments, we assume they can be separated. From the perspective of music analysis, disentangled represen- tations of pitch and timbre can be regarded as timbre- and pitch-in variant features which could be exploited for do wn- stream tasks [29, 30]. From the synthesis point of view , dis- entangled representations enable the generation of musical notes with identical pitches (timbres) and dif ferent timbres (pitches). Recently , Hung et al. presented the ﬁrst attempt to learn disentangled representations of pitch and timbre for synthesized music by using frame-lev el instrument and pitch labels based on encoder -decoder networks [21]. Even though the authors managed to change instrumentation to some extent without affecting pitch structure, the ap- proach was restrictive, as it worked with MIDI-synthesized audio and relied on clean frame-lev el labels, which are scarce to ﬁnd. Disentangled representations allow for sev- eral applications, including music style transfer . Brun- ner et al. proposed a model based on variational autoen- coders (V AEs) [25] to generate music with controllable at- tributes [4]. While genre was factorized by an auxiliary classiﬁer , other musical properties were entangled. Be- sides the aforementioned models based on MIDI, research on audio has focused on translating between different do- mains of instrumentation [3, 7, 20, 28]. None of them, how- ev er , has addressed learning disentangled latent variables of both pitch and timbre. This research distinguishes itself from others by disen- tangling instrument sounds into distinct sets of latent vari- ables (i.e., pitch and timbre), with a frame work based on Gaussian Mixture V AEs (GMV AEs). W e model the gener - Figure 1 . The proposed framew ork includes separate en- coders for pitch and timbre, and a shared decoder . ativ e process of an isolated musical note by independently sampling pitch and timbre (instrument) cate gorical v ari- ables. Note that the two f actors are actually dependent in a sense that range of pitch is instrument-dependent, ho wever , we v erify the model’ s capability to disentangle them under this simpliﬁed assumption of independence. Conditioned on these categorical variables, Gaussian-distributed latent variables are then sampled that characterize variation in the sampled pitch and instrument, respectively . Finally , the data are generated conditioned on the two latent variables. W e f avor the proposed frame work ov er v anilla V AEs [8, 9] for its more ﬂexible latent distrib ution compared to a stan- dard Gaussian. In addition, it allows for unsupervised or semi-supervised clustering, which can learn interpretable mixture components and corresponding Gaussian param- eters. More importantly , such a framew ork facilitates the applications in this research: controllable synthesis of in- strument sounds, and man y-to-many transfer of instru- ment timbres. Our proposed framework differs from pre- vious studies on timbre transfer , in that we achie ve transfer between multiple instruments without training a domain- speciﬁc decoder for each instrument (e.g. [28]), and we infer both the pitch and timbre latent variable without re- quiring categorical conditions of source pitch and instru- ment as in [3]. W e ev aluate our model by visualizing both the latent space and the synthesized spectrograms, and ex- plore the classiﬁcation F-scores of classiﬁers trained in an end-to-end fashion. The results conﬁrm the model’ s ability to learn disentangled pitch and timbre representations. The rest of the paper is organized as follows: in Section 2, we discuss the proposed framew ork, and Section 3 describes the dataset and experimental setup. Experiments and re- sults are reported in Section 4. W e conclude our work and provide future directions in Section 5. 2. PR OPOSED FRAMEWORK In this section, we brieﬂy describe V AEs and GMV AEs, and elaborate on the proposed frame work and architecture. 2.1 Gaussian Mixture V ariational A utoencoders V AEs [25] are unsupervised generati ve models that com- bine latent variable models and deep learning [12]. W e denote the observed data and the latent variables respec- tiv ely by X and z . A graphical model, corresponding to z → X , is trained by maximizing the lower bound of the log marginal likelihood p ( X ) . The intractable posterior p ( z | X ) is approximated by introducing a variational distri- bution q ( z | X ) parameterized with neural networks. In reg- ular V AEs, a common choice for the prior distribution p ( z ) is an isotropic Gaussian, which encourages each dimension of the latent variables to capture an independent factor of variation from the data, and results in a disentangled rep- resentation [14]. Such a unimodal prior , ho wever , does not allow for multi-modal representations. GMV AEs [6, 22, 24] extend the prior to a mixture of Gaussians, and assume the observed data are generated by ﬁrst determining the mode from which it was generated, which corresponds to learn- ing a graphical model y → z → X . This introduces a cat- egorical variable y , and q ( y | X ) , which infers the classes of data. This enables semi-supervised learning [24] and unsupervised clustering [6, 22] in deep generativ e models. In the speech domain, Hsu et al. used tw o mixture distribu- tions to separately model the supervised speaker and unsu- pervised utterance attributes, which allo wed for e xtra ﬂexi- bility in conditional speech generation [17]. W e build upon this idea to learn separate latent distrib utions to represent the pitch and timbre of musical instrument sounds. More importantly , to facilitate downstream creative applications such as controllable synthesis and instrument timbre trans- fer in music, we propose to model supervised pitch rep- resentations and semi-supervised timbre representations, with labels of pitch and instrument identity . As such, the mixture components in latent space of pitch and timbre can be clearly interpreted as the classes, i.e., pitch and instru- ment identity . 2.2 Model Formulation The latent v ariables of pitch and timbre for an isolated mu- sical note X are denoted as z p ( pitch code ) and z t ( timbr e code ), respectiv ely . T o represent Gaussian mixture latent distributions, two categorical variables are introduced: an M -way categorical variable y p for pitch, where M is the number of recorded pitches in the dataset, and a K -way categorical variable y t for timbre, where K is the number of instrument classes. W e consider y p to be observed (fully supervised), which assumes the av ailability of pitch labels during training, and is reasonable as we model isolated in- strument sounds in this research. For y t , we in vestigate both unsupervised and semi-supervised learning, i.e., us- ing varying numbers of instrument labels for training. It is shown in Section 4 that our model can efﬁciently le verage the limited number of labels. W ithout loss of generality , we denote y t as unobserved (unsupervised) as in [17]. The joint probability of X , y t , z t and z p is written as: p ( X , y t , z t , z p | y p ) = p ( X | z p , z t ) p ( z p | y p ) p ( z t | y t ) p ( y t ) , (1) where p ( y t ) is uniform-distributed, i.e., we do not assume to know the instrument distrib ution in the dataset. Both the conditional distributions p ( z p | y p ) = N ( µ y p , diag ( σ y p )) and p ( z t | y t ) = N ( µ y t , diag ( σ y t )) are assumed to be diagonal-cov ariance Gaussians with learnable means and constant variances. This amounts to both the marginal prior p ( z p ) and p ( z t ) being Gaussian mixture models (GMMs) with diagonal cov ariances. Ideally , each mix- ture component in the former ( pitc h space ) uniquely repre- sents the pitch of X , while that in the latter ( timbr e space ) is interpreted as the instrument identity . As we will see in Section 4.1, howe ver , moderate supervision is essential to learn a timbre space that groups instruments perfectly . For creativ e applications such as the synthesis and tim- bre transfer of instrument sounds, the proposed model has numerous merits: 1) the learnt representations are not re- stricted to be unimodal, which of fers a more discriminati ve timbre space than regular V AEs (Section 4.1 and 4.2); 2) direct and intuitiv e sampling from pitch and timbre space allows for consistent and controllable synthesis of instru- ment sounds, attributed to the fact that Gaussian param- eters of each interpretable mixture component are read- ily av ailable after training (Section 4.3); and 3) simple arithmetic manipulations between means of mixture com- ponents facilitate many-to-many transfer between instru- ment timbres (Section 4.4). F or the training objective, we closely follow the deri vation in [17] and train the model by maximizing the e vidence lo wer bound (ELBO) as follows: L ( p, q ; X , y p ) = E q ( z p | X ) q ( z t | X ) [log p ( X | z p , z t )] − D K L ( q ( z p | X ) || p ( z p | y p )) − E q ( y t | X ) [ D K L ( q ( z t | X ) | p ( z t | y t )] − D K L ( q ( y t | X ) || p ( y t )) , (2) where p ( X | z p , z t ) , q ( z p | X ) , and q ( z t | X ) are parameter- ized with neural networks, referred to as the decoder , pitch encoder 2 , and timbr e encoder , respectiv ely . Instead of us- ing another neural network, we approximate q ( y t | X ) by E q ( z t | X ) [ p ( y t | z t )] . Readers interested in detailed deri va- tion are referred to Appendix A in [17]. 2.3 Architectur e Our model is composed of a shared decoder and separate encoders for pitch and timbre, as illustrated in Figure 1. Speciﬁcally , we reshape the T -by- F spectrogram to hav e number of channels C = F , each of which is a T -by- 1 vec- tor , where T and F refer to time and frequency . Each en- coder contains two one-dimensional con volutional layers, each with 512 ﬁlters of shape 3 × 1 , and a fully connected layer with 512 units. A Gaussian parametric layer fol- lows and outputs two L − dimensional vectors which rep- resent mean and log variance. z p and z t are sampled from the Gaussian layer with the reparameterization trick [25], which enables stochastic gradient descent, and are then concatenated for the decoder to reconstruct the input. The architecture of the decoder is symmetric to the encoder . Batch normalization follo wed by the acti vation function relu are used for e very layer except for the Gaussian and the output layer . W e use the activ ation function tanh for the output layer as we normalize the data within [ − 1 , 1] . 2 A common alternativ e is conditioning the model with categorical pitch labels such that one does not hav e to train a pitch encoder [3, 7]. It, howe ver , requires the pitch of the inputs to be kno wn a priori to perform- ing tasks such as timbre transfer [3], and also prohibits the model from extracting pitch features for downstream tasks. By training this extra en- coder , we also demonstrate how one can extend the model to possibly learn multiple interpretable latent variables. 3. EXPERIMENT AL SETUP In this section, we describe the experimental setup, includ- ing details of the dataset, input representations, and model conﬁgurations. 3.1 Dataset Inspired by Esling et al. [8], we use a subset of Studio- On-Line (SOL) [1], a database of instrument note record- ings. 3 The dataset contains 12 instruments, i.e, piano ( Pno , 246), violin ( Vn , 138), cello ( Vc , 147), English horn ( Ehn , 128), French horn ( Fhn , 214), tenor trombone ( Trtb , 63), trumpet ( Trop , 194), saxophone ( Sax , 99), bassoon ( Bn , 251), clarinet ( Clr , 180), ﬂute ( Fl , 118) and oboe ( Ob , 107). There are 1,885 samples in total. All recordings are resampled to 22,050Hz, and only the ﬁrst 500ms se gment ( T = 43 ) of each recording is consid- ered. W e extract Mel-spectrograms with 256 ﬁlterbanks ( F = 256 ), deriv ed from the power magnitude spectrum of the short-time Fourier transform (STFT). T o compute STFT , we use a Hann window with window size of 92ms and hop size of 11ms. As a result, the input representation is a 43-by-256 Mel-spectrogram. The dataset is split into a training (90%) and validation set (10%), each containing the same distrib ution of instruments. The magnitude of the Mel-spectrogram is scaled logarithmically , and the mini- mum and maximum values in the training set are used for normalizing the magnitude within [ − 1 , 1] in a corpus-wide fashion to preserve dif ferences in dynamics. 3.2 Hyperparameters In order to train both the GMMs in pitch and timbre space, we initialize the means of mixture components us- ing Xa vier initialization [11]. W e set constant standard de- viations, rather than trainable ones, for pitch and timbre space. For pitch space, σ y p = e − 2 for all mixture com- ponents, which is relati vely small, as each mixture compo- nent represents a pitch, and we do not expect a large vari- ance over recordings that play the same pitch. For timbre space, we let σ y t = e 0 for all mixture components, which captures the timbre v ariation of each mixture component, i.e., instrument identity . The dimensionality of the latent space is L = 16 , and the numbers of mixture components are M = 82 and K = 12 , equiv alent to the numbers of classes of pitch and instrument, respecti vely . For all e xper- iments, a batch size of 128 is used, model parameters are initialized with Xa vier initialization and are trained using the Adam optimizer [23] with a learning rate of 10 − 4 . In addition to the proposed model ( M GMV AE ), we con- sider a baseline ( M V AE ) that substitutes the timbre space with an isotropic Gaussian as in regular V AEs. T rain- ing such a model amounts to optimizing Eqn (2) with the last two terms replaced with D K L ( q ( z t | X ) || p ( z t )) , where p ( z t ) = N ( 0 , I ) . The experimental results in Section 4.1 and Section 4.2 show that M GMV AE learns a more discrim- inativ e and disentangled timbre space than M V AE . 3 Access to the dataset was requested from [8]. Figure 2 . Timbre space visualization of M V AEs (top) and M GMV AEs (bottom). From left to right: models trained with 0, 25, 50, 75, or 100% of instrument labels, respectiv ely . Instrument Classiﬁcation Pitch Classiﬁcation N (%) CNN M V AE M GMV AE CNN M V AE M GMV AE z t z p z t z p z t z p z t z p 0 - 0.960 0.163 0.937 0.175 - 0.112 0.966 0.146 0.960 25 0.920 0.960 0.192 0.971 0.180 - 0.169 0.966 0.084 0.977 50 0.983 0.971 0.169 0.988 0.186 - 0.158 0.977 0.079 0.977 75 1.000 0.971 0.169 1.000 0.163 - 0.079 0.971 0.045 0.977 100 1.000 0.937 0.158 1.000 0.197 0.983 0.039 0.983 0.028 0.966 T able 1 . The F-scores of instrument and pitch prediction by linear classiﬁers and CNNs. N (%) refers to the percentage of instrument labels used to train the models. Columns z t and z p , respecti vely , refer to the F-scores obtained using the learned timbre and pitch code to train the down-stream linear classiﬁer . 3.3 Semi-Supervised Learning W e e xploit a moderate number of instrument labels to learn a timbre space in which the clusters clearly represent in- strument identity . Similar to Kingma et al. [24], in the semi-supervised training for M GMV AE , we guide the infer- ence of instrument labels q ( y t | X ) by leveraging limited amounts of supervision. This is done by adding an addi- tional loss term which measures the cross entrop y between the inferred and true instrument labels. Because we do not infer y t in M V AE , we use z t to train an auxiliary classiﬁer to predict y t . It has two 128-unit fully-connected layers, and is jointly optimized with M V AE . W e consider varying numbers of instrument labels N = 0 (unsupervised), 25, 50, 75, and 100% (fully supervised) of the total number . W e randomly sample and let the label distribution match the distribution of instruments. 4. EXPERIMENTS AND RESUL TS The experiments and the results are presented in this sec- tion. W e ﬁrst visualize the timbre space, and quantitati vely ev aluate the disentangled representations. W e then demon- strate the applications of controllable synthesis and many- to-many timbre transfer . Finally , we identify the particular latent dimension that is sensitive to the distribution of the spectral centroid, which allows for ﬁner timbre controls. 4.1 V isualization Figure 2 visualizes the timbre space using t-distrib uted stochastic neighbor embedding (t-SNE) [34], a technique that projects vectors from high- to low-dimensional space. W e ﬁrst observe that M GMV AE learns a Gaussian-mixture distributed timbre space, with means of mixture compo- nents marked as crosses in the ﬁgure. Second, attributed to the pitch encoder which addresses pitch v ariations, both M V AE and M GMV AE are able to form clusters of instrument identity ev en without being trained with instrument labels (the leftmost column). W e observe that the wind family (e.g., saxophone, clarinet and ﬂute) forms an ambiguous cluster . Such an ambiguity remains in the M V AE ev en with increased N , while it is less present in the M GMV AE latent space, due to the multi-modal prior distribution. As we will conﬁrm in Section 4.2, M GMV AE outperforms M V AE in learning a more discriminative and disentangled tim- bre space. Note that in M GMV AE , p ( y t ) is assumed to be uniformly distributed over 12 classes of instruments, i.e., mixture components are equally weighted. As a result, in- struments with larger within-class v ariances (e.g., bassoon and trumpet) are assigned to more than one cluster when N = 0 . In future w ork we aim to improv e the performance of the unsupervised clustering of instruments. 4.2 Pitch and Instrument Disentanglement A disentangled pitch (timbre) representation should be dis- criminativ e for pitch (instrument identity), and at the same time non-informati ve of instrument identity (pitch). There- fore, we ev aluate z p and z t by means of classiﬁcation. W e train linear classiﬁers to map z p and z t to predict both pitch and instrument labels with one fully connected layer . For comparison, we train an end-to-end con volutional neural network (CNN), whose architecture is the same as the en- coder and is a strong baseline, to map the original input Mel-spectrograms to either pitch or instrument labels. T able 1 sho ws the results. The CNN achiev es high F- scores on both instrument and pitch classiﬁcation; note that N is the supervisory percentage of the total number of in- strument labels, and we always use all pitch labels to train the models, which is reasonable as we model isolated notes Figure 3 . The F-scores for predicting instrument (left) and pitch (right) labels from the synthesized spectrograms. in this work. In instrument classiﬁcation, using z t as the feature representations outperforms z p by a large margin, as expected. Speciﬁcally , in both models, the z t learned with unsupervised learning ( N = 0) is already discrimina- tiv e enough to predict instruments with linear classiﬁers. While the F-score of M GMV AE improv es with increased N , that of M V AE does not. Moreov er , the linear classiﬁer trained with z t outperforms the CNN when N < 75 . The timbre space of M GMV AE displays the most discriminative power among the models. W e attribute the F-scores of in- strument classiﬁcation attained by z p to the fact that the piano co vers all possible pitches in the dataset, while other instruments account for a smaller pitch range. As a result, z p of notes that were only recorded by piano are correctly classiﬁed. Future work can be done to decorrelate par- ticular pitches and instruments by data augmentation and adversarial training as in [16]. In pitch classiﬁcation, z p outperforms z t as expected, and both models achiev e com- parable results. More importantly , M GMV AE performs bet- ter than M V AE in terms of disentanglement, as z t results in lower F-scores when predicting pitch with increased N. 4.3 Controllable Synthesis of Instrument Sounds As sho wn in Figure 2, M GMV AE learns a timbre space p ( z t ) , whose mixture components are clearly interpreted as in- strument identity when trained with moderate supervision. Meanwhile, mixture components in p ( z p ) represent pitch. As Gaussian parameters are readily av ailable after train- ing, we can achiev e controllable sound synthesis by sam- pling p ( z | y ) . T o synthesize the target pitch y m and instru- ment y k , we ﬁrst sample z p ∼ N ( µ y m , w · diag ( σ y m )) and z t ∼ N ( µ y k , w · diag ( σ y k )) , where the multiplier w ∈ { 0 , 0 . 5 , 1 . 0 } serv es to examine the effect of sam- pling latent v ariables that de viate from the modes. The de- coder then synthesizes the Mel-spectrogram by consuming [ z t , z p ] . For ev aluation, the CNNs (trained on the original dataset) are used to test whether the synthesized spectro- grams are still recognized as belonging to the desired in- strument and pitch. High F-scores therefore indicate high controllability of the model in sound synthesis. W e use the sound samples in the validation set as the tar gets to synthe- size, and repeat the sampling 30 times for each target. The F-scores for pitch and instrument classiﬁcation are reported in Figure 3. W e ﬁrst note that increasing w de- grades classiﬁcation performance. This is expected, as a sample which is synthesized using a latent v ariable far from its corresponding mean of mixture component devi- ates more from the intended instrument or pitch distribu- tion. Moreo ver , the fact that the CNN was trained on the Figure 4 . Many-to-many timbre transfer . The i th sample of the Fhn is transferred to the Pno , with vector arithmetic in the (partially shown) timbre space. original samples while tested on the synthesized ones also contributes to the inferior performance. Second, increas- ing N improv es instrument classiﬁcation performance. Fi- nally , the high F-scores across all N ’ s when w ∈ { 0 , 0 . 5 } indicate accurate and consistent synthesis of instrument sounds with intended pitches and instruments, even with a timbre space trained using a limited number of instru- ment labels. This implies that M GMV AE efﬁciently e xploits the instrument labels, and learns a discriminativ e mixture distribution of timbre, which is consistent with the visual- ization in Figure 2 (bottom row , N ≥ 25 ). W e do not ex- plore the timbre space resulting from unsupervised learn- ing ( N = 0 ) in this experiment, as the instrument iden- tity of each mixture component is not directly available. W e can, ho wev er , infer the instrument identity of each mixture component by sampling and synthesis, and expect reasonably good performance for controllable synthesis if the clustering of instruments shown in the bottom left of Figure 2 is improved. This will be e xplored in future work. 4.4 Many-to-Many T ransfer of Timbr e In this e xperiment, we demonstrate man y-to-many transfer of instrument timbre. In Mor et al. , a domain-speciﬁc de- coder was trained for each target [28]. T o achieve timbre transfer with a single encoder -decoder architecture, Bit- ton et al. proposed to use a conditional layer [31] which takes both instrument and pitch labels as inputs [3]. On the other hand, our model infers z t and z p , and only uses a single joint decoder . As illustrated in Figure 4, tim- bre transfer is achieved by decoding [ z transf er , z p ] , i.e., transferring timbre while keeping pitch unchanged, where z transf er = z source + α µ source → tar g et , µ source → tar g et = µ targ et − µ source , and α ∈ [0 , 1] . Once again, we rely on the trained CNNs in T able 1 for ev aluation. More speciﬁ- cally , we examine the posterior shift in instrument predic- tion of the CNN, before and after transferring from source to target instruments with α = { 0 , 0 . 25 , 0 . 5 , 0 . 75 , 1 . 0 } . For simplicity , the most frequent instruments (i.e., French horn, piano, cello, and bassoon) of the four families are se- lected as the representati ves, and we perform timbre trans- fer using the samples in the validation set as the source. For example, consider Fhn as the source and Pno as tar- get, as shown in Figure 4. W e modify the timbre code as z i F hn → P no = z i F hn + α µ F hn → P no , where z i F hn is the tim- bre code of the i th Fhn sample, and i = { 1 , 2 , . . . , N Fhn } . Figure 5 . The averaged posterior (color) shift in instru- ment prediction of the CNN, caused by timbre transfer . W e decode as described earlier and report the a veraged posterior (ov er N Fhn ) of instrument prediction of the CNN. For simplicity , in Figure 5, we report the results of the source-target pairs F hn → P no , P no → V c , V c → B n and B n → F hn . Each subﬁgure refers to a source- target pair , and represents the av eraged posterior shift of instrument classiﬁcation of the CNN, with varying α . For all pairs, the biggest posterior shift (hence the prediction change) happens when α = 0 . 5 . This also applies to the rest of the possible instrument pairs not sho wn in the ﬁg- ure. Meanwhile, by using pitch classiﬁcation, we examine if the pitches are the same before and after timbre trans- fer , and we use the original pitch labels as ground-truths. W e ﬁnd that, except in the case where the source is pi- ano, all source-target pairs attain a perfect F-score in terms of pitch. This conﬁrms the ability of the model to suc- cessfully perform many-to-man y timbre transfer . A special case arises when piano is the source. The F-scores before transfer , after transfer to French horn, to cello, and to bas- soon, are 0.958, 0.750, 0.791, and 0.791, respectiv ely . As described earlier in Section 4.2, lo wer F-scores can be at- tributed to the fact that the range of piano is much lar ger than that of the target instruments, or the classiﬁer fails to predict the synthesized samples that hav e unseen combina- tions of pitch and instrument. The other possible reason is the model falls short of generalization. Nev ertheless, this only happens in some cases when the source is piano; as demonstrated in Figure 6, the model is able to transfer Pno G6 to cello (the ﬁrst ro w), which is an e xample of general- izing to an out-of-range pitch for the target instrument. In the ﬁrst and third row , the high-frequency components ap- pear with increased α , and the energy distributes over the segment without decay . The model, howe ver , falls short in Figure 6 . Examples of timbre transfer P no → V c . The top two ro ws are tones outside of the cello range. Figure 7 . Spectral centroid v alues in response to z 13 t . generalizing to the higher pitch, i.e., Pno C7 (the second row), where the ener gy remains focused at the onset, and high-frequency components are smeared. In the future, we could improve the model generalizability by performing data augmentation and adversarial training as in [16]. 4.5 Spectral Centroid Disentanglement A diagonal-co variance Gaussian prior encourages the model to learn disentangled latent dimensions [14]. This applies to all mixture components in our model. In par- ticular , we identify a latent dimension that correlates with the spectral centroid. we modify the 13th dimension of z t , z 13 t , of each sound sample in the validation set by ± 2 σ y k , where σ y k = e 0 for all instruments, and then synthesize the spectrograms, for which we then calculate the spectral centroid. Figure 7 shows the distributions of the spectral centroid before and after the modiﬁcations. The two-tailed t-test indicates signiﬁcant differences ( p < 0 . 05 ) between − 2 σ y k and +2 σ y k for all instruments. As demonstrated in Figure 8, we observe that increased z 13 t reduces the energy of high-frequency components and results in lower spec- tral centroid values. In future research, we will further in- vestigate disentangling speciﬁc acoustic features for ﬁner control of sound synthesis beyond pitch and instrument. Figure 8 . Latent dimension trav erse of z 13 t . 5. CONCLUSIONS AND FUTURE WORK W e have proposed a framework based on GMV AEs to learn disentangled timbre and pitch representations for mu- sical instrument sounds, which is veriﬁed by our exper- imental setup. W e demonstrate its applicability in con- trollable sound synthesis and many-to-many timbre trans- fer . In future work, we plan to conduct listening tests for a more comprehensiv e ev aluation of the applications, and further disentangle both low- (e.g., acoustic features) and high-lev el (e.g., playing techniques) sound attributes, en- abling ﬁner control of synthesized timbres. By using su- pervised and unsupervised learning in a deep generative model, the framework can be easily adapted to learn in- terpretable mixtures such as singer identity , music style, emotion, etc., which facilitates music representation learn- ing and creativ e applications. 6. A CKNO WLEDGMENTS W e would like to thank the anonymous revie wers for their constructiv e revie ws. This work is supported by a Sin- gapore International Graduate A w ard (SINGA) provided by the Agency for Science, T echnology and Research (A*ST AR), under reference number SING-2018-01-1270. 7. REFERENCES [1] G. Ballet, R. Borghesi, P . Hof fmann, and F . Levy . Stu- dio online 3.0: An internet “killer application” for remote acess to ircam sounds and processing tools. Journee Informatique Musicale , 1999. [2] Y . Bengio. Deep learning of representations: Look- ing forward. In International Conference on Statis- tical Languag e and Speec h Pr ocessing , pages 1–37. Springer , 2013. [3] A. Bitton, P . Esling, and A. Chemla-Romeu- Santos. Modulated v ariational auto-encoders for many- to-many musical timbre transfer . arXiv preprint arXiv:1810.00222 , 2018. [4] G. Brunner, A. Konrad, Y . W ang, and R. W attenhofer . Midi-vae: Modeling dynamics and instrumentation of music with applications to style transfer . In Pr oc. of the International Society for Music Information Retrieval Confer ence , pages 23–27, 2018. [5] X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutske ver , and P . Abbeel. Infogan: Interpretable rep- resentation learning by information maximizing gener- ativ e adversarial nets. In Advances in Neural Informa- tion Pr ocessing Systems , pages 2172–2180, 2016. [6] N. Dilokthanakul, P . A. M. Mediano, M. Garnelo, M. C.-H. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep unsupervised clustering with gaus- sian mixture variational autoencoders. arXiv preprint arXiv:1611.02648 , 2016. [7] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan. Neural audio synthesis of musical notes with wav enet autoencoders. In Pr oc. of the International Conference on Machine Learning , pages 1068–1077, 2017. [8] P . Esling and A. Bitton. Bridging audio analysis, per- ception and synthesis with perceptually-regularized variational timbre spaces. In Pr oc. of the International Society for Music Information Retrieval Confer ence , 2018. [9] P . Esling, A. ChemlaRomeu-Santos, and A. Bitton. Generativ e timbre spaces with variational audio syn- thesis. In Pr oc. of the International Confer ence on Dig- ital Audio Ef fects , 2018. [10] Z. Fu, X. T an, N. Peng, D. Zhao, and R. Y an. Style transfer in text: Exploration and e valuation. In AAAI , 2017. [11] X. Glorot and Y . Bengio. Understanding the difﬁculty of training deep feedforward neural netw orks. In AAAI , pages 249–256, 2010. [12] I. Goodfello w , Y . Bengio, and A. Courville. Deep learning . MIT press, 2016. [13] I. Goodfello w , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde-Farle y , S. Ozair, A. Courville, and Y . Ben- gio. Generative adversarial nets. Advances in Neural Information Pr ocessing Systems , 2014. [14] I. Higgins, L. Matthe y , A. Pal, C. Burgess, X. Glo- rot, M. Botvinick, M. Shakir , , and A. Lerchner . Beta- vae: Learning basic visual concepts with a constrained variational frame work. In International Conference on Learning Repr esentations , 2017. [15] W .-N. Hsu, Y . Zhang, and J. Glass. Unsupervised learning of disentangled and interpretable represen- tations from sequential data. In Advances in Neural Information Processing Systems , pages 1878–1889, 2017. [16] W .-N. Hsu, Y . Zhang, R. J. W eiss, Y .-A. Chung, Y . W ang, Y . W u, and J. Glass. Disentangling correlated speaker and noise for speech synthesis via data aug- mentation and adversarial factorization. In Advances in Neural Information Pr ocessing Systems , 2018. [17] W .-N. Hsu, Y . Zhang, R. J. W eiss, H. Zen, Y . W u, Y . W ang, Y . Cao, Y . Jia, Z. Chen, J. Shen, P . Nguyen, and R. Pang. Hierarchical generativ e modeling for con- trollable speech synthesis. In International Confer ence on Learning Repr esentations , 2019. [18] Z. Hu, Z. Y ang, X. Liang, R. Salakhutdinov , and E. P Xing. T o ward controlled generation of text. In Interna- tional Conference on Machine Learning , pages 1587– 1596, 2017. [19] Z. Hu, Z. Y ang, R. Salakhutdinov , and E. P . Xing. On unifying deep generati ve models. arXiv pr eprint arXiv:1706.00550 , 2017. [20] S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse. T imbretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer . arXiv pr eprint arXiv:1811.09620 , 2018. [21] Y .-N. Hung, Y .-A. Chen, and Y .-H. Y ang. Learning dis- entangled representations for timber and pitch in music audio. arXiv pr eprint arXiv:1811.03271 , 2018. [22] Z. Jiang, Y . Zheng, H. T an, B. T ang, and H. Zhou. V ari- ational deep embedding: An unsupervised and genera- tiv e approach to clustering. In International J oint Con- fer ence on Artiﬁcial Intelligence , 2017. [23] D. P . Kingma and J. Ba. Adam: A method for stochas- tic optimization. arXiv preprint , 2014. [24] D. P . Kingma, S. Mohamed, D. J. Rezende, and M. W elling. Semi-supervised learning with deep gener- ativ e models. In Advances in Neural Information Pr o- cessing Systems , pages 3581–3589, 2014. [25] D. P . Kingma and M. W elling. Auto-encoding vari- ational bayes. International Confer ence on Learning Repr esentations , 2014. [26] T . D. Kulkarni, W . F . Whitney , P . K ohli, and J. T enen- baum. Deep con volutional in verse graphics network. In Advances in Neural Information Pr ocessing Systems , pages 2530–2538, 2015. [27] A. K umar , P . Sattigeri, and A. Balakrishnan. V aria- tional inference of disentangled latent concepts from unlabeled observations. International Confer ence on Learning Repr esentations , 2018. [28] N. Mor , L. W olf, A. Polyak, and Y . T aigman. A univ ersal music translation network. arXiv pr eprint arXiv:1805.07848 , 2018. [29] M. Muller , D. P . W . Ellis, A. Klapuri, and G. Richard. Signal processing for music analysis. IEEE Journal of Selected T opics in Signal Pr ocessing , 5(6):1088–1110, 2011. [30] M. Muller and S. Ewert. T ow ards timbre-inv ariant au- dio features for harmony-based music. IEEE T rans- actions on Audio, Speech, and Language Pr ocessing , 18(3):649–662, 2010. [31] E. Perez, F . Strub, H. D. Vries, V . Dumoulin, and A. Courville. Film: V isual reasoning with a general conditioning layer . In AAAI , 2018. [32] K. Ridgew ay . A survey of inductiv e biases for factorial representation-learning. arXiv pr eprint arXiv:1612.05299 , 2016. [33] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. V inyals, A. Grav es, N. Kalchbrenner , A. Senior , and K. Kavukcuoglu. W av enet: A generativ e model for raw audio. arXiv pr eprint arXiv:1609.03499 , 2016. [34] L. v an der Maaten and G. Hinton. V isualizing data using t-sne. J ournal of Machine Learning Researc h , 9(Nov):2579–2605, 2008. [35] Y . W ang, D. Stanton, Y . Zhang, RJ Skerry-Ryan, E. Battenberg, J. Shor, Y . Xiao, F . Ren, Y . Jia, and R. A. Saurous. Style tokens: Unsupervised style mod- eling, control and transfer in end-to-end speech synthe- sis. arXiv pr eprint arXiv:1803.09017 , 2018. [36] H. Zhou, M. Huang, T . Zhang, X. Zhi, and B. Liu. Emotional chatting machine: Emotional con versation generation with internal and external memory . In AAAI , 2017.

Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment