MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

MIDI-V AE: MODELING D YN AMICS AND INSTR UMENT A TION OF MUSIC WITH APPLICA TIONS T O STYLE TRANSFER Gino Brunner Andres K onrad Y uyi W ang Roger W attenhofer Department of Electrical Engineering and Information T echnology ETH Zurich Switzerland brunnegi,konradan,yuwang,wattenhofer@ethz.ch ABSTRA CT W e introduce MIDI-V AE, a neural network model based on V ariational Autoencoders that is capable of handling polyphonic music with multiple instrument tracks, as well as modeling the dynamics of music by incorporating note durations and velocities. W e show that MIDI-V AE can per- form style transfer on symbolic music by automatically changing pitches, dynamics and instruments of a music piece from, e.g., a Classical to a Jazz style. W e e valu- ate the efﬁcac y of the style transfer by training separate style validation classiﬁers. Our model can also interpolate between short pieces of music, produce medle ys and cre- ate mixtures of entire songs. The interpolations smoothly change pitches, dynamics and instrumentation to create a harmonic bridge between two music pieces. T o the best of our kno wledge, this work represents the ﬁrst successful at- tempt at applying neural style transfer to complete musical compositions. 1. INTR ODUCTION Deep generativ e models do not just allo w us to generate new data, but also to change properties of existing data in principled ways, and even transfer properties between data samples. Hav e you ev er wanted to be able to cre- ate paintings like V an Gogh or Monet? No problem! Just take a picture with your phone, run it through a neural net- work, and out comes your personal masterpiece. Being able to generate new data samples and perform style trans- fer requires models to obtain a deep understanding of the data. Thus, advancing the state-of-the-art in deep genera- tiv e models and neural style transfer is not just important for transforming horses into zebras, 1 but lies at the very core of Deep (Representation) Learning research [2]. While neural style transfer has produced astonishing re- sults especially in the visual domain [21, 37], the progress 1 https://junyanz.github.io/CycleGAN/ c  Gino Brunner , Andres K onrad, Y uyi W ang, Roger W at- tenhofer . Licensed under a Creativ e Commons Attrib ution 4.0 Interna- tional License (CC BY 4.0). Attribution: Gino Brunner, Andres Kon- rad, Y uyi W ang, Roger W attenhofer . “MIDI-V AE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer”, 19th International Society for Music Information Retrieval Conference, Paris, France, 2018. for sequential data, and in particular music, has been slower . W e can already transfer sentiment between restau- rant revie ws [30, 36], or ev en change the instrument with which a melody is played [33], but we ha ve no way of knowing how our fav orite pop song would have sounded if it were written by a composer who lived in the classi- cal epoch or how a group of jazz musicians would play the Overture of Mozart’ s Don Giovanni. In this work we take a step to wards this ambitious goal. T o the best of our knowledge, this paper presents the ﬁrst successful ap- plication of unaligned style transfer to musical composi- tions. Our proposed model architecture consists of paral- lel V ariational Autoencoders (V AE) with a shared latent space and an additional style classiﬁer . The style classiﬁer forces the model to encode style information in the shared latent space, which then allows us to manipulate existing songs, and effecti vely change their style, e.g., from Clas- sic to Jazz. Our model is capable of producing harmonic polyphonic music with multiple instruments. It also learns the dynamics of music by incorporating note durations and velocities. 2. RELA TED WORK Gatys et al. [14] introduce the concept of neural style trans- fer and show that pre-trained CNNs can be used to merge the style and content of two images. Since then, more pow- erful approaches have been developed [21, 37]; these allow , for example, to render an image taken in summer to look like it was shot in winter . For sequential data, autoencoder based methods [30, 36] have been proposed to change the sentiment or content of sentences. V an den Oord et al. [33] introduce a V AE model with discrete latent space that is able to perform speaker voice transfer on raw audio data. Mor et al. [26] dev elop a system based on a W aveNet au- toencoder [12] that can translate music across instruments, genres and styles, and even create music from whistling. Malik et al. [23] train a model to add note velocities (loud- ness) to sheet music, resulting in more realistic sounding playback. Their model is trained in a supervised manner , with the tar get being a human-lik e performance of a music piece in MIDI format, and the input being the same piece but with all note velocities set to the same value. While their model can indeed play music in a more human-like manner , it can only change note velocities, and does not learn the characteristics of different musical styles/genres. Our model is trained on unaligned songs from different musical styles. Our model can not only change the dy- namics of a music piece from one style to another, but also automatically adapt the instrumentation and even the note pitches themselves. Apart from style transfer , our model can also generate short pieces of music, medleys, interpo- lations and song mixtures. At the core of our model thus lies the capability to produce music. In the following we will therefore discuss related work in the domains of sym- bolic and raw audio generation. For a more comprehen- siv e overvie w we refer the interested readers to these sur- ve ys: [4, 13, 16]. People hav e been trying to compose music with the help of computers for decades. One of the most famous early examples is “Experiments in Musical Intelligence” [9], a semi-automatic system based on Markov models that is able to create music in the style of a certain com- poser . Soon after , the ﬁrst attempts at music composition with artiﬁcial neural networks were made. Most notably , T odd [31], Mozer [27] and Eck et al. [11] all used Recur- rent Neural Networks (RNN). More recently , Boulanger- Lew andowski et al. [3] combined long short term memory networks (LSTMs) and Restricted Boltzmann Machines to simultaneously model the temporal structure of music, as well as the harmony between notes that are played at the same time, thus being capable of generating polyphonic music. Chu et al. [7] use domain knowledge to model a hierarchical RNN architecture that produces multi-track polyphonic music. Brunner et al. [5] combine a hierar - chical LSTM model with learned chord embeddings that form the Circle of Fifths, sho wing that e ven simple LSTMs are capable of learning music theory concepts from data. Hadjeres et al. [15] introduce an LSTM-based system that can harmonize melodies by composing accompany- ing voices in the style of Bach Chorales, which is con- sidered a very difﬁcult task ev en for professionals. John- son et al. [18] use parallel LSTMs with shared weights to achiev e transposition-in variance (similar to the translation- in variance of CNNs). Chuan et al. [8] in vestigate the use of an image-based T onnetz representation of music, and apply a hybrid LSTM/CNN model to music generation. Generativ e models such as the V ariational Autoencoder (V AE) and Generative Adversarial Networks (GANs) hav e been increasingly successful at modeling music. Roberts et al. introduce MusicV AE [29], a hierarchical V AE model that can capture long-term structure in polyphonic music and exhibits high interpolation and reconstruction perfor- mance. GANs, while very powerful, are notoriously dif- ﬁcult to train and hav e generally not been applied to se- quential data. Howe ver , Mogren [25], Y ang et al. [34] and Dong et al. [10] ha ve recently shown the efﬁcacy of CNN- based GANs for music composition. Y u et al. [35] were the ﬁrst to successfully apply RNN-based GANs to music by incorporating reinforcement learning techniques. Researchers hav e also worked on generating raw au- dio wa ves. V an den Oord et al. [32] introduce W aveNet, a CNN-based model for the conditional generation of speech. The authors also sho w that it can be used to gen- erate pleasing sounding piano music. More recently , En- gel et al. [12] incorporated W av eNet into an Autoencoder structure to generate musical notes and different instru- ment sounds. Mehri et al. [24] dev eloped SampleRNN, an RNN-based model for unconditional generation of raw au- dio. While these models are impressi ve, the domain of raw audio is very high dimensional and it is much more dif ﬁcult to generate pleasing sounding music. Thus most existing work on music generation uses symbolic music represen- tations (see e.g., [3, 5, 7–10, 15, 18, 23, 25, 27, 29, 31, 34, 35]). 3. MODEL ARCHITECTURE Our model is based on the V ariational Autoencoder [20] (V AE) and operates on a symbolic music representation that is extracted from MIDI [1] ﬁles. W e extend the stan- dard piano roll representation of note pitches with veloc- ity and instrument rolls, modeling the most important in- formation contained in MIDI ﬁles. Thus, we term our model MIDI-V AE. MIDI-V AE uses separate recurrent en- coder/decoder pairs that share a latent space. A style clas- siﬁer is attached to parts of the latent space to make sure the encoder learns a compact latent style label that we can then use to perform style transfer . The architecture of MIDI- V AE is shown in Figure 1, and will be explained in more detail in the following. 3.1 Symbolic Music Representation W e use music ﬁles in the MIDI format, which is a sym- bolic representation of music that resembles sheet music. MIDI ﬁles have multiple tracks. Tracks can either be on with a certain pitch and velocity , held ov er multiple time steps or be silent . Additionally , an instrument is assigned to each track. T o feed the note pitches into the model we represent them as a tensor P ∈ { 0 , 1 } n P · n B · n T (com- monly known as piano roll and henceforth referred to as pitch roll), where n P is the number of possible pitch val- ues, n B is the number of beats and n T is the number of tracks. Thus, each song in the dataset is split into pieces of length n B . W e choose n B such that each piece cor - responds to one bar . W e include a “silent” note pitch to indicate when no note is played at a time step. The note velocities are encoded as tensor V ∈ [0 , 1] n P · n B · n T (ve- locity roll). V elocity values between 0.5 and 1 signify a note being played for the ﬁrst time, whereas a value belo w 0.5 means that either no note is being played, or that the note from the last time step is being held. The note veloc- ity range deﬁned by MIDI (0 to 127) is mapped to the in- terval [0 . 5 , 1] . W e model the assignment of instruments to tracks as matrix I = { 0 , 1 } n T · n I (instrument roll), where n I is the number of possible instruments. The instrument assignment is a global property and thus remains constant ov er the duration of one song. Finally , each song in our dataset belongs to a certain style, designated by the style label S ∈ { C l assic, J az z , P op, B ach, M oz art } . In order to generate harmonic polyphonic music it is important to model the joint probability of simultane- Dense GRU Pitch roll ( ) GRU Instrument roll ( ) GRU Velocity roll ( ) Dense Dense Dense 󰵹 z 󰶀 z Style Classifier Style prediction ( ) GRU Dense GRU Dense GRU Dense Reconstructed Pitch roll ( ) Reconstructed Instrument roll ( ) Reconstructed Velocity roll ( ) z t z style Style label ( ) 󰶬 󰶬  N (0, 󰵆 󰶬 ) Encoders Decoders Figure 1 . MIDI-V AE architecture. GR U stands for Gated Recurrent Unit [6]. ously played notes. A standard recurrent neural network model already models the joint distribution of the sequence through time. If there are multiple outputs to be produced per time step, a common approach is to sample each out- put independently . In the case of polyphonic music, this can lead to dissonant and generally “wrong” sounding note combinations. Howe ver , by unrolling the piano rolls in time we can let the RNN learn the joint distribution of si- multaneous notes as well. Basically , instead of one n T -hot vector for each beat, we input n T 1-hot vectors per beat to the RNN. This is a simple but effecti ve way of model- ing the joint distrib ution of notes. The drawback is that the RNN needs to model longer sequences. W e use the pretty midi [28] Python library to extract information from MIDI ﬁles and con vert them to piano rolls. 3.2 Parallel V AE with Shared Latent Space MIDI-V AE is based on the standard V AE [20] with a hy- perparameter β to weigh the Kullback-Leibler diver gence in the loss function (as in [17]). A V AE consists of an en- coder q θ ( z | x ) , a decoder p φ ( x | z ) and a latent variable z , where q and p are usually implemented as neural networks parameterized by θ and φ . In addition to minimizing the standard autoencoder reconstruction loss, V AEs also im- pose a prior distribution p ( z ) on the latent v ariables. Hav- ing a known prior distribution enables generation of new latent vectors by sampling from that distribution. Further- more, the model will only “use” a ne w dimension, i.e., de- viate from the prior distribution, if it signiﬁcantly lowers the reconstruction error . This encourages disentanglement of latent dimensions and helps learning a compact hidden representation. The V AE loss function is L V AE = E q θ ( z | x ) [log p φ ( x | z )] − β D K L [( q θ ( z | x ) || p ( z )] , where the ﬁrst term corresponds to the reconstruction loss, and the second term forces the distribution of latent vari- ables to be close to a chosen prior . D K L is the Kullback- Leibler div ergence, which giv es a measure of how similar two probability distributions are. As is common practice, we use an isotropic Gaussian distribution with unit v ari- ance as our prior , i.e., p ( z ) = N (0 , I ) . Thus, both q θ ( z | x ) and p ( z ) are (isotropic) Gaussian distributions and the KL div ergence can be computed in closed form. As described in Section 3.1, we represent multi-track music as a combination of note pitches, note velocities and an assignment of instruments to tracks. In order to generate harmonic multi-track music, we need to model a joint distribution of these input features instead of three marginal distributions. Thus, our model consists of three encoder/decoder pairs with a shared latent space that cap- tures the joint distribution. For each input sample (i.e., a piece of length n B beats), the pitch, v elocity and instru- ment rolls are passed through their respecti ve encoders, implemented as RNNs. The output of the three encoders is concatenated and passed through sev eral fully connected layers, which then predict σ z and µ z , the parameters of the approximate posterior q θ ( z | x ) = N ( µ z , σ z ) . 2 Using the reparameterization trick [20], a latent vector z is sam- pled from this distribution as z ∼ N ( µ z , σ z ∗  ) where ∗ stands for element-wise multiplication. This is necessary because it is generally not possible to backpropagate gra- dients through a random sampling operation, since it is not differentiable.  is sampled from an isotropic Gaussian dis- tribution N (0 , σ  ∗ I ) , where we treat σ  as a h yperparam- eter (see Section 4.2 for more details). This shared latent vector is then fed into three parallel fully connected lay- ers, from which the three decoders try to reconstruct the pitch, velocity and instrument rolls. The note pitch and instrument decoders are trained with cross entropy losses, whereas for the velocity decoder we use MSE. 3.3 Style Classiﬁer Having a disentangled latent space might enable some con- trol ov er the style of a song. If for example one dimension in the latent space encodes the dynamics of the music, then we could easily change an existing piece by only varying this dimension. Choosing a high v alue for β (the weight of the KL term in the V AE loss function) has been shown to increase disentanglement of the latent space in the visual domain [17]. Howe ver , increasing β has a negati ve ef fect 2 W e use notation σ for both a variance vector and the corresponding diagonal variance matrix. Dataset #Songs #Bars Artists Classic 477 60523 Beethov en, Clementi, ... Jazz 554 72190 Sinatra, Coltrane, ... Pop 659 65697 ABB A, Bruno Mars, ... Bach 156 16213 Bach Mozart 143 17198 Mozart T able 1 . Properties of our dataset. on the reconstruction performance. Therefore, we intro- duce additional structure into the latent space by attaching a softmax style classiﬁer to the top k dimensions of the la- tent space ( z sty le ), where k equals the number of different styles in our dataset. This forces the encoder to write a “latent style label” into the latent space. Using only k di- mensions and a weak classiﬁer encourages the encoder to learn a compact encoding of the style. In order to change a song’ s style from S i to S j , we pass the song through the encoder to get z , swap the v alues of dimensions z i sty le and z j sty le , and pass the modiﬁed latent vector through the de- coder . As style we choose the music genre (e.g., Jazz, Pop or Classic) or individual composers (Bach or Mozart). 3.4 Full Loss Function Putting all parts together , we get the full loss function of our model as L tot = λ P H ( P , ˆ P ) + λ I H ( I , ˆ I ) (1) + λ V M S E ( V , ˆ V ) + λ S H ( S, ˆ S ) − β D K L ( q || p ) , where H ( · , · ) , M S E ( · , · ) and D K L ( ·||· ) stand for cross entropy , mean squared error and KL di ver gence respec- tiv ely . The hats denote the predicted/reconstructed values. The weights λ and β can be used to balance the individual terms of the loss functions. 4. IMPLEMENT A TION In this section we describe our dataset and pre-processing steps. W e also giv e some insight into the training of our model and justiﬁcation for hyperparameter choices. 4.1 Dataset and Pre-Pr ocessing Our dataset contains songs from the genres Classic, Jazz and Pop. The songs were gathered from various online sources; 3 a summary of the properties is shown in T a- ble 1. Note that we excluded symphonies from our Clas- sic, Bach and Mozart datasets due to their complexity and high number of simultaneously playing instruments. W e use a train/test split of 90/10. Each song in the dataset can contain multiple instrument tracks and each track can hav e multiple notes played at the same time. Unless stated otherwise, we select n T = 4 instrument tracks from each song by ﬁrst picking the tracks with the highest number of 3 Pop: www.midiworld.com / Jazz: http://midkar. com/jazz/jazz_01.html / Classic (including Bach, Mozart): www.reddit.com/r/WeAreTheMusicMakers/comments/ 3ajwe4/ played notes, and from each track we choose the highest voice, meaning picking the highest notes per time step. If a song has fewer than n T instrument tracks, we pick ad- ditional voices from the tracks until we hav e n T voices in total. W e exclude drum tracks, since they do not hav e a pitch value. W e choose the 16 th note as smallest unit. In the most widely used time signature 4 4 there are 16 16 th notes in a bar . 91% of Jazz and Pop songs in our dataset are in 4 4 , whereas for Classic the fraction is 34%. For songs with time signatures other than 4 4 we still designate 16 16th notes as one bar . All songs are split into samples of one bar and our model auto-encodes one sample at a time. During training we shufﬂe the songs for each epoch, but keep the bars of a song in the correct order and do not reset the RNN states between samples. Thus, our model is trained on a proper sequence of bars, instead of being confused by random bar progressions. There are 128 possible pitches in MIDI. Since very lo w and high pitches are rare and often do not sound pleasing, we only use n P = 60 pitch v alues ranging from 24 ( C 1 ) to 84 ( C 6 ). 4.2 Model (Hyper -)Parameters Our model is generally not sensiti ve to most hyperparame- ters. Nevertheless we continuously performed local hyper- parameter searches based on good baseline models, only varying one hyperparameter at a time. W e use the recon- struction accuracy of the pitch roll decoder as ev aluation metric. Using Gated Recurrent Units (GR Us) [6] instead of LSTMs increases performance signiﬁcantly . Using bidi- rectional GRUs did not improve the results. The pitch roll encoder/decoder uses two GR U layers, whereas the rest uses only one layer . All GR U state sizes as well as the size of the latent space z are set to 256. W e use the AD AM opti- mizer [19] with an initial learning rate of 0.0002. For most layers in our architecture, we found tanh to work better than sigmoid or rectiﬁed linear units. W e train on batches of size 256. The loss function weights λ P , λ I , λ V and λ S were set to 1.0, 1.0, 0.1 and 0.1 respectively . λ p was set to 1.0 to fav or high quality note pitch reconstructions ov er the rest. λ V was also set to 1.0 because the MSE magnitude is much smaller than the cross entropy loss v alues. During our experiments, we realized that high v alues of β generally lead to very poor performance. W e further found that setting the variance of  to the value of σ  = 1 , as done in all previous work using V AEs, also has a neg- ativ e effect. Therefore we decided to treat σ  as a hyper- parameter as well. Figure 2 shows the results of the grid search. σ  is the variance of the distribution from which the  values for the reparameterization trick are sampled, and is thus usually set to the same value as the variance of the prior . Howe ver , especially at the beginning of learn- ing, this introduces a lot of noise that the decoder needs to handle, since the values for µ z and σ z , output by the en- coder , are small compared to  . W e found that by reducing σ  , we can improve the performance of our model signif- icantly , while being able to use higher values for β . An annealing strategy for both β and σ  might produce better 10 − 3 10 − 2 10 − 1 10 0 0.55 0.60 0.65 0.70 0.75 0.80 0.85 σ  Pitch Roll Reconstruction Acc. Beta: 10.0 Beta: 1.0 Beta: 0.1 Beta: 0.01 Beta: 0.001 Beta: 0.0001 Figure 2 . T est reconstruction accurac y of pitch roll for different β and σ  . P itch I nstrument S tyle V elocity T rain T est Train T est Train T est Train T est CvJ 0.90 0.85 0.99 0.87 0.98 0.92 0.008 0.029 CvP 0.96 0.88 0.99 0.89 0.96 0.91 0.017 0.036 JvP 0.88 0.80 0.99 0.86 0.94 0.69 0.043 0.048 BvM 0.91 0.75 0.99 0.82 0.94 0.74 0.010 0.033 T able 2 . T rain and test performance of our ﬁnal models. The velocity column sho ws MSE loss values, whereas the rest are accuracies. results, but we did not test this. In the ﬁnal models we use β = 0 . 1 and σ  = 0 . 01 . Note that during generation we sample z from N (0 , σ ˆ z ) , where σ ˆ z is the empirical vari- ance obtained by feeding the entire training dataset through the encoder . The empirical mean µ ˆ z is very close to zero. 4.3 T raining All models are trained on single GPUs (GTX 1080) un- til the pitch roll decoder con ver ges. This corresponds to around 400 epochs, or 48 hours. W e train one model for each genre/composer pair to mak e learning easier . This re- sults in four models that we henceforth call CvJ (trained on Classic and Jazz), CvP (Classic and Pop), JvP (Jazz and Pop) and BvM (Bach and Mozart). The train/test accura- cies/losses of all ﬁnal models are shown in T able 2. The columns correspond to the terms in our model’ s full loss function (Equation 1). 5. EXPERIMENT AL RESUL TS In this section we ev aluate the capabilities of MIDI-V AE. Wherev er mentioned, corresponding audio samples can be found on Y ouT ube. 4 5.1 Style T ransfer T o ev aluate the effecti veness of MIDI-V AE’ s style transfer , we train three separate style ev aluation classiﬁers. The in- put features are the pitch, velocity and instrument rolls re- spectiv ely . The three style classiﬁers are also combined to 4 https://goo.gl/vb8Yrh T rain Songs T est Songs Before After Diff. Before After Diff. CvJ 0.92 0.38 0.54 0.87 0.39 0.48 CvP 0.94 0.43 0.51 0.92 0.45 0.47 JvP 0.72 0.60 0.12 0.72 0.62 0.10 BvM 0.77 0.45 0.32 0.66 0.47 0.19 T able 3 . Style transfer performance (ensemble classiﬁer accuracies before and after) between all style pairs. output a voting based ensemble prediction. The accuracy of the classiﬁers is computed as the fraction of correctly predicted styles per bar in a song. W e predict the likelihood of the source style befor e and after the style change. If the style transfer works, the predicted likelihood of the source style decreases. The larger the difference, the stronger the effect of the style transfer . Note that for all experiments presented in this paper we set the number of styles k = 2 , that is, one MIDI-V AE model is trained on two styles, e.g., Classic vs. Jazz. Therefore, the style classiﬁer is binary and a reduction in probability of the source style is equiv- alent to an increase in probability of the target style of the same magnitude. All style classiﬁers use two-layer GR Us with a state size of 256. T able 3 shows the performance of MIDI-V AE’ s style transfer when measured by the en- semble style classiﬁer . W e trained a separate MIDI-V AE for each style pair . For each pair of styles we perform a style change on all songs in both directions and a verage the results. The style transfer works for all models, albeit to v arying degrees. In all cases except for JvP , the predic- tor is ev en skewed belo w 0.5, meaning that the target style is now considered more lik ely than the source style. T able 4 shows the style transfer results measured by each individual style classiﬁer . W e can see that pitch and velocity contribute equally to the style change, whereas in- strumentation seems to correlate most with the style. For CvJ and CvP , switching the style heavily changes the in- strumentation. Figure 3 illustrates how the instruments of all songs in our Jazz test set are changed when switch- ing the style to Classic. Only few instruments are rarely changed (piano, ensemble, reed), whereas most others are mapped to one or multiple dif ferent instruments. The instrument switch between genres with highly overlap- ping instrumentation (JvP , BvM) is much less pronounced. Classifying style based on the note pitches and velocities of one bar is more dif ﬁcult, as shown by the “before” accura- cies in T able 4, which are generally lower than the ones of the instrument roll based classiﬁer . Nev ertheless, the style transfer changes pitch and v elocity to wards the target style. MIDI-V AE retains most of the original melody , while of- ten changing accompanying instruments to suit the target style. This is generally desirable, since we do not want to change the pitches so thoroughly that the original song cannot be recognized anymore. W e provide examples of style transfers on a range of songs from our training and test sets on Y ouT ube (see Style transfer songs ). piano percussion organs guitar bass strings ensemble brass reed pipe synth lead synth pad synth effects ethnic percussive sound effects sound effects percussive ethnic synth effects synth effects synth lead pipe reed brass ensemble strings bass guitar organs percussion piano Classic Jazz 0.0 0.2 0.4 0.6 0.8 1.0 Figure 3 . The matrix visualizes how the instruments are changed when switching from Jazz to Classic, averaged ov er all Jazz songs in the test set. Pitch V elocity Instrument Bf. Af. Bf. Af. Bf. Af. CvJ T est 0.77 0.66 0.67 0.57 0.90 0.20 CvP T est 0.77 0.67 0.71 0.60 0.91 0.27 JvP T est 0.65 0.63 0.67 0.64 0.67 0.55 BvM T est 0.55 0.47 0.60 0.49 0.64 0.47 T able 4 . A verage before and after classiﬁer accuracies for all classiﬁers (pitch/instrument/velocity) for the test set. 5.2 Latent Space Evaluation Figure 4 shows a t-SNE [22] plot of the latent vectors for all bars of 20 Jazz and 20 Classic pieces. The darker the color , the more “jazzy” or “classical” a song is according to the ensemble style classiﬁer . The genres are well sepa- rated, and most songs have all their bars clustered closely together (likely thanks to the instrument roll being con- stant). Some classical pieces are bleeding o ver into the Jazz region and vice versa. As can be seen from the light color , the ensemble style classiﬁer did not conﬁdently as- sign these pieces to either style. W e further perform a sweep over all 256 latent dimen- sions on randomly sampled bars to check whether chang- ing one dimension has a measurable effect on the generated music. W e deﬁne 27 metrics, among which are total num- ber of (held) notes, mean/max/min/range of (speciﬁc or all) pitches/velocities, and style changes. Besides the obvious dimensions where the style classiﬁer is attached, we ﬁnd that some dimensions correlate with the total number of notes played in a song, the highest pitch in a bar , or the occurrence of a speciﬁc pitch. The changes can be seen when plotting the pitches, but are dif ﬁcult to hear . Fur- thermore, the dimensions are very entangled, and chang- ing one dimension has multiple effects. Higher values for β ∈ { 1 , 2 , 3 } slightly improve the disentanglement of la- tent dimensions, but strongly reduce reconstruction accu- [ Classic (o) ] [ Jazz (+) ] Figure 4 . t-SNE plot of latent vectors for bars from 20 Jazz and Classic songs. Bars from the same song were given the same color . Lighter colors mean that the ensemble style classiﬁer was less certain in its prediction. racy (see Figure 2). W e added samples to Y ouTube to sho w the results of manipulating individual latent v ariables. 5.3 Generation and Interpolation MIDI-V AE is capable of producing smooth interpolations between bars. This allows us to generate medleys by con- necting short pieces from our dataset. The interpolated bars form a musically consistent bridge between the pieces, meaning that, e.g., pitch ranges and velocities increase when the target bar has higher pitch or v elocity v alues. W e can also merge entire songs together by linearly interpolat- ing the latent vectors for two bar progressions, producing interesting mixes that are surprisingly fun to listen to. The original songs can sometimes still be identiﬁed in the mix- tures, and the resulting music sounds harmonic. W e again uploaded several audio samples to Y ouT ube (see Medleys , Interpolations and Mixtur es ). 6. CONCLUSION W e introduce MIDI-V AE, a simple but effecti ve model for performing style transfer between musical compositions. W e sho w the ef fectiveness of our method on se veral dif fer- ent datasets and provide audio examples. Unlik e most ex- isting models, MIDI-V AE incorporates both the dynamics (velocity and note durations) and instrumentation of mu- sic. In the future we plan to integrate our method into a hierarchical model in order to capture style features over longer time scales and allow the generation of larger pieces of music. T o facilitate future research on style transfer for symbolic music, and sequence tasks in general, we make our code and data publicly av ailable. 5 5 https://github.com/brunnergino/MIDI- VAE 7. REFERENCES [1] Midi association, the ofﬁcial midi speciﬁcations. https://www.midi.org/specifications . Accessed: 01-06-2018. [2] Y oshua Bengio, Aaron C. Courville, and Pascal V in- cent. Representation learning: A revie w and new per- spectiv es. IEEE T rans. P attern Anal. Mach. Intell. , 35(8):1798–1828, 2013. [3] Nicolas Boulanger-Lewando wski, Y oshua Bengio, and Pascal V incent. Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription. In Pr oceed- ings of the 29th International Conference on Machine Learning, ICML 2012, Edinbur gh, Scotland, UK, J une 26 - J uly 1, 2012 , 2012. [4] Jean-Pierre Briot, Ga ¨ etan Hadjeres, and Franc ¸ ois Pa- chet. Deep learning techniques for music generation-a surve y . arXiv pr eprint arXiv:1709.01620 , 2017. [5] Gino Brunner , Y uyi W ang, Roger W attenhofer , and Jonas W iesendanger . JamBot: Music theory aware chord based generation of polyphonic music with LSTMs. In 29th International Conference on T ools with Artiﬁcial Intelligence (ICT AI) , 2017. [6] K yunghyun Cho, Bart v an Merrienboer, C ¸ aglar G ¨ ulc ¸ ehre, Dzmitry Bahdanau, Fethi Bougares, Hol- ger Schwenk, and Y oshua Bengio. Learning phrase representations using RNN encoder-decoder for sta- tistical machine translation. In Pr oceedings of the 2014 Conference on Empirical Methods in Natural Language Pr ocessing, EMNLP 2014, October 25-29, 2014, Doha, Qatar , A meeting of SIGDA T , a Special Inter est Group of the A CL , pages 1724–1734, 2014. [7] Hang Chu, Raquel Urtasun, and Sanja Fidler . Song from PI: A musically plausible network for pop music generation. CoRR , abs/1611.03477, 2016. [8] Ching-Hua Chuan and Dorien Herremans. Modeling temporal tonal relations in polyphonic music through deep networks with a nov el image-based representa- tion. 2018. [9] Da vid Cope. Experiments in music intelligence (EMI). In Proceedings of the 1987 International Computer Music Confer ence, ICMC 1987, Champaign/Urbana, Illinois, USA, August 23-26, 1987 , 1987. [10] Hao-W en Dong, W en-Y i Hsiao, Li-Chia Y ang, and Y i- Hsuan Y ang. Musegan: Multi-track sequential gener- ativ e adversarial networks for symbolic music genera- tion and accompaniment. In Pr oceedings of the Thirty- Second AAAI Conference on Artiﬁcial Intelligence, New Orleans, Louisiana, USA, F ebruary 2-7, 2018 , 2018. [11] Douglas Eck and Juergen Schmidhuber . A ﬁrst look at music composition using lstm recurrent neural net- works. Istituto Dalle Molle Di Studi Sull Intelligenza Artiﬁciale , 103, 2002. [12] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of musical notes with wav enet autoencoders. In Pr oceedings of the 34th International Conference on Machine Learn- ing, ICML 2017, Sydney , NSW , Austr alia, 6-11 August 2017 , pages 1068–1077, 2017. [13] Jose D. Fern ´ andez and Francisco J. V ico. AI methods in algorithmic composition: A comprehensiv e survey . J. Artif . Intell. Res. , 48:513–582, 2013. [14] Leon A Gatys, Alexander S Ecker , and Matthias Bethge. Image style transfer using con volutional neural networks. In Computer V ision and P attern Reco gnition (CVPR), 2016 IEEE Confer ence on , pages 2414–2423. IEEE, 2016. [15] Ga ¨ etan Hadjeres, Franc ¸ ois Pachet, and Frank Nielsen. Deepbach: a steerable model for bach chorales gener- ation. In Pr oceedings of the 34th International Confer - ence on Machine Learning, ICML 2017, Sydne y , NSW , Austr alia, 6-11 August 2017 , pages 1362–1371, 2017. [16] Dorien Herremans, Ching-Hua Chuan, and Elaine Chew . A functional taxonomy of music generation sys- tems. ACM Comput. Surv . , 50(5):69:1–69:30, 2017. [17] Irina Higgins, Loic Matthey , Arka Pal, Christopher Burgess, Xavier Glorot, Matthe w Botvinick, Shakir Mohamed, and Alexander Lerchner . beta-vae: Learn- ing basic visual concepts with a constrained v ariational framew ork. 2016. [18] Daniel D. Johnson. Generating polyphonic music us- ing tied parallel networks. In Computational Intel- ligence in Music, Sound, Art and Design - 6th In- ternational Conference , EvoMUSART 2017, Amster- dam, The Netherlands, April 19-21, 2017, Pr oceed- ings , pages 128–143, 2017. [19] Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR , abs/1412.6980, 2014. [20] Diederik P . Kingma and Max W elling. Auto-encoding variational bayes. CoRR , abs/1312.6114, 2013. [21] Y ijun Li, Chen Fang, Jimei Y ang, Zhao wen W ang, Xin Lu, and Ming-Hsuan Y ang. Univ ersal style transfer via feature transforms. In Advances in Neural Information Pr ocessing Systems 30: Annual Confer ence on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA , pages 385–395, 2017. [22] Laurens van der Maaten and Geof frey Hinton. V isu- alizing data using t-sne. Journal of machine learning r esearc h , 9(Nov):2579–2605, 2008. [23] Iman Malik and Carl Henrik Ek. Neural translation of musical style. CoRR , abs/1708.03535, 2017. [24] Soroush Mehri, Kundan Kumar , Ishaan Gulrajani, Rithesh Kumar , Shubham Jain, Jose Sotelo, Aaron C. Courville, and Y oshua Bengio. Samplernn: An un- conditional end-to-end neural audio generation model. CoRR , abs/1612.07837, 2016. [25] Olof Mogren. C-RNN-GAN: continuous recurrent neural networks with adversarial training. CoRR , abs/1611.09904, 2016. [26] Noam Mor , Lior W olf, Adam Polyak, and Y aniv T aig- man. A univ ersal music translation network. CoRR , abs/1805.07848, 2018. [27] Michael C. Mozer . Neural network music composi- tion by prediction: Exploring the beneﬁts of psychoa- coustic constraints and multi-scale processing. Con- nect. Sci. , 6(2-3):247–280, 1994. [28] Colin Raffel and Daniel PW Ellis. Intuiti ve anal- ysis, creation and manipulation of midi data with pretty midi. In 15th International Society for Music Information Retrie val Confer ence Late Br eaking and Demo P apers , pages 84–93, 2014. [29] Adam Roberts, Jesse Engel, and Douglas Eck. Hier- archical v ariational autoencoders for music. In NIPS W orkshop on Mac hine Learning for Cr eativity and De- sign , 2017. [30] T ianxiao Shen, T ao Lei, Regina Barzilay , and T ommi Jaakkola. Style transfer from non-parallel text by cross- alignment. In Advances in Neural Information Pr ocess- ing Systems , pages 6833–6844, 2017. [31] Peter M T odd. A connectionist approach to algorith- mic composition. Computer Music Journal , 13(4):27– 43, 1989. [32] A ¨ aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol V inyals, Alex Grav es, Nal Kalchbrenner, Andrew W . Senior, and K oray Kavukcuoglu. W av enet: A generativ e model for raw audio. In The 9th ISCA Speech Synthesis W orkshop, Sunnyvale, CA, USA, 13-15 September 2016 , page 125, 2016. [33] A ¨ aron v an den Oord, Oriol V inyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Pr ocessing Sys- tems 30: Annual Confer ence on Neural Information Pr ocessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA , pages 6309–6318, 2017. [34] Li-Chia Y ang, Szu-Y u Chou, and Y i-Hsuan Y ang. Midinet: A con volutional generative adversarial net- work for symbolic-domain music generation. In Pr o- ceedings of the 18th International Society for Music In- formation Retrieval Conference , ISMIR 2017, Suzhou, China, October 23-27, 2017 , pages 324–331, 2017. [35] Lantao Y u, W einan Zhang, Jun W ang, and Y ong Y u. Seqgan: Sequence generativ e adversarial nets with pol- icy gradient. In Pr oceedings of the Thirty-F irst AAAI Confer ence on Artiﬁcial Intelligence, F ebruary 4-9, 2017, San F rancisco, California, USA. , pages 2852– 2858, 2017. [36] Junbo Jake Zhao, Y oon Kim, Kelly Zhang, Alexan- der M. Rush, and Y ann LeCun. Adversarially regu- larized autoencoders for generating discrete structures. CoRR , abs/1706.04223, 2017. [37] Jun-Y an Zhu, T aesung Park, Phillip Isola, and Alex ei A. Efros. Unpaired image-to-image translation using c ycle-consistent adv ersarial networks. In IEEE International Confer ence on Computer V ision, ICCV 2017, V enice, Italy , October 22-29, 2017 , pages 2242– 2251, 2017.

MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment