Musical Composition Style Transfer via Disentangled Timbre Representations

Musical Composition Style T ransfer via Disentangled T imbr e Repr esentations Y un-Ning Hung 1 , I-T ung Chiang 1 , Y i-An Chen 2 , Y i-Hsuan Y ang 1 1 Research Center for IT Innov ation, Academia Sinica, T aiwan 2 KKBO X Inc., T aiwan { biboamy , itungchiang } @citi.sinica.edu, annchen@kkbox.com, yang@citi.sinica.edu Abstract Music creation in v olves not only composing the different parts (e.g., melody , chords) of a musical work but also arranging/selecting the instruments to play the different parts. While the former has re- ceiv ed increasing attention, the latter has not been much inv estigated. This paper presents, to the best of our knowledge, the ﬁrst deep learning models for rearranging music of arbitrary genres. Specif- ically , we build encoders and decoders that take a piece of polyphonic musical audio as input, and predict as output its musical score. W e in v esti- gate disentanglement techniques such as adversar- ial training to separate latent factors that are related to the musical content (pitch) of different parts of the piece, and that are related to the instrumenta- tion (timbre) of the parts per short-time segment. By disentangling pitch and timbre, our models ha ve an idea of how each piece was composed and ar- ranged. Moreov er , the models can realize “com- position style transfer” by rearranging a musical piece without much af fecting its pitch content. W e validate the effecti v eness of the models by exper - iments on instrument acti vity detection and com- position style transfer . T o facilitate follo w-up re- search, we open source our code at https://github . com/biboamy/instrument- disentangle. 1 Introduction Music generation has long been considered as an important task for AI, possibly due to the complicated mental and cre- ativ e process that is inv olv ed, and its wide applications in our daily liv es. Even if machines cannot reach the same profes- sional le vel as well-trained composers or songwriters, ma- chines can cooperate with human to make music generation easier or more accessible to ev eryone. Along with the rapid development of deep generativ e mod- els such as generativ e adversarial networks (GANs) [ Good- fellow and others, 2014 ] , the research on music generation is experiencing a new momentum. While a musical work is typ- ically composed of melody , chord, and rhythm (beats), a ma- jor trend of recent research focuses on creating original con- tent for all or some of these three parts [ Roberts et al. , 2017; Dong et al. , 2018a ] . That is, we expect machines can learn the complex musical structure and compose music like hu- man composers. This has been technically referred to as mu- sic composition [ Briot et al. , 2017 ] . Another trend is to generate v ariations of existing musical material. By preserving the pitch content of a musical piece, we expect machines to modify style-related attributes of mu- sic, such as genre and instrumentation. This has been techni- cally referred to as music style transfer [ Dai et al. , 2018 ] . A famous example of style transfer, though not fully-automatic, is the “Ode to Joy” project presented by [ Pachet, 2016 ] . The project aims to rearrange (or reorchestrate) Beethoven’ s Ode to Joy according to the stylistic con ventions of se ven other different musical genres such as Jazz and Bossa No v a. 1 In this paper , we are in particular interested in such a music r earr angement task. The work of [ Pachet, 2016 ] is inspiring, but as the target styles are by nature very dif ferent, they had to tailor different machine learning models, one for each style. Moreov er , they chose to use non-neural network based mod- els such as Mark ov models, possibly due to the need to inject musical kno wledge manually to ensure the ﬁnal production quality . As a result, the solution presented by [ Pachet, 2016 ] is not general and it is not easy to make extensions. In con- trast, we would like to in vestigate a univ ersal neural network model with which we can easily change the instrumentation of a giv en music piece fully automatically , without much af- fecting its pitch content. While we do not think such a style- agnostic approach can work well for all the possible musical styles, it can be treated as a starting point from which style- speciﬁc elements can be added later on. Music rearrangement is closely related to the art of mu- sic arr angement [ Corozine, 2002 ] . For creating music, it in- volv es not only composing the different parts (e.g., melody , chords) of a musical work but also arranging/selecting the in- struments to play the different parts. For example, chord and melody can be played by dif ferent instruments. Setting the instrumentation is like coloring the music. Therefore, music arrangement is an “advanced” topic, and is needed to wards generating fully-ﬂedged music material. Compared to music arrangement, music rearrangement is a style transfer task and is like r e-coloring an e xisting music piece. By contributing to music rearrangement, we also contribute indirectly to the 1 [Online] https://www .youtube.com/watch?v=buXqNqBFd6E more challenging automatic music arrangement task, which has been seldom addressed in the literature as well. Music rearrangement is not an easy task ev en for human beings and requires years of professional training. In order to con vincingly perform music style transfer between two spe- ciﬁc styles, a musician must ﬁrst be familiar with the anno- tation, instrument usage, tonality , harmonization and theories of the two styles. The task gets harder when we consider more styles. For machines to rearrange a musical piece, we ha ve to consider not only the pitch content of the given music, but also the pitch range of each individual instrument as well as the relation between dif ferent instruments (e.g., some instru- ment combination creates more harmonic sounds). And, it is hard to ﬁnd pair ed data demonstrating dif ferent instrumenta- tions of the same music pieces [ Crestel et al. , 2017 ] . A neural network may not learn well when the data size is small. T o address these challenges, we propose to train music transcription neural networks that take as input an audio ﬁle and generate as output the corresponding musical score. In addition to predicting which notes are played in the given audio ﬁle, the model also has to predict the instrument that plays each note. As a result, the model learns both pitch- and timbre-related representations from the musical audio. While the model is not trained to perform music rearrangement, we in v estigate “disentanglement techniques” such as adversarial training [ Liu et al. , 2018b; Lee et al. , 2018 ] to separate la- tent factors of the model that are pitch- and timbre-related. By disentangling pitch and timbre, the model has an idea of how a musical piece was composed and arranged. If the dis- entanglement is well done, we expect that we can rearrange a musical piece by holding its pitch-related latent factors ﬁxed while manipulating the timbre-related ones. W e propose and in v estigate two models in this paper . They share the following two advantages. First, they require only pairs of audio and MIDI ﬁles, which are relatively easier to collect than dif ferent arrangements of the same music pieces. Second, they can take any musical piece as input and rear - range it, as long as we hav e its audio recording. In sum, the main contributions of this w ork include: • T o the best of our kno wledge, this work presents the ﬁrst deep learning models that realize music rearrangement, a composition style transfer task [ Dai et al. , 2018 ] (see Section 2.1), for polyphonic music of arbitrary genres. • While visual attribute disentanglement has been much studied recently , attribute disentanglement for musical audio has only been studied in our prior work [ Hung et al. , 2018 ] , to our kno wledge. Extending from that work, we report comprehensiv e e v aluation of the models, em- phasizing their application to rearrangement. Musical compositional style transfer is still at its early stage. W e hope this work can call for more attention toward ap- proaching this task with computational means. 2 Related W ork 2.1 Music Style T ransfer According to the categorization of [ Dai et al. , 2018 ] , there are three types of music style transfer: timbre style transfer , per - formance style transfer , and composition style transfer . T o our knowledge, timbre style transfer recei ves more attention than the other two lately . The task is concerned with alter- ing the timbre of a musical piece in the audio domain. For example, [ Engel et al. , 2017 ] applied W av eNet-based autoen- coders to audio wa v eforms to model the musical timbre of short single notes, and [ Lu et al. , 2019 ] employed a GAN- based network on spectrograms to achiev e long-structure tim- bre style transfer . Howe v er , for the latter two types of music style transfer , little ef fort has been made. Music rearrangement can be considered as a composition style transfer task, which is deﬁned as a task that “preserve[s] the identiﬁable melody contour (and the underlying structural functions of harmony) while altering some other score fea- tures in a meaningful way” [ Dai et al. , 2018 ] . The recent work on composition style transfer we are aware of are that by [ Pati, 2018 ] and [ Kaliakatsos-Papakostas et al. , 2017 ] , which used neural networks and rules respectively to deal with melody , rhythm, and chord modiﬁcation. Our work is different in that we aim to modify the instrumentation. 2.2 Disentanglement for Style T ransfer Neural style transfer was ﬁrst introduced in image style trans- fer [ Gatys et al. , 2016 ] . In this work, the y proposed to learn an image representation to separate and recombine content and style. Since then, sev eral methods hav e been proposed to modify the style of a image. [ Liu et al. , 2018b ] trained an auto-encoder model supervised by content ground truth labels to learn a content-inv ariant representation, which can be mod- iﬁed to reconstruct a new image. [ Lee et al. , 2018 ] showed that it is possible to learn a disentangled representation from unpaired data by using cross-cycle consistency loss and ad- versarial training, without needing the ground truth labels. Disentangled representation learning for music is still new and has only been studied by [ Brunner et al. , 2018 ] for sym- bolic data. By learning a style representation through style classiﬁer , their model is able to modify the pitch, instrumen- tation and v elocity gi v en dif ferent style codes. Our model is different in that we deal with audio input rather than MIDIs. 3 Proposed Models Figure 1 shows the architecture of the proposed models. W e opt for using the encoder/decoder structures as the backbone of our models, as such networks learn latent representations (a.k.a., latent codes) of the input data. W ith some labeled data, it is possible to train an encoder/decoder network in a way that different parts of the latent code correspond to dif- ferent data attributes, such as pitch and timbre. Both models use encoders/decoders and adversarial training to learn mu- sic representations, b ut the second model additionally uses skip connections to deal with the pitch information. W e note that both models perform polyphonic music transcription. W e present the details of these models below . 3.1 Input/Output Data Representation The input to our models is an audio wa v eform with arbitrary length. T o facilitate pitch and timbre analysis, we ﬁrstly con- vert the wav eform into a time-frequency representation that shows the energy distrib ution across different frequency bins (a) DuoED: Use separate encoders/decoders for timbre and pitch (b) UnetED: Use skip connections to process pitch Figure 1: The tw o proposed encoder/decoder architectures for learn- ing disentangled timbre representations (i.e., Z t ) for musical au- dio. The dashed lines indicate the adversarial training components. (Notations: CQT —a time-frequency representation of audio; roll– multi-track pianoroll; E—encoder; D—decoder; Z—latent code; t— timbre; p—pitch; skip—skip connections). (a) Pianoroll (b) Instrument roll (c) Pitch roll Figure 2: Dif ferent symbolic representations of music. for each short-time frame. Instead of using the short-time Fourier transform, we use the constant-Q transform (CQT), for it adopts a logarithmic frequency scale that better aligns with our perception of pitch [ Bittner et al. , 2017 ] . CQT also provides better frequency resolution in the low-frequency part, which helps detect the fundamental frequencies. As will be described in Section 3.5, our encoders and de- coders are designed to be fully-con v olutional [ Oquab et al. , 2015; Liu et al. , 2019 ] , so that our models can deal with input of any length in testing time. Howe v er , for the conv enience of training the models with mini-batches, in the training stage we di vide the wa veforms in our training set into 10-second chunks (without ov erlaps) and use these chunks as the model input, leading to a matrix X cq t ∈ R F × T of ﬁxed size for each input. In our implementation, we compute CQT with the librosa library [ McFee et al. , 2015 ] , with 16,000 Hz sampling rate and 512-sample windo w size, again with no ov erlaps. W e use a frequency scale of 88 bins, with 12 bins per octave to represent each note. Hence, F = 88 (bins) and T = 312 (frames). W e use the pianorolls [ Dong et al. , 2018b ] (see Figure 2a) as the target output of our models. A pianoroll is a binary- valued tensor that records the presence of notes (88 notes here) across time for each track (i.e., instrument). When we consider M instruments, the target model output would be X rol l ∈ { 0 , 1 } F × T × M . X rol l and X cq t are temporally aligned, since we use MIDIs that are time-aligned with the au- dio clips to derive the pianorolls, as discussed in Section 3.5. As shown in Figures 1 and 2, besides asking our models to generate X rol l , we use the instrument r oll X t ∈ { 0 , 1 } M × T and pitc h r oll X p ∈ { 0 , 1 } F × T as supervisory signals to help learn the timbre representation. Both X t and X p can be com- puted from X rol l by summing along a certain dimension. 3.2 The DuoED Model The architecture of DuoED is illustrated in Figure 1a. The de- sign is inspired by [ Liu et al. , 2018b ] , b ut we adapt the model to encode music. Speciﬁcally , we train tw o encoders E t and E p to respecti vely con vert X cq t into the timbre code Z t = E t ( X cq t ) ∈ R κ × τ and pitch code Z p = E p ( X cq t ) ∈ R κ × τ . W e note that, unlike in the case of image representation learn- ing, here the latent codes are matrices , and we require that the second dimensions (i.e., τ ) represent time. This way , each column of Z t and Z p is a κ -dimensional representation of a temporal segment of the input. For abstraction, we require κτ < F T . DuoED also contains three decoders D rol l , D t and D p . The encoders and decoders are trained such that we can use D rol l ([ Z T t , Z T p ] T ) to predict X rol l , D t ( Z t ) to predict X t , and D p ( Z p ) to predict X p . The prediction error is measured by the binary cross entr opy between the ground truth and the predicted one. For e xample, for the timbre classiﬁer D t , it is: L t = − P [ X t · ln σ ( b X t ) + (1 − X t ) · ln(1 − σ ( b X t ))] , (1) where b X t = D t ( Z t ) , ‘ · ’ denotes the element-wise product, and σ is the sigmoid function that scales its input to [0 , 1] . W e can similarly deﬁne L rol l and L p . In each training epoch, we optimize both the encoders and decoders by minimizing L rol l , L t and L p for the given train- ing batch. W e refer to the way we train the model as using the “temporal supervision, ” since to minimize the loss terms L rol l , L t and L p , we hav e to make accurate prediction for each of the T time frames. When the adversarial training strategy is emplo yed (i.e., those marked by dashed lines in Figure 1a), we additionally consider the following tw o loss terms: L n t = − P [ln(1 − σ ( b X n t ))] , (2) L n p = − P [ln(1 − σ ( b X n p ))] , (3) where b X n t = D t ( Z p ) , b X n p = D p ( Z t ) , meaning that we feed the ‘wrong’ input (purposefully) to D t and D p . The equation must close to zero, which means when we use the wrong input, we expect D t and D p can output nothing (i.e., all zeros), since, e.g., Z t is supposed not to contain any pitch-related information. Please note that, in adv ersarial training, we use L n t and L n p to update the encoders only . This is to preserv e the function of the decoders in making accurate predictions. 3.3 The UnetED Model The architecture of UnetED is depicted in Figure 1b, which has a U-shape similar to [ Ronneberger et al. , 2015 ] . In Un- etED, we learn only one encoder E cq t to get a single latent representation Z t of the input X cq t . W e add skip connections between E cq t and D rol l , which enables lo wer-layer informa- tion of E (closer to the input) to be passed directly to the higher-layer of D (closer to the output), making it easier to train deeper models. The model learn E cq t and D rol l by min- imizing L rol l , the cross entropy between D rol l ( Z t ) and the pianoroll X rol l . Moreov er , we promote timbre information in Z t by reﬁning E cq t and learning a classiﬁer D t by minimiz- ing L t (see Eq. (1)). When the adversarial training strategy is adopted, we design a GAN-like structure (i.e., unlike the case in DuoED) to dispel pitch information from timbre represen- tation. That is, the model additionally learns a pitch classiﬁer D p to minimize the loss L p between D p ( Z t ) and X p . This loss function only updates D p . Meanwhile, the encoder E cq t tries to fool D p , where the ground truth matrices should be zero matrices. That is, the encoder should minimize L n p , and the loss function only affects the encoder . The design of UnetED is based on the following intuitions. First, since Z t is supposed not to carry pitch information, the only way to obtain the pitch information needed to pre- dict X rol l is from the skip connections. Second, it is ﬁne to assume that the skip connections can pass over the pitch information, due to the nice one-to-one time-frequency cor- respondence between X cq t and each frontal slice of X rol l . 2 Moreov er , in X cq t pitch only affects the lowest partial of a harmonic series created by a musical note, while timbre af- fects all the partials. If we view pitch as the “boundary” out- lining an object, and timbre as the “te xture” of that object, detecting the pitches may be analogous to image segmenta- tion, for which U-nets hav e been shown ef fecti ve in solving [ Ronneberger et al. , 2015 ] . W e note that the major difference between DuoED and Un- etED is that there is a pitch code Z p in DuoED. Although the pitch representation may beneﬁt other tasks, for composition style transfer we care more about the timbre code Z t . 3.4 Composition Style T ransfer While both DuoED and UnetED are not trained to perform music rearrangement, they can be applied to music rearrange- ment due to the built-in pitch/timbre disentanglement. Specif- ically , for style transfer , we are giv en a source clip A and a target clip B , both audio ﬁles. W e can realize style transfer by using A as the input to the encoder of DuoED or UnetED to get content information (i.e., through the pitch code Z (A) p or the skip connections), but then combine it with the timbre code Z (B) t obtained from B to generate an original pianoroll X (A) → (B) rol l , from which we can create synthesized audio. 3.5 Implementation Details Since the input and output are both matrices, we use con volu- tional layers in all the encoders and decoders of DuoED and UnetED. T o accommodate input of v ariable length, we adopt 2 That is, both X cqt ( i, j ) and X roll ( i, j, m ) refer to the activity of the same musical note i for the same time frame j . a fully-con volutional design [ Oquab et al. , 2015 ] , meaning that we do not use pooling layers at all. All the encoders in our models are composed of four residual blocks. Each block contains three con volution layers and two batch normaliza- tion layers. The decoder D rol l has the same structure, b ut use transpose conv olution layers to do the upsampling. There are in total twelve layers in both encoder and pianoroll de- coder D rol l . For the pitch and timbre decoder, we use three transpose con volution layers to reconstruct the pitch roll and instrument roll from latent representation. Moreover , we use leaky ReLU as the activ ation function for all layers but the last one, where we use the sigmoid function. Both DuoED and UnetED are trained using stochastic gradient descend with momentum 0.9. The initial learning rate is set to 0.005. W e use the newest released MuseScore dataset [ Hung et al. , 2019 ] to train the proposed models. This dataset contains 344,166 paired MIDI and MP3 ﬁles. Most MP3 ﬁles were synthesized from the MIDIs with the MuseScore synthesizer by the uploaders. Hence, the audio and MIDIs are already time-aligned. W e further ensure temporal alignment by us- ing the method proposed by [ Raffel and Ellis, 2016 ] . W e then conv ert the time-aligned MIDIs to pianorolls with the pypianoroll package [ Dong et al. , 2018b ] . The dataset contains music of different genres and 128 different instru- ment categories as deﬁned by the MIDI spec. 4 Experiment As automatic music rearrangement remains a ne w task, there are no standard metrics to ev aluate our models. W e propose to ﬁrstly e valuate our models objectiv ely with the surrogate task of instrument activity detection (IAD) [ Gururani and Lerch, 2017 ] —i.e., detecting the acti vity of dif ferent instruments for each short-time audio segment—to validate the effecti veness of the obtained instrument codes Z t . After that, we e valu- ate the performance of music rearrangement subjectiv ely , by means of a user study . 4.1 Evaluation on Instrument Activity Detection W e ev aluate the performance for IAD using the ‘MedleyDB + Mixing Secret’ (M&M) dataset proposed by [ Gururani and Lerch, 2017 ] . The dataset contains real-world music record- ings (i.e., not synthesized ones) of various genres. By ev al- uating our model on this dataset, we can compare the result with quite a few recent work on IAD. Following the setup of [ Hung et al. , 2019 ] , we ev aluate the result for per-second IAD in terms of the area under the curve (A UC). W e compute A UC for each instrument and report the per-instrument A UC as well as the av erage A UC across the instruments. IAD is essentially concerned with the prediction of the in- strument roll shown in Figure 2b. As our DuoED and UnetED models are pre-trained on a synthetic dataset MuseScore, we train additional instrument classiﬁers D 0 t with the pre-deﬁned training split of the M&M dataset (200 songs) [ Hung et al. , 2019 ] . D 0 t takes as input the timbre code Z t computed by the pre-trained models. Follo wing [ Gururani et al. , 2018 ] , we use a netw ork of four con volution layers and two fully-connected layers for the instrument classiﬁers D 0 t . 3 W e use the estimate 3 W e do not use the M&M training split to ﬁne-tune the original Method training set Piano Guitar V iolin Cello Flute A vg [ Hung and Y ang, 2018 ] MuseScore training split 0.690 0.660 0.697 0.774 0.860 0.736 [ Liu et al. , 2019 ] Y ouT ube-8M 0.766 0.780 0.787 0.755 0.708 0.759 [ Hung et al. , 2019 ] MuseScore training split 0.718 0.819 0.682 0.812 0.961 0.798 [ Gururani et al. , 2018 ] M&M training split 0.733 0.783 0.857 0.860 0.851 0.817 DuoED updated MuseScore (pre), M&M training split 0.721 0.790 0.865 0.810 0.912 0.815 UnetED updated MuseScore (pre), M&M training split 0.781 0.835 0.885 0.807 0.832 0.829 [ Hadad et al. , 2018 ] updated MuseScore (pre), M&M training split 0.745 0.807 0.816 0.769 0.883 0.804 [ Liu et al. , 2018a ] updated MuseScore (pre), M&M training split 0.808 0.844 0.789 0.766 0.710 0.793 T able 1: A verage A UC scores of per-second instrument acti vity detection on the test split of the ‘MedleyDB+Mixing Secret’ (M&M) dataset [ Gururani and Lerch, 2017; Hung et al. , 2019 ] , for ﬁve instruments. For the latter four methods, we pre-train (‘pre’) the models on the MuseScore dataset [ Hung et al. , 2019 ] and then use the training split of M&M for updating the associated timbre classiﬁers. ev aluated roll model A vg A UC Instrument roll DuoED w/o Adv 0.731 DuoED w Adv 0.741 UnetED w/o Adv 0.733 UnetED w Adv 0.754 Sum up from Pianoroll DuoED w/o Adv 0.778 DuoED w Adv 0.781 UnetED w/o Adv 0.778 UnetED w Adv 0.783 T able 2: The same ev aluation task as that in T able 1. W e compare the models trained on MuseScore only here, with (‘w/’) or without (‘w/o’) adversarial training (‘ Adv’). W e e valuate both the instru- ment roll predicted by the timbre decoder D t , and the instrument roll obtained by summing up the pianoroll predicted by D roll . of D 0 t as the predicted instrument roll. T able 1 shows the e valuation result on the pre-deﬁned test split of the M&M dataset (69 songs) [ Hung et al. , 2019 ] of four state-of-the-art models (i.e., the ﬁrst four rows) and our models (the middle two ro ws), considering only the ﬁv e most popular instruments as [ Hung et al. , 2019 ] . T able 1 shows that the proposed UnetED model can achieve better A UC scores in most instruments and achie ve the highest a verage A UC 0.829 overall, while the performance of DuoED is on par with [ Gururani et al. , 2018 ] . W e also conduct a paired t-test and found that the performance difference between Gu- rurani’ s method and the proposed UnetED method is statis- tically signiﬁcant (p-value < 0.05). This result demonstrates the ef fectiv eness of the learned disentangled timbre represen- tation Z t for IAD, compared to con ventional representations such as the log mel-spectrogram used by [ Gururani et al. , 2018 ] . Moreover , we note that these prior arts on IAD cannot work for composition style transfer , while our models can. The last two rows of T able 1 sho w the result of two ex- isting disentanglement methods originally dev eloped for im- ages [ Liu et al. , 2018a; Hadad et al. , 2018 ] . 4 W e use en- instrument classiﬁers D t of our models but train a ne w one D 0 t , so as to make sure that our instrument classiﬁer has the same architecture as that used by [ Gururani et al. , 2018 ] . In this way , the major dif fer- ence between our models and theirs is the input feature (the timbre code Z t for ours, and the log mel-spectrogram for theirs), despite of some minor differences in for e xample the employed ﬁlter sizes. 4 W e adapt the tw o methods for music as follo ws. W e employ the Figure 3: Result of our user study on music rearrangement. coder/decoder structures similar to our models and the same strategy to pre-train on MuseScore and then optimize the tim- bre classiﬁer on the M&M training split. W e see that the pro- poses models still outperform these two models. This can be expected as their models were not designed for music. T able 2 reports an ablation study where we do not use the M&M training split to learn D 0 t but use the original D t trained on MuseScore for IAD. In addition, we compare the case with or without adversarial training, and the case where we get the instrument roll estimate from the output of D rol l [ Hung et al. , 2019 ] , which considers both pitch and timbre. W e see that ad- versarial training improv es IAD, and that adding pitch infor - mation helps. From T ables 1 and 2 we also see that it is help- ful to update the instrument classiﬁer by the training split of M&M, possibly due to the acoustic difference between syn- thetic and real-world audio recordings. 4.2 Evaluation on Music Rearrangement T o ev aluate the result of music rearrangement, we choose four types of popular composition style in our experiment: strings composition (i.e., violin and cello), piano , acoustic compo- sition (i.e., guitar, or guitar and melody), and band compo- sition (i.e., electric guitar , bass, and melody); the melody can be played by any instrument. W e randomly choose four clips in each type from the MuseScore dataset, and then transfer them into the other three types follo wing the approach de- scribed in Section 3.4. W e intend to compare the result of pianoroll as the target output for both. And, for [ Liu et al. , 2018a ] , we treat ‘S’ as timbre and ‘Z’ as pitch; for [ Hadad et al. , 2018 ] , we replace ‘style’ with pitch and ‘class’ with timbre. Figure 4: Demonstration of music rearrangement by UnetED (best viewed in color), for four types of styles: strings, piano, acoustic, and band (see Section 4.2 for deﬁnitions). The source clips are those marked by bounding boxes, and the generated ones are those to the right of them. [Purple: ﬂute, Red: piano, Black: bass, Green: acoustic guitar , Y ello: electric guitar , Blue: cello, Bro wn: violin]. UnetDE, DuoDE, UnetDE w/o Adv , and the method proposed by [ Hadad et al. , 2018 ] . W e treat the last method as the base- line, as it is designed for image style transfer , not for music. W e in vite human subjects to listen to and rate the rearrange- ment result. Each time, a subject is giv en the source clip, and the four rearranged results by dif ferent models, all aiming to con vert the style of that clip to one of the other three. The subject rate the rearranged results in the following three di- mensions in a four-point Lik ert scale: – whether the composition sounds rhythmic ; – whether the composition sounds harmonic ; – and, the overall quality of the rearrangement. The scores are the higher the better . Since there is no ground truth of music style transfer, the rating task is by nature sub- jectiv e. This process is repeated 4 × 3 = 12 times until all the source-target pairs are ev aluated. Each time, the ordering of the result of the four modes are random. W e distrib uted the online surv ey through emails and social media to solicit voluntary , non-paid participation. 40 sub- jects participated in the study , 18 of which have experience in composing music. Figure 3 shows the average result across the clips and the subjects. It shows that, while DuoED and UnetED perform similarly for IAD, UnetED performs better in all the three metrics for music rearrangement. Moreover , with the adv ersarial training, the UnetED can generate more rhythmic and harmonic music than its ablated version. And, [ Hadad et al. , 2018 ] does not work well at all. 5 Moreov er, we found that there is no much difference between the ratings from people with music background and people without mu- sic background. The only difference is that people with music background av eragely tends to give slightly lo wer ratings. Figures 4 shows examples of the rearrangement result by UnetED. W e can see that when rearranging the music to 5 The poor result of [ Hadad et al. , 2018 ] can be expected, since the pre-trained instrument decoder can only guarantee Z t to con- tain timbre information, but cannot restrict Z t not to hav e any pitch information. W e observe that this method tend to lose much pitch information in the rearranged result and create a lot of empty tracks. be played by band, UnetED ﬁnds the low-pitched notes for the bass to play and the melody notes for the ﬂute or vio- lin to play . This demonstrates that UnetED has the ability to choose suitable instruments playing certain notes combi- nation. This result is also sho wed in arranging string in- struments. More demo results of UnetED can be found at https://biboamy .github .io/disentangle demo/. 5 Conclusion In this paper , we hav e presented two models for learning disentangled timbre representations. This is done by using the instrument and pitch labels from MIDI ﬁles. Adversar - ial training is also emplo yed to disentangle the timbre-related information from the pitch-related one. Experiment sho w that the learned timbre representations lead to state-of-the-art accuracy in instrument activity detection. Furthermore, by modifying the timbre representations, we can generate new score rearrangement from audio input. In the future, we plan to extend the instrument categories and also include the singing voice to cope with more music genres. W e want to test other combinations of instruments to ev aluate the performance of our models. W e also want to further improve our models by re-synthesizing the MP3 ﬁles of the MuseScore data with more realistic sound fonts. References [ Bittner et al. , 2017 ] Rachel M. Bittner , Brian McFee, Justin Salamon, Peter Li, and Juan P . Bello. Deep salience rep- resentations for f 0 tracking in polyphonic music. In Pr oc. Int. Soc. Music Information Retrieval Conf . , 2017. [ Briot et al. , 2017 ] Jean-Pierre Briot, Ga ¨ etan Hadjeres, and Franc ¸ ois Pachet. Deep learning techniques for music gen- eration: A surve y . arXiv preprint , 2017. [ Brunner et al. , 2018 ] Gino Brunner , Andres K onrad, Y uyi W ang, and Roger W attenhofer . MIDI-V AE: Modeling dy- namics and instrumentation of music with applications to style transfer . arXiv preprint , 2018. [ Corozine, 2002 ] V . Corozine. Arranging Music for the Real W orld: Classical and Commer cial Aspects . Paciﬁc, MO: Mel Bay . ISBN 0-7866-4961-5. OCLC 50470629, 2002. [ Crestel et al. , 2017 ] L ´ eopold Crestel, Philippe Esling, Lena Heng, and Stephen McAdams. A database linking piano and orchestral MIDI scores with application to automatic projectiv e orchestration. In Pr oc. Int. Soc. Music Informa- tion Retrieval Conf . , pages 592–598, 2017. [ Dai et al. , 2018 ] Shuqi Dai, Zheng Zhang, and Gus Xia. Music style transfer issues: A position paper . arXiv pr eprint arXiv:1803.06841 , 2018. [ Dong et al. , 2018a ] Hao-W en Dong, W en-Y i Hsiao, Li- Chia Y ang, and Y i-Hsuan Y ang. MuseGAN: Multi-track sequential generati ve adversarial networks for symbolic music generation and accompaniment. In Proc. AAAI Conf. Artiﬁcial Intelligence , 2018. [ Dong et al. , 2018b ] Hao-W en Dong, W en-Y i Hsiao, and Y i- Hsuan Y ang. Pypianoroll: Open source Python package for handling multitrack pianoroll. In Pr oc. Int. Soc. Music Information Retrieval Conf . , 2018. Late-breaking paper . [ Engel et al. , 2017 ] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of mu- sical notes with wav enet autoencoders. In Pr oc. Int. Conf. Machine Learning , pages 1068–1077, 2017. [ Gatys et al. , 2016 ] Leon A Gatys, Alexander S Ecker , and Matthias Bethge. Image style transfer using con volutional neural networks. In Pr oc. IEEE Conf . Computer V ision and P attern Recognition , pages 2414–2423, 2016. [ Goodfellow and others, 2014 ] Goodfellow et al. Generativ e adversarial nets. In Proc. Advances in neural information pr ocessing systems , pages 2672–2680, 2014. [ Gururani and Lerch, 2017 ] Siddharth Gururani and Alexan- der Lerch. Mixing Secrets: A multi-track dataset for instrument recognition in polyphonic music. In Pr oc. Int. Soc. Music Information Retrieval Conf. , 2017. Late- breaking paper . [ Gururani et al. , 2018 ] Siddharth Gururani, Cameron Sum- mers, and Alexander Lerch. Instrument acti vity detection in polyphonic music using deep neural networks. In Pr oc. Int. Soc. Music Information Retrieval Conf . , 2018. [ Hadad et al. , 2018 ] Naama Hadad, Lior W olf, and Moni Shahar . A two-step disentanglement method. In Pr oc. IEEE Conf. Computer V ision and P attern Recognition , pages 772–780, 2018. [ Hung and Y ang, 2018 ] Y un-Ning Hung and Y i-Hsuan Y ang. Frame-lev el instrument recognition by timbre and pitch. In Pr oc. Int. Soc. Music Information Retrieval Conf. , pages 135–142, 2018. [ Hung et al. , 2018 ] Y un-Ning Hung, Y i-An Chen, and Y i- Hsuan Y ang. Learning disentangled representations for timber and pitch in music audio. arXiv preprint arXiv:1811.03271 , 2018. [ Hung et al. , 2019 ] Y un-Ning Hung, Y i-An Chen, and Y i- Hsuan Y ang. Multitask learning for frame-level instrument recognition. In Pr oc. IEEE Int. Conf. Acoustics, Speech and Signal Pr ocessing , 2019. [ Kaliakatsos-Papakostas et al. , 2017 ] Maximos Kaliakatsos- Papakostas, Marcelo Queiroz, Costas Tsougras, and Emil- ios Cambouropoulos. Conceptual blending of harmonic spaces for creati ve melodic harmonisation. J. New Music Resear ch , 46(4):305–328, 2017. [ Lee et al. , 2018 ] Hsin-Y ing Lee, Hung-Y u Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Y ang. Diverse image-to-image translation via disentangled representa- tions. In Pr oc. Eur opean Conf. Computer V ision , 2018. [ Liu et al. , 2018a ] Y ang Liu, Zhao wen W ang, Hailin Jin, and Ian W assell. Multi-task adversarial network for disentan- gled feature learning. In Pr oc. IEEE Conf. Computer V i- sion and P attern Recognition , pages 3743–3751, 2018. [ Liu et al. , 2018b ] Y u Liu, Fangyin W ei, Jing Shao, Lu Sheng, Junjie Y an, and Xiaogang W ang. Exploring disentangled feature representation beyond face identiﬁ- cation. In Pr oc. IEEE Conf. Computer V ision and P attern Recognition , pages 2080–2089, 2018. [ Liu et al. , 2019 ] Jen-Y u Liu, Y i-Hsuan Y ang, and Shyh- Kang Jeng. W eakly-supervised visual instrument-playing action detection in videos. IEEE T rans. Multimedia , 2019. [ Lu et al. , 2019 ] Chien-Y u Lu, Min-Xin Xue, Chia-Che Chang, Che-Rung Lee, and Li Su. Play as you like: T imbre-enhanced multi-modal music style transfer . In Pr oc. AAAI Conf. Artiﬁcial Intelligence , 2019. [ McFee et al. , 2015 ] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McV icar , Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Pr oc. Python in Science Conf. , 2015. [ Oquab et al. , 2015 ] Maxime Oquab, L ´ eon Bottou, Ivan Laptev , and Josef Sivic. Is object localization for free?- weakly-supervised learning with con volutional neural net- works. In Pr oc. Int. Computer V ision and P attern Recog- nition , pages 685–694, 2015. [ Pachet, 2016 ] Franc ¸ ois Pachet. A joyful ode to auto- matic orchestration. ACM T rans. Intell. Syst. T echnol. , 8(2):18:1–18:13, 2016. [ Pati, 2018 ] Ashis Pati. Neural style transfer for musi- cal melodies, 2018. [Online] https://ashispati.github .io/ style- transfer/. [ Raffel and Ellis, 2016 ] Colin Raf fel and Daniel P . W . El- lis. Optimizing DTW-based audio-to-MIDI alignment and matching. In Proc. IEEE Int. Conf. Acoustics, Speec h and Signal Pr ocessing , pages 81–85, 2016. [ Roberts et al. , 2017 ] Adam Roberts, Jesse Engel, and Dou- glas Eck. Hierarchical v ariational autoencoders for music. In Proc. NIPS W orkshop on Machine Learning for Cre- ativity and Design , 2017. [ Ronneberger et al. , 2015 ] Olaf Ronneberger , Philipp Fis- cher , and Thomas Brox. U-net: Conv olutional networks for biomedical image segmentation. In Pr oc. Int. Conf. Medical Image Computing and Computer-Assisted Inter- vention , pages 234–241. Springer , 2015.

Musical Composition Style Transfer via Disentangled Timbre Representations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment