High-Level Control of Drum Track Generation Using Learned Patterns of Rhythmic Interaction
Spurred by the potential of deep learning, computational music generation has gained renewed academic interest. A crucial issue in music generation is that of user control, especially in scenarios where the music generation process is conditioned on …
Authors: Stefan Lattner, Maarten Grachten
2019 IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New P altz, NY HIGH-LEVEL CONTR OL OF DR UM TRA CK GENERA TION USING LEARNED P A TTERNS OF RHYTHMIC INTERA CTION Stefan Lattner 1 , Maarten Grac hten 2 1 Sony Computer Science Laboratories (CSL), P aris, France 2 Contractor for Sony CSL, P aris, France ABSTRA CT Spurred by the potential of deep learning, computational mu- sic generation has gained rene wed academic interest. A cru- cial issue in music generation is that of user control, es- pecially in scenarios where the music generation process is conditioned on existing musical material. Here we pro- pose a model for conditional kick drum track generation that takes e xisting musical material as input, in addition to a lo w- dimensional code that encodes the desired relation between the existing material and the ne w material to be generated. These relational codes are learned in an unsupervised man- ner from a music dataset. W e show that codes can be sampled to create a variety of musically plausible kick drum tracks and that the model can be used to transfer kick drum patterns from one song to another . Lastly , we demonstrate that the learned codes are largely in variant to tempo and time-shift. 1. INTR ODUCTION A crucial issue in music generation is that of user control. Es- pecially for problems where musical material is to be gener - ated conditioned on existing musical material (so-called con- ditional generation ), it is not desirable for a system to pro- duce its output deterministically . T ypically there are multiple valid ways to complement existing material with ne w mate- rial, and a music generation system should reflect that degree of freedom, either by modeling it as a predictive distrib ution from which samples can be drawn and e valuated by the user , or by letting the generated material depend on some form of user input in addition to the existing material. An intuiti ve way to address this requirement is to learn a latent space, for example by means of a variational autoencoder (V AE). This approach has been successfully applied to music gener - ation [1, 2], and allows for both generation and manipulation of musical material by sampling from the latent prior, manual exploration of the latent space, or some form of local neigh- borhood search or interpolation. In this paper we also take a latent space learning approach to address the issue of control o ver music generation. More specifically , we propose a model architecture to learn a latent space that encodes rhythmic interactions of the kick drum vs. bass and snare patterns. The architecture is a con volutional variant of a Gated Autoencoder (GAE, see Section 3). This architecture can be thought of as a feed-forward neural net- work where the weights are modulated by learned mapping codes [3]. Each mapping code captures local relations be- tween kick vs bass and snare inputs, such that an entire track is associated to a sequence of mapping codes. Since we want mapping codes to capture rhythmic pat- terns rather than just the instantaneous presence or absence of onsets in the tracks, during training we enforce in variance of mapping codes to (moderate) time shifts and tempo changes in the inputs. The resulting mapping codes remain largely constant throughout sections with a stable rh ythm. This pro- vides high-level control over the generated material in the sense that dif ferent kick drum patterns for some section can be realized simply by selecting a dif ferent mapping code (ei- ther by sampling or by inferring them from another section or song), and applying it throughout the section. T o our kno wledge this is a novel approach to music gen- eration. It reconciles the notion of user control with the pres- ence of conditioning material in a musically meaningful way: rather than controlling the characteristics of the generated material directly , it offers control over how the generated ma- terial r elates to the conditioning material. Apart from quantitative experiments to show the basic validity of our approach, we v alidate our model by way of a set of sound examples and visualized outputs. W e focus on three scenarios specifically . Firstly we demonstrate the abil- ity to create a variety of plausible kick drum tracks for a given snare and bass track pair by sampling from a standard multi- variate normal distribution in the mapping space. Secondly , we test the possibility of style-transfer , by applying rhythmic interaction patterns inferred from one song to induce similar patterns in other songs. Finally , we sho w that the semantics of the mapping space is in variant under changes in tempo. In continuation we present related work (Section 2), de- scribe the proposed model architecture and data representa- tions (Section 3), and validate the approach (Section 4). Sec- tion 5 provides concluding remarks and future work. 2. RELA TED WORK In addition to the V AE-based methods for control over mu- sic generation processes mentioned abov e, a number of other studies have applied deep learning methods to address the problem of music generation in general, as revie wed in [4]. Drum track generation has been tackled using recurrent ar- chitectures [5, 6], Restricted Boltzmann Machines [7], and Generativ e Adversarial Networks (GANs) [8]. Approaches 2019 IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New P altz, NY to contr ol the generation process may rely on sampling from some latent representation of the material to be generated [1, 2], possibly in an incremental fashion [9], or condition- ing on user-pro vided information (such as a style label [10], unary [11], or structural [12] constraints). [13] demonstrates style transfer for audio. GANs are used in [8, 14], where the output of the generation process is determined by providing some (time-varying) noise, in combination with conditioning on existing material. Similar to our study , [15] uses a GAE to model relations between musical material in an autoregres- siv e prediction task. T o our knowledge this is the first use of GAEs for conditional music generation. 3. METHOD A schematic ov ervie w of the proposed model architecture is shown in Figure 1. For time series modeling, we adapt the common dense GAE architecture to 1D con volution in time, yielding a Con volutional Gated Autoencoder (CGAE). W e aim to model the rhythmic interactions between input signals x ∈ R M × T and a target signal y ∈ R 1 × T . More precisely , x represents M 1D signals of length T indicating onset func- tions of instrument tracks and beat- and downbeat informa- tion of a song, while y represents the onset function of a tar- get instrument. Then the rhythmic interactions (henceforth referred to as mappings ) between x and y are defined as m = W ∗ ( U ∗ x · V ∗ y ) , (1) where m ∈ R Q × T , and U ∈ R K × M × R , V ∈ R K × 1 × R represent respectiv ely K conv olution kernels for M input maps and kernel size R , and W ∈ R Q × K × 1 represents Q con volution kernels for K input maps and kernel size 1 . Fur - thermore, ∗ is the con volution operator and · is the Hadamard product. For brevity the notation above assumes a CGAE architecture with only one mapping layer and one layer for input and target. In practice we use several con volutional layers, as described in Section 3.1. Giv en the rhythmic interactions m and the rhythmic con- text x , the tar get onset function is reconstructed as ˜ y = V > ∗ ( U ∗ x · W > ∗ m ) , (2) where the transposed k ernels V > and W > result in a decon- volution. The model parameters are trained by minimizing the mean squared error L mse ( y , ˜ y ) between the target signal y and its reconstruction ˜ y . In order to dra w samples from the model, we w ant to im- pose a Gaussian prior over m resulting in p ( m ) = N (0 , I ) . Additionally , m should apply to any input x , and should therefore not contain any information about the content of x . These conditions are imposed using adversarial training [16]: A discriminator D ( · ) estimates whether its input is drawn from a Gaussian distrib ution and contains no infor- mation about x . T o that end, we concatenate ( U ∗ x ) with either actual mappings m or noise drawn from an indepen- dent Gaussian distribution η ∼ N (0 , I ) , η ∈ R Q × T . This CNN x CNN D m ( U ∗ x ) ( U ∗ x ) y/y ~ 1x1 convs ( V ∗ y ) η ( U ∗ x ) η Draw from ( 0 , I ) Figure 1: The proposed model architecture. results in D ( m , ( U ∗ x )) and D ( η , ( U ∗ x )) . In adversarial training, the discriminator D ( · ) learns to distinguish between the input containing m and the input containing η . If there is mutual information between ( U ∗ x ) and m the discrim- inator can exploit this for its classification task. This causes the encoding pathways to remove any information about x from m . Also, we obtain m ∼ N (0 , I ) . Accordingly , the discriminator is trained to minimize the loss L advers = 1 T X t D ( m , ( U ∗ x )) t − D ( η , ( U ∗ x )) t , (3) with D ( · ) t being the output of the discriminator at time t . T o make the mappings more constant over time, an ad- ditional loss penalizes differences of successiv e mappings L const = 1 T P t ( m t − m t +1 ) 2 , m t ∈ R Q . A further loss that constrains each map m q ∈ R T to have zero mean and unit v ariance o ver time and instances in a batch considerably improv es the learning of the CGAE: L std = 1 Q Q X q 1 N N X i ( m q ,i − µ q ) 2 − 1 2 + µ 2 q , (4) where m q ,i are the observations of conv olutional map m q ov er all time steps and instances in a batch, and µ q is the mean of m q ,i . Optimization is performed in two steps per mini-batch. First, the discriminator is trained to minimize L advers , then the CGAE is trained to minimize L mse ( y , ˜ y ) + L const + L std − L advers . 3.1. Architectur e and training details As mentioned above the weight matrices W , U and V in Eqs. 1 and 2 act as placeholders for sev eral conv olutional lay- ers. F or U and V , 8 conv olutional layers are defined, with { 32 , 32 , 64 , 64 , 64 , 128 , 128 , 256 } output units, kernel size 2 , and dilations which double for each layer (i.e., 1 , 2 , 4 , 8 , . . . ). 2019 IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New P altz, NY The first 5 layers keep the 4 inputs (onset strength snare, onset strength bass, beats, downbeats) separated (i.e., their units are separated in 4 groups, where each group only pools over 1 / 4 th of the input maps), and the information is combined only in the two top-most layers. W is a place- holder for 6 layers, with sizes { 128 , 128 , 64 , 32 , 32 , 16 } , and kernel size 1 . The discriminator D ( · ) consists of 5 lay- ers with { 256 , 128 , 64 , 64 , 1 } maps and kernel size 1 . All stacks described above have no non-linearity in the output and SELU non-linearity [17] between layers (also for the de- con volution passes). The model is trained for 2500 epochs with batch size 100 , using 50% dropout on the inputs x . During training, a data augmentation based regularization method is used to make the mappings in variant to time shift and tempo change. T o that end we define a transformation function ψ θ ( z ) that shifts and scales a signal z in the time dimension with ran- dom shifts between − 150 and +150 time steps ( ± 1 . 75 s ) and random scale factors between 0 . 8 and 1 . 2 . T raining is then performed as follows. First, the mappings m are inferred ac- cording to Eq. 1. Then, the input signals are modified using ψ θ ( · ) resulting in an altered Eq. 2: ˜ y ψ θ = V > ∗ ( U ∗ ψ θ ( x ) · W > ∗ m ) . (5) Finally , the mean squared error between the such ob- tained reconstruction and the transformed target is mini- mized as L mse ( ψ θ ( y ) , ˜ y ψ θ ) . This approach was first pro- posed in [18]. Due to the gating mechanism · (activ ating only pathways with appropriate tempo and time-shift), a CGAE is particularly suited for learning such in variances. By im- posing time-shift inv ariance, we assume that rhythmic inter- action patterns (and the respectiv e mappings) in the train- ing data are locally constant. Even if this method introduces some error at positions where rhythmic patterns change, most of the time the assumption of locally constant rhythm is v alid. 3.2. Data repr esentation The training/v alidation sets consist of 665 / 193 pop/rock/ electro songs where the rhythm instruments bass, kick and snare are available as separate 44.1kHz audio tracks. The context signals x consist of two 1D input maps for beat and downbeat probabilities, and tw o 1D input maps for the onset functions of Snare and Bass. The target signal y consists of a 1D onset function of the Kick drum. The onset functions are extracted using the ComplexDomainOnsetDetection fea- ture of the publicly a vailable Y aafe library 1 with a block size of 1024 , a step size of 512 , and a Hann window function. For the do wnbeat functions we use the do wnbeat estimation RNN of the madmom library 2 . Input signals are indi vidually standardized to zero mean and unit v ariance o ver each song. 1 http://yaafe.sourceforge.net 2 https://github.com/CPJKU/madmom 3.3. Rendering A udio W e create an actual kick drum track from an onset strength curve y using salient peak picking. First, we remov e less salient peaks from y by zero-phase filtering with a low-pass Butterworth filter of order two and a critical frequency of half the Nyquist frequency . The local maxima of the smoothed curve are then thresholded, discarding all maxima belo w a certain proportion (see below) of the maximum of the stan- dardized onset strength curve. The remaining peaks are se- lected as onset positions. Finally , we render an audio file by placing a “one-shot” drum sample on all remaining peaks after thresholding. W e introduce dynamics by choosing the volume of the sample from 70% for peaks at the threshold to 100% for peaks with the maximum v alue. For the qualitati ve experiments in the follo wing section, we manually choose the threshold between 15% and 50% . For the quantitativ e results in T able 1 we fix the threshold at 25% , but values of 20% and 30% yield similar figures. 4. EXPERIMENTS For the qualitativ e experiments we use four songs, Gipsy Love , Orgs W altz , Miss Y ou and Dr ehscheibe , produced by the first author . W e encourage the reader to listen to the re- sults on the accompanying web page 3 . Three scenarios are chosen to show the ef fectiv eness of the proposed approach: Conditional Generation of Drum Patterns T o gener- ate a kick drum track, we sample only one mapping code m t (from a 16-dimensional standard Gaussian), repeat it across the time dimension, and reconstruct y given the resulting m , as well as x . Subsequently , we render 20 audio files as de- scribed in Section 3.3 and pick those 10 which together con- stitute the most v aried set. Figure 2 shows some results of the generation task – randomly generated kick drum tracks con- ditioned on the song Dr ehscheibe (the sound examples are av ailable online). It is clear from these screenshots that the model generates a wide variety of dif ferent rhythmic patterns which adapt to the local context, ev en though the sampled mapping code is constant (repeated) ov er time. Style T ransfer First, for a given song, we infer m from x and y . Second, k-means clustering is performed over all m t , using the Davis-Bouldin score [19] for determining the optimal number of clusters (typically yielding an optimal k between 5 and 8). Then we use the cluster center of the largest cluster found as the mapping code, again repeat it ov er time and use it for another song onto which the style should be transferred. Again the results are av ailable on the accompanying web page (see abo ve). T empo-in variance W e use the WSOLA time stretching algorithm [20] as implemented in sox , to create four time stretched versions of each song, at 80% , 90% , 110% and 120% of the original tempo, respecti vely . Then, for a given song in original tempo , we determine a prototypical mapping 3 https://sites.google.com/view/drum- generation 2019 IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New P altz, NY Figure 2: Conditionally sampled kick drum tracks for the song Dr ehscheibe . Each track is the result of a sampled “transfer function” m t which is held constant ov er time. Figure 3: Generated drum tracks for different tempos ( 80% , 90% , 100% , 110% and 120% of the original tempo) for the song Or gs W altz , using the mapping code of the original tempo. T op: the ov erall song; Bottom: close-up of the first onsets, manually aligned for the purpose of visualization. code with the k-means clustering method described above. W e repeat that code througout the time-stretc hed versions of the song and reconstruct y giv en m and x . Figure 3 sho ws generated kick drum tracks in the five dif ferent tempos (four time-stretched versions plus the original tempo). It is clear from these screenshots that the drum pattern adjusts to the different tempos and does not change its style. Although it is not obvious how to ev aluate the output of the model other than by listening, we can check the validity of basic assumptions about the behavior of the model. One assumption is that the gr ound truth mappings for a song—as defined in Eq. (1)—allo w us to reconstruct the drum track relativ ely faithfully (Eq. (2)). T o test the degree to which reconstruction may be sacrificed to satisfy other constraints (e.g., the adversarial loss) we compute the accuracy of the reconstruction. Gi ven onset strength curves y and ˜ y we de- termine the onset positions as described in Section 3.3, and compute the precision, recall, and F-score using a 50 ms tol- erance window , follo wing MIREX onset detection ev aluation criteria [21]. T able 1 (upper half) lists the results for the training and validation sets and sho ws that the mappings are specific Precision Recall F-Score Gr ound truth T raining 0 . 946 0 . 812 0 . 865 V alidation 0 . 943 0 . 816 0 . 867 Style transfer T raining 0 . 774 0 . 696 0 . 712 V alidation 0 . 781 0 . 707 0 . 723 T able 1: A verage precision, recall, and F-score for onset re- construction using ground truth and style transfer mappings. enough to lar gely reconstruct the target onsets correctly . The reconstructions are not perfect, likely due to the model’ s in- variance and the adv ersarial loss on the mappings. Note also that the accuracy for the validation set is similar to that for the training set, implying that no overfitting has occurred. The dominance of precision over recall is likely due to the typical “conservati ve” behavior of GAEs [22]. Furthermore we test the v alidity of the heuristic of tak- ing the largest cluster centroid as a constant mapping vector ov er time for style transfer (assuming time-in v ariance). T o do so, we apply this heuristic to transfer the style of a song to itself. That is, to reconstruct the kick drum track we use the largest mode of the song in the mapping space as a con- stant through time, rather than the ground truth mapping—a trajectory through the mapping space. Unsurprisingly , this approximation affects the reconstruction of the original kick drum track negativ ely , b ut the F-scores of ov er 0 . 7 still shows that a substantial part of the tracks is reconstructed correctly . 5. CONCLUSIONS AND FUTURE WORK W e have presented a model for the conditional genera- tion of kick drums tracks gi ven snare and bass tracks in pop/rock/electro music. The model was trained on a dataset of multi-track recordings, using a custom objective func- tion to capture the relationship between onset patterns in the tracks of the same song in mapping codes. W e hav e sho wn that the mapping codes are largely tempo and time-in variant and that musically plausible kick drum tracks can be gener- ated given a snare and bass track either by sampling a map- ping code or through style transfer , by inferring the mapping code from another song. Importantly , two basic aspects of the chosen approach hav e been shown to be valid. Firstly , the ground-truth map- ping codes are able to faithfully reconstruct the original kick drum track. Secondly , the style transfer heuristic of apply- ing a constant mapping code through time was shown to be largely v alid, by comparing the original kick drum track of a song to the result of applying the style of a song to itself. Although the current w ork is limited in the sense that the model has only been demonstrated for kick drum track gener- ation, we believe this approach is applicable to other content. W e are currently applying the same approach to snare drum generation and f 0 -generation for bass tracks. 2019 IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New P altz, NY 6. A CKNO WLEDGEMENTS W e thank Cyran Aouameur for his valuable support, as well as Adonis Storr , T egan K oster, Stefan W eißenberger and Clemens Riedl for their contribution in producing the exam- ple tracks. 7. REFERENCES [1] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “ A hierarchical latent vector model for learn- ing long-term structure in music, ” in Pr oceedings of the 35th International Conference on Machine Learn- ing, ICML 2018, Stockholmsmässan, Stockholm, Swe- den, J uly 10-15, 2018 , 2018, pp. 4361–4370. [2] I. Simon, A. Roberts, C. Raffel, J. Engel, C. Hawthorne, and D. Eck, “Learning a latent space of multitrack mea- sures, ” arXiv preprint , 2018. [3] R. Memisevic, “Gradient-based learning of higher- order image features, ” in IEEE International Conf. on Computer V ision, ICCV 2011, Bar celona, Spain, November 6-13, 2011 , 2011, pp. 1591–1598. [4] J. Briot, G. Hadjeres, and F . Pachet, “Deep learning techniques for music generation - A surve y , ” CoRR , vol. abs/1709.01620, 2017. [Online]. A vailable: http: //arxiv .org/abs/1709.01620 [5] D. Makris, M. Kaliakatsos-Papakostas, I. Karydis, and K. L. Kermanidis, “Conditional neural sequence learn- ers for generating drums’ rhythms, ” Neural Computing and Applications , pp. 1–12, 2018. [6] D. Makris, M. A. Kaliakatsos-Papak ostas, I. Karydis, and K. L. Kermanidis, “Combining LSTM and feed for- ward neural networks for conditional rhythm composi- tion, ” in EANN , ser . Communications in Computer and Information Science, vol. 744. Springer , 2017, pp. 570–582. [7] R. V ogl and P . Knees, “ An intelligent drum machine for electronic dance music production and perfor- mance, ” in 17th International Conference on New Interfaces for Musical Expr ession, NIME 2017, Aal- bor g University , Copenha gen, Denmark, May 15-18, 2017. , C. Erkut, Ed. nime.org, 2017, pp. 251–256. [Online]. A vailable: http://www .nime.org/proceedings/ 2017/nime2017_paper0047.pdf [8] H.-W . Dong, W .-Y . Hsiao, L.-C. Y ang, and Y .-H. Y ang, “Musegan: Multi-track sequential generative adversar- ial networks for symbolic music generation and accom- paniment, ” in Thirty-Second AAAI Conference on Arti- ficial Intelligence , 2018. [9] G. Hadjeres, F . Pachet, and F . Nielsen, “DeepBach: A steerable model for bach chorales generation, ” in Pr oceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydne y , NSW , Austr alia, August 6-11, 2017 , ser . Proceedings of Machine Learning Research, D. Precup and Y . W . T eh, Eds., vol. 70. PMLR, 2017, pp. 1362–1371. [Online]. A vailable: http://proceedings.mlr .press/v70/ hadjeres17a.html [10] H. H. Mao, T . Shin, and G. Cottrell, “Deepj: Style- specific music generation, ” in 2018 IEEE 12th Inter- national Conference on Semantic Computing (ICSC) . IEEE, 2018, pp. 377–382. [11] G. Hadjeres and F . Nielsen, “ Anticipation-rnn: enforc- ing unary constraints in sequence generation, with ap- plication to interacti ve music generation, ” Neural Com- puting and Applications , pp. 1–11, 2018. [12] S. Lattner, M. Grachten, and G. Widmer , “Imposing higher-le vel structure in polyphonic music generation using con volutional restricted Boltzmann machines and constraints, ” Journal of Cr eative Music Systems , vol. 3(1), 2018. [Online]. A v ailable: http://jcms.org.uk/ issues/V ol3Issue1/ [13] E. Grinstein, N. Q. K. Duong, A. Ozerov , and P . Pérez, “ Audio style transfer, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocess- ing, ICASSP 2018, Calgary , AB, Canada, April 15-20, 2018 . IEEE, 2018, pp. 586–590. [14] H. Liu and Y . Y ang, “Lead sheet generation and ar- rangement by conditional generati ve adversarial net- work, ” in 17th IEEE International Conference on Ma- chine Learning and Applications (ICMLA) , 2018, pp. 722–727. [15] S. Lattner , M. Grachten, and G. W idmer, “ A predictiv e model for music based on learned interval representations, ” in Pr oceedings of the 19th International Society for Music Informa- tion Retrieval Conference , ISMIR 2018, P aris, F rance, September 23-27 , 2018. [Online]. A vailable: http://ismir2018.ircam.fr/doc/pdfs/179_Paper .pdf [16] A. Makhzani, J. Shlens, N. Jaitly , and I. J. Goodfellow , “ Adversarial autoencoders, ” CoRR , v ol. abs/1511.05644, 2015. [Online]. A vailable: http: //arxiv .org/abs/1511.05644 [17] G. Klambauer , T . Unterthiner , A. Mayr , and S. Hochre- iter , “Self-normalizing neural networks, ” in Advances in Neural Information Pr ocessing Systems 30: Annual Confer ence on Neural Information Pr ocessing Systems 2019 IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2019, New P altz, NY 2017, 4-9 December 2017, Long Beac h, CA, USA , I. Guyon, U. von Luxburg, S. Bengio, H. M. W allach, R. Fergus, S. V . N. V ishwanathan, and R. Garnett, Eds., 2017, pp. 972–981. [Online]. A vailable: http://papers. nips.cc/paper/6698- self- normalizing- neural- networks [18] S. Lattner , M. Grachten, and G. Widmer , “Learning transposition-in variant interval features from symbolic music and audio, ” in Pr oceedings of the 19th International Society for Music Information Retrieval Confer ence, ISMIR 2018, P aris, F rance , September 23-27 , 2018. [Online]. A vailable: http://ismir2018. ircam.fr/doc/pdfs/172_Paper .pdf [19] D. L. Davies and D. W . Bouldin, “ A cluster separation measure, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , v ol. 1, no. 2, pp. 224–227, 1979. [20] W . V erhelst and M. Roelands, “An overlap-add tech- nique based on wav eform similarity (WSOLA) for high quality time-scale modification of speech, ” in IEEE In- ternational Conference on Acoustics, Speech, and Sig- nal Pr ocessing , vol. 2, April 1993, pp. 554–557. [21] “MIREX Onset Detection T ask, ” https://www .music- ir . org/mire x/wiki/2018:Audio_Onset_Detection, 2018. [22] R. Memise vic, “On multi-vie w feature learning, ” in Pr oceedings of the 29th International Confer ence on Machine Learning (ICML-12) , ser . ICML ’12, J. Lang- ford and J. Pineau, Eds. New Y ork, NY , USA: Omni- press, July 2012, pp. 161–168.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment