Audio-to-score alignment of piano music using RNN-based automatic music transcription

A UDIO-T O-SCORE ALIGNMENT OF PIANO MUSIC USING RNN-B ASED A UT OMA TIC MUSIC TRANSCRIPTION T aegyun Kwon, Dasaem Jeong, and Juhan Nam Graduate School of Culture T echnology , KAIST { ilcobo2, jdasam, juhannam } @kaist.ac.kr ABSTRA CT W e propose a framework for audio-to-score alignment on piano performance that employs automatic music transcrip- tion (AMT) using neural networks. Even though the AMT result may contain some errors, the note prediction out- put can be regarded as a learned feature representation that is directly comparable to MIDI note or chroma representa- tion. T o this end, we employ two recurrent neural networks that work as the AMT -based feature extractors to the align- ment algorithm. One predicts the presence of 88 notes or 12 chroma in frame-lev el and the other detects note onsets in 12 chroma. W e combine the two types of learned fea- tures for the audio-to-score alignment. For comparability , we apply dynamic time warping as an alignment algorithm without any additional post-processing. W e ev aluate the proposed frame work on the MAPS dataset and compare it to pre vious work. The result shows that the alignment framew ork with the learned features signiﬁcantly improves the accuracy , achieving less than 10 ms in mean onset er- ror . 1. INTR ODUCTION Audio-to-score alignment (also kno wn as score follo wing) is the process of temporally ﬁtting music performance au- dio to its score. The task has been explored for quite a while and utilized mainly for interactiv e music applica- tions, for example, automatic page turning, computer-aided accompaniment or interactive interface for acti v e music listening [1, 2]. Another use case of audio-to-score align- ment is performance analysis which examines performer’ s interpretation of music pieces in terms of tempo, dynam- ics, rhythm and other musical expressions [3]. T o this end, the alignment result must be sufﬁciently precise having high temporal resolution. It was reported that the just- noticeable difference (JND) time displacement of a tone presented in a metrical sequence is about 10 ms for short notes [4], which is beyond the current accuracy of the au- tomatic alignment algorithm. This challenge has provided the motiv ation for our research. There are two main components in audio-to-score align- ment: features used in comparing audio to score, and align- ment algorithm between two feature sequences. In this Copyright: c  2017 T ae gyun Kwon et al. This is an open-access article distrib uted under the terms of the Creative Commons Attribution 3.0 Unported License , which permits unrestricted use, distribution, and r epr oduction in any medium, pr ovided the original author and sour ce ar e cr edited. paper , we limit our scope to the feature part. A typical approach is con v erting MIDI score to synthesized audio and comparing it to performance audio using various audio features. The most common choices are time-frequency representations through short time Fourier transformation (STFT) [5] or auditory ﬁlter bank responses [6]. Others suggested chroma audio features, which are designed to minimize differences in acoustic quality between two pi- ano audio such as timbre, dynamics and sustain effects [6]. Howe ver , the design process by hands relies on trial-and- error and so is time-consuming and sub-optimal. Another approach to audio-to-score alignment is con v erting the per- formance audio to MIDI using automatic music transcrip- tion (AMT) systems and comparing the performance to score in the MIDI domain [7]. The adv antage of this ap- proach is that the transcribed MIDI is robust to timbre and dynamics v ariations by the nature of the AMT system if it predicts only the presence of notes. In addition, the syn- thesis step is not required. Howe ver , the AMT system must hav e high performance to predict notes accurately , which is actually a challenging task. In this paper, we follo w the AMT -based approach for audio-to-score alignment. T o this end, we build two AMT systems by adapting a state-of-art method using recurrent neural networks [8] with a few modiﬁcations. One sys- tem takes spectrograms as input and is trained in a super- vised manner to predict a binary representation of MIDI in either 88 notes or chroma. The prediction does not con- sider intensities of notes, i.e. MIDI velocity . Using this system only howe v er does not provide precise alignment because onset frames and sustain frames are equally im- portant. In order to make up for the limitation, we use another AMT system that is trained to predict the onsets of MIDI notes in chroma domain. This was inspired from De- caying Locally-adaptiv e Normalized Chroma Onset (DL- NCO) feature by Ewert et al. [6]. Follo wing the idea, we employ decaying chroma note onset features which turned out to offer not only temporally precise points but also make onset frames salient. Finally , we combine the two MIDI domain features and conduct dynamic time warping algorithm on the feature similarity matrix. The evaluation on the MAPS dataset sho ws that our proposed frame work signiﬁcantly improves the alignment accuracy compared to previous w ork. 2. SYSTEM DESCRIPTION The proposed framework is illustrated in Figure 1. The left-hand presents the two independent AMT systems that Figure 1. Flow diagram of proposed audio-to-score align- ment system return either 88 note or chroma output and chroma on- set output, respectiv ely . The outputs are concatenated and aligned with the score MIDI through dynamic time warp- ing (DTW). Since our main idea is not improving the per- formance of AMT system but rather utilizing a neural-network based system that produces features for audio-to-score align- ment, we borrowed the state-of-art AMT system proposed by B ¨ ock and Schedl [8]. Howe ver , we slighly modiﬁed the training setting for our purpose. 2.1 Pre-pr ocessing As aforementioned, our AMT system is based on the exist- ing model. Therefore, we used the same multi-resolution STFT with semitone spaced logarithmic compression in the model. It ﬁrst receiv es audio wav eforms as input and computes two types of short time Fourier transform (STFT), one with a short windo w (2048 samples, 46.4 ms) and the other with a long window (8192 samples, 185.8 ms), with the same ov erlap (441 samples, 10 ms). The STFT with a short time windo w gi v es temporally sensiti v e output while the one with a longer window offers better frequency res- olution. A Hamming window was applied on the signal before the STFT . W e only take magnitude of the STFT , thereby obtaining spectrogram with 100 frames/sec. T o apply logarithmic characteristics of sound intensity , a log-like compression with a multiplication factor 1000 is applied on the magnitude of spectrograms. W e then reduce the dimensionality of inputs by ﬁltering with semi-tone ﬁl- terbanks. The center frequencies are distributed according to the frequencies of the 88 MIDI notes and the widths are formed with overlapping triangular shape. This process is not only ef fecti ve for reducing size of inputs b ut also for suppressing variance in piano tuning by mer ging neighbor- ing frequency bins. In the low frequency , some note bins become completely zero or linear summation of neighbor- ing notes due to the lo w frequency resolution of the spec- trogram. W e remove those dummy note bins, thereby ha v- ing 183 dimensions in total. W e augmented the input by concatenating it with the ﬁrst-order dif ference of the semi- tone ﬁltered spectrogram. W e observed a signiﬁcant in- crease of the transcription performance with this addition. 2.2 Neural Network The B ¨ ock and Schedl model uses a recurrent neural net- work (RNN) using the Long Short T erm Memory (LSTM) architecture. Compared to feedforward neural networks, RNNs are capable of learning temporal dependency of se- quential data, which is the property found in music audio. Also, the LSTM unit has a memory block updated only when an input or for get gate is open, and the gradients can propagate through memory cells without being multiplied each time step. This property enables LSTM to learn long- term dependency . In our task, the LSTM is expected to learn the continuity of onset, sustain and offset within a note as well as the relation among notes. The LSTM units are also set to be bidirectional, indicating that the input sequence is not only presented in order but also in the op- posite direction. Throughout backward and forward layers together , the networks can access to both history and future of the giv en time frame. While the B ¨ ock and Schedl model used a single network that predicts 88 notes, we use two types of networks; one predicts 88 notes or 12 chroma and the other predicts 12 chroma onsets. In the 88-note network, we reduced the size to two layers of 200 LSTM units as it performed bet- ter in our experiments. In the 12-chroma, we downsized it further having 100 LSTM units on the ﬁrst layer and 50 LSTM units on the second layer . On top of the LSTM networks, a fully connected layer with sigmoid activ ation units are added as the output layer . Each output unit corre- sponds to one MIDI note or chroma (i.e. pitch class of the MIDI note). 2.2.1 Backpr opagation Theoretically , LSTM can learn any length of long-term de- pendency through backpropagation through time (BPTT) with a desired number of time steps. In practice, it re- quires large memory and heavy computation because all past history of network within the backpropagation length should be stored and updated. T o overcome this difﬁculty , a truncated backpropagation method [9] is usually applied for long sequences (also with long time dependency). In the truncated backpropagation, input sequences are di vided into shorter sequences and the last state of each segment is transfered to the consecutiv e segment. Therefore, even though the backpropagation is only computed in each seg- ment, it can serv e as an approximation to full-length back- propagation. For a bidirectional system, howe ver , the back- ward ﬂo w requires computation on the full future and thus the truncated backpropagation requires large memory as well. T o imitate the advantage of the truncated backprop- agation within the computational av ailability , we split the input sequence into relativ ely lar ge sequences and perform full-length backpropagation within each segment. W e con- ducted grid search on the segmentation length between 10 Figure 2. Examples of bidirectional LSTM networks that predict 12 chroma: (a) ground truth, (b) without over - lapping segmentation, (c) with overlapping segmentation. Dotted lines indicate the boundaries of segments frames to 300 frames (100 to 3000 ms) and ﬁnally settled down to 50 frames (500 ms). This was long enough to catch up the continuity of individual notes and also was not computationally expensi ve. W e conducted a compar- ativ e experiment between a unidirectional model with the truncated backpropagation and a bidirectional model with a non-transferred segmentation. The result showed that the bidirectional model performs better . T o reduce the amount computation, our model works in sequence-to-sequence manner . In other words, the output of the network is a sequence with the same length of in- put segment. Therefore, frames on the edges of a segment hav e only one-side context windo w . W e observ ed that con- tagious errors frequently occur on such frames as shown in Figure 2b . T o tackle this problem we split the input se- quence with 50% overlapped segments and take only the middle part of output from each segment. This procedure signiﬁcantly increase transcription result as shown in Fig- ure 2c. 2.2.2 Network T raining In order to train the networks, we used audio ﬁles and aligned MIDI ﬁles. The MIDI data was conv erted into a piano-roll representation with the same frame rate of the input ﬁlter-bank spectrogram (100 fps). For 88 notes and chroma labels, the elements of piano-roll representation were set to 1 between note onset and offset and otherwise to 0. For chroma onset labels, only the elements that cor- respond to note onsets were set to 1. The corresponding audio data was normalized with zero-mean and standard deviation of one o v er each ﬁlter in the training set. W e used dropout with a ratio of 0.5 and weight regular- ization with a value of 10 − 4 in each LSTM layer . This ef- fectiv ely impro ved the performance by generalization. W e used the network with stochastic gradient decent to min- imize binary cross entropy loss function. Learning rate was initially set as 0.1 and iterativ ely decreased by a fac- tor of 3 when no improvement was observed for valida- tion loss for 10 epochs (i.e. early stopping). The training was stopped after six iterations. Examples of the AMT outputs are presented in Figure 3. T o verify the perfor- Figure 3. (a) An excerpt of music score from Beethoven’ s 8th sonata. (b)-(d) the prediction outputs of the AMT sys- tems: (b) 88 note (c) 12 chroma (d) 12 chroma onset. mance, frame-wise transcription performance for the 88- note AMT system was measured on the test sets. W e used a ﬁxed threshold of 0.5 to predict the note presence and measured the accuracy with F-score to make the results comparable to those in [10, 11]. The resulting F-score was 0.7285 on a verage, which is better than the results of RNN with basic units [11] and lower than those with ﬁne-tuned frame-wise DNN and CNN [10]. 2.3 Alignment The AMT systems return two types of MIDI-le vel features. For chroma onset features, every onset was elongated for 10 frames (100 ms) with decaying weights of 1, √ 0 . 9 , √ 0 . 8 , ... , √ 0 . 1 as proposed in [6]. The resulting fea- tures are combined by concatenation with either 88-note or 12-chroma AMT output features. The corresponding score MIDI was also con verted into 88 note (or chroma) and the chroma onsets are elongated in the same manner before combined. W e used euclidean distance to measure similarity between the two combined representations. W e then applied the FastDTW algorithm [12] which is an approximate method to dynamic time warping (DTW). FastDTW uses iterativ e multi-lev el approach with windo w constraints to reduce the complexity . Because of the high frame rate of the features, it is necessary to employ lo w-cost algorithm. While the original DTW algorithm has O( N 2 ) time and space com- plexity , FastDTW operates in O( N ) complexity with al- most the same accuracy . M ¨ uller et al. [13] also examined a similar multi-le vel DTW for the audio-to-score alignment task and reported similar results compared to the origi- nal DTW . The radius parameter in the fastDTW algorithm, which deﬁnes the size of windo w to ﬁnd an optimal path for each resolution reﬁnement, w as set to 10 in our e xperi- ment. 3. EXPERIMENTS 3.1 Dataset W e used the MAPS dataset [14], speciﬁcally the ‘MUS’ subset that contains large pieces of piano music, for train- ing and e valuation. Each piece consists of audio ﬁles and a ground-truth MIDI ﬁle. The audio ﬁles were rendered from the MIDI with nine settings of different pianos and recording conditions. This helped our model av oid over - ﬁtting to a speciﬁc piano tone. The MIDI ﬁles served as the ground-truth annotation of the corresponding audio b ut some of them (ENSTDkCl and ENSTDkAm) are some- times temporally inaccurate, which is more than 65 ms as described in [15]. W e conducted the experience with 4-fold cross v alida- tion with training and test splits from the conﬁguration I 1 in [11]. For each fold, 43 pieces were detached from the training set and used for validation. As a result, each fold was composed of 173, 43 and 54 pieces for training, vali- dation and test, respectiv ely , as processed in [10]. 3.2 Evaluation method In order to ev aluate the audio-to-score alignment task, we need another MIDI representation (typically score MIDI) apart from the performance MIDI aligned with audio. W e generated the separate MIDI by changing intervals between successiv e concurrent set of notes. Speciﬁcally , we mul- tiplied a randomly selected v alue between 0.7 and 1.3 to modify the interval. This scheme of temporal distortion prev ents the alignment path from being trivial and w as also employed in pre vious work [6, 16, 17]. After we obtained the alignment path through DTW , ab- solute temporal errors between estimated note onsets and ground truth were measured. For each piece of music in the test set, mean v alue of the temporal errors and ratio of correctly aligned notes with varying thresholds were used to summarize the results. 3.3 Compared Algorithms T o make a performance comparison, we reproduced two alignment algorithms proposed by Ewert et al. [6] and one by Carabias-Orti et al. [18]. W e performed the experiments with the same test set using the FastDTW algorithm but without any post-processing. Ewert’ s algorithms used a hand-crafted chromagram and onset features based on au- dio ﬁlter bank responses. Carabias-Orti’ s algorithm em- ployed a non-neg ati ve matrix factorization to learn spectral basis of each note combination from spectrogram. The lat- ter is designed only for audio-to-audio alignment while the 1 http://www .eecs.qmul.ac.uk/sss31/T ASLP/info.html Figure 4. Ratio of correctly aligned onsets as a function of threshold. Each point represent mean of piecewise preci- sion. Some data points with lower than 80% of precision are not shown in this ﬁgure. former can be applied to both audio-to-audio and audio- to-MIDI alignment. Therefore, we made an audio version of the distorted MIDI using a high-quality sample-based piano synthesizer and employed it as an input. W e tested Ewert’ s algorithms for both audio and MIDI cases. The temporal frame rate of features were adjusted to 100 fps for both algorithms. For the aligning task with Ewert’ s algorithms, we used the same FastDTW algorithm. But since the FastDTW al- gorithm cannot be directly applied to Carabias-Orti’ s algo- rithm due to its o wn distance calculation method, we ap- plied a classic DTW algorithm, which employs an entire frame-wise distance matrix. Because of the limitation of memory , when reproducing Carabias-Orti’ s algorithm, we excluded 35 pieces that are longer than 400 seconds among the test sets. Note that e ven though the dataset for e valuation is dif fer - ent, the results of two reproduced algorithms were similar to the results in their original works. The mean onset er- rors of Ewert’ s algorithm on piano music was 19 ms with 26 ms standard deviation [6]. The result introduced in the original Carabias-Orti’ s paper [18] sho ws a quite lar ge dif- ference in terms of the mean of piecewise error , but we assumes that the difference is due to the change of the test set. The align rate of original result and our reproduced result were similar (50 ms: 74% - 69%, 100 ms: 90% - 92%, 200 ms: 95% - 96%). Hence, we assumed that our reproduction was reliable for the comparison. 4. RESUL TS AND DISCUSSION 4.1 Comparison with Others Figure 4 shows the results of the audio-to-score alignment from the compared algorithms. They represent the ratio of correctly aligned onsets in precision as a function of er- ror threshold. T ypically , tolerance window with 50 ms is used for ev aluation. Howe v er , because most of notes were aligned within 50 ms of temporal threshold, we varied the width tolerance window from 0 ms to 200 ms with 10 ms steps. Mean Median Std ≤ 10 ms ≤ 30 ms ≤ 50 ms ≤ 100 ms Proposed with onset chroma 12.83 6.40 56.22 92.01 97.44 98.31 98.98 88 note 8.62 5.57 31.14 91.60 98.00 98.97 99.61 Proposed w/o onset chroma 48.01 27.96 152.06 60.66 84.65 89.36 93.72 88 note 25.31 18.69 63.26 56.39 86.42 93.05 97.48 Ewert et. al. (audio-to-MIDI) 16.44 13.64 32.52 71.78 91.38 95.50 98.03 Ewert et. al. (audio-to-audio) 14.66 11.71 25.38 71.53 92.43 96.91 99.13 Carabias-Orti et. al. 131.31 49.96 305.52 23.58 49.40 69.30 91.60 T able 1. Results of the piecewise onset errors. Mean, median, and standard deviation of the errors are in millisecond. The right columns are the ratio of notes (%) that are aligned within the onset error of 10 ms, 30 ms, 50 ms and 100 ms, respectiv ely . Overall, our proposed frame work with 88 note combined with the chroma onsets achieved the best accuracy . Even with zero threshold, which means the best match with reso- lution of our system (10 ms), our proposed model with the 88-note output exactly aligned 52.55% of notes. The ratio was increased to 91.60% with 10 ms threshold. The pro- posed framew ork using 12 chroma sho wed similar preci- sion to the 88-note framew ork, but the accurac y was slightly lower . Compared to Ewert’ s algorithms with hand-crafted features, our method shows signiﬁcantly better performance especially in high resolution. Over 100 ms of threshold, our frame work with chroma and Ewert’ s method sho ws similar precisions but the difference becomes signiﬁcant with the intervals under 50 ms. Note that we penalized our frame work compared to the audio-to-audio scenario of Ewert’ s algorithm because the audio-to-audio approach takes advantage from identical note velocities. W e sup- pose Ewert’ s algorithm performed better in the audio-to- audio scenario rather than the audio-to-MIDI for the same reason. Carabias-Orti’ s algorithm shows lo wer precisions compared to others. W e assume that the difference mainly comes from the usage of onset features. For the fair comparison of the results, we should note that our frame work is hea vily dependent on the training set un- like the two other compared methods. On the other hand, Carabias-Orti’ s algorithm focused on dealing with various instruments and online alignment scenario, which were un- able to be fully appreciated in our experiment. 4.2 Effect of Chroma Onset Featur es On the second e xperiment, we further in vestigate the ef fect of chroma onset features. W e removed the onset features from each model and compared the mean onset errors and shows their distribution. As can be seen in Figure 5, the absence of onset features signiﬁcantly decreases the per- formance. Thus we conclude that employing chroma onset features can compensate the limitation of normalized tran- scription features. As we stated in Section 1, 88 note rep- resentation shows much better results compared to those with chroma output features especially without onset. T able 1 sho ws the statistics of piecewise onset errors. This sho ws that the use of chroma onset feature is crucial in our proposed method. The median of piecewise onset errors was decreased from 18.69 ms to 5.57 ms when ap- plying chroma onset features to the 88-note system. The Figure 5. Comparison of mean onset errors between mod- els with/without chroma onset features. Each point corre- sponds to mean onset error of a piece. Outliers abov e 60 ms of errors are omitted in this ﬁgure. Each number on the top of the box indicates the median value in ms. importance of the note onset feature for aligning piano mu- sic was also examined in [6]. In addition to these experiments, we also aligned some real-world recordings through the trained system. Ev en though a quantitativ e e v aluation has not been presented here, the soniﬁcation of the aligned MIDI ﬁles shows promis- ing results. The synchronized MIDI-audio examples are linked on our demo website 2 . 5. CONCLUSIONS In this paper , we proposed a framework for audio-to-score alignment of piano music using automatic music transcrip- tion. W e built two AMT systems based on bidirectional LSTM that predict note existence and chroma onset. They provide MIDI-le vel features that can be compared with score MIDI to be used for the alignment algorithm. Our experiments with the MAPS dataset showed that the AMT - based features are effecti ve in the alignment task and our proposed system outperforms compared approaches. The 88-note model with chroma onset worked best. W e also 2 http://mac.kaist.ac.kr/ ˜ ilcobo2/alignWithAMT showed that chroma onset features take a crucial role in improving the accuracy . In fact, the successful alignment performance might be possible because we used the same recording condition for both training and test sets. Consid- ering this issue, we in v estigate the generalization capacity of our model by e v aluating it on v arious datasets in the fu- ture. Also, we plan to improve the AMT system by using other types of deep neural networks. Acknowledgments This work was supported by Korea Adv anced Institute of Science and T echnology (project no. G04140049) and Ko- rea Creativ e Content Agenc y (project no. N04170044). 6. REFERENCES [1] A. Arzt, G. W idmer , and S. Dixon, “ Automatic Page T urning for Musicians via Real-Time Machine Listen- ing, ” in Pr oceedings of the conference on ECAI 2008: 18th European Confer ence on Artiﬁcial Intelligence , vol. 1, no. 1, 2008, pp. 241–245. [2] R. B. Dannenber g and C. Raphael, “Music score align- ment and computer accompaniment, ” Communications of the A CM , vol. 49, no. 8, pp. 38–43, 2006. [3] G. Widmer , S. Dixon, W . Goebl, E. Pampalk, and A. T obudic, “In Search of the Horowitz Factor , ” AI Magazine , v ol. 24, no. 3, pp. 111–130, 2003. [4] A. Friber g and J. Sundber g, “Perception of just- noticeable time displacement of a tone presented in a metrical sequence at dif ferent tempos, ” The Journal of The Acoustical Society of America , vol. 94, no. 3, pp. 1859–1859, 1993. [5] S. Dixon and G. W idmer , “MA TCH: a music alignment tool chest, ” in Pr oceedings of the International Society for Music Information Retrieval Confer ence (ISMIR) , 2005, pp. 492–497. [6] S. Ewert, M. M ¨ uller , and P . Grosche, “High resolution audio synchronization using chroma onset features, ” in Pr oc. IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2009, pp. 1869–1872. [7] A. Arzt, S. B ¨ ock, S. Flossmann, H. Frostel, M. Gasser , C. C. S. Liem, and G. W idmer , “The piano music com- panion, ” F r ontiers in Artiﬁcial Intelligence and Appli- cations , vol. 263, no. 1, pp. 1221–1222, 2014. [8] S. B ¨ ock and M. Schedl, “Polyphonic piano note tran- scription with recurrent neural networks, ” in Pr oc. IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2012, pp. 121–124. [9] R. J. W illiams and J. Peng, “An Ef ﬁcient Gradient- Based Algorithm for On-Line T raining of Recurrent Network Trajectories, ” Appears in Neural Computa- tion , no. 2, pp. 490–501, 1990. [10] R. K elz, M. Dorfer, F . K orzenio wski, S. B ¨ ock, A. Arzt, and G. Widmer , “On the Potential of Simple Frame- wise Approaches to Piano T ranscription, ” in Pr oceed- ings of the International Conference on Music Infor- mation Retrieval (ISMIR) , 2016, pp. 475–481. [11] S. Sigtia, E. Benetos, and S. Dixon, “ An end-to-end neural network for polyphonic piano music transcrip- tion, ” IEEE/A CM T ransactions on Audio, Speech and Language Pr ocessing (T ASLP) , vol. 24, no. 5, pp. 927– 939, 2016. [12] S. Salv ador and P . Chan, “FastDTW : T oward Accurate Dynamic T ime W arping in Linear T ime and Space, ” Intelligent Data Analysis , v ol. 11, pp. 561–580, 2007. [13] M. M ¨ uller , H. Mattes, and F . Kurth, “An efﬁcient mul- tiscale approach to audio synchronization, ” in Proc. International Conference on Music Informa- tion Re- trieval (ISMIR) , 2006, p. 192197. [14] V . Emiya, R. Badeau, and B. David, “Multipitch esti- mation of piano sounds using a ne w probabilistic spec- tral smoothness principle, ” IEEE T ransactions on Au- dio, Speech and Language Pr ocessing , vol. 18, no. 6, pp. 1643–1654, 2010. [15] S. Ewert and M. Sandler , “Piano Transcription in the Studio Using an Extensible Alternating Directions Framew ork, ” vol. 24, no. 11, pp. 1983–1997, 2016. [16] M. M ¨ uller , H. Mattes, and F . Kurth, “ An efﬁcient mul- tiscale approach to audio synchronization. ” in Pr oc. International Conference on Music Informa- tion Re- trieval (ISMIR) . Citeseer , 2006, pp. 192–197. [17] C. Joder , S. Essid, and G. Richard, “ A conditional ran- dom ﬁeld framew ork for robust and scalable audio-to- score matching, ” IEEE T ransactions on A udio, Speec h, and Language Pr ocessing , vol. 19, no. 8, pp. 2385– 2397, 2011. [18] J. J. Carabias-Orti, F . J. Rodr ´ ıguez-Serrano, P . V era- Candeas, N. Ruiz-Reyes, and F . J. Ca ˜ nadas-Quesada, “ An audio to score alignment framew ork using spec- tral factorization and dynamic time warping. ” in Pr oc. International Conference on Music Informa- tion Re- trieval (ISMIR) , 2015, pp. 742–748.

Audio-to-score alignment of piano music using RNN-based automatic music transcription

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment