Calibration of a two-state pitch-wise HMM method for note segmentation in Automatic Music Transcription systems

CALIBRA TION OF A TWO-ST A TE PITCH-WISE HMM METHOD FOR NO TE SEGMENT A TION IN A UT OMA TIC MUSIC TRANSCRIPTION SYSTEMS Dorian Cazau 1 Y uancheng W ang 2 Olivier Adam 3 Qiao W ang 2 Gr ´ egory Nuel 4 1 Lab STIC, ENST A-Bretagne 2 Southeast China 3 Institut d’Alembert, UPMC 4 LPMA, UPMC dorian.cazau@ensta-bretagne.fr ABSTRA CT Many methods for automatic music transcription in volv es a multi-pitch estimation method that estimates an activ- ity score for each pitch. A second processing step, called note se gmentation, has to be performed for each pitch in order to identify the time intervals when the notes are played. In this study , a pitch-wise two-state on/off ﬁrst- order Hidden Markov Model (HMM) is dev eloped for note segmentation. A complete parametrization of the HMM sigmoid function is proposed, based on its original re- gression formulation, including a parameter α of slope smoothing and β of thresholding contrast. A compara- tiv e e v aluation of different note segmentation strategies was performed, dif ferentiated according to whether they use a ﬁx ed threshold, called “Hard Thresholding” (HT), or a HMM-based thresholding method, called “Soft Thresh- olding” (ST). This ev aluation was done following MIREX standards and using the MAPS dataset. Also, different transcription scenarios and recording natures were tested using three units of the De gradation toolbox. Results sho w that note segmentation through a HMM soft thresholding with a data-based optimization of the { α, β } parameter couple signiﬁcantly enhances transcription performance. 1. INTR ODUCTION W ork on Automatic Music T ranscription (AMT) dates back more than 30 years [20], and has known numerous applications in the ﬁelds of music information retriev al, in- teractiv e computer systems, and automated musicological analysis [15]. Due to the dif ﬁculty in producing all the in- formation required for a complete musical score, AMT is commonly deﬁned as the computer-assisted process of an- c  First author , Second author , Third author , Fourth author , Fifth author . Licensed under a Creative Commons Attribution 4.0 In- ternational License (CC BY 4.0). Attribution: First author , Second author , Third author , F ourth author, Fifth author . “Calibration of a two- state pitch-wise HMM method for note se gmentation in Automatic Music T ranscription systems”, 18th International Society for Music Information Retriev al Conference, Suzhou, China, 2017. alyzing an acoustic musical signal so as to write down the musical parameters of the sounds that occur in it, which are basically the pitch, onset time, and duration of each sound to be played. This task of “lo w-level” transcription, to which we will restrict ourselves in this study , has inter- ested more and more researchers from different ﬁelds (e.g. library science, musicology , machine learning, cognition), and has been a very competitive task in the MIR (Mu- sic Information Retrie v al) community [1] since the early 2000s. Despite this lar ge enthusiasm for AMT challenges, and several audio-to-MIDI con verters av ailable commer - cially , perfect polyphonic AMT systems are out of reach of today’ s technology . The diversity of music practice, as well as supports of recording and dif fusion, makes the task of AMT v ery chal- lenging indeed. These variability sources can be parti- tioned based on three broad classes: 1) instrument based, 2) music language model based and 3) technology based. The ﬁrst class covers variability from tonal instrument tim- bre. All instruments possess a speciﬁc acoustic signa- ture, that makes them recognizable among dif ferent instru- ments playing a same pitch. This timbre is deﬁned by acoustic properties, both spectral and temporal, speciﬁc to each instrument. The second class includes variability from the different w ays an instrument can be played, that vary with the musical genre (e.g. tonality , tuning, rhythm), the playing techniques (e.g. dynamics, plucking modes), and the personal interpretations of a same piece. These ﬁrst tw o classes induce a high comple xity of note spec- tra over time, whose non-stationary is determined both by the instrument and the musician playing characteris- tics. The third class includes v ariability from electrome- chanics (e.g. transmission channel, microphone), en vi- ronment (e.g. background noise, room acoustics, distant microphone), data quality (e.g. sampling rate, recording quality , audio codec/compression). F or e xample, in eth- nomusicological research, extensiv e sound datasets cur- rently exist, with generally poor quality recordings made on the ﬁeld, while a growing need for automatic analysis appears [8, 17, 19, 24]. Concerning AMT methods, many studies hav e used rank reduction and source separation methods, exploiting both the additiv e and oscillatory properties of audio sig- nals. Among them, spectrogram factorization methods hav e become very popular , from the original Non-ne gati ve Matrix Factorization (NMF) to the recent developments of the Probabilistic Latent Component Analysis (PLCA) [2, 5]. PLCA is a powerful method for Multi-Pitch Esti- mation (MPE), representing the spectra as a linear com- bination of vectors from a dictionary . Such models take advantage of the inherent low-rank nature of magnitude spectrograms to provide compact and informati ve descrip- tions. Their output generally takes the form of a pianoroll- like matrix showing the “acti vity” of each spectral basis against time, that is itself discretized into successive time frame of analysis (of the order of magnitude of 11 ms). From this acti vity matrix, the next processing step in vie w of AMT is note segmentation, that aims to identify for each pitch the time intervals when the notes are played. T o per- form this operation, most spectrogram factorization-based transcription methods [10, 14, 21] use a simple threshold- based detection of the note activ ations from the pitch activ- ity matrix, follo wed by a minimum duration pruning. One of the main drawback of this PLCA method with a simple threshold is that all successiv e frame are processed inde- pendently from one another, and thus temporal correlation between successiv e frames is not modeled. One solution that has been proposed is to jointly learn spectral dictionar- ies as well as a Markov chain that describes the structure of changes between these dictionaries [5, 21, 22]. In this paper , we will focus on the note segmentation stage, using a pitch-wise two-state on/off ﬁrst-order HMM, initially proposed by Poliner et al. [23] for AMT . This HMM allo ws taking into account the dependence of pitch activ ation across time frames. W e revie w the formalism of [23]’ s model, including a full parametrization of the sig- moid function used to map HMM observation probabili- ties into the [0 , 1] interv al, with a term α of slope smooth- ing and β of thresholding contrast. After demonstrating the relev ance of an optimal adjustment of these parame- ters for note segmentation, a supervised approach to es- timate the sigmoid parameters from a learning corpus is proposed. Also, the degradation toolbox [18] was used to build three “de graded” sound datasets that hav e allo wed to e v aluate transcription performance on real life types of audio recordings, such as radio broadcast and MP3 com- pressed audio, that are almost nev er dealt with in transcrip- tion studies. 2. METHODS 2.1 Background on PLCA PLCA is a probabilistic factorization method [25] based on the assumption that a suitably normalized magnitude spectrogram, V , can be modeled as a joint distribution over time and frequency , P ( f , t ) , with f is the log-frequency index and t = 1 , . . . , T the time index with T the number of time frames. This quantity can be factored into a frame probability P ( t ) , which can be computed directly from the observed data (i.e. energy spectrogram), and a conditional distribution o ver frequency bins P ( f | t ) , as follo ws [7] P ( f | t ) = X p,m P ( f | p, m ) P ( m | p, t ) P ( p | t ) (1) where P ( f | p, m ) are the spectral templates for pitch p = 1 , . . . , N p (with N p the number of pitches) and play- ing mode m , P ( m | p, t ) is the playing mode activ ation, and P ( p | t ) is the pitch activ ation (i.e. the transcription). In this paper , the playing mode m will refer to different playing dynamics (i.e. note loudness). T o estimate the model parameters P ( m | p, t ) and P ( p | t ) , since there is usu- ally no closed-form solution for the maximization of the log-likelihood or the posterior distributions, iterati ve up- date rules based on the Expectation-Maximization (EM) algorithm [9] are emplo yed (see [4] for details). The pitch activity matrix P ( p, t ) is deduced from P ( p | t ) with the Bayes’ rule P ( p, t ) = P ( t ) P ( p | t ) (2) PLCA note templates are learned with pre-recorded iso- lated notes, using a one component PLCA model (i.e. m = 1 in Eq. (1). Three different note templates per pitch are used during MPE. In this paper , we use the PLCA-based AMT system dev eloped by Benetos and W eyde [6] 1 . In the following, for p = 1 , . . . , N p and t = 1 , . . . , T , we deﬁne the logarithmic pitch acti vity matrix as X p,t = log  P ( p, t )  (3) 2.2 Note segmentation strategies 2.2.1 HT : Har d Thr esholding The note segmentation strategy HT consists of a sim- ple thresholding β HT of the logarithmic pitch activity ma- trix X ( p, t ) , as it is most commonly done in spectrogram factorization-based transcription or pitch tracking systems, e.g. in [10, 14, 21]. This HT is sometimes combined with a minimum duration constraint with typical post ﬁltering like “all runs of acti ve pitch of length smaller than k are set to 0”. 2.2.2 ST : Soft Thr esholding In this note segmentation strate gy , initially proposed by Poliner and Ellis [23], each pitch p is modelled as a two- state on/off HMM, i.e. with underlying states q t ∈ { 0 , 1 } that denote pitch activity/inacti vity . The state dynamics, transition matrix, and state priors are estimated from our “directly observed” state sequences, i.e. the training MIDI data, that are sampled at the precise times corresponding to the analysis frames of the activ ation matrix. For each pitch p , we consider an independent HMM with observations X p,t , that are actually observ ed, and hid- den binary Markov sequence Q = q 1 , . . . , q T , illustrated in ﬁgure 1. The Markov model then follo ws the law: 1 Codes are available at https://code.soundsoftware.ac. uk/projects/amt_mssiplca_fast . X p, 1 X p, 2 X p, 3 X p,T q 1 q 2 q 3 q T . . . . Figure 1 . Graphical representation of the two-state on/off HMM. q t ∈ { 0 , 1 } are the underlying states label at time t, and o t the the probability observations. P ( Q, X ) ∝ P ( q 1 ) T Y t =2 P ( q t | q t − 1 ) T Y t =1 P ( q t | X p,t ) (4) where ∝ means “proportional to”, as the probabilities do not sum to 1. For t = 1 , . . . , T , we assume that: P ( q t = 0 | q t = 0) = 1 − τ 0 P ( q t = 1 | q t = 0) = τ 0 (5) P ( q t = 0 | q t = 1) = τ 1 P ( q t = 1 | q t = 1) = 1 − τ 1 (6) with τ 0 , τ 1 ∈ [0 , 1] the transition probabilities, and the con vention that q 0 = 0 because all notes are inacti ve at the beginning of a recording. The transition probabilities τ correspond to the state transitions: on/on, on/off, off/on, off/of f. Parameter τ 0 (resp. τ 1 ) is directly related to the prior duration of inactivity (resp. activity) of pitch p . With- out observation, the length of an inactivity run (resp. activ- ity run) would be geometric with parameter τ 0 (resp. τ 1 ) with av erage length 1 /τ 0 (resp. 1 /τ 0 ). The observation probabilities are deﬁned as follo ws, us- ing a sigmoid curve with the PLCA pitch acti vity matrix X p,t as input, P ( q t = 0 | X p,t ) ∝ 1 / Z (7) P ( q t = 1 | X p,t ) ∝ exp [ e α ( X p,t − β )] / Z (8) with α, β ∈ R , and Z deﬁned such as P q t P ( q t | X p,t ) = Z . The parameter of the model is denoted θ = ( τ , α, β ) which includes the speciﬁc value for all pitches. The HMM model is solved using classical forward-backward recursions for all t = 1 , . . . , T , i.e. P θ ( q t = s | X p,t ) = η s ( t ) ∝ F t ( s ) B t ( s ) . Note that the HMM deﬁnition combines both the spatial pitch dependence (the Markov model) with a PLCA gener- ativ e model. As a result of this combination, the resulting model is deﬁned up to a constant factor , but this is not a problem since we will exploit this model to compute pos- terior distribution. In contrast, in the initial [23]’ s model, one should not that a similar model is suggested where the PLCA generative part is associated to so called “virtual ob- servation”. W e here preferred the fully generati ve formula- tion presented abo ve, but both models are ob viously totally equiv alent. Using logarithmic v alues, the parameters { α, β } , ex- pressed in dB, are directly interpretable by physics. β is an offset thresholding parameter , which allows separating signal from noise (or in other w ords, i.e. the higher its value, the more pitch candidates with low probability will be discarded.), while α is a contrast parameter, a v alue su- perior to 0 is used for a fast switch from noise to signal (i.e. low degree of tolerance from threshold), and a value infe- rior to 0 for a smoother switch. Figure 2 shows a sigmoid curve with different values of β and α . This suggested parametrization { α, β } can therefore be seen as a general- ization of the initial [23]’ s model. Figure 2 . Effects of the parameters β (top) and α (bottom) on the theoretical sigmoid gi ven by Eq. (8). On top, a ﬁx ed value of 0 is set to α , and on bottom, a ﬁxed value of -5 is set to β . For this note segmentation strategy ST , we use the set of parameters { α, β } = { 0 , β HT } , as used in previous studies [5, 23]. 2.2.3 OST : Optimized Soft Thr esholding The note segmentation strategy OST is based on the same HMM model as the ST strategy , although the parameters { α, β } are no w optimized for each pitch. Given the ground truth of a musical sequence test, we use the Nelder-Mead optimizer of the R software to iterati vely ﬁnd the optimal { α, β } parameters that provide the best transcription per- formance measure. The Nelder-Mead method is a simplex- based multiv ariate optimizer known to be slow and impre- cise but generally rob ust and suitable for irregular and dif- ﬁcult problems. For optimization, we use the Least Mean Square Error (LMSE) metric, as it allows to take into ac- count the precise shape of activ ation proﬁles. Figure 3 pro- vides an example of this optimization through the contour graph of the log 10 ( LMSE ) function. Howe ver , classical AMT error metrics (see Sec. 2.3.3) will be used as display variables for graphics as the y allo w direct interpretation and comparison in terms of transcription performance. In real world scenarios of AMT , the ground truth of a musical piece is ne ver kno wn by advance. A common strategy to estimate model or prior knowledge parame- ters is to train them on a learning dataset that is some- what similar to the musical piece to be transcribed. This was done in this study for the { α, β } parameters, through a cross-validation procedure with the LMSE-optimization (see Sec. 2.3.2). Figure 3 . Example of a data-based optimization of the { α, β } parameters through the contour graph of the log 10 ( LMSE ) function, using the musical piece MAPS MUS-alb esp2 AkPnCGdD . The dashed white lines point to the local minimum. 2.3 Evaluation procedure 2.3.1 Sound dataset T o test and train the AMT systems, three different sound corpus are required: audio musical pieces of an instrument repertoire, the corresponding scores in the form of MIDI ﬁles, and a complete dataset of isolated notes for this in- strument. Audio musical pieces and corresponding MIDI scores were extracted from the MAPS database [11], be- longing to the solo classical piano repertoire. The 56 musi- cal pieces of the two pianos labelled AkPnCGdD and EN- STDkCl were used, and constituted our ev aluation sound dataset called Baseline. The ﬁrst piano model is the virtual instrument Akoustik Piano (concert grand D piano) devel- oped by the software Nativ e Instruments. The second one is the real upright piano model Y amaha Disklavier Mark III. Three other sound datasets of musical pieces ha ve then been deﬁned as follows: • MP3 dataset. It corresponds to the same musical pieces of the dataset Baseline, but modiﬁed with the Strong MP3 Compression degradation from the degradation toolbox [18]. This degradation com- presses the audio data to an MP3 ﬁle at a constant bit rate of 64 kbps using the Lame encoder ; • Smartphone dataset. It corresponds to the same mu- sical pieces of the dataset Baseline, b ut modiﬁed with the Smartphone Recording degradation from the degradation toolbox [18]. This degradation sim- ulates a user holding a phone in front of a speaker: 1. Apply Impulse Response, using the IR of a smart- phone microphone (“Google Nexus One”), 2. Dy- namic Range Compression, to simulate the phone’ s auto-gain, 3. Clipping, 3 % of samples, 4. Add Noise, adding medium pink noise ; • V in yl dataset. It corresponds to the same musical pieces of the dataset Baseline, but modiﬁed with the V inyl degradation from the degradation toolbox [18]. This de gradation applies an Impulse Response, using a typical record player impulse response, adds Sound and record player crackle, a W o w Resample, imitating wow-and-ﬂutter , with the wow-frequenc y set to 33 rpm (speed of Long Play records), and adds Noise and light pink noise. For all datasets, isolated note samples were e xtracted from the R WC database (ref. 011, CD 1) [13]. 2.3.2 Cr oss-validation During a cross-validation procedure, the model is ﬁt to a training dataset, and predictiv e accuracy is assessed using a test dataset. T wo cross-validation procedures were used for training the { α, β } parameters of the OST strate gy , and testing separately the three thresholding strategies. The ﬁrst one is the “lea ve-one-out” cross-validation procedure, using only one musical piece for parameter training and testing all others. This process is iterated for each musical piece. The second one is a repeated random sub-sampling validation, also known as Monte Carlo cross-validation. At each iteration, the complete dataset of musical pieces is randomly split into training and test data accordingly to a giv en training/test ratio. The results are then a veraged over the splits. The advantage of this method (o ver k-fold cross validation) is that the proportion of the training/test split is not dependent on the number of iterations (folds). A num- ber of 20 iterations was used during our simulations. W e also tested different training/test ratio, ranging from 10/90 % to 60/40 % in order to ev aluate the inﬂuence of the train- ing dataset on transcription performance. 2.3.3 Evaluation metrics For assessing the performance of our proposed transcrip- tion system, frame-based e valuations are made by com- paring the transcribed output and the MIDI ground-truth frame by frame using a 10 ms scale as in the MIREX multiple- F 0 estimation task [1]. W e used the frame-based recall (TPR), precision (PPV), the F-measure (FMeas) and the ov erall accuracy (Acc) TPR = P T t =1 TP [ t ] P T t =1 TP [ t ] + FN [ t ] (9) PPV = P T t =1 TP [ t ] P T t =1 TP [ t ] + FP [ t ] (10) FMeas = 2 . PPV . TPR PPV + TPR (11) Acc = P T t =1 TP [ t ] P T t =1 TP [ t ] + FP [ t ] + FN [ t ] (12) where T is the total number of time frames, and TP [ t ] , TN [ t ] , FN [ t ] and FP [ t ] are the numbers of true positiv e, true negati ve, false negativ e and false positiv e pitches at frame t . The recall is the ratio between the number of relev ant and original items; the precision is the ratio between the number of relev ant and detected items; and the F-measure is the harmonic mean between precision and recall. For all these ev aluation metrics, a value of 1 represents a perfect match between the estimated transcription and the refer- ence one. 2.3.4 MPE algorithms on the benchmark In this study , we tested the four following MPE algorithms: • T olonen2000, this algorithm 2 [26] is an efﬁcient model for multipitch and periodicity analysis of complex audio signals. The model essentially di- vides the signal into two channels, below and above 1000 Hz, computes a “generalized” autocorrelation of the low-channel signal and of the en velope of the high-channel signal, and sums the autocorrelation functions ; • Emiya2010, this algorithm 3 [11] models the spec- tral env elope of the o vertones of each note with a smooth autoregressi ve model. For the background noise, a moving-a verage model is used and the com- bination of both tends to eliminate harmonic and sub-harmonic erroneous pitch estimations. This leads to a complete generativ e spectral model for simultaneous piano notes, which also explicitly in- cludes the typical deviation from exact harmonicity in a piano ov ertone series. The pitch set which max- imizes an approximate likelihood is selected from among a restricted number of possible pitch combi- nations as the one ; • HALCA, the Harmonic Adaptiv e Latent Compo- nent Analysis algorithm 4 [12] models each note in a constant-Q transform as a weighted sum of ﬁxed narrowband harmonic spectra, spectrally conv olved with some impulse that deﬁnes the pitch. All param- eters are estimated by means of the EM algorithm, in the PLCA framework. This algorithm was ev alu- ated by MIREX and obtained the 2 nd best score in the Multiple Fundamental Frequency Estimation & T racking task, 2009-2012 [1] ; • Benetos2013, this PLCA-based AMT system 5 [3] uses pre-ﬁxed templates deﬁned with real note sam- ples, without updating them in the maximization step of the EM algorithm. It has been ranked ﬁrst in the MIREX transcription tasks [1]. 2.4 Setting the HT threshold value W e need to deﬁne the threshold v alue β HT used in the note segmentation strategies HT and ST . Although most studies in AMT literature [10, 14, 21] use this note segmentation strategy , threshold values are barely reported and proce- dures to deﬁne them have not yet been standardize. Most of the time, one threshold value is computed across each ev aluation dataset, which is dependent on various parame- ters of the experimental set-up, such as the used e valuation metric, input time-frequency representation, normalization 2 W e used the source code implemented in the MIR toolbox [16], called mirpitch(..., ’T olonen’). 3 Source code courtesy of the primary author . 4 Source codes are a vailable at http://www.benoit- fuentes. fr/publications.html . 5 Source codes are av ailable at https://code. soundsoftware.ac.uk/projects/amt_mssiplca_fast . of input wa veform. In this paper , we will use a similar empirical dataset-based approach to deﬁne the HT thresh- old value. R OC curves (True Positiv es against False Pos- itiv es) are computed ov er the threshold range [0 ; -5] dB so as to choose the v alue that maximizes True Positi ve and minimizes False Positives, i.e. that increases transcription performance at best ov er each dataset. 3. RESUL TS AND DISCUSSION All follo wing results on transcription performance hav e been obtained using the Benetos2013 AMT system, ex- cept for ﬁgure 6 where all AMT systems are comparati vely ev aluated. Figure 4 represents the boxplots of the optimal { α, β } values obtained for each pitch. The “leav e-one-out” cross-validation procedure has been applied to the different datasets, from top to bottom. For each dataset, we can see that the data-based pitch-wise optimization leads to β val- ues drastically different from the threshold v alue β HT used in the ST and HT thresholding strategies (represented by the horizontal red lines). Differences range from 0.5 to 2 dB, that hav e a signiﬁcant impact for note segmentation. Slighter differences are observed in values of α , although slightly positive values of α (around + 1 dB) tend to con- tribute to reduce the LMSE metric used in optimization. Also, note that optimal β HT values are also dependent on the datasets, varying from -1.8 to -2.8 dB. Now , let’ s see ho w this optimization of { α, β } in the method OST impacts real transcription performance. T a- ble 1 shows transcription results obtained with the “leav e- one-out” cross-validation procedure, applied to the differ- ent thresholding strategies. In comparison to the meth- ods HT and ST , signiﬁcant gains in transcription perfor - mance are brought by the proposed method OST . These gains are the highest for the baseline dataset D 1 , in the or- der of magnitude of 5 to 8 % for the two metrics Acc and FMeas. They remain systematically positive for the other datasets, with a minimum g ain of 4 % whatever the dataset, error metric and compared thresholding strategy . Alto- gether , these gains are very signiﬁcant in regards to com- mon gains in transcription performance reported in litera- ture, and demonstrate the validity of our proposed method. In Figure 5, we ev aluated the dependency of transcrip- tion performance on the training dataset size, through a Monte Carlo cross-validation procedure with different training/test ratios, ranging from 10 to 60 % of the com- plete dataset of musical pieces, plus the “lea ve-one-out” (labelled LOM) ratio. This ﬁgure sho ws that increasing the size of the training set directly induces av erage tran- scription gains from 0.5 to 6 % of the metric FMeas with the OST method, in comparison to the HT method. W e note that once the curves reach the 60 /40 % training/test ratio, all systems ﬁnd a quick conv ergence to the gain ceil- ing achiev ed with the LOM ratio. Eventually , we studied the dependency of OST tran- scription performance on the AMT system used, in com- parison to the method HT . Figure 6 sho ws the differences between the FMeas obtained with the methods OST and HT . W e can observ e that these differences are relati vely Figure 4 . Boxplots of the optimal { α, β } values obtained for each pitch, and for each ev aluation dataset. The hor- izontal red lines in each boxplot represents the parameter values used in the ST and HT thresholding strate gies. Figure 5 . Difference between the F-measures obtained with the OST and HT note segmentation methods, using 20 iterations of the repeated random sub-sampling valida- tion method with training/test ratio ranging from 10 /90 % to 60 /40 %, plus the “lea ve-one-out” (labelled LOM) ratio. Datasets Note segmentation strategies Acc (%) Fmeas (%) Baseline HT 54.9 53.3 ST 57.6 55.3 OST 62.3 59.2 MP3 HT 51.9 52.6 ST 52.2 50.1 OST 55.6 56.7 Smartphone HT 52.2 51.9 ST 53.1 51.3 OST 58.4 56.5 V inyl HT 50.8 48.8 ST 51.1 49.2 OST 57.8 54.1 T able 1 . A verages of error metrics FMeas and Acc ob- tained with the different thresholding strategies, i.e. ST , OST and HT , using a leave-one-out cross-validation pro- cedure. small, i.e. inferior to 2 %. This demonstrates that the pro- posed OST method improves transcription performance in a rather univ ersal way , as independent from the character- istics of activ ation matrices as long as AMT system spe- ciﬁc training datasets are used. Only AMT system T olo- nen2000 shows higher transcription gains (especially for the datasets D 3 and D 4 ) brought by the OST method as this system outputs the worst acti vation matrices. Figure 6 . Difference between the F-measures obtained with the OST and HT note segmentation methods, using different AMT systems. 4. CONCLUSION In this study , an original method for the task of note seg- mentation was presented. This task is a crucial process- ing step in most systems of automatic music transcription. The presented method is based on a two-state pitch-wise Hidden Markov Model method, augmented with two sig- moid parameters on contrast and slope smoothing that are trained with a learning dataset. This rather simple method has brought signiﬁcant results in transcription performance on music datasets with different characteristics. It can also be used as a universal post-processing block after any pitch-wise activ ation matrix, sho wing great promise for fu- ture use. 5. REFERENCES [1] MIREX (2007). Music information retrie val ev alua- tion exchange (mirex). 2011. av ailable at http://music- ir .org/mire xwiki/ (date last vie wed January 9, 2015). [2] V . Arora and L. Behera. Instrument identiﬁcation us- ing plca ov er stretched manifolds. In Communications (NCC), 2014 T wentieth National Confer ence on , pages 1–5, Feb 2014. [3] E. Benetos, S. Cherla, and T . W eyde. An efﬁcient shift- in variant model for polyphonic music transcription. In 6th Int. W orkshop on Machine Learning and Music, Prague , Czech Republic , 2013. [4] E. Benetos and S. Dixon. A shift-inv ariant latent vari- able model for automatic music transcription. Com- puter Music J. , 36:81–84, 2012. [5] E. Benetos and S. Dixon. Multiple-instrument poly- phonic music transcription using a temporally con- strained shift-in variant model. J. Acoust. Soc. Am. , 133:1727–1741, 2013. [6] E. Benetos and T . W eyde. An efﬁcient temporally- constrained probabilistic model for multiple- instrument music transcription. In 16th International Society for Music Information Retrieval Confer ence, Malaga , Spain , pages 355–360, 2015. [7] D. Cazau, O. Adam, J. T . Laitman, and J. S. Reiden- berg. Understanding the intentional acoustic behavior of humpback whales: a production-based approach. J. Acoust. Soc. Am. , 134:2268–2273, 2013. [8] O. Cornelis, M. Lesaffre, D. Moelants, and M. Le- man. Access to ethnic music: Advances and perspec- tiv es in content-based music information retrie val. Sig- nal Pr oc. , 90:1008–1031, 2010. [9] A. P . Dempster, N. M. Laird, and D. B. Rubin. Maxi- mum likelihood from incomplete data via the em algo- rithm. Journal of the Royal Statistical Society , Series B , 39:1–38, 1977. [10] A. Dessein, A. Cont, and G. Lemaitre. Real-time polyphonic music transcription with nonnegati ve ma- trix factorization and beta-di vergence. In 11th Interna- tional Society for Music Information Retrieval Confer- ence, Utretc ht, Netherlands , pages 489–494, 2010. [11] V . Emiya, R. Badeau, and G. Richard. Multipitch es- timation of piano sounds using a new probabilistic spectral smoothness principle. IEEE T rans. on Audio, Speech, Lang. Pr oc. , 18:1643–1654, 2010. [12] B. Fuentes, R. Badeau, and G. Richard. Harmonic adaptiv e latent component analysis of audio and ap- plication to music transcription. IEEE T rans. on Audio Speech Lang. Pr ocessing , 21:1854–1866, 2013. [13] M. Goto, H. Hashiguchi, T . Nishimura, and R. Oka. Rwc music database: Popular, classical, and jazz music databases. In 3r d International Confer ence on Music Information Retrieval, Baltimor e,MD. , pages 287–288, 2003. [14] G. Grindlay and D. P . W . Ellis. T ranscribing multi-instrument polyphonic music with hierarchical eigeninstruments. IEEE J. Sel. T opics Signal Pr oc. , 5:1159–1169, 2011. [15] A. Klapuri. Automatic music transcription as we kno w it today . J. of New Music Resear ch , 33:269–282, 2004. [16] O. Lartillot and P . T oiviainen. A matlab toolbox for musical feature extraction from audio. In Pr oc. of the 10th Int. Confer ence on Digital Audio Effects (DAFx- 07), Bor deaux, F rance , September 10-15, 2007 , 2007. [17] T . Lidy , C. N. Silla, O. Cornelis, F . Gouyon, A. Rauber, C. A. A. Kaestner , and A. L. Koerich. On the suitability of state-of-the-art music information retriev al methods for analyzing, categorizing and accessing non-western and ethnic music collections. Signal Pr oc. , 90:1032– 1048, 2010. [18] M. Mauch and S. Ewert. The audio degradation tool- box and its application to robustness ev aluation. In 14th International Society for Music Information Retrieval Confer ence, Curitiba, PR, Brazil , pages 83–88, 2013. [19] D. Moelants, O. Cornelis, M. Leman, J. Gansemans, R. T . Caluwe, G. D. T r ´ e, T . Matth ´ e, and A. Hallez. The problems and opportunities of content-based analysis and description of ethnic music. International J. of In- tangible Heritage , 2:59–67, 2007. [20] J. A. Moorer . On the transcription of musical sound by computer . Computer Music Journal , 1:32–38, 1977. [21] G. J. Mysore and P . Smaragdis. Relati ve pitch esti- mation of multiple instruments. In International Con- fer ence on Acoustical Speech and Signal Processing , T aipei, T aiwan , pages 313–316, 2009. [22] M. Nakano, J. Le Roux, H. Kameoka, O. Kitano, N. Ono, and S. Sagayama. Nonnegativ e matrix factor- ization with markov-chained bases for modeling time- varying patterns in music spectrograms. In L V A/ICA 2010, LNCS 6365, V . V igner on et al. (Eds.) , pages 149– 156, 2010. [23] G. Poliner and D. Ellis. A discriminative model for polyphonic piano transcription. J. on Advances in Sig- nal Pr oc. , 8:1–9, 2007. [24] J. Six and O. Cornelis. Computer-assisted transcription of ethnic music. In 3th International W orkshop on F olk Music Analysis, Amster dam, Netherlands , pages 71– 72, 2013. [25] P . Smaragdis, B. Raj, and M. Shanshanka. A proba- bilistic latent variable model for acoustic modeling. In Neural Information Pr oc. Systems W orkshop, Whistler , BC, Canada , 2006. [26] T . T olonen and M. Karjalainen. A computationally efﬁ- cient multipitch analysis model. IEEE T rans. on speech and audio pr ocessing , 8:708–716, 2000.

Calibration of a two-state pitch-wise HMM method for note segmentation in Automatic Music Transcription systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment