Optimal spectral transportation with application to music transcription
Many spectral unmixing methods rely on the non-negative decomposition of spectral data onto a dictionary of spectral templates. In particular, state-of-the-art music transcription systems decompose the spectrogram of the input signal onto a dictionar…
Authors: Remi Flamary, Cedric Fevotte, Nicolas Courty
Optimal spectral transportation with application to music transcription Rémi Flamary Univ ersité Côte d’Azur , CNRS, OCA remi.flamary@unice.fr Cédric Févotte CNRS, IRIT , T oulouse cedric.fevotte@irit.fr Nicolas Courty Univ ersité de Bretagne Sud, CNRS, IRISA courty@univ-ubs.fr V alentin Emiya Aix-Marseille Univ ersité, CNRS, LIF valentin.emiya@lif.univ-mrs.fr Abstract Many spectral unmixing methods rely on the non-negati ve decomposition of spec- tral data onto a dictionary of spectral templates. In particular , state-of-the-art music transcription systems decompose the spectrogram of the input signal onto a dictionary of representativ e note spectra. The typical measures of fit used to quantify the adequacy of the decomposition compare the data and template entries frequency-wise. As such, small displacements of ener gy from a frequency bin to another as well as variations of timbre can disproportionally harm the fit. W e address these issues by means of optimal transportation and propose a ne w measure of fit that treats the frequency distributions of energy holistically as opposed to frequency-wise. Building on the harmonic nature of sound, the new measure is in variant to shifts of ener gy to harmonically-related frequencies, as well as to small and local displacements of energy . Equipped with this ne w measure of fit, the dictionary of note templates can be considerably simplified to a set of Dirac vectors located at the target fundamental frequencies (musical pitch values). This in turns giv es ground to a very fast and simple decomposition algorithm that achiev es state-of-the-art performance on real musical data. 1 Context Many of no wadays spectral unmixing techniques rely on non-negati ve matrix decompositions. This concerns for example hyperspectral remote sensing (with applications in Earth observation, astronomy , chemistry , etc.) or audio signal processing. The spectral sample v n (the spectrum of light observ ed at a gi ven pixel n , or the audio spectrum in a gi ven time frame n ) is decomposed onto a dictionary W of elementary spectral templates, characteristic of pure materials or sound objects, such that v n ≈ Wh n . The composition of sample n can be inferred from the non-negati ve e xpansion coefficients h n . This paradigm has led to state-of-the-art results for various tasks (recognition, classification, denoising, separation) in the aforementioned areas, and in particular in music transcription, the central application of this paper . In state-of-the-art music transcription systems, the spectrogram V (with columns v n ) of a musical signal is decomposed onto a dictionary of pure notes (in so-called multi-pitch estimation) or chords. V typically consists of (power -)magnitude values of a regular short-time F ourier transform (Smaragdis and Bro wn, 2003). It may also consists of an audio-specific spectral transform such as the Mel- frequency transform, like in (V incent et al., 2010), or the Q-constant based transform, like in (Oudre et al., 2011). The success of the transcription system depends of course on the adequac y of the time-frequency transform & the dictionary to represent the data V . In particular , the matrix W must 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. be able to accurately represent a diversity of real notes. It may be trained with individual notes using annotated data (Boulanger-Le wando wski et al., 2012), hav e a parametric form (Rigaud et al., 2013) or be learnt from the data itself using a harmonic subspace constraint (V incent et al., 2010). One important challenge of such methods lies in their ability to cope with the variability of real notes. A simplistic dictionary model will assume that one note characterised by fundamental frequenc y ν 0 (e.g., ν 0 = 440 Hz for note A 4 ) will be represented by a spectral template with non-zero coef ficients placed at ν 0 and at its multiples (the harmonic fr equencies ). In reality , many instruments, such as the piano, produce musical notes with either slight frequency misalignments (so-called inharmonicities ) with respect to the theoretical values of the fundamental and harmonic frequencies, or amplitude variations at the harmonic frequencies with respect to recording conditions or played instrument (variations of timbr e ). Handling these variabilities by increasing the dictionary with more templates is typically unrealistic and adapti ve dictionaries ha ve been considered in (V incent et al., 2010; Rig aud et al., 2013). In these papers, the spectral shape of the columns of W is adjusted to the data at hand, using specific time-in variant semi-parametric models. Howe ver , the note realisations may v ary in time, something which is not handled by these approaches. This work presents a new spectral unmixing method based on optimal transportation (O T) that is fully flexible and remedies the latter difficulties. Note that T ypke et al. (2004) have pre viously applied O T to notated music (e.g., score sheets) for search-by-query in databases while we address here music transcription from audio spectral data. 2 A relev ant baseline: PLCA Before presenting our contrib utions, we start by introducing the PLCA method of Smaragdis et al. (2006) which is heavily used in audio signal processing. It is based on the Probabilistic Latent Semantic Analysis (PLSA) of Hofmann (2001) (used in text retrie val) and is a particular form of non- negati ve matrix factorisation (NMF). Simplifying a bit, in PLCA the columns of V are normalised to sum to one. Each vector v n is then treated as a discrete probability distribution of “frequency quanta” and is approximated as V ≈ WH . The matrices W and H are of size M × K and K × N , respectiv ely , and their columns are constrained to sum to one. As a result, the columns of the approximate ˆ V = WH sum to one as well and each distribution v ector v n is as such approximated by the counterpart distribution ˆ v n in ˆ V . Under the assumption that W is known, the approximation is found by solving the optimisation problem defined by min H ≥ 0 D KL ( V | WH ) s.t ∀ n, k h n k 1 = 1 , (1) where D KL ( v | ˆ v ) = P i v i log( v i / ˆ v i ) is the KL div ergence between discrete distributions, and by extension D KL ( V | ˆ V ) = P n D KL ( v n | ˆ v n ) . An important characteristic of the KL di ver gence is its separability with respect to the entries of its arguments. It operates a frequency-wise comparison in the sense that, at ev ery frame n , the spectral coefficient v in at frequency i is compared to its counterpart ˆ v in , and the results of the comparisons are summed ov er i . In particular , a small displacement in the frequenc y support of one observation may disproportionally harm the di ver gence value. For example, if v n is a pure note with fundamental frequency ν 0 , a small inharmonicity that shifts energy from ν 0 to an adjacent frequency bin will unreasonably increase the di vergence v alue, when v n is compared with a purely harmonic spectral template with fundamental frequency ν 0 . As explained in Section 1 such local displacements of frequency ener gy are very common when dealing with real data. A measure of fit in variant to small perturbations of the frequency support would be desirable in such a setting, and this is precisely what O T can bring. 3 Elements of optimal transportation Gi ven a discrete probability distrib ution v (a non-ne gati ve real-v alued column v ector of dimension M and summing to one) and a tar get distribution ˆ v (with same properties), O T computes a transportation matrix T belonging to the set Θ def = { T ∈ R M × M + |∀ i, j = 1 , . . . , N , P M j =1 t ij = v i , P M i =1 t ij = ˆ v j } . T establishes a bi-partite graph connecting the two distrib utions. In simple words, an amount (or , in typical O T parlance, a “mass”) of ev ery coefficient of v ector v is transported to an entry of ˆ v . The sum of transported amounts to the j th entry of ˆ v must equal ˆ v j . The v alue of t ij is the amount 2 transported from the i th entry of v to the j th entry of ˆ v . In our particular setting, the vector v is a distribution of spectral ener gies v 1 , . . . , v M at sampling frequencies f 1 , . . . , f M . W ithout additional constraints, the problem of finding a non-ne gativ e matrix T ∈ Θ has an infinite number of solutions. As such, O T takes into account the cost of transporting an amount from the i th entry of v to the j th entry of ˆ v , denoted c ij (a non-negati ve real-v alued number). Endorsed with this cost function, O T in volv es solving the optimisation problem defined by min T J ( T | v , ˆ v , C ) = X ij c ij t ij s.t T ∈ Θ , (2) where C is the non-negati ve square matrix of size M with elements c ij . Eq. (2) defines a conv ex linear program. The value of the function J ( T | v , ˆ v , C ) at its minimum is denoted D C ( v | ˆ v ) . When C is a symmetric matrix such that c ij = k f i − f j k p p , where we recall that f i and f j are the frequencies in Hertz index ed by i and j , D C ( v | ˆ v ) defines a metric (i.e., a symmetric diver gence that satisfies the triangle inequality) coined W asserstein distance or earth mov er’ s distance (Rubner et al., 1998; Villani, 2009). In other cases, in particular when the matrix C is not ev en symmetric like in the next section, D C ( v | ˆ v ) is not a metric in general, but is still a v alid measure of fit. For generality , we will refer to it as the “O T div ergence”. By construction, the O T div ergence can e xplicitly embed a form of in variance to displacements of support, as defined by the transportation cost matrix C . For e xample, in the spectral decomposition setting, the matrix with entries of the form c ij = ( f i − f j ) 2 will increasingly penalise frequency displacements as the distance between frequency bins increases. This precisely remedies the limitation of the separable KL di vergence presented in Section 2. As such, the next section addresses v ariants of spectral unmixing based on the W asserstein distance. 4 Optimal spectral transportation (OST) Unmixing with OT . In light of the abov e discussion, a direct solution to the sensibility of PLCA to small frequency displacements consists in replacing the KL di vergence with the O T di vergence. This amounts to solving the optimisation problem giv en by min H ≥ 0 D C ( V | WH ) s.t ∀ n, k h n k 1 = 1 , (3) where D C ( V | ˆ V ) = P n D C ( v n | ˆ v n ) , W is fixed and populated with pure note spectra and C penalises large displacements of frequenc y support. This approach is a particular case of NMF with the W asserstein distance, which has been considered in a face recognition setting by Sandler and Lindenbaum (2011), with subsequent developments by Zen et al. (2014) and Rolet et al. (2016). This approach is rele vant to our spectral unmixing scenario b ut as will be discussed in Section 5 is on the do wnside computationally intensiv e. It also requires the columns of W to be set to realistic note templates, which is still constraining. The next two sections describes a computationally more friendly approach which additionally remov es the difficulty of choosing W appropriately . Harmonic-in variant transportation cost. In the approach above, the harmonic modelling is con ve yed by the dictionary W (consisting of comb-like pure note spectra) and the in v ariance to small frequency displacements is introduced via the matrix C . In this section we propose to model both harmonicity and local in variance through the transportation cost matrix C . Loosely speaking, we want to define a class of equi valence between musical spectra, that takes into account their inherent harmonic nature. As such, we essentially impose that a harmonic frequency (i.e., a close multiple of its fundamental) can be considered equi valent to its fundamental, the only target of multi-pitch estimation. As such, we assume that a mass at one frequency can be transported to a divisor frequenc y with no cost. In other words, a mass at frequency f i can be transported with no cost to f i / 2 , f i / 3 , f i / 4 , and so on until sampling resolution. One possible cost matrix that embeds this property is c ij = min q =1 ,...,q max ( f i − q f j ) 2 + δ q 6 =1 , (4) where q max is the ceiling of f i /f j and is a small v alue. The term δ q 6 =1 fa vours the discrimination of octav es. Indeed, it penalises the transportation of a note of fundamental frequency 2 ν 0 or ν 0 / 2 to the spectral template with fundamental frequency ν 0 , which would be costless without this additi ve term. Let us denote by C h the transportation cost matrix defined by Eq. (4) . Fig. 1 compares C h 3 Quadr atic co st C 2 (log scale) j = 1 . . . 100 i = 1 . . . 100 j = 1 . . . 100 c ij Select ed colu mns o f C 2 i=20 i=25 i=30 i=35 Harmo nic co st C h (log scale) j = 1 . . . 100 i = 1 . . . 100 j = 1 . . . 100 c ij Select ed colu mns o f C h i=20 i=25 i=30 i=35 Figure 1: Comparison of transportation cost matrices C 2 and C h (full matrices and selected columns). 0 10 20 30 40 50 60 70 80 90 0 0.5 1 O n e D i r a c s p e c t r al t e m p l a t e a n d t h r e e d a t a s a m p l e s ˆ v v 1 v 2 v 3 Measure of fit D ` 2 D KL D C 2 D C h D ( v 1 | ˆ v ) 1.13 72.92 145.00 134.32 D ( v 2 | ˆ v ) 1.13 5.42 10.00 10.00 D ( v 3 | ˆ v ) 0.91 2.02 1042.67 1.00 Figure 2: Three example spectra v n compared to a given template ˆ v (left) and computed diver gences (right). The template is a mere Dirac vector placed at a particular frequency ν 0 . D ` 2 denotes the standard quadratic error k x − y k 2 2 . By construction of D C h , sample v 3 which is harmonically related to the template returns a very good fit with the latter O T div ergence. Note that it does not make sense to compare output v alues of different div ergences; only the relati ve comparison of output v alues of the same div ergence for dif ferent input samples is meaningful. to the more standard quadratic cost C 2 defined by c ij = ( f i − f j ) 2 . With the quadratic cost, only local displacements are permissible. In contrast, the harmonic-in variant cost additionally permits larger displacements to di visor frequencies, improving rob ustness to variations of timbre besides to inharmonicities. Dictionary of Dirac vectors. Having designed an O T diver gence that encodes inherent properties of musical signals, we still need to choose a dictionary W that will encode the fundamental frequencies of the notes to identify . T ypically , these will consist of the physical frequencies of the 12 notes of the chromatic scale (from note A to note G, including half-tones), ov er sev eral octav es. As mentioned in Section 1, one possible strategy is to populate W with spectral note templates. Howe ver , as also discussed, the performance of the resulting unmixing method will be capped by the representati veness of the chosen set of templates. A most welcome consequence of using the O T diver gence built on the harmonic-insensitive cost matrix C h is that we may use for W a mere set of Dirac vectors placed at the fundamental frequencies ν 1 , . . . , ν K of the notes to identify and separate. Indeed, under the proposed setting, a real note spectra (composed of one fundamental and multiple harmonic frequencies) can be transported with no cost to its fundamental. Similarly , a spectral sample composed of sev eral notes can be transported to mixture of Dirac vectors placed at their fundamental frequencies. This simply eliminates the problem of choosing a representativ e dictionary! This very appealing property is illustrated in Fig. 2. Furthermore, the particularly simple structure of the dictionary leads to a very efficient unmixing algorithm, as explained in the ne xt section. In the following, the unmixing method consisting of the combined use of the harmonic-in variant cost matrix C h and of the dictionary of Dirac vectors will be coined “optimal spectral transportation” (OST). At this le vel, we assume for simplicity that the set of K fundamental frequencies { ν 1 , . . . , ν K } is contained in the set of sampled frequencies { f 1 , . . . , f M } . This means that w k (the k th column of W ) is zero ev erywhere except at some entry i such that f i = ν k where w ik = 1 . This is typically not the case in practice, where the sampled frequencies are fix ed by the sampling rate, of the form f i = 0 . 5( i/T ) f s , and where the fundamental frequencies ν k are fixed by music theory . Our approach can actually deal with such a discrepancy and this will be e xplained later in Section 5. 4 5 Optimisation O T unmixing with linear programming . W e start by describing optimisation for the state-of-the- art OT unmixing problem described by Eq. (3) and proposed by Sandler and Lindenbaum (2011). First, since the objectiv e function is separable with respect to samples, the optimisation problem decouples with respect to the acti vation columns h n . Dropping the sample inde x n and combining Eqs. (2) and (3), optimisation thus reduces to solving for ev ery sample a problem of the form min h ≥ 0 , T ≥ 0 h T , C i = X ij t ij c ij s.t. T1 M = v , T > 1 M = Wh , (5) where 1 M is a vector of dimension M containing only ones and h· , ·i is the Frobenius inner product. V ectorising the variables T and h into a single vector of dimension M 2 + K , problem (5) can be turned into a canonical linear program. Because of the large dimension of the v ariable (typically in the order of 10 5 ), resolution can howe ver be very demanding, as will be sho wn in experiments. Optimisation for OST . W e no w assume that W is a set of Dirac vectors as explained at the end of Section 4. W e also assume that K < M , which is the usual scenario. Indeed, K is typically in the order of a few tens, while M is in the order of a few hundreds. In such a setting ˆ v = Wh contains by design at most K non-zero coefficients, located at the entries such that f i = ν k . W e denote this set of frequency indices by S . Hence, for j / ∈ S , we hav e ˆ v j = 0 and thus P i t ij = 0 , by the second constraint of Eq. (5) . Additionally , by the non-negati vity of T this also implies that T has only K non-zero columns, index ed by j ∈ S . Denoting by e T this subset of columns, and by e C the corresponding subset of columns of C , problem (5) reduces to min h ≥ 0 , e T ≥ 0 h e T , e C i s.t. e T1 K = v , e T > 1 M = h . (6) This is an optimisation problem of significantly reduced dimension ( M + 1) K . Even more appealing, the problem has a simple closed-form solution. Indeed, the v ariable h has a virtual role in problem (6) . It only appears in the second constraint, which de facto becomes a free constraint. Thus problem (6) can be solved with respect to e T regardless of h , and h is then simply obtained by summing the columns of e T > at the solution. No w , the problem min e T ≥ 0 h e T , e C i s.t. e T1 K = v (7) decouples with respect to the rows ˜ t i of e T , and becomes, ∀ i = 1 , . . . , M , min ˜ t i ≥ 0 X k ˜ t ik ˜ c ik s.t. X k ˜ t ik = v i . (8) The solution is simply giv en by ˜ t ik ? i = v i for k ? i = arg min k { ˜ c ik } , and ˜ t ik = 0 for k 6 = k ? i . Introducing the labelling matrix L which is ev erywhere zero except for indices ( i, k ? i ) where it is equal to 1, the solution to OST is tri vially gi ven by ˆ h = L > v . Thus, under the specific assumption that W is a set of Dirac vectors, the challenging problem (5) has been reduced to an effortless assignment problem to solve for T and a simple sum to solve for h . Note that the algorithm is independent of the particular structure of C . In the end, the complexity per frame of OST reduces to O ( M ) , which starkly contrasts with the complexity of PLCA, in the order O ( K M ) per iteration . In Section 4, we assumed for simplicity that the set of fundamental frequencies { ν k } k was contained in the set of sampled frequencies { f i } i . As a matter of fact, this assumption can be tri vially lifted in the proposed setting of OST . Indeed, we may construct the cost matrix e C (of dimensions M × K ) by replacing the target frequencies f j in Eq. (4) by the theoretical fundamental frequencies ν k . Namely , we may simply set the coef ficients of e C to be e c ik = min q ( f i − q ν k ) 2 + δ q 6 =1 , in the implementation. Then, the matrix e T indicates how each sample v is transported to the Dirac vectors placed at fundamental frequencies { ν k } k , without the need for the actual Dirac v ectors themselves, which elegantly solv es the frequency sampling problem. OST with entropic regularisation (OST e ). The procedure described above leads to a winner- takes-all transportation of all of v i to its cost-minimum target entry k ? i . W e found it useful in 5 practice to relax this hard assignment and distribute energies more evenly by using the entropic regularisation of Cuturi (2013). It consists of penalising the fit h e T , e C i in Eq. (6) with an additional term Ω e ( e T ) = P ik ˜ t ik log( ˜ t ik ) , weighted by the hyper -parameter λ e . The negentropic term Ω e ( e T ) promotes the transportation of v i to se veral entries, leading to a smoother estimate of e T . As e xplained in the supplementary material, one can sho w that the negentropy-re gularised problem is a Bregman projection (Benamou et al., 2015) and has again a closed-form solution ˆ h = L > e v where L e is the M × K matrix with coefficients l ik = exp( − ˜ c ik /λ e ) / P p exp( − ˜ c ip /λ e ) . Limiting cases λ e = 0 and λ e = ∞ return the unregularised OST estimate and the maximum-entropy estimate h k = 1 /K , respecti vely . Because L e becomes a full matrix, the complexity per frame of OST e becomes O ( K M ) . OST with group regularisation (OST g ). W e ha ve explained abo ve that the transportation matrix T has a strong group structure in the sense that it contains by construction M − K null columns, and that only the subset e T needs to be considered. Because a small number of the K possible notes will be played at e very time frame, the matrix e T will additionally hav e a significant number of null columns. This heavily suggests using group-sparse regularisation in the estimation of e T . As such, we also consider problem (6) penalised by the additional term Ω g ( e T ) = P k q k e t k k 1 which promotes group-sparsity at column lev el (Huang et al., 2009). Unlike OST or OST e , OST g does not of fer a closed-form solution. Following Courty et al. (2014), a majorisation-minimisation procedure based on the local linearisation of Ω g ( e T ) can be employed and the details are giv en in the supplementary material. The resulting algorithm consists in iterati vely applying unregularised OST , as of Eq. (6) , with the iteration-dependent transportation cost matrix e C ( iter ) = e C + e R ( iter ) , where e R ( iter ) is the M × K matrix with coefficients e r ( iter ) ik = 1 2 k e t ( iter ) k k − 1 2 1 . Note that the proposed group-regularisation of e T corresponds to a sparse re gularisation of h . This is because h k = k e t k k 1 and thus, Ω g ( e T ) = P k √ h k . Finally , note that OST e and OST g can be implemented simultaneously , leading to OST e + g , by considering the optimisation of the doubly-penalised objectiv e function h e T , e C i + λ e Ω e ( e T ) + λ g Ω g ( e T ) , addressed in the supplementary material. 6 Experiments T oy experiments with simulated data. In this section we illustrate the rob ustness, the flexibility and the efficienc y of OST on two simulated examples. The top plots of Fig. 3 display a synthetic dictionary of 8 harmonic spectral templates, referred to as the “harmonic dictionary”. The y hav e been generated as Gaussian kernels placed at a fundamental frequency and its multiples, and using exponential dampening of the amplitudes. As e verywhere in the paper , the spectral templates are normalised to sum to one. Note that the 8 th template is the upper octave of the first one. W e compare the unmixing performance of fiv e methods in two different scenarios. The fi ve methods are as follo ws. PLCA is the method described in Section 2, where the dictionary W is the harmonic dictionary . Con ver gence is stopped when the relative dif ference of the objectiv e function between two iterations falls belo w 10 − 5 or the number of iterations (per frame) exceeds 1000. OT h is the unmixing method with the O T div ergence, as in the first paragraph of Section 4, using the harmonic transportation cost matrix C h and the harmonic dictionary . OST is like O T h , but using a dictionary of Dirac vectors (placed at the 8 fundamental frequencies characterising the harmonic dictionary). OST e , OST g and OST e + g are the regularised variants of OST , described at the end of Section 4. The iterati ve procedure in the group-regularised v ariants is run for 10 iterations (per frame). In the first experimental scenario, reported in Fig. 3 (a), the data sample is generated by mixing the 1st and 4th elements of the harmonic dictionary , but introducing a small shift of the true fundamental frequencies (with the shift being propagated to the harmonic frequencies). This mimics the effect of possible inharmonicities or of an ill-tuned instrument. The middle plot of Fig. 3 (a), displays the generated sample, together with the “theoretical sample”, i.e., without the frequencies shift. This shows ho w a slight shift of the fundamental frequencies can greatly impact the ov erall spectral distribution. The bottom plot displays the true activ ation vector and the estimates returned by the fiv e methods. The table reports the value of the (arbitrary) error measure k ˆ h − h true k 1 together with the run time (on an a verage desktop PC using a MA TLAB implementation) for ev ery method. The results show that group-re gularised variants of OST lead to best performance with very light computational 6 (a) Unmixing with shifted fundamental frequencies Method PLCA O T h OST OST g OST e OST e + g ` 1 error 0.900 0.340 0.534 0.021 0.660 0.015 Time (s) 0.057 6.541 0.006 0.007 0.007 0.013 (b) Unmixing with wrong harmonic amplitudes Method PLCA O T h OST OST g OST e OST e + g ` 1 error 0.791 0.430 0.971 0.045 0.911 0.048 Time (s) 0.019 6.529 0.006 0.006 0.005 0.010 Figure 3: Unmixing under model misspecification. See text for details. burden, and without using the true harmonic dictionary . In the second e xperimental scenario, reported in Fig. 3 (b), the data sample is generated by mixing the 1st and 6th elements of the harmonic dictionary , with the right fundamental and harmonic frequencies, b ut where the spectral amplitudes at the latters do not follo w the exponential dampening of the template dictionary (v ariation of timbre). Here again the group-re gularised variants of OST outperforms the state-of-the-art approaches, both in accuracy and run time. T ranscription of real musical data. W e consider in this section the transcription of a selection of real piano recordings, obtained from the MAPS dataset (Emiya et al., 2010). The data comes with a ground-truth binary “piano-roll” which indicates the activ e notes at e very time. The note fundamental frequencies are gi ven in MIDI, a standard musical inte ger-v alued frequency scale that matches the keys of a piano, with 12 half-tones (i.e., piano keys) per octave. The spectrogram of each recording is computed with a Hann window of size 93 -ms and 50% ov erlap ( f s = 44.1Hz). The columns (time frames) are then normalised to produce V . Each recording is decomposed with PLCA, OST and OST e , with K = 60 notes (5 octa ves). Half of the recording is used for validation of the hyper -parameters and the other half is used as test data. For PLCA, we v alidated 4 and 3 v alues of the width and amplitude dampening of the Gaussian kernels used to synthesise the dictionary . For OST , we set = q 0 in Eq. (4) , which was found to satisfactorily improv e the discrimination of octaves increasingly with frequency , and v alidated 5 orders of magnitude of 0 . For OST e , we additionally validated 4 orders of magnitude of λ e . Each of the three methods returns an estimate of H . The estimate is turned into a 0/1 piano-roll by only retaining the support of its P n maximum entries at ev ery frame n , where P n is the ground-truth number of notes played in frame n . The estimated piano-roll is then numerically compared to its ground truth using the F-measure, a global recognition measure which accounts both for precision and recall and which is bounded between 0 (critically wrong) and 1 (perfect recognition). Our ev aluation framework follo ws standard practice in music transcription ev aluation, see for example (Daniel et al., 2008). As detailed in the supplementary material, it can be shown that OST g and OST e + g do not change the location of the maximum entries in the estimates of H returned by OST and OST e , respecti vely , but only their amplitude. As such, they lead to the same F-measures than OST and OST e , and we did not include them in the experiments of this section. W e first illustrate the complexity of real-data spectra in Fig. 4, where the amplitudes of the first six partials (the components corresponding to the harmonic frequencies) of a single piano note are represented along time. Depending on the partial order q , the amplitude ev olves with asynchronous beats and with v arious slopes. This behaviour is characteristic of piano sounds in which each note comes from the vibration of up to three coupled strings. As a consequence, the spectral env elope of such notes cannot be well modelled by a fixed amplitude pattern. Fig. 4 shows that, thanks to its flexibility , OST e can perfectly recover the true fundamental frequency (MIDI 50) while PLCA 7 (a) Thresholded OST e transcription 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 Pitch (MIDI) 40 60 80 (b) Thresholded PLCA transcription 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 Pitch (MIDI) 40 60 80 T ime (s) Figure 4: First 6 partials and transcription of a single piano note (note D3, ν 0 = 147 Hz, MIDI 50). T able 1: Recognition performance (F-measure values) and av erage computational unmixing times. MAPS dataset file IDs PLCA PLCA+noise OST OST+noise OST e OST e +noise chpn_op25_e4_ENSTDkAm 0.679 0.671 0.566 0.564 0.695 0.695 mond_2_SptkBGAm 0.616 0.713 0.470 0.534 0.610 0.607 mond_2_SptkBGCl 0.645 0.687 0.583 0.676 0.695 0.730 muss_1_ENSTDkAm 4 0.613 0.478 0.513 0.550 0.671 0.667 muss_2_AkPnCGdD 0.587 0.574 0.531 0.611 0.667 0.675 mz_311_1_ENSTDkCl 0.561 0.593 0.580 0.628 0.625 0.665 mz_311_1_StbgTGd2 0.663 0.617 0.701 0.718 0.747 0.747 A verage 0.624 0.619 0.563 0.612 0.673 0.684 T ime (s) 14.861 15.420 0.004 0.005 0.210 0.202 is prone to octave errors (confusions between MIDI 50 and MIDI 62). Then, T able 1 reports the F-measures returned by the three competing approaches on seven 15-s e xtracts of pieces from Chopin, Beethov en, Mussorgski and Mozart. For each of the three methods, we hav e also included a variant that incorporates a flat component in the dictionary that can account for noise or non-harmonic components. In PLCA, this merely consists in adding a constant vector w f ( K +1) = 1 / M to W . In OST or OST e this consists in adding a constant column to e C , whose amplitude has also been validated ov er 3 orders of magnitude. OST performs comparably or slightly inferiorly to PLCA b ut with an impressi ve gain in computational time ( ∼ 3000 × speedup). Best ov erall performance is obtained with OST e +noise with an av erage ∼ 10% performance gain ov er PLCA and ∼ 750 × speedup. A Python implementation of OST and real-time demonstrator are av ailable at https://github. com/rflamary/OST 7 Conclusions In this paper we ha ve introduced a ne w paradigm for spectral dictionary-based music transcription. As compared to state-of-the-art approaches, we have proposed a holistic measure of fit which is robust to local and harmonically-related displacements of frequency energies. It is based on a new form of transportation cost matrix that takes into account the inherent harmonic structure of musical signals. The proposed transportation cost matrix allo ws in turn to use a simplistic dictionary composed of Dirac vectors placed at the target fundamental frequencies, eliminating the problem of choosing a meaningful dictionary . Experimental results hav e shown the rob ustness and accuracy of the proposed approach, which strikingly does not come at the price of computational ef ficiency . Instead, the particular structure of the dictionary allows for a simple algorithm that is way faster than state-of-the-art NMF-like approaches. The proposed approach offers new foundations, with promising results and room for improv ement. In particular, we belie ve e xciting av enues of research concern the learning of C h from examples and e xtensions to other areas such as in remote sensing, using application-specific forms of C . Acknowledgments. This w ork is supported in part by the European Research Council (ERC) under the European Union’ s Horizon 2020 research & innovation programme (project F A CTOR Y) and by the French ANR under the JCJC programme (project MAD). Many thanks to Anton y Schutz for generating & providing some of the musical data. 8 References J.-D. Benamou, G. Carlier , M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing , 37(2):A1111–A1138, 2015. N. Boulanger -Lewando wski, Y . Bengio, and P . V incent. Discriminati ve non-ne gativ e matrix factoriza- tion for multiple pitch estimation. In Pr oc. International Society for Music Information Retrieval Confer ence (ISMIR) , 2012. N. Courty , R. Flamary , and D. T uia. Domain adaptation with regularized optimal transport. In Pr oc. Eur opean Confer ence on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) , 2014. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation. In Advances on Neural Information Pr ocessing Systems (NIPS) , 2013. A. Daniel, V . Emiya, and B. Da vid. Perceptually-based ev aluation of the errors usually made when automatically transcribing music. In Pr oc. International Society for Music Information Retrie val Confer ence (ISMIR) , 2008. V . Emiya, R. Badeau, and B. David. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE T rans. Audio, Speech, and Languag e Pr ocessing , 18(6): 1643–1654, 2010. T . Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning , 42 (1):177–196, 2001. J. Huang, S. Ma, H. Xie, and C.-H. Zhang. A group bridge approach for v ariable selection. Biometrika , 96(2):339–355, 2009. L. Oudre, Y . Grenier , and C. Févotte. Chord recognition by fitting rescaled chroma vectors to chord templates. IEEE T rans. A udio, Speech and Languag e Pr ocessing , 19(7):2222 – 2233, Sep. 2011. F . Rigaud, B. David, and L. Daudet. A parametric model and estimation techniques for the inhar- monicity and tuning of the piano. The J ournal of the Acoustical Society of America , 133(5): 3107–3118, 2013. A. Rolet, M. Cuturi, and G. Peyré. Fast Dictionary Learning with a Smoothed W asserstein Loss. Pr oceedings of the 19th International Confer ence on Artificial Intelligence and Statistics , 630–638, 2016. Y . Rubner , C. T omasi, and L. Guibas. A metric for distributions with applications to image databases. In Pr oc. International Confer ence in Computer V ision (ICCV) , 1998. R. Sandler and M. Lindenbaum. Nonnegati ve matrix factorization with earth mov er’ s distance metric for image analysis. IEEE T rans. P attern Analysis and Machine Intelligence , 33(8):1590–1602, Aug 2011. P . Smaragdis and J. C. Brown. Non-negati ve matrix factorization for polyphonic music transcription. In Pr oc. IEEE W orkshop on Applications of Signal Pr ocessing to A udio and Acoustics (W ASP AA) , 2003. P . Smaragdis, B. Raj, and M. V . Shashanka. A probabilistic latent variable model for acoustic modeling. In Pr oc. NIPS workshop on Advances in models for acoustic pr ocessing , 2006. R. T ypke, R. C. V eltkamp, and F . W iering. Searching notated polyphonic music using transportation distances. In Pr oc. ACM International Confer ence on Multimedia , 2004. C. Villani. Optimal transport: old and new . Springer , 2009. E. V incent, N. Bertin, and R. Badeau. Adaptiv e harmonic spectral decomposition for multiple pitch estimation. IEEE T rans. A udio, Speech and Langua ge Pr ocessing , 18:528 – 537, 2010. G. Zen, E. Ricci, and N. Sebe. Simultaneous ground metric learning and matrix factorization with earth mov er’ s distance. In Pr oc. International Confer ence on P attern Recognition (ICPR) , 2014. 9
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment