Particle Filtering for PLCA model with Application to Music Transcription

P article Filtering for PLCA mo del with Application to Music T ranscription Cazau Dorian a 1 , Guillaume Revillon a , Y uancheng W ang a and Olivier Adam a a Sorb onne Univ ersit´ es, UPMC Universit y P aris 06/CNRS, UMR 7190, Institut Jean le Rond, d’Alembert, F-75015, P aris, F rance 1 Corresp onding author e-mail: cazaudorian@outlo ok.fr Abstract Automatic Music T ranscription (AMT) consists in automatically estimat- ing the notes in an audio recording, through three attributes: onset time, duration and pitc h. Probabilistic Laten t Comp onent Analysis (PLCA) has b ecome v ery p opular for this task. PLCA is a sp ectrogram factorization metho d, able to mo del a magnitude sp ectrogram as a linear com bination of sp ectral vectors from a dictionary . Suc h metho ds use the Expectation- Maximization (EM) algorithm to estimate the parameters of the acoustic mo del. This algorithm presen ts well-kno wn inheren t defaults (lo cal conv er- gence, initialization dep endency), making EM-based systems limited in their applications to AMT, particularly in regards to the mathematical form and n um b er of priors. T o o vercome suc h limits, w e prop ose in this pap er to emplo y a diﬀerent estimation framework based on P article Filtering (PF), whic h consists in sampling the p osterior distribution ov er larger parameter ranges. This framework prov es to b e more robust in parameter estimation, more ﬂexible and unifying in the integration of prior knowledge in the sys- tem. Note-level transcription accuracies of 61.8 % and 59.5 % were achiev ed on ev aluation sound datasets of t w o diﬀerent instrument rep ertoires, includ- ing the classical piano (from MAPS dataset) and the marov an y zither, and direct comparisons to previous PLCA-based approaches are pro vided. Steps for further dev elopmen t are also outlined. Cazau et al., Draft, p. 3 1 In tro duction 1.1 Bac kground on PLCA Probabilistic Laten t Comp onen t Analysis (PLCA) is a straigh tforw ard exten- sion of Probabilistic Laten t Semantic Indexing ( Hofmann , 1999 ) which deals with an arbitrary n um b er of dimensions and can exhibit v arious features suc h as sparsit y or shift-in v ariance. The basic mo del is deﬁned as P ( x ) = X z P ( z ) J Y j =1 P ( x j | z ) (1) where P ( x ) is an J -dimensional distribution of the random v ariable x = ( x 1 , . . . , x J ), z is a latent v ariable and the P ( x j | z ) are one dimensional distributions with j ∈ { 1 , . . . , J } . Suc h a general model has b een success- fully applied to audio signals, with a theoretical framew ork devel op ed b y ( Smaragdis et al. , 2006 ). Esp ecially , PLCA has been pro v en to be an eﬃcien t probabilistic to ol for non-negative data analysis, whic h oﬀers a conv enien t w a y of designing sp ectrogram mo dels. F rom its general form ulation (eq. 1 ), and considering a sp ectrogram S ( f , t ) as a probability distribution P ( f , t ), a laten t v ariable z is introduced to mo del P ( f , t ) as P ( f , t ) = X z P ( z ) P ( f | z ) P ( t | z ) = X z P ( z , t ) P ( f | z ) (2) where f and t represent resp ectiv ely frequency and time, and are b oth conditionally indep endent giv en z , P ( f | z ) are the sp ectral bases corresp ond- ing to comp onen t z , and P ( z , t ) their time activ ations. Since there is usually no closed-form solution for the maximization of the log-lik eliho o d or the p os- terior, iterative up date rules based on the Exp ectation-Maximization (EM) algorithm are emplo y ed to estimate P ( f | z ) and P ( z , t ). 1.2 Limitations of curren t PLCA mo dels The ma jor limitation of curren t PLCA mo dels lies in the inheren t problems of the EM algorithm. This algorithm w as originally in tro duced by ( Dempster et al. , 1977 ) to o vercome the diﬃculties in maximizing likelihoo ds of missing data models. The main adv an tage of that metho d is its easy implementation, Cazau et al., Draft, p. 4 consisting of initializing the parameters and iterating exp ectation and max- imization likelihoo ds in a step-b y-step pro cess un til conv ergence. Its ma jor dra wbac k, b esides the requiremen t of conv ex likelihoo ds, lies in its sensitivity to initialization, which increase the risks to lo cal conv ergences ( Rob ert and Casella , 1999 ). That issue is exacerbated in the case of m ultimo dal lik eli- ho o ds. Indeed, the increase of the lik eliho o d function at eac h step of the algorithm ensures its con v ergence to the maxim um lik eliho od estimator in the case of unimo dal likelihoo ds, but implies a dep endence on initial con- ditions for m ultimo dal lik elihoo ds. Alternativ e tec hniques ha v e also been prop osed to optimize the searc h of global maxima, suc h as running the al- gorithm a n umber of times with diﬀerent, random starting p oints, or using v ariants from the basic EM algorithm such as Deterministic Annealing EM (D AEM) algorithm ( Ueda and Nak ano , 1998 ). These theoretical issues hav e reac hed research ﬁelds working on audio signals. T o tac kle the problem of dep endency to initialization, some authors ( Grindla y and Ellis , 2010 ; Bene- tos and Dixon , 2013 ) p erform a training of the instrumen t templates, whic h has pro v ed to b e an eﬀectiv e wa y to initialise the sp ectral bases. Indeed, b y ﬁxing them without data-driv en up dating, w e obtain a stable output for the gain function, indep endent of its initialisation. Ho wev er, when the mo del b ecomes more complex with for example the in tro duction of diﬀerent instru- men t v ariables, p erforming robust initialization is more diﬃcult. F or what concerns the lo cal con vergence problem, some w orks ( Hoﬀman et al. , 2009 ; Grindla y and Ellis , 2010 ; Cheng et al. , 2013 ) ha v e used the D AEM algorithm based on a temp erature parameter. This limitation b ecomes particularly critical when in tegrating priors in to the PLCA framework. Generally sp eaking, this integration in tro duces generic problems in optimization con vergence to global maxima, esp ecially when the prior has a multi-modality form. Indeed, when a prior is injected, the maximization step b ecomes a maximum a p osteriori step and the log p osterior probabilit y needs to hav e the righ t prop erties for maximization. ( F uentes et al. , 2013 ) used of a n umerical ﬁxed p oin t algorithm to solv e the mo diﬁed EM equations with a sparsity prior, whose con vergence is only the- oretically supp osed, but ”observ ed in practice” (although the sensitivit y of the algorithm con v ergence to the ev aluation sound dataset is not detailed). ( Benetos and Dixon , 2013 ) privileged the use of pre-deﬁned templates, which allo ws them to skip computing the EM up date equation of templates, and just to apply a sparsity constrain t on the pitch activit y matrix and the pitch-wise source contribution matrix. Also, the sim ultaneous use of sev eral priors on Cazau et al., Draft, p. 5 a same mo del parameter leads to some diﬃculties in terms of mathematical calculation and increases con v ergence problems ( F uen tes et al. , 2013 ). 1.3 P article ﬁltering In the framework of Ba y esian v ariable selection, Marko v Chain Monte Carlo, or Particle ﬁltering (PF), t yp e approaches ha v e b een prop osed ( F´ ev otte and Go dsill , 2006a ; F ´ evotte et al. , 2008 ). These metho ds consist in sampling the p osterior distribution ov er larger parameter ranges, making them more demanding than their EM-like counterparts, but which also, in return, oﬀer increased robustness in conv ergence (i.e. reduced problems of con v ergence to lo cal minima) and a complete Monte Carlo description of this parameter p osterior densit y ( F ´ ev otte and Go dsill , 2006b , a ; F ´ evotte et al. , 2008 ). 1.3.1 General Ov erview Man y problems in statistical signal pro cessing ( F ong et al. , 2002 ; Andrieu et al. , 2003 ; V ermaak et al. , 2000 ) can b e stated in a state space form as follo ws, x t +1 ∼ f ( x t +1 | x t ) (3) y t +1 ∼ g ( y t +1 | x t +1 ) (4) where { x t } are unobserved states of the system and { y t } are observ a- tions made o v er some time, t . f ( . | . ) and g ( . | . ) are pre-sp eciﬁed state ev olution and observ ation densities. A primary concern in many state-space inference problems is the sequen tial estimation of the ﬁltering distribution p ( x t | y 1: t ), and the sim ulation of the en tire smo othing distribution p ( x 1: t | y 1: t ), where y 1: t = ( y 1 , y 2 , · · · , y t ) and x 1: t = ( x 1 , x 2 , · · · , x t ). Up dating of the ﬁltering distribution can b e ac hiev ed, in principle, using the standard ﬁltering recur- sions ( Rob ert and Casella , 1999 ) p ( x t +1 | y 1: t ) = Z p ( x t | y 1: t ) f ( x t +1 | x t ) dx t (5) p ( x t +1 | y 1: t +1 ) = g ( y t +1 | x t +1 ) p ( x t +1 | y 1: t ) p ( y t +1 | y 1: t ) (6) Smo othing can also b e p erformed recursiv ely bac kw ards in time using the smo othing form ula ( Rob ert and Casella , 1999 ) Cazau et al., Draft, p. 6 p ( x t | y 1: T ) = Z p ( x t +1 | y 1: T ) p ( x t | y 1: t ) f ( x t +1 | x t ) p ( x t +1 | y 1: t ) dx t (7) In practice, these ﬁltering (eq. 5 ) and smo othing (eq. 7 ) computations can only b e p erformed in closed form for linear Gaussian mo dels using the Kalman ﬁlter / smo other, and for ﬁnite state-space hidden Marko v mo dels. In the case of non-linear non-Gaussian mo dels, there is no general analytic expression for the computations of these densit y functions. As a consequence, an approximation strategy is required to estimate the ﬁltering and smo oth- ing densities, whic h is commonly p erformed with the PF metho d, also kno wn as sequen tial Monte Carlo metho ds. Within the PF framework, the ﬁlter- ing distribution is appro ximated with an empirical distribution formed from p oin t masses also called particles, p ( x t | y 1: t ) ≈ N X i =1 w ( i ) t δ ( x t − x ( i ) t ) (8) N X i =1 w ( i ) t = 1 , w ( i ) t ≥ 0 (9) where δ ( . ) is the Dirac delta function and w ( i ) t is a weigh t attached to particle x ( i ) t . Given this particle approximation to the p osterior distribution, w e can estimate the exp ected v alue of any function f w .r .t the distribution I ( f , t ), deﬁned as I ( f t ) = R f ( x t ) p ( x t | y 1: t ) dx t , using the following Mon te Carlo appro ximation I ( f t ) ≈ N X i =1 f ( x ( i ) t ) w t ( i ) (10) P article smoothers generate batched realisations of p ( x 1: T | y 1: T ) based on the forward PF results. In other words, the particle smo others are an eﬃcien t method for generating realisations from the entire smo othing density p ( x 1: T | y 1: T ) using ﬁltering appro ximation. 1.3.2 Filtering W e consider the ﬁltering distribution p ( x t | y 1: t ). Using the Bay es’ rule, this distribution can b e rewritten as follo ws, Cazau et al., Draft, p. 7 p ( x t | y 1: t ) = p ( x t | y t , y 1: t − 1 ) (11) ∝ p ( y t | x t , y 1: t − 1 ) p ( x t | y 1: t − 1 ) (12) ∝ g ( y t | x t ) p ( x t | y 1: t − 1 ) (13) ∝ Z g ( y t | x t ) f ( x t | x t − 1 ) p ( x 1: t − 1 | y 1: t − 1 ) dx 1: t − 1 (14) Assuming that a particle appro ximation to p ( x 1: t − 1 | y 1: t − 1 ) has already b een generated, p ( x 1: t − 1 | y 1: t − 1 ) ≈ N X i =1 δ ( x 1: t − 1 − x ( i ) 1: t − 1 ) (15) Then, assuming that f ( x t | x t − 1 ) and g ( y t | x t ) can b e ev aluated p oint wise, w e generate, for eac h state tra jectory x ( i ) 1: t − 1 , a random sample from a proposal distribution q ( x t | x ( i ) 1: t − 1 , y 1: t ). Then, the w eigh ts w t of the ﬁltering distribution (eq. 8 ) can b e approximated by w ( i ) t ≈ g ( y t | x ( i ) t ) f ( x ( i ) t | x ( i ) t − 1 ) q ( x ( i ) t | x ( i ) 1: t − 1 , y 1: t ) (16) Finally , w e p erform a multinomial resampling step, such that the prob- abilit y that x ( i ) t is selected is prop ortional to w ( i ) t , to obtain an unw eighted appro ximate random draw from the ﬁltering distribution p ( x t | y 1: t ). It is note- w orth y that if the resampling step is forgotten, a degeneracy phenomenon can o ccur. Indeed, after a few iterations, all but one particle will ha v e negligible w eigh t. ( Doucet et al. , 2000 ) has shown that the v ariance of the imp ortance w eigh ts can only increase ov er time, and th us, it is impossible to a v oid the de- generacy phenomenon. This degeneracy implies that a large computational eﬀort is dev oted to up dating particles whose con tribution is almost zero. As a result, a resampling step is needed to eliminate particles with small w eigh ts and generate a new set { x ( i ) t } i , whic h is an i.i.d. (indep enden t and iden tically distributed) sample from the appro ximate densit y p ( x t | y 1: t ), with a resetting of the w eigh ts { w ( i ) t } i to 1 / N . 1.3.3 Smo othing The en tire smo othing densit y p ( x 1: T | y 1: T ) can b e factorized as : Cazau et al., Draft, p. 8 p ( x 1: T | y 1: T ) = p ( x T | y 1: T ) T − 1 Y t =1 p ( x t | x t +1: T , y 1: T ) (17) Using the ﬁlter approximation (eq. 8 ) to p ( x t | y 1: t ) and the Marko vian assumptions of the mo del, w e can write, p ( x t | x t +1: T , y 1: T ) ∝ p ( x t | y 1: t ) f ( x t +1 | x t ) ≈ N X i =1 w ( i ) t | t +1 δ ( x t − x ( i ) t ) (18) with the mo diﬁed w eigh ts w ( i ) t | t +1 = w ( i ) t f ( x t +1 | x ( i ) t ) P N j =1 w ( j ) t f ( x t +1 | x ( j ) t ) (19) This revised particle distribution can b e used to generate states succes- siv ely in the rev erse-time direction, conditioning up on future states. 1.4 Our con tributing w ork The main ob jective of this paper is to propose an alternativ e form ulation of current PLCA mo dels applied to audio signals, replacing the EM algo- rithm b y a more generic parameter estimation algorithm based on a PF metho d. W e call this new algorithm PLCA-PF in the following. The main adv antage exp ected from this new algorithm is to b e able to scan the whole parameter space so as to take in to accoun t an y features of the parameters, and thus ov ercoming the limitations underlined in our introduction sp eciﬁc to curren t PLCA mo dels. In regards to prior integration particularly , this new framew ork allows releasing the constraints on prior mathematical forms and num ber. This pa v es the wa y tow ards more complete mo delings of the m ulti-faceted information carried b y musical signals, co v ering both time (e.g. temp o and rh ythm) and frequency (e.g. note sp ectra and c hords) domains, and the diﬀerent prior kno wledge classes related to musicolo gy , tim bre and pla ying st yle. Cazau et al., Draft, p. 9 2 P article Filtering for PLCA 2.1 State space represen tation Considering the equations 1 - 2 , the PLCA can b e expressed as : P ( x, t ) = X z P ( z , t ) P ( x | z ) = X z 1 ,...,z K P ( z 1 , . . . , z K , t ) J Y j =1 P ( x j | z 1 , . . . , z K ) = X z 1 ,...,z K P ( z K , t ) K − 1 Y k =1 P ( z k | z k +1 , . . . , z K , t ) J Y j =1 P ( x j | z 1 , . . . , z K ) (20) with : • z ∈ Z 1 × . . . × Z K is a v ector of K laten t comp onen ts ( z 1 , . . . , z K ) asso ciated to a ﬁnite subset Z k = { 1 , . . . , L k } • t ∈ { 0 , . . . , T } is the time v ariable • x ∈ X 1 × . . . × X J is a v ector of J features ( x 1 , . . . , x J ) where X j = { 1 , . . . , F j } In this decomp osition, P ( z K , t ) can b e seen as the activ ation distribution of the laten t v ariable z K , P ( z k | z k +1 , . . . , z K , t ) as the weigh t of the v ariable z k conditionally to ( z k +1 , . . . , z K ) and P ( x j | z 1 , . . . , z K ) as the J features basis. T o estimate the set of parameters p t = { P ( z K , t ) , P ( z k | z k +1 , . . . , z K , t ) ∀ k ∈ { 1 , . . . , K − 1 }} at eac h time t ∈ { 0 , . . . , T } , the mo del can b e rearranged as a state space pro cess p t ∼ f ( p t | p t − 1 ) (21) y t ∼ g ( y t | p t ) (22) where f is the transition state densit y function for p t deﬁned ab ov e and g the observ ation function of y t . Cazau et al., Draft, p. 10 2.2 T ransition and observ ation densities 2.2.1 T ransition densit y Assuming that eac h latent v ariable z k ∈ Z k is i.i.d, each marginal v ector P ( z K , t ) and P ( z k | z k +1 , . . . , z K , t ) can be indep endently estimated. Recalling that at a given time t , P ( z K , t ) and P ( z k | z k +1 , . . . , z K , t ) represent distribu- tions, Diric hlet priors are injected to ensure that their elemen ts belong to [0 , 1], as follo ws, ∀ ( z 2 , . . . , z K ) ∈ Z 2 × . . . × Z K , ∀ k ∈ { 1 , . . . , K − 1 } P ( z K , t ) ∼ D ir ( θ 1 t , . . . , θ L K t ) (23) P ( z k | z k +1 , . . . , z K , t ) ∼ D ir ( δ k ( z k , . . . , z K − 1 ) 1 t , . . . , δ k ( z k , . . . , z K − 1 ) L K t ) (24) where θ and δ k are random v ariables represen ting the w eigh t of each comp onen t of z K in P ( z K , t ) and P ( z k | z k +1 , . . . , z K , t ). That injection leads to the follo wing hierarc hical mo del H t → H t +1 ↓ ↓ P t P t +1 ↓ ↓ Y t Y t +1 (25) with H t = (Θ t , ∆ t ) the new states deﬁned b y Θ t = { θ z K t , ∀ z K ∈ Z K } (26) ∆ t = { δ k ( z k , . . . , z K − 1 ) z K t , ∀ z K ∈ Z K } (27) where w e ha ve deﬁned θ z K t +1 = θ z K t × α z K t , α z K t ∼ φ (28) δ k ( z k , . . . , z K − 1 ) z K t +1 = δ k ( z k , . . . , z K − 1 ) z K t × γ z K t , γ z K t ∼ ψ k (29) with φ and ψ k are p ositiv e distributions. Cazau et al., Draft, p. 11 2.2.2 Observ ation densit y y t has b een deﬁned as a represen tation of x at time t . In that state space approac h, eac h comp onent of y t is represen ted b y the sum of the PLCA mo del and a white noise, ∀ x ∈ X 1 × . . . × X J , y t ( x ) = P ( x, t ) + V t = X z 1 ,...,z K P t ( z 1 , . . . , z K ) J Y j =1 P t ( x j | z 1 , . . . , z K ) + V t (30) where V t ∼ N (0 , σ 2 ). Denoting ˆ y t the v ector of comp onen ts P ( x, t ), the observ ation density g follo ws a normal distribution, i.e. g ∼ N ( ˆ y t , σ 2 ). 2.3 Prior injection T o o v ercome limits of current m usic information retriev al systems, a practi- cal engineering solution w as to use computational tec hniques from statistics and digital signal pro cessing allowing the insertion of prior knowledge from diﬀeren t scientiﬁc disciplines (e.g. cognitiv e science, neuroscience, m usicol- ogy , m usical acoustics) ( Engelmore and Morgan , 1988 ; Na wab and Lesser , 1992 ; Carver and Lesser , 1992 ; Ellis , 1996 ). Suc h systems p erform a pro cess of reconciliation b et ween the observed acoustic features and the predictions of an in ternal mo del of the data-pro ducing en tities in the en vironmen t. This approac h is close to h uman experience, who p erceive the sound in view of diﬀeren t hierarc hical lev els of prior kno wledge, using a collection of global prop erties, such as musical genre, temp o, and orchestration, as w ell as more instrumen t sp eciﬁc prop erties, such as tim bre. It has b een widely applied in man y m usic information retriev al tasks, suc h as genre recognition and au- tomatic music transcription ( Ellis , 1996 ; Go dsmark and Bro wn , 1999 ; Bello , 2000 ; Ryynanen , 2004 ; Klapuri , 2004 ; Benetos et al. , 2013b ). In mathematical terms, priors are used to sharp en up estimation of mo del parameters b y emphasizing the most lik ely v alues in their distribu- tions. This prior integration is p erformed during state generation, b y re- w eigh ting each particle v alue with a corresp onding prior gain. Adding prior kno wledge on parameters p t leads up to sample from the p osterior distribu- tion, Cazau et al., Draft, p. 12 P ( p t | y t ) ∼ P ( y t | p t ) P ( p t ) (31) where P ( y t | p t ) identiﬁes to the observ ation densit y g and P ( p t ) the prior kno wledge. When the prior and the likelihoo d are conjugate, sampling from the p osterior distribution is rather straigh tforw ard. When the p osterior do es not ha v e a w ell kno wn form, as it is the case in most real-life applications, computational statistics metho ds can b e introduced to sample from the p os- terior. The Metropolis-Hasting algorithm ( Roberts et al. , 1997 ; Newman and Bark ema , 1999 ; Rob ert and Casella , 1999 ), based on Monte Carlo metho ds, brings a pow erful framew ork to tac kle that issue. This algorithm is a ran- dom w alk that uses an acceptance/rejection rule to conv erge to the sp eciﬁed target distribution, and pro ceeds as describ ed in the algorithm 1 . Algorithm 1 Metrop olis-Hasting algorithm for prior in tegration. 1: Draw a starting p oin t p 0 t , for whic h P ( p 0 t | y t ) > 0, from a starting distri- bution p 0 ( p t ) ; 2: for q = 1 , 2 , . . . do 3: Sample a proposal p ∗ t from a jumping distribution at iteration q , J q ( p ∗ t | p q − 1 t ) ; 4: Calculate the ratio of densities, r = P ( p ∗ t | y t ) /J q ( p ∗ t | p q − 1 t ) P ( p q − 1 t | y t ) /J q ( p q − 1 t | p ∗ t ) (32) 5: Set p q t =  p ∗ t with probabilit y min( r , 1) p q − 1 t otherwise (33) 6: end for Considering the ﬁltering particle framew ork deﬁned ab o v e, few remarks ab out the diﬀerent steps can b e highligh ted. First, the initial dra w is replaced b y the PF dra w w e w ant to interfere in. Ev en if the p osterior distribution is unkno wn, the ratio r can b e computed as the ratio of the pro duct of the lik eliho o d and the prior since the normalization constant is remo v ed in the ratio. Cazau et al., Draft, p. 13 r = g ( y t | p ∗ t ) p ( p ∗ t ) /J q ( p ∗ t | p q − 1 t ) g ( y t | p q − 1 t ) p ( p q − 1 t ) /J q ( p q − 1 t | p ∗ t ) (34) The jumping distribution J q is chosen as a normal distribution to sim- plify the ratio computation. Indeed, the symmetry prop ert y of the normal distribution in v olves that J q can b e remo v ed in eq. 34 , to get r = g ( y t | p ∗ t ) p ( p ∗ t ) g ( y t | p q − 1 t ) p ( p q − 1 t )) (35) 3 Application to AMT W e now prop ose an application of our PLCA-PF framework to the task of Automatic M usic T ranscription (AMT), and present ev aluation results on this task with quantitativ e comparisons with other state-of-the-art metho ds. 3.1 Bac kground on AMT W ork on AMT dates bac k more than 30 y ears, and has known numerous applications in the ﬁelds of music information retriev al, interactiv e computer systems, and automated m usicological analysis ( Klapuri , 2004 ). Due to the diﬃcult y in pro ducing all the information required for a complete musical score, AMT is commonly deﬁned as the computer-assisted pro cess of analyz- ing an acoustic musical signal so as to write do wn the musical parameters of the sounds that occur in it, whic h are basically the pitch, onset time, and du- ration of each sound to b e play ed. This task of “low-lev el” transcription, to whic h w e will restrict ourselves in this study , has interested more and more researc hers from diﬀerent ﬁelds (e.g. library science, m usicology , mac hine learning, cognition), and b een a very comp etitiv e task in the Music Informa- tion Retriev al comm unit y (, 2007 ) since 2000. Despite this large en thusiasm for AMT c hallenges, and several audio-to-MIDI conv erters a v ailable commer- cially , perfect polyphonic AMT systems are out of reac h of to da y’s tec hnology ( Klapuri , 2004 ; Benetos et al. , 2013b ). T o o vercome these limit ations, a prac- tical engineering solution was to use computational tec hniques from statistics and digital signal pro cessing allowing the insertion of prior knowledge from cognitiv e science, musicology and musical acoustics ( Engelmore and Morgan , 1988 ; Ellis , 1996 ). This approac h is close to h uman exp erience, in whic h the Cazau et al., Draft, p. 14 p erception of sounds is embedded with prior knowledge, using a collection of global prop erties such as musical genre, temp o, and orchestration, as well as more sp eciﬁc prop erties, such as the tim bre of a particular instrumen t. 3.2 Acoustic mo deling 3.2.1 PLCA formalization In the audio framework, PLCA views the input magnitude sp ectrogram of a sound source as a histogram of “sound quanta” across time and frequency , and mo deling it as as a linear combination of sp ectral vectors from a dictio- nary ( Smaragdis et al. , 2006 ). PLCA metho d is then based on the assump- tion that a suitably normalized magnitude sp ectrogram, V, can b e mo deled as a join t distribution ov er time and frequency , P ( f , t ), with f is the log- frequency index and t the time index. This quan tit y can b e factored in to a frame probabilit y P ( t ), whic h can b e computed directly from the observed data (i.e. energy sp ectrogram), and a conditional distribution o ver frequency bins P ( f | t ), as follows P ( f , t ) = P ( t ) P ( f | t ) (36) Sp ectrogram frames are then treated as rep eated dra ws from an underly- ing random pro cess characterized by P ( f | t ). W e can mo del this distribution with a mixture of laten t factors related to p olyphonic music transcription of single instrumen ts as follo ws: P ( f | t ) = X i,m P ( i | t ) P ( m | i, t ) P ( f | i, m ) (37) where P ( f | i, m ) are the sp ectral templates for pitc h i ∈ I (with I the set of pitc hes, and N I the n um b er of pitc hes) and pla ying mo de m ∈ M (with M the set of playing mo des, and N m the n um b er of mo des), P ( m | i, t ) is the pla ying mo de activ ation, and P ( i | t ) is the pitc h activ ation (i.e. the transcription). In this pap er, the playing mo de m will refer to diﬀeren t dy- namics of instrumen t pla ying (i.e. note loudness). ( Smaragdis et al. , 2008 ) extended the PLCA mo del of eq. 37 b y exploiting the fact that in a CQT, a change of fundamen tal frequency is reﬂected by a simple frequency trans- lation of its partials, resulting in a shift inv ariance ov er log-frequency . The mo del prop osed, called Shift-Inv arian t PLCA (SIPLCA), then consists in shifting the templates P ( f | i, m ) o v er the log-frequency range of the CQT, Cazau et al., Draft, p. 15 th us p erforming a m ulti-pitch detection with a frequency resolution higher than MIDI scale. Eq. 37 is re-written as follows P ( f | t ) = X i,δ f ,m P ( i | t ) P ( m | i, t ) P ( δ f | i, t ) P ( f − δ f | i, m ) (38) where δ f is the pitc h shifting factor. T o constrain δ f so that eac h sound state template is asso ciated with a single pitch, the shifting o ccurs in a semitone range around the ideal p osition of eac h pitc h. Thus b ecause we are using in this pap er a log-frequency represen tation with a sp ectral resolution of 60 bins/o cta v e, i.e. a 20 cen t resolution, we ha v e δ f ∈ [-2:2], with N ∆ f the length of this set of v alues. P ( f , t ) = X i,m,δ f A t ( i ) B t ( i, m ) C t ( i, m, δ f ) P ( f − δ f | i, m ) (39) In eq. 39 , we also identify the diﬀeren t PF argumen ts, where at a giv en time t , A t is a v ector of length N I represen ting the elemen ts of the pitch activit y matrix P ( i, t ) (equal to P ( t ) P ( i | t ), through the Bay e’s rule), B t ( i, s ) is the N I × N m matrix whose co eﬃcients are the w eights P ( s | i, t ), and C t is the N I × N m × N ∆ f tensor corresp onding to the sp ectral w eights co eﬃcients P ( δ f | i, t, m ). The sp ectral shifted templates P ( f − δ f | i, m ) are extracted from isolated note samples using a one comp onen t PLCA, and are not up dated. Ev en tually , as in most sp ectrogram factorization-based transcription or pitc h tracking metho ds ( Grindla y and Ellis , 2011 ; Mysore and Smaragdis , 2009 ; Dessein et al. , 2010 ), we use a simple threshold-based detection of the note activ ations from the pitch activit y matrix P ( i, t ), follo w ed by a minimum duration pruning. The threshold for minim um duration for pruning w as set to 50 ms. The use of this simple thresholding method should allow one to b etter highligh t the intrinsic diﬀerences from the diﬀerent AMT systems w e will compare. 3.2.2 P article ﬁlter argumen t Since, the spectral shifted templates P ( f − δ f | i, m ) are learned, the set of unkno wn parameters is { A t , B t , C t } , and y t = V f t ( ., t ) ∈ [0 , 1] F denotes the observ ations. As describ ed in 2.2.1 , eac h marginal vector A t ( s ), B t ( s ) and C t ( s, z ) are indep enden tly forecasted through a Diric hlet distribution, as fol- lo ws Cazau et al., Draft, p. 16 A t ∼ D ir ( θ 1 t , . . . , θ I t ) (40) B t ( s ) ∼ D ir ( δ 1 ( s ) 1 t , . . . , δ 1 ( s ) I t ) (41) C t ( s, δ f ) ∼ D ir ( δ 2 ( s, δ f ) 1 t , . . . , δ 2 ( s, δ f ) I t ) (42) Concerning the φ , ψ 1 and ψ 2 distributions pro ducing the states ( θ , δ 1 , δ 2 ), w e opt for a Gamma distribution to obtain a non biased transition and to con trol the v ariance of the transition. That c hoice leads to the follo wing transition rules where ∀ i, s, δ f , θ i t +1 = θ i t × α i t , α i t ∼ Γ( a i , b i ) (43) δ 1 ( s ) i t +1 = δ 1 ( s ) i t × γ i t , γ i t ∼ Γ( c i s , d i s ) (44) δ 2 ( s, δ f ) i t +1 = δ 2 ( s, δ f ) i t × λ i t , λ i t ∼ Γ( e i s,δ f , f i s,δ f ) (45) with hyperparameters a i , b i , c i s , d i s , e i s,δ f and f i s,δ f . Conditionally to θ t and β t θ i t +1 ∼ Γ( a i s , b i s θ i t ) (46) δ 1 ( s ) i t +1 ∼ Γ( c i s , d i s δ 1 ( s ) i t ) (47) δ 1 ( s, δ f ) i t +1 ∼ Γ( e i s,δ f , f i s,δ f δ 2 ( s, δ f ) i t ) (48) Recalling that the mean of θ i t +1 is a i s θ i t b i s , a non biased transition for θ in v olves that a i s = b i s since E ( θ i t +1 ) = θ i t +1 is exp ected. Under the same argumen t, c i s = d i s and e i s,δ f = f i s,δ f . Figure 1 provides an example of piano- roll transcription output obtained with our PF-PLCA system. 3.3 Prior kno wledge in tegration A musical signal is highly structured, in b oth time and frequency domains. In time domain, tempo and b eat sp ecify the range of likely note transition times. In the frequency domain, as audio signals are b oth additiv e and oscilla- tory (m usical ob jects in p olyphonic m usic sup erimp ose and not conceal each other), several notes pla yed at the same time form chords, or p olyphony 2 , 2 Here p olyphonic m usic refers to a signal where sev eral sounds o ccur sim ultaneously . Whereas in monophonic signals, at most one note is sounding at a time. Cazau et al., Draft, p. 17 merging their resp ectiv e spectral structures. When designing priors for an AMT system, one basically aims to help the system ﬁguring out ”which notes are presen t at time t” and ”by whic h ones they will b e follow ed”. These tw o types of information belong respectively to frame-wise sp ectral priors (e.g. sparseness, sp ectrum mo deling including inharmonicity ( Rigaud et al. , 2013 )) and to frame-to-frame temp oral priors (e.g. harmonic conten t transitions, smo othing of sp ectrum en v elop), and will b oth b e dev elop ed in to our PLCA-PF framew ork. In transcription systems with a general application ( Emiya et al. , 2010 ; F uentes et al. , 2013 ; Benetos and Dixon , 2013 ), prior kno wledge is generally incorp orated with no regard to their musical/ph ysical sense, but with the sole preo ccupation of con v ergence optimization and enhancemen t of transcription results on a sp eciﬁc m usical corpus. As a consequence, priors mostly tak e the form of a single constan t factor, set arbitrarily after sim ulation exp eriments. Here, w e propose a more complex modeling of musical signal with explicit m usic-related kno wledge. T o do so, w e quan tify relations of inﬂuence betw een the diﬀeren t pitc hes of the instrument pitch range, either within a frame (for sp ectral priors), or from one frame to the next one (for temp oral priors). These priors will then tak e the form of a matrix S of size N 2 I , whic h quantiﬁes av erage relations of inﬂuence b et w een the N I × N I couples of diﬀerent pitches, and is deﬁned as follo ws, ∀ ( i, j ) ∈ { 1 , · · · , N I } 2 , S =    S (1 , 1) · · · S (1 , N I ) . . . S ( i, j ) . . . S ( N I , 1) · · · S ( I , N I )    (49) The PLCA-PF framework allows a general insertion of this matrix through the term P ( p t ) of eq. 31 , to which we can give the following form P ( p t ) ∝ exp( − p 0 t S K p t ) (50) where p t ∈ { A t , B t ( s ) , C t ( s, δ f ) } , ∀ ( s, δ f ) ∈ { 1 , . . . , S } × { 1 , . . . , ∆ f } , and K p t is a v ector of length N I asso ciated to p t . It is notew orthy that simpler modeling of prior knowledge, such as a simple pitch-dependent vector, can also take the form of a diagonal matrix S of size N 2 I , with the vector v alues put into this diagonal (the zero-co eﬃcien ts of S pro vide an unitary prior v alue which do es not aﬀect particle weigh ts). Cazau et al., Draft, p. 18 3.3.1 Sparse priors During the multi-pitc h estimation step of an AMT pro cess, a to o muc h large n um b er of non-zero activ ation scores is often observed, making the op eration of “ﬁnding the right notes” more diﬃcult. In order to ov ercome this ﬂaw, a sparseness prior can reduce the num b er of active notes p er frame in selecting the most salient ones. Previous works mostly use pitch-independent sparse prior in PLCA-EM algorithms. ( F uentes et al. , 2013 ) compute a sparseness prior P ( A t ) to constrain the impulse distribution A t , as follo ws P ( A t ) ∝ exp  − 2 β √ J || A t ) || 1 / 2  (51) with || A t || 1 / 2 = P i p A t ( i ) and β a p ositiv e h yp erparameter indicating the strength of the prior. With this prior, a n umerical ﬁxed p oin t algorithm is required to obtain a solution with the EM algorithm. Other works ( Grindla y and Ellis , 2011 ; Benetos and Dixon , 2011 , 2013 ) imp ose sparsity on the pitch activit y matrix and the pitch-wise source con tribution matrix b y mo difying EM equations. The common p oint to all these EM-based sparse priors is that they are pitch-independent, and rely on h yp er-parameters, whic h are either ar- bitrary set and/or optimized on a given sound dataset. In this pap er, w e deﬁne sparse priors informed by explicit m usical acoustics related knowl- edge. Musically , the o ccurrence of simultaneous notes can result either from “acoustic p olyphon y”, or from “musical polyphony”. ”Acoustics p olyphon y” is strongly related to the timbre of the instrument, and more precisely to the ph ysical phenomena of m utual resonances and note p ersistence. Although this t yp e of p olyphon y is an integral part of instrument tim bre, it represents a noise signal added to the actual pla y ed note from the p oin t of view of m usic transcription. F or what concerns “m usical p olyphony”, it corresp onds to the note combinations pla yed b y the musician and intended b y a comp oser with a prop er p olyphonic writing. It directly provides useful information ab out whic h notes are commonly pla yed simultaneously in a musical piece. Prior kno wledge can b e learned from b oth of these p olyphonic origins for an in- strumen t rep ertoire, studying resp ectively the timbre of the instrument or the frame-wise m usical c haracteristics of the rep ertoire. A ﬁrst sparse prior P spa 1 on note mixture likelihoo d has then b een de- ﬁned. F ollowing the form of matrix S (eq. 49 ), eac h co eﬃcient is computed b y a frame-wise counting of the pitches j play ed simultaneously to pitch i, Cazau et al., Draft, p. 19 from our training MIDI transcripts (see Sec. 3.5.2 for details on the sound database). W e propose a second sparse prior P spa 2 on mutual resonances. F or strings on a b ow ed, pluck ed, or hammered instruments, mutual resonances result from sympathetic strings, whic h vibrate (and thereby sound a note) in sympathetic resonance with the note sounded near them b y some other agen t. Here, to compute eac h co eﬃcient S ( i, j ), we used tw o datasets of isolated notes, a ﬁrst one composed of free-resonating notes, and a second one in whic h all strings w ere muted excepting the pla y ed one. F or each note sample of these tw o datasets, the sp ectrum was computed with a FFT using a 4096-sample Hamming windo w after the onset, unitary normalized and la- b elled X ( d,i ) for pitch i and dataset d (equal to 1 or 2). W e then used the algorithm 2 to get the scores S(i,j). Algorithm 2 Computation of co eﬃcien ts S(i,j). 1: for F or eac h pitc h i ∈ I do 2: ˜ X i = || X (1 ,i ) − X (2 ,i ) || 3: Binary thresholding of ˜ X i , i.e. ˜ X i ( f ) =  1 for f = ar g ( ˜ X i ≥ 0 . 5) 0 otherwise 4: for Each pitch j ∈ I , j 6 = i do 5: S(i,j) = X (2 ,j ) · ˜ X i , with [ · ] the elemen t-wise pro duct 6: end for 7: end for These t w o priors P spa 1 and P spa 2 are represen ted in ﬁgure 2 through their respective matrices S spa 1 and S spa 2 . Ev en tually , for this sparse t yp e prior, the set of prior parameters p t in eq. 50 is equal to A t and K p t is set to A t , whic h b ecomes P spa ( A t ) ∝ exp( − A 0 t S spa A t ) (52) F or prior P spa 1 , b efore injecting it in eq. 52 , we normalized its matrix S spa 1 with the op erator Π (eq. 53 ) deﬁned as Π( x ) = 1 − x max ( x ) (53) as w e need kno wledge rejecting the h yp othesis of certain pitch combi- nations. Cazau et al., Draft, p. 20 3.3.2 Sequen tial priors on harmonic transitions Man y previous w orks ( Poliner and Ellis , 2007 ; Grindla y and Ellis , 2011 ; Bene- tos and Dixon , 2013 ) on AMT ha v e used sequen tial priors to mo del eac h pitch activit y/inactivit y phases, whic h is done using tw o-state on/oﬀ HMMs for eac h of them during a p ost-processing stage. This op eration p erforms a time ﬁltering of note detection decision, which mainly av oids a lot of single miss errors and smo oths note b oundaries. But musically , the information is v ery restricted, as it consists only in knowing ho w long a giv en pitch note remains activ e, which can result from b oth playing techniques of the musician and vibratory prop erties of the instrumen t. The sequential prior w e presen t in this pap er is deﬁned as the proba- bilit y to switc h b et w een t w o successiv e mixtures of notes in a musical piece. These transition probabilities are determined b y sampling the training MIDI transcripts at the precise times corresp onding to the analysis frames of the activ ation matrix, and just chec king for the presence of a note in each frame. These probabilities giv e us a global view of the usual and unusual harmonic transitions of an instrument rep ertoire. Original mixtures, i.e. those not en- coun tered during the training phase, get a likelihoo d weigh ted accordingly to the Witten-Bell discoun ting algorithm ( Witten and Bell , 1991 ). This prior P tra is represen ted in ﬁgure 2 through their matrix S tra . Ev entually , for this sequen tial type prior, p t in eq. 50 is equal to A t and K p t is set to A t − 1 , which leads to P tra ( A t ) ∝ exp( − A 0 t S tra A t − 1 ) (54) 3.3.3 Prior com bination The PLCA-PF framework oﬀers an easy-to-implemen t unifying wa y of in te- grating priors from b oth time and frequency domains. In this framew ork, priors are injected during the ﬁltering pro cess through eq. 31 , and mo dify the parameters without disturbing their generation. In the set of parameters p t , the indep endence b et w een eac h parameter p n t leads to P ( p t ) ∝ Y n P ( p n t ) (55) Within a deﬁned parameter p n t , the general prior P ( p n t ) can b e seen as the pro duct of the diﬀeren t priors P n prior asso ciated to p n t Cazau et al., Draft, p. 21 P ( p n t ) ∝ Y prior P n prior ( p n t ) (56) Using equations 50 , 55 and 56 , w e com bine the diﬀerent priors, charac- terized b y their resp ectiv e matrices S prior , as follo ws P ( p t ) ∝ exp( − X n X prior p n t 0 S n prior K n prior ) (57) 3.4 Practical implemen tation As a time-frequency representation, all input signals sampled undergo a Q- constan t with 60 bins/o cta ve, with windo w size of 23 ms (1024 co eﬃcients at 44.1-kHz sampling rate) and a 50 % hop, whic h is adequate for the tonal part of the signal. W e now present the diﬀerent algorithms implemen ted in our PF framew ork. 3.4.1 Filtering algorithm Suc h as in ( F ong et al. , 2002 ), w e dev elop a generic PF algorithm assuming that the prop osal distribution q ( x t | x ( i ) 1: t − 1 , y 1: t ) = f ( x t | x ( i ) t − 1 ), as detailed in the pseudo-algorithm 3 . The resample step is pro cessed as detailed in the pseudo-algorithm 4 . Algorithm 3 Generic particle ﬁltering with m ultinomial resampling 1: Let f ( x 1 | x 0 ) = f ( x 1 ) b e state prior distribution. Then for t = 1 to T : 2: ∀ i ∈ { 1 , · · · , N } , generate N samples from the prop osal q ( x t | x ( i ) 1: t − 1 , y 1: t ) = f ( x t | x ( i ) t − 1 ), x ( i ) t ∼ f ( x t | x ( i ) t − 1 ) 3: ∀ i ∈ { 1 , · · · , N } , ev aluate the imp ortance w eigh ts and normalise w ( i ) t ∝ g ( y t | x ( i ) t ) , N P i =1 w ( i ) t = 1 4: Resample { x ( i ) t ; i = 1 , · · · , N } N times with replacement. Cazau et al., Draft, p. 22 Algorithm 4 Multinomial resampling 1: Initialize the CDF : c 1 = 0 2: ∀ i ∈ { 2 , · · · , N } • Construct CDF : c i = c i − 1 + w ( i ) t 3: Start at the b ottom of the CDF : i = 1 4: Draw a starting p oint : u 1 ∼ U [0 , 1 N ] 5: ∀ j ∈ { 1 , · · · , N } • Mov e along the CDF : u j = u 1 + j − 1 N • While u j > c i , i = i + 1 • Assign sample : x ( j ) ∗ t = x ( i ) t • Assign weigh t : w ( j ) t = 1 / N 6: Return { x ( k ) ∗ t , w k t } N k =1 . 3.4.2 Smo othing algorithm After having generated w eigh ted particles { x ( i ) t , w ( i ) t ; i = 1 , · · · , N , t = 1 , · · · , T } from the PF, the smo othing algorithm pro ceeds as detailed in the pseudo- algorithm 5 . Algorithm 5 Generic particle smo other 1: Cho ose e x T = x ( i ) T with probabilit y w ( i ) T . 2: F or t = T − 1 to 1 : • Calculate w ( i ) t | t +1 ∝ w ( i ) t f ( e x t +1 | x ( i ) t ) for eac h i = 1 , · · · , N ; • Cho ose e x t = x ( i ) t with probabilit y w ( i ) t | t +1 . 3: e x 1: T = ( e x 1 , e x 2 , · · · , e x T ) is an appro ximate realisation from p ( x 1: T | y 1: T ). 3.5 Ev aluation pro cedure 3.5.1 Ev aluation AMT systems T o ev aluate comparativ ely transcription p erformance of our PLCA-PF al- gorithm, w e tested diﬀerent algorithms on the same test datasound. T a- ble 1 provides an o v erview of these algorithms. HALCA is short for the Cazau et al., Draft, p. 23 Harmonic Adaptiv e Laten t Comp onen t Analysis algorithm 3 ( F uentes et al. , 2013 ). Here, each note in a constan t-Q transform is lo cally mo deled as a w eigh ted sum of ﬁxed narro wband harmonic sp ectra, sp ectrally con volv ed with some impulse that deﬁnes the pitch. All parameters are estimated b y means of the EM algorithm, in the PLCA framew ork. This algorithm w as recen tly ev aluated b y MIREX and obtained the 2 nd b est score (, 2007 , 2 nd b est scores in the Multi-Pitc h Estimation task, 2009-2012). The algorithm PLCA-EM is the algorithm prop osed by ( Benetos et al. , 2013a ), whose main c haracteristics is its use of pre-deﬁned templates, allowing them to av oid up- dating the templates in the maximization step of the EM algorithm. This algorithm is also state-of-the-art (, 2007 , 1 st b est scores in the Multi-Pitch Estimation task, 2009-2012). PLCA-D AEM is the same as PLCA-EM, only replacing the EM algorithm b y a DAEM algorithm. ( Cheng et al. , 2013 ) observ ed signiﬁcant improv emen ts on their transcriptions through this mo d- iﬁcation. Method name References Parameter estimation algorithm Priors HALCA ( F uentes et al. , 2013 ) EM Sparsity + Continuity + Unimodal PLCA-EM ( Benetos et al. , 2013a ) EM Sparsity PLCA-DAEM ( Cheng et al. , 2013 ) DAEM Sparsity PLCA-PF Proposed PF – PLCA-PF + priors Proposed PF P spa 1 , P spa 2 , P tra T able 1 – Recapitulativ e table of the AMT systems tested in our sim ulation exp erimen ts. 3.5.2 Musical corpus T o test our AMT system and train the sparse priors prop osed, w e need three diﬀeren t sound corpus: audio musical pieces of an instrumen t rep ertoire, the corresp onding scores in the form of MIDI ﬁles, and a complete dataset of isolated notes for this instrumen t. W e will use t w o diﬀeren t deca y instru- men ts for ev aluation, namely the classical piano and the mar ovany zither from Madagascar. F or piano, audio data w as extracted from the MAPS database ( Emiya et al. , 2010 ), whic h is comp osed of high-quality note sam- ples and recordings from a real uprigh t piano, whose MIDI scores ha ve b een automatically compiled using the Diskla vier technology . F or the maro v an y instrumen t, sound templates and musical pieces w ere extracted from personal 3 Co des are av ailable at http://www.benoit- fuentes.fr/ . Cazau et al., Draft, p. 24 recordings made in our lab oratory . Pieces w ere transcrib ed with an original m ulti-sensor retriev al system ( Cazau et al. , 2013 ). F rom these sound databases, w e extracted diﬀerent sets of training and test data, as our prior must b e trained using automatically generated kno wl- edge from MIDI ﬁles and template datasets. T o do so, we ﬁrst divided eac h m usical pieces in to 15-second sequences, whic h pro vided us with a total of 1.2 and 0.83 hours of audio, resp ectively for the piano and marov an y datasets. Within eac h dataset, the m usical sequences w ere randomly split in to training and testing sequences, using by default 30 % of sequences for testing, and the 70% remaining ones for training. In our sim ulation exp eriments, this pro- cedure is rep eated ﬁv e times, and an av erage is computed on the resulting scores. T o prev en t any ov erﬁtting of our data, w e carefully distinguished b e- t w een training and test data. Esp ecially , sound templates used in the PLCA mo del were extracted from an instrumen t mo del diﬀeren t from the one used in recordings. Also, sequences used to train the priors were not used for ev aluation. 3.5.3 Error metrics F or assessing the p erformance of our proposed transcription system, w e adopt a note-oriented approac h, according to which a note ev en t is assumed to b e correct if it ﬁlls the condition that its onset is within a 50 ms range from a ground-truth onset (i.e. the standard tolerance commonly used ( Bello et al. , 2005 ; , 2007 )). Such a tolerance lev el is considered to b e a fair margin for an accurate transcription, although it is far more toleran t than human ears w ould, as we remind that those are able to distinguish b etw een tw o onsets as close as 10 ms apart ( Moore , 1997 ). Ev aluation metrics are deﬁned b y equations 58 - 60 (, 2007 ), resulting in the note-based recall (TPR), precision (PPV) and F-measure (the harmonic mean of precision and recall) : TPR = P N n =1 TP[ n ] P N n =1 TP[ n ] + FN[ n ] (58) PPV = P N n =1 TP[ n ] P N n =1 TP[ n ] + FP[ n ] (59) F-measure = 2 . PPV . TPR PPV + TPR (60) Cazau et al., Draft, p. 25 where N is the total num b er of notes, and TP , FP and FN scores stand for the w ell-kno wn T rue P ositiv e, F alse P ositiv e and F alse Negativ e detec- tions. The recall is the ratio b et w een the n um b er of relev an t and original items; the precision is the ratio b et w een the n um b er of relev ant and detected items; and the F-measure is the harmonic mean betw een precision and re- call. F or all these ev aluation metrics, a v alue of 1 represents a p erfect match b et w een the estimated transcription and the reference one. 3.6 Results and discussion W e present in the following simulation exp erimen ts on parameter initial- ization dep endency and transcription p erformance, comparing our prop osed PF-based algorithm with three other state-of-the-art algorithms (see table 1 ). 3.6.1 Computational time Figure 3 sho ws the computational time of our PLCA-PF system on a 15-s test m usical sequence against the n um b er of particles. Both the computational time and transcription p erformance increase with the num b er of particles used. It is w ell-kno wn that ﬁltering particle is very demanding in compu- tation time, whic h increases exp onen tially with the n um b er of particles. In its curren t form, our system is not very eﬃcient computationally , as it pro- duces a transcription in ab out 50 at real time on a PC computer (e.g., for a 15 sec recording it requires 12.5 mins) with 2000 particles. Increasing the n um b er of particles also increases transcription accuracy , rising the a v erage F-measure with gains as high as 14 % b et ween 10 and 1000 particles, which b egins to stagnate after 5000 particles. This tendency remains observed re- gardless the instrumen t rep ertoire. A trade-oﬀ b et w een computation time and transcription precision m ust then b e considered. 3.6.2 P arameter initialization dep endency Figure 4 compares dep endency of transcription outputs on parameter initial- ization. These results are computed from 40 simulation trials on one test sequence with a randomized parameter initialization. Black bars indicate the reference scores of eac h system, obtained with an uniform initialization of parameters, as it is commonly done b y default (e.g. ( F uentes et al. , 2013 , Cazau et al., Draft, p. 26 P aragraph V.A.1)). W e observ e that the prop osed systems including ﬁltering particle are globally more robust to parameter initialization, in comparison to other EM or DAEM -based systems, whic h present an imp ortant v ariabilit y in the a verage F-measure (e.g., ± 2 . 7% the PLCA-EM mo del for the mar ovany rep ertoire. Then, dep ending on the sound dataset under ev aluation, the set of initialization parameters may b e sub-optimal, with p erformance losses rising as high as 5 % in the av erage F-measure. In deﬁnitive, making AMT systems less dep enden t on data should fav our their generalization to the diver sity of m usic. 3.6.3 T ranscription p erformance T able 2 compares transcription p erformance of our diﬀeren t AMT systems through the diﬀerent error metrics, for the piano and the mar ovany rep er- toires, resp ectively . Both of them presen t a complex p olyphony structure. The mar ovany rep ertoire is c haracterized by fast arp eggios, with an ample halo-lik e sound with ric h o v ertones due to the complex resonating behav- ior of the instrumen t. The classical piano rep ertoire presen ts more complex and richer chord transitions, with diﬀerent playing techniques and dynamics whic h in terfere con tinuously on the tim bre of the instrumen t. F or b oth of these rep ertoires, the impro v emen ts brough t by PF-based systems in tran- scription p erformance are likely related to the EM limitations evok ed in our In tro duction, whic h mak e EM-based algorithms less eﬃcien t in ﬁnding ac- tiv e notes in complex p olyphonic signals due to problems of lo cal maxima con v ergences. Also, information from standard deviations sho ws that our prop osed algorithm presents the minim um v alue for the standard deviation (2.3 %), which implies a higher robustness to transcrib e p olyphonic signals with diﬀeren t m usical features. Piano Marov any Methods TPR PPV F-measure TPR PPV F-measure HALCA 55.2 59.7 57.3 55.1 57.6 56.3 PLCA-EM 57.2 62.8 59.8 55.6 59.7 57.6 PLCA-DAEM 59.1 62.9 60.9 56.1 58.3 57.2 PLCA-PF 57.9 62.2 59.9 57.7 60.1 58.8 PLCA-PF + priors 61.2 62.5 61.8 58.2 60.9 59.5 T able 2 – Mean transcription error metrics (in %) for the piano recordings with our diﬀeren t AMT systems. Cazau et al., Draft, p. 27 4 Conclusion Curren t PLCA-based systems for AMT use the well-kno wn EM algorithm to estimate the mo del parameters. This algorithm presents well-kno wn inher- en t defaults (lo cal con v ergence, initialization dep endency), making EM-based systems limited in their applications to AMT, particularly in regards to the mathematical form and n um b er of priors. T o ov ercome such limits, w e ha v e dev elop ed in this pap er a diﬀeren t estimation framew ork based on P article Filtering, whic h consists in sampling the posterior distribution o v er larger parameter ranges. This framework pro ves to b e more robust in parameter estimation, more ﬂexible and unifying in the integration of prior kno wledge in the system. It pro vides the abilities of injecting more complex musicolog- ical knowledge, as well as com bining sim ultaneously a theoretically inﬁnite n um b er of priors. Our proposed P article-Filtering systems achiev e promis- ing rankings in terms of accuracy rate, and further exp erimen tations will b e necessary to conﬁrm these preliminary results. Ac kno wledgemen ts A t the risk of omitting some relev ant names, the authors would like to esp ecially thank March Chemillier (CAMS-EHESS) for his help in recording the mar ovany , and Laurent Quartier (LAM-UPMC) for technical supp orts. References (2007), M. ( 2011 ). “Music information retriev al ev aluation exc hange (mirex).” Av ailable at http://m usic-ir.org/mirexwiki/ (date last viewed Jan uary 9, 2015). Andrieu, C., Davy , M., and Doucet, A. ( 2003 ). “Eﬃcient particle ﬁltering for jump mark ov systems. application to time-v arying autoregressions.” IEEE T rans. on Signal Pro c., 51 , 1762–1770. Bello, J.P ., Daudet, L., Ab dallah, S., Duxbury , C., Da vies, M., and Sandler, M.B. ( 2005 ). “A tutorial on onset detection in music signals.” IEEE T rans. on Sp eech and Audio Pro c., 13 , 1035–1047. Bello, J. P .and Sandler, M.B. ( 2000 ). “Blac kb oard systems and top down pro cessing for the transcription of simple p olyphonic music.” In Pr o c e e d- ings of the International Confer enc e on Digital A udio Eﬀe cts (DAFx) . Cazau et al., Draft, p. 28 Benetos, E., Cherla, S., and W eyde, T. ( 2013 a). “An eﬃcien t shift-inv arian t mo del for p olyphonic music transcription.” In 6th Int. Workshop on Ma- chine L e arning and Music, Pr ague, Cze ch R epublic . Benetos, E. and Dixon, S. ( 2011 ). “Multiple-instrumen t p olyphonic m usic trancription using a con v olutiv e probabilistic mo del.” In Pr o c. 8th Sound and Music Computing Conf. pp. 19–24. Benetos, E. and Dixon, S. ( 2013 ). “Multiple-instrumen t p olyphonic m usic transcription using a temp orally constrained shift-inv arian t mo del.” J. Acoust. So c. Am., 133 , 1727–1741. Benetos, E., Dixon, S., Giannoulis, D., Kirc hhoﬀ, H., and Klapuri, A. ( 2013 b). “Automatic m usic transcription: Challenges and future direc- tions.” J. of Intelligen t Information Systems, 41 , 407–434. Carv er, N. and Lesser, V. ( 1992 ). Symb olic and Know le dge-Base d Sig- nal Pr o c essing (New Y ork: Pren tice Hall), c hap. Blac kb oard systems for kno wledge-based signal understanding. Cazau, D., Chemillier, M., and Adam, O. ( 2013 ). “Information retriev al of maro v any zither music with an original optical-based system.” In Pr o c e e d- ings of D AFx 2013, Mayno oth, Ir eland . pp. 1–6. Cheng, T., Dixon, S., and Mauc h, M. ( 2013 ). “A deterministic annealing algorithm for automatic m usic transcription.” In 14th International So ciety for Music Information R etrieval Confer enc e, Curitib a, PR, Br azil . Dempster, A.P ., Laird, N.M., and Rubin, D.B. ( 1977 ). “Maximum lik eli- ho o d from incomplete data via the em algorithm.” Journal of the Roy al Statistical So ciet y , Series B, 39 , 1–38. Dessein, A., Cont, A., and Lemaitre, G. ( 2010 ). “Real-time p olyphonic m usic transcription with nonnegativ e matrix factorization and b eta-div ergence.” In 11th International So ciety for Music Information R etrieval Confer enc e, Utr etcht, Netherlands . pp. 489–494. Doucet, A., Go dsill, S., and Andrieu, C. ( 2000 ). “On sequential mon te carlo sampling metho ds for ba yesian ﬁltering.” Statistics and Computing, 10 , 197–208. Cazau et al., Draft, p. 29 Ellis, D.P .W. ( 1996 ). “Prediction-driv en computational auditory scene anal- ysis.” Ph.D. thesis, Massach usetts Institute of T ec hnology . Emiy a, V., Badeau, R., and Ric hard, G. ( 2010 ). “Multipitch estimation of piano sounds using a new probabilistic sp ectral smo othness principle.” IEEE T rans. on Audio, Sp eech, Lang. Pro c., 18 , 1643–1654. Engelmore, R. and Morgan, A., eds. ( 1988 ). Blackb o ar d Systems (Addison- W esley Longman Publishing (Boston, MA, USA)), 602 pp. F ´ ev otte, C. and Go dsill, S. ( 2006 a). “A bay esian approach for blind separa- tion of sparse sources.” IEEE T rans. Audio Sp eech Language Pro cessing, 14 , 2174–2188. F ´ ev otte, C. and Go dsill, S. ( 2006 b). “Sparse linear regression in unions of bases via ba y esian v ariable selection.” IEEE Signal Processing Letters, 13 , 441–444. F ´ ev otte, C., T orr´ esani, B., Daudet, L., and Godsill, S. ( 2008 ). “Sparse linear regression with structured priors and application to denoising of m usical audio.” IEEE T rans. Audio Sp eec h Language Pro cessing, 16 , 174–185. F ong, W., Go dsill, S., Doucet, A., and W est, M. ( 2002 ). “Mon te carlo smo othing with application to audio signal enhancemen t.” IEEE T ransac- tions on Signal Pro cessing, 50 , 438–449. F uentes, B., Badeau, R., and Ric hard, G. ( 2013 ). “Harmonic adaptiv e laten t comp onen t analysis of audio and application to music transcription.” IEEE T rans. on Audio Sp eech Lang. Pro cessing, 21 , 1854–1866. Go dsmark, D. and Bro wn, G.J. ( 1999 ). “A blac kb oard arc hitecture for computational auditory scene analysis.” Sp eech Comm unication, 27 , 351– 366. Grindla y , G. and Ellis, D.P .W. ( 2010 ). “A probabilistic subspace mo del for m ulti-instrumen t p olyphonic transcripton.” In 11th International So ciety for Music Information R etrieval Confer enc e, Utr e cht, Netherlands . Grindla y , G. and Ellis, D.P .W. ( 2011 ). “T ranscribing m ulti-instrument p oly- phonic music with hierarc hical eigeninstruments.” IEEE J. Sel. T opics Signal Pro c., 5 , 1159–1169. Cazau et al., Draft, p. 30 Hoﬀman, M.D., Blei, D.M., and Co ok, P .R. ( 2009 ). “Finding laten t sources in recorded music with a shift-in v ariant hdp.” In 12th Int. Confer enc e on Digital A udio Eﬀe cts (DAFx-09), Como, Italy . Hofmann, T. ( 1999 ). “Probabilistic latent seman tic indexing.” In 22th An- nual International SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval . Klapuri, A. ( 2004 ). “Automatic m usic transcription as w e know it to da y .” J. of New Music Researc h, 33 , 269–282. Mo ore, B.C.J. ( 1997 ). A n Intr o duction to the Psycholo gy of He aring (New Y ork: Academic), 441 pp. Mysore, G.J. and Smaragdis, P . ( 2009 ). “Relative pitch estimation of mul- tiple instruments.” In International Confer enc e on A c oustic al Sp e e ch and Signal Pr o c essing, T aip ei, T aiwan . pp. 313–316. Na w ab, S.H. and Lesser, V. ( 1992 ). Symb olic and Know le dge-Base d Signal Pr o c essing (New Y ork: Pren tice Hall), chap. In tegrated processing and understanding of signals. Newman, M.E. and Bark ema, G.T. ( 1999 ). Monte Carlo Metho ds in Statis- tic al Physics (USA: Oxford Universit y Press). P oliner, G. and Ellis, D. ( 2007 ). “A discriminativ e mo del for p olyphonic piano transcription.” J. on Adv ances in Signal Pro c., 8 , 1–9. Rigaud, F., F alaize, A., David, B., and Daudet, L. ( 2013 ). “Does inharmonic- it y impro v e an nmf-based piano transcription mo del ?” In IEEE Interna- tional Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) . pp. 11–15. Rob ert, C.P . and Casella, G. ( 1999 ). Monte Carlo Statistic al Metho ds (Springer Science+Business Media New Y ork). Rob erts, G.O., Gelman, A., and Gilks, W.R. ( 1997 ). “W eak con v ergence and optimal scaling of random walk metrop olis algorithms.” Ann. Appl. Probab., 7 (1) , 110–120. Ryynanen, M. ( 2004 ). “Probabilistic mo delling of note ev en ts in the tran- scription of monophonic melo dies.” Master’s thesis, T amp ere Univ ersity . Cazau et al., Draft, p. 31 Smaragdis, P ., Ra j, B., and Shanshank a, M. ( 2006 ). “A probabilistic la- ten t v ariable mo del for acoustic mo deling.” In Neur al Information Pr o c. Systems Workshop, Whistler, BC, Canada . Smaragdis, P ., Ra j, B., and Shashank a, M. ( 2008 ). “Sparse and shift- in v ariant feature extraction from non-negative data.” In International Confer enc e A c oustic al Sp e e ch and Signal Pr o c essing, L as V e gas, NV . pp. 2069–2072. Ueda, N. and Nak ano, R. ( 1998 ). “Deterministic annealing em algorithm.” Neural Net w orks, 11 , 271–282. V ermaak, J., Andrieu, C., and Doucet, A. ( 2000 ). “Particle ﬁltering for non-stationary sp eec h mo delling and enhancement.” In 6th International Confer enc e on Sp oken L anguage Pr o c essing . pp. 594–597. Witten, I.H. and Bell, T.C. ( 1991 ). “The zero-frequency problem: estimating the probabilities of no v el even ts in adaptive text compression.” IEEE T ransactions on Information Theory , 37 , 1085–1094. Cazau et al., Draft, p. 32 Figure 1 – Illustration of diﬀerent stages of our PF-PLCA system on a test m usical sequence, with from top to b ottom: ground truth, pitc h activit y matrix P(i,t) and piano-roll transcription output. Cazau et al., Draft, p. 33 Figure 2 – Illustration of the in ter-pitch inﬂuence matrix S for chord conten t (on the left) and sympathetic resonances (on the righ t). 10 100 1000 5000 10000 0 100 200 300 400 500 C o m p u t a t i o n a l T i m e ( x 1 5 s ) N u m b e r o f P a r t i c l e s 10 100 1000 5000 10000 0 4 8 12 16 20 G a i n i n t h e A v e r a g e F - m e a s u r e ( % ) Figure 3 – Computational time of our PLCA-PF system on a 15-s test m usical sequence against the n um b er of particles. Cazau et al., Draft, p. 34 50 55 60 65 Av e r age F - m e as u r e ( % ) HALCA PLCA − EM PLCA − DAEM PLCA − PF PLCA − PF+priors 50 55 60 65 Av e r age F - m e as u r e ( % ) Figure 4 – V ariances in transcription performance using random initialization of system parameters, for the piano (top graph) and the marov an y (b ottom graph) rep ertoires. Black bars indicate the reference scores of each system, obtained with an uniform initialization of parameters, as it is commonly done b y default (e.g. ( F uen tes et al. , 2013 , Paragraph V.A.1)).

Particle Filtering for PLCA model with Application to Music Transcription

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment