Complex ISNMF: a Phase-Aware Model for Monaural Audio Source Separation

1 Comple x ISNMF: a Phase-A wa re Model for Monaural Audio Source Se paration Paul Magron, T uomas V irtanen, Senior Member , IEEE Abstract —This paper introduces a phase-awar e p robabilistic model fo r aud io source separation. Classical source models in the short-time F ourier transf orm domain use circularly-symmetric Gaussian or Poisson random va riables. This is equivalent to assuming that the ph ase of each source is uniform ly distributed, which is n ot suitabl e for exploiting the und erlying structure of the phase. Drawing o n preliminary wo rks, we introduce her e a Bayesian anisotropic Gaussian source model in which th e phase is no longer un iform . Such a model permits us to fav or a ph ase value that originates from a signal mo del through a Markov chain prior structure. The v ariance of the latent variables are structured with nonnegativ e matrix factorization (NMF). The resulting model is called complex Itaku ra-Saito NMF (ISNM F) since it generalizes th e ISNMF model to t he case of non-isotropic variables. It combin es the advantages of IS NMF , wh ich u ses a distortion measure adapted to audio and yield s a set of estimates which preserv e the ov erall energy of the mixture, and of complex NMF , which enables one to account for some p hase constrain ts. W e d eriv e a generalized expectation-maximization algorithm to estimate th e model parameters. Experiments cond ucted on a musical source separation task in a semi-in fo rmed setting show that the p roposed approach outperfor ms state-of-the-art phase- aware separation techniqu es. Index T erms —Nonnegative matrix factorization (N MF), com- plex NMF , anisotropic Gaussian model, Itakura-Saito divergence, Bayesian in ference, phase recov ery , audio source separation. I . I N T RO D U C T I O N T HE goal of audio source separation [1] is to extract underly ing sour ces that add up to form an o bservable audio mixtur e . In this paper , we add ress the p r oblem of monaural so urce separation, which means that the observed audio signal h as been recor d ed th rough a single microph one. T o tackle this issue, m a ny techn iques act on a time- frequen cy (TF) representa tio n of the data, such as the short- time Fourier transform (ST FT ) , since the structure of audio signals is more promin ent in th a t do main. In particular, nonnegative matrix factorization (NMF) [2] techniques h av e shown successful fo r audio source separation [3], [4]. NMF is a rank-red uction meth od u sed for obtaining part-based decomp o sitions of nonnegative data. The NMF prob lem is expressed as follows: given a matrix V o f dimension s F × T with nonn egative entr ies, ﬁnd a factorization V ≈ WH wher e W and H are non negativ e m atrices of dimen sions F × K an d K × T re sp ectiv ely . T o redu ce the dimen sionality o f the d ata, the rank K is ge nerally chosen so that K ( F + T ) ≪ F T . In audio applications V is usu a lly a m agnitude or po wer P . Magron a nd T . V irtanen are wit h the Labora tory of Si gnal Processing, T ampere Uni versi ty of T echnolo gy , Finland (e-mail: ﬁrst- name.lastna me@tut.ﬁ). The work of P . Magron was partly supported by the Academy of Finland , project no. 290190. spectrogra m, and one can interpret W as a diction ary of spectral templates and H as a matrix of tem p oral a cti vations. Such a factorization is generally obtained by m inimizing a co st functio n that p enalizes the error between V and WH . Popular choices are the E uclidean distance or Kullback-Leibler (KL) [2] and Itakur a-Saito (IS) div ergences [4]. NMF m ay often be fram ed in a probab ilistic fr a m ew ork, wh ere the cost function app ears as the negativ e log-likelihood of the data [4]– [7], an d wher e the model stru ctures the dispersion parameter of the un derlying proba b ility distribution rath er than its observed realizations. For instance, in add itive Gaussian mixtures [8] where the NMF models the variance o f the sources, maxi- mum likelihood estimatio n is equ iv alent to an NMF with IS div ergence (ISNMF) of the power spectrog r am [4]. Once the NMF model has been estimated, the com plex- valued STFT s are retriev ed by means of a W iener-like ﬁl- ter [9]. T h is soft-maskin g of the complex-valued m ixture’ s STFT assigns the phase of th e origin al mixture to each extracted source. However , even if this ﬁlter yields qu ite satisfactory soundin g estimates in p r actice [3], [4], it h as been pointed out [10] that wh en sources overlap in the TF do main, it is respo nsible f or residual interfe r ence and artifacts in th e separated signa ls. This is a consequence of assuming th a t the phase is uniformly distributed [11], and ther efore of not exploiting its under ly ing structur e. T o alle viate this issue, the complex NMF (CNMF) model [12] h as been pr oposed. I t consists in directly de- composin g the complex-valued mixture’ s STFT into a sum of ran k-1 comp onents whose mag nitudes are stru ctured by means of an NM F . This mod el allows fo r jointly estimating the magnitu de an d the phase of each source. It is estimated by minimizing the Euclidean distance between the model and the data, to which can be added some regular ization terms, su c h as a spar sity p enalty [12]. It was later imp roved by mean s of adding a consistency co nstraint [13], that is, to account for th e redund ancy of the STFT which intr oduces some d ependen cies between adjacent TF bins [14], [ 15]. Alternatively , impr oved recovery can b e achieved by using phase constrain ts that originate fro m a signal model. For instance, th e mod el of sums o f sinuso ids [16] leads to explicit constraints between the phases of adjacent TF b ins [17], [18]. Such an approa c h has been exploited i n speech enha n ce- ment [1 9], [2 0], audio restoration [21] and for a time-stretch in g application in the ph ase vocode r algo r ithm [2 2]. It h as also been inco rporated into some phase- c o nstrained CNMF m odels for au dio sou r ce separation [23]–[25]. Tho se developments have shown p romising re su lts in terms o f interference re- jection, though th ey suffer from two drawbacks. Firstly , the 2 CNMF model is estimated by minimizing a Euclid ean d is- tance, which does not proper ly chara c terize the properties of audio (such as its large d ynamic rang e ) , where alternative div ergences (such as K L or IS) are preferre d [26]. Secon dly , the set of estimated s our ces does not pre ser ve the overall energy of the mixtu re, wh ich lea d s to artifacts in the sepa r ated signals. Drawing on tho se observations, we p r oposed in a prelimi- nary work [27] to mo del the sources with anisotro pic Gaussian (A G) v ariables, i.e., where the phase is no long e r unifo rm. In such a model, one can promo te a phase v alue wh ich is obtained by exploiting th e sinu so idal mod e l. Estimation in a minimum mean square error sense results in an an iso tropic W iener ﬁlter , which optimally combin es the m ixture phase and the und erlying p h ase mode l. W e f urther intro duced in [28] a general Bayesian f ramework in which both mag nitudes and phases wer e m odeled as ran dom variables, and the sinusoidal model was promo ted thr o ugh a Markov ch a in prior structure on the pha se locatio n p arameter . Howe ver , in tho se pr eliminary approa c h es, the v ariance parameter s were left u nconstraine d and there f ore either assum ed known or estimated befo rehand . In this paper , we intro d uce a Bayesian A G model that overcomes th e lim itations of those approa c hes. W e structure the variance parame ters o f the sources by m eans of an NMF model, so we c a n join tly estimate the magn itudes and th e phases in a un iﬁed framework. This m o del, called comple x ISNMF , comb in es th e beneﬁts of both I SNMF and CNMF: 1) It is phase- aware; 2) The set of estimators is conservative , i. e . , the ir sum is equal to the observed mixture; 3) The estimatio n is based on th e min imization o f an IS- like diver gence, which is appro p riate fo r au d io [29]. In order to infer the p arameters o f the mod e l, we deriv e a generalized expectation-m aximization (EM) algorithm. This model is app lied to a musical source sep a r ation task in a semi- informe d setting. It outperfor ms both the traditiona l phase- unaware ISNMF a n d the phase-con strained CNMF mode l [25]. This d e monstrates the usefuln ess of such a phase-aware Bayesian AG model to pe rform the joint estimation o f mag - nitudes and p hases for aud io source sep aration. The rest of this paper is organized as follows. Section II introdu c es the comp lex ISNMF mo del. Section III d etails the inferen ce p r ocedur e. Sec tion IV exper imentally validates the po tential of this method. Finally , Sectio n V draws some conclud in g remark s. I I . C O M P L E X I S N M F Let X ∈ C F × T be the STFT o f a single-ch annel audio signal, wh ere F and T are the nu m bers of fre q uency channels and time frames. X is th e linear a nd instantane ous mixture of J sources S j ∈ C F × T , such tha t for all TF b ins f t , x f t = J X j =1 s j,f t . (1) Since all TF bin s are treated similarly , we remove th e ind ice s f t wh en appro priate for m ore c la r ity . φ 0 2 π p( φ | µ , κ ) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 µ κ =0.1 κ =2 κ =5 κ =10 Fig. 1. Density of the VM distributio n. A. Modelin g magnitu de an d p hase Let us consider a co mplex-valued r andom variable s = r e i φ whose magn itude and phase ar e assumed indep endent and denoted r and φ . Drawing on [28], we propose to m odel r as a Rayleigh random v ariable R ( v ) , wh ich is the distribution of th e modulus of a cir cularly-sym metric co m plex normal distribution w ith variance v . Besides, as in [27], we con sider that the phase should be d istributed around some fav ored value µ and that the relativ e importan ce of th is value should be adjusted by mean s of a con centration p arameter κ ∈ [0 , + ∞ [ : the higher κ , th e mor e fav ored µ . Sev eral no n-unif orm periodic distributions exist (such as the wrapped Gau ssian [ 30] or wrap ped Cauchy distributions) but the von Mises ( VM) [31] distribution comes as a natur al candidate [32], [3 3], sinc e its den sity is easily exp ressed by : p ( φ | µ, κ ) = e κ cos( φ − µ ) 2 π I 0 ( κ ) , (2) where I n is the m odiﬁed Bessel fu nction of the ﬁrst k ind of order n [34], µ ∈ [0; 2 π [ is a lo cation parameter and κ ∈ [0; + ∞ [ is a conc entration pa rameter . In p articular, if κ = 0 , the VM distrib ution becomes uniform . Contrarily , if κ → + ∞ , it b ecomes equivalent to a Dirac delta f unction center e d at µ . It is illustrated in Fig. 1. This methodolo gy r esults in a mod el called Rayleigh + von Mises (R VM ) , in wh ich one can promo te some fa vored phase values (see Section II- C). Such an ap proach has been orig inally used in [32], [3 3] fo r a speech enhanc e m ent application in a speech plus noise model. Howev er, in the p resent case, since we con sid e r any num ber of sou rces J , the R VM m odel is no longer tractable becau se th e d ensity of the mixtur e d o es not admit a closed-f orm exp r ession. Therefore it is not suitable for sou rce separation, wher e we aim to estimate the model parameters. Nonetheless, we can co mpute the m oments of s = re i φ which will b e used later in this work. If φ ∼ V M ( µ, κ ) , the n -th circular m oment is, ∀ n ∈ Z ( cf. [3 1]) : E ( e i nφ ) = I | n | ( κ ) I 0 ( κ ) e i nµ . (3) Besides, if m agnitude r ∼ R ( v ) , we hav e: E ( r ) = r π 4 v and E ( r 2 ) = v . (4) 3 This lead to the expression of the mean of s : m = E ( r e i φ ) = E ( r ) E ( e i φ ) = λ √ v e i µ , (5) and its variance γ = E ( | s − m | 2 ) : γ = E ( | r e i φ | 2 ) − | m | 2 = (1 − λ 2 ) v , (6) and the r elation term c = E (( s − m ) 2 ) , which measures the joint variability of a variable and its complex conju gate: c = E ( r 2 ) E ( e i2 φ ) − m 2 = ρv e i2 µ , (7) where λ = √ π 2 I 1 ( κ ) I 0 ( κ ) and ρ = I 2 ( κ ) I 0 ( κ ) − λ 2 . (8) This relation term c is no t comm only intro duced in statistical models o f aud io signals in th e TF domain because it is usually a ssum ed to be null [35]. Ind eed, most models [4], [9], [36] assume the secon d-ord er circular ity (o r isotr opy ) o f the variables, that is, with the same distrib ution in th e complex plane regardless of the orien tation. Since this is eq u iv alent to assuming that th e pha se is unifor mly distributed, we propo se instead to explicitly consider this relation term as non-zero in general: it enab les us to pro mote the no n-circular ity of th e variable, and th e r efore the non -unifor mity of the phase. B. Anisotr opic Gau ssian sour ces T o alleviate the non-trac ta b ility issue of the R VM model, we pro pose to appro x imate it by a Gaussian model 1 in which the mom ents of the variables are the same on es as in the original R VM model. This appro a ch enables us to keep th e phase depen dencies in a mod el wh ich is fully tr actable. Therefo re, we assume that each so urce s j follows a comp lex normal distribution: s j ∼ N ( m j , Γ j ) , where m j = E ( s j ) ∈ C is the m e an of s j and Γ j is its covariance matrix : Γ j =  γ j c j ¯ c j γ j  , (9) where γ j = E ( | s j − m j | 2 ) ∈ R + and c j = E (( s j − m j ) 2 ) ∈ C are the variance and r elation term of s j , and ¯ z d enotes the complex co njugate of z . T he density of such a distribution is: p ( x | m, Γ) = 1 π p | Γ | e − 1 2 ( x − m ) H Γ − 1 ( x − m ) , (10) where x =  x ¯ x  T , and where T and H denote the transp ose and conju gate transpo se. Many previous studies model the sources as circularly- symmetric (or isotr opic ) variables [4], [38] (i.e., such that m j = c j = 0 ), which is equ iv alent to assumin g th at th e phase o f each source is unifo rmly distributed. The ke ystone of ou r appr o ach is that, in order to promo te a fav ored phase value, the mome nts are the same on es as in th e o riginal R VM model. Ther efore, we use the expressions given by (5), (6) and (7) to e stima te the moments which are the n used to design the Gaussian m odel, as illustra te d in Fig. 2. The ma in characteristic of this m o del is tha t the relation terms c j are 1 This strategy is reminisc ent of [37], where the m ixture model was a sum of random va riables with phase priors. Moments Rayleigh+V on Mises Anisotropic Gaussian Fig. 2. Design of the AG model. W e ﬁrst model the magnitudes and phases as Rayle igh and von Mises rando m varia bles. The moments in this model are then used to deﬁne the equiv ale nt AG model. Rayleigh + Von Mises -0.5 0 0.5 1 1.5 Real part 0 0.5 1 1.5 2 Imaginary part Anisotropic Gaussian -0.5 0 0.5 1 1.5 Real part 0 0.5 1 1.5 2 Imaginary part Fig. 3. 2-D histograms of 10000 samples generated from the R VM model (left) and A G model (right), with v = 1 , µ = π / 3 and κ = 50 . The intersec tion between the dashed lines represent s the mean of the samples. non-ze r o in g eneral, which con veys the pro perty of a nisotr opy of the correspond ing Gau ssian distribution: this is why we refer to it a s the aniso tropic Gaussian (AG) model. The additive pr operty of the Gaussian distrib ution family then implies that x ∼ N ( m x , Γ x ) with: m x = X j m j , γ x = X j γ j , c x = X j c j , Γ x = X j Γ j . ( 11) Remark : If κ = 0 , then λ = ρ = 0 and co nsequently m = c = 0 and γ = v : th e R VM and A G models are then equi valent since th ey both b ecome isotro pic Gaussian. Contrarily , f or imp ortant values of κ , the models still remain quite alike, as illu strated in Fig. 3 for κ = 50 . C. Phase mod el The non-unif ormity of the phase is taken into accou n t in the A G model throu gh the location parame ter µ . Howev er, in order to o btain go od quality ph ase estimates, th is model ca n beneﬁt from incorp orating some prior kn owledge about th e phase, fo r instance by accountin g for its structu r e in time or frequen cy . W e p ropose to exploit some inf ormation a bout the phase by exploitin g the sinusoidal model, wh ich is wid ely used for representin g audio signals [19], [23]. E a ch source in th e time domain is m o deled as a sum of sinusoids. Let u s assume that there is at m ost on e sinusoid (whose no rmalized frequency is de noted ν j,f t ) per freq uency ch annel. It can be shown [ 2 1] that the p h ase µ j follows the unwrapp ing equation: µ j,f t ≈ µ j,f t − 1 + 2 π l ν j,f t , (12) 4 where l is the h op size of the STFT . As in [2 8], we pro pose to enfo rce this prope r ty by means of a Markov chain prior structure. W e have, fo r each source : p ( µ j ) = F − 1 Y f =0 p ( µ j,f 0 ) T − 1 Y t =1 p ( µ j,f t | µ j,f t − 1 ) . (13) W e then pr o pose th e following cho ice, fo r t > 0 : µ j,f t | µ j,f t − 1 ∼ V M ( µ j,f t − 1 + 2 π l ν j,f t , τ ) , (14) and the initial distribution in each frequen cy ch annel p ( µ j,f 0 ) is Jeffrey’ s non-in formative pr ior . I n this way , we enforce the ph ase location parameter to approxima tely follo w the sinusoidal mod e l (12). The parameter τ ∈ R + adjusts the relativ e importan c e of th is prior . Once again , we choose a VM distribution f or mod eling th e phase lo cation parame ter , since it is a natural c a ndidate for acco unting for the periodic ity of this variable. However , u nlike pr eviously , we do not n e e d he r e to approximate this distrib ution: since the prior ( 1 4) applies indepen d ently to each sou rce, it is straigh tforward to explicitly obtain the log -prior : log( p ( µ )) c = τ X j,f ,t ℜ  e i µ j,f t e − i µ j,f t − 1 − 2i π lν j,f t  , (15) where c = denotes equality u p to an additi ve con stan t an d ℜ is the real part. The mo del therefo re depends on two concentr a tion par ameters that ha ve a different ro le: κ quantiﬁes the no n-unif o rmity of the phase in the AG model (i.e., h ow concentr a ted abou t a loca tio n p arameter the phase is), wh ile τ qu antiﬁes how close to the sinusoidal model this location parameter is. D. Complex ISNMF For practical separation application s, it is n ecessary to constrain the v ariance param eters of the so u rces V j . W e propo se to stru cture it by mean s of an NMF model: V j = W j H j , (16) where W j and H j are nonn egati ve-valued matrices of dimen- sions F × K j and K j × T respectively . T herefor e, th e moments in the AG mo del b ecome: m j,f t = λ q [ W j H j ] f t e i µ j,f t , γ j,f t = (1 − λ 2 )[ W j H j ] f t , (17) c j,f t = ρ [ W j H j ] f t e i2 µ j,f t , where [ W j H j ] f t denotes the ( f , t ) -th en tr y o f the matrix W j H j . In par ticular, if κ = 0 , th en m j = c j = 0 and γ j = W j H j : the model becom es equiv alent to ISNMF . Thus, since the proposed model gen eralizes ISNMF wh ile allowing us to accou nt for some ph ase con straint, we call it complex I S NMF . The whole model is r epresented as a Bayesian network in Fig. 4 Fig. 4. Bayesian network corresponding to the complex ISNMF model. Latent (resp. observed) variabl es are represente d with empty (resp. shaded) ellipses. The sub-graph contained in each rectangl e is repeat ed according to the index ( k or j ) indica ted in the bottom-right corner of the rectangle. The vertic al dashed lines m ark the limits betwee n successi ve time frames. E. Relatio n to other models The A G model alo ng with th e NMF v ariance s tructu re results in a phase- aware extension of ISNMF , as po inted out in Section I I -D. Howe ver , other m o dels can be seen as particular cases of this general f r amew ork . In deed, in Section II-B we approx imated the R VM mod el with a n A G model by eq uating their moments. As illustrated in Fig. 2, we ch ose to equate all the momen ts (mean, variance and relation term) , b ut other approa c h es are po ssible. Firstly , it is possible to set the mean and relatio n term to 0 , in which case the sources f o llow a circularly -symmetric Gaussian distribution: s j ∼ N (0 , γ j I ) , where I is the id entity matrix. Alon g with an NMF variance, this resu lts in the ISNMF mo del [4]. Th is is therefore another way of seeing the prop o sed AG mode l as an extension of ISNMF . Alternatively , one can on ly p reserve the mean informatio n from th e R VM model, a n d set the cov ariance matrix to b e diagona l with a constant variance σ : s j ∼ N ( m j , σI ) . Th is is the un derlying statistical mo d el from CNMF [ 12]. Th e refore, this A G framework brid ges the gap between ISNMF and CNMF since it generalizes bo th of th em in a uniﬁed mo d el. Finally , other approx imations are po ssible. For in stan ce, one can only preserve the secon d-ord er statistics from the R VM mo del and set the mean v alue at 0 ( s j ∼ N (0 , Γ j ) ). Instead, on e can set the relation terms at 0 and keep the ph ase depend encies only th rough the m ean ( s j ∼ N ( m j , γ j I ) ). T his leads to alternative versions of Complex ISNMF that simplify the estimation of the NMF parameters ( cf . Section III-C) or the phase param eters ( cf . Section III -D). Those will be discussed in the co rrespond ing sections. H owever , in ord er to keep the scope of this paper broad enou gh, we will in fer th e mod el in the genera l case described in Section II-D. I I I . I N F E R E N C E The model parameter s Θ = { { W j } j , { H j } j , { µ j } j } are estimated in a maxim um a p osteriori sense, which consists in 5 maximizing the log -posterior distribution: C MAP (Θ) = log p ( X | Θ) + log p (Θ) , (18) where p ( X | Θ) is the likelihood of the d ata an d p (Θ) the prio rs on the parameter s. In this work, we only exp loit the Markov prior infor m ation a b out the ph ase, therefore log p (Θ) is given by (15). However , this fr amew ork is very general an d it cou ld be possible to further enforc e som e desirable property such as h a rmonicity [ 39] thro ugh pr iors on th e colu m ns of W j or temporal contin u ity [3] throug h priors on the rows o f H j . A. EM framework Since the d ir ect maximization of th e c r iterion (18) is more in volved than in classical isotrop ic models [4], we pr opose to adopt an EM [40] strategy which co nsists in maximizing a lower bou nd o f the log-p osterior distribution, given by: Q MAP (Θ , Θ ( i − 1) ) = Q ML (Θ , Θ ( i − 1) ) + log p (Θ) , (19) where i is a step index, Θ ( i − 1) contains the curre n t set of estimated p a r ameters (i.e., the para m eters estimated at the previous step i − 1 ) and Q ML is the c o nditional expec tation of the com plete-data log -likelihood: Q ML (Θ , Θ ( i − 1) ) = Z p ( Z | X ; Θ ( i − 1) ) log p ( X , Z ; Θ) d Z , (20) where Z d enotes a set of latent ( h idden) variables. Due to the mixing constraint (1), we use, as in [38], [41], a reduced set of J ′ = J − 1 free variables Z = S = { s f t } f t , where we note s f t = [ s 1 ,f t , ..., s J ′ ,f t ] T . Therefo re, s J,f t = x f t − P J ′ j =1 s j,f t . The EM algo r ithm consists in alternatively com puting th e function al Q MAP giv en the cur rent set of parameters Θ ( i − 1) (E-step) a n d max im izing it with r espect to Θ ( M-step). Th is is pr oven [40] to increase the value of th e c r iterion (18). Howe ver , wh en the maximization of Q MAP is too in volved, it may b e pre f erable to solely increase its value at the M-step. This has also been proved [40] to lead to a lo cal maximum of (1 8), and the corresp onding proced ure is called generalized EM. This is the app r oach we are ado pting her eafter . B. E-step Since all { s j,f t } J ′ j =1 are in d epende n t Gau ssian variables, s f t is a Gau ssian vector . It can be shown [3 5] that S | X fo llows a multiv ariate co mplex no rmal d istribution N ( m ′ f t , Ξ f t ) . The posterior means of the sources are given by aniso tr opic W ien er ﬁltering [27]: m ′ j,f t = m ( i − 1) j,f t + Γ ( i − 1) j,f t  Γ ( i − 1) x,f t  − 1 ( x f t − m ( i − 1) x,f t ) . ( 21) Note th at, g i ven the mixing constraint (1), this expression is also valid fo r the last so urce f or which j = J . Th e p osterior covariance matrix Ξ f t is given by [41]: Ξ f t =     Γ ( i − 1) 1 ,f t 0 0 0 . . . 0 0 0 Γ ( i − 1) J ′ ,f t     −     Γ ( i − 1) 1 ,f t . . . Γ ( i − 1) J ′ ,f t      Γ ( i − 1) x,f t  − 1     Γ ( i − 1) 1 ,f t . . . Γ ( i − 1) J ′ ,f t     T . (22) In particular, the d iagonal blocks in the p osterior covariance matrix provid e the po sterior covariance fo r each source: Γ ′ j,f t = Γ ( i − 1) j,f t − Γ ( i − 1) j,f t  Γ ( i − 1) x,f t  − 1 Γ ( i − 1) j,f t . (23) Thanks to ( 21) and (2 3), we can comp ute the posterior mean, variance and relation term of the sour ces, respectively , deno ted by m ′ j , γ ′ j and c ′ j . The compu tation of (20) is detailed in the append ix and results in: Q ML (Θ , Θ ( i − 1) ) c = − X f ,t J X j =1 log( q | Γ j,f t | ) + 1 | Γ j,f t |  γ j,f t ( | m ′ j,f t − m j,f t | 2 + γ ′ j,f t )  (24) − 1 | Γ j,f t |  ℜ (¯ c j,f t (( m ′ j,f t − m j,f t ) 2 + c ′ j,f t ))  , where | Γ j,f t | = γ 2 j,f ,t − | c j,f t | 2 is the d eterminant o f Γ j,f t . C. M-step: NMF parameters 1) NMF function al: Let us ﬁr st rewrite Q ML by r e moving the ter ms that do not d epend on the NMF parameters. Us- ing (24) and ( 17), we have: Q ML (Θ | Θ ( i − 1) ) c = − J X j =1 X f ,t log([ W j H j ] f t ) + p j,f t [ W j H j ] f t − q j,f t p [ W j H j ] f t , (25) with: p = (1 − λ 2 )  γ ′ + | m ′ | 2  − ρ ℜ  e − 2i µ ( c ′ + m ′ 2 )  (1 − λ 2 ) 2 − ρ 2 , (26) and: q = 2 λ 1 − λ 2 + ρ ℜ  e − i µ m ′  , (27) where we removed the indices j, f t for brevity . This highlig hts two novel quan tities p and q on which Q ML depend s. First, from th e derivation conducted in the appe ndix we re m ark that: p j,f t [ W j H j ] f t = E S | X ;Θ ( i − 1)  s H j,f t Γ − 1 j,f t s j,f t  . ( 28) In particular, when κ = 0 , p j,f t = γ ′ j,f t + | m ′ j,f t | 2 , which is the posterior power of s j,f t . Therefor e , in the general case, we c all the quantity p in (28) the ph ase-corr ected po sterior power of the sour ces. Note that sinc e Γ is po siti ve-deﬁnite, p is n ecessarily non n egati ve. This quantity is in teresting becau se it acco u nts for the phase while b eing no nnegative: therefo r e, 6 estimating the NMF m odel from this qua n tity leads to a p hase- aware decom position of the data. On th e oth er hand , the physical meaning o f the q uantity q is not fully clear . I n particular, it ha s the same sign as ℜ  e − i µ m ′  , that is, th e same sign as cos( µ − ∠ m ′ ) . Ac- counting for the mixture ’ s phase when computing the posterior mean (21) lead s to a d eviation of ∠ m ′ from the lo cation parameter µ . Howe ver , our intuition is that the posterior m e an angle will stay relativ ely close to the loc a tion parameter µ . If this angle d ifference remains r e lati vely small (that is, | µ − ∠ m ′ | < π/ 2 ), then its co sine (and consequ ently q ) is nonnegative. Then, q has the dim e nsion o f a mag nitude, and can therefo re be seen as a ph ase-corr ected posterior magni- tude . E ven though we were not able to fo r mally d emonstrate that th is in tuition ho lds, we observed experimentally that q was alw ays non negativ e. Theref ore, we will assume in w h at follows that q is nonnegative, and we leav e to f u ture work a more in-d epth analy sis of those quantities. 2) Majo rize-min imization ap pr oach: Since Q MAP is e q ual to Q ML up to th e log- prior on the p hase, which do es no t depend on the NMF parameters, th e problem then becomes that of m inimizing the following f unction, for all sources j : H (Θ) = X f ,t log( X k w f k h kt ) + p f t P k w f k h kt − q f t p P k w f k h kt . (29) T o do so, we pro pose to adopt a majo rize-minimiza tion approa c h [42]. The core idea of this strategy is to ﬁn d an auxiliary func tio n G w h ich majorizes H : ∀ (Θ , e Θ) , H (Θ) ≤ G (Θ , e Θ) , and H ( e Θ) = G ( e Θ , e Θ) . (30) Giv en so me current par ameter e Θ , minimizing G (Θ , e Θ) with respect to Θ provides an upd ate on Θ . T his approa c h guaran- tees that the co st function H is non-incre a sing over iter ations. Let us deri ve the up date on W j . W e introdu ce auxiliar y parameters e w f k and we den ote e v f t = P k e w f k h kt . In a similar fashion as in [4 3]–[45], we decomp ose the function H in to its conv ex and con cav e p arts. Since p is no nnegative, the term in (29) inv olving p is con - vex. Therefore it is major ized by using the Jensen inequality: p f t P k w f k h kt ≤ X k e w 2 f k w f k p f t h kt e v 2 f t . (31) Besides, since we assumed that q is negative, the term in (29) in volving q is conc av e, so it is majo rized by its ta n gent: − q f t p P k w f k h kt ≤ X k w f k h kt q f t e v 3 / 2 f t . (32) Finally , th e ﬁrst term in (29) is m ajorized as in [44]: log( X k w f k h kt ) ≤ X k w f k h kt e v f t . (33) Combining ( 31), (3 2) and (33) results into the fo llowing auxiliary func tio n fo r H : G (Θ , e Θ) = X f ,k e w 2 f k w f k X t p f t h kt e v 2 f t + w f k X t h kt ( 1 e v f t + q f t e v 3 / 2 f t ) . (34) 3) Upd ate rules: Setting the d eriv ativ e o f G with respect to w f k at zero an d solving leads to th e following update: w f k = e w f k v u u u u u u u t X t p f t h kt e v 2 f t X t h kt 1 e v f t + q f t e v 3 / 2 f t ! . (35) W e ca n rewrite th is update rule on to matrix for m as: W j ← W j ⊙ ( P j ⊙ V ⊙− 2 j ) H T j ( V ⊙− 1 j + Q j ⊙ V ⊙− 3 / 2 j ) H T j ! ⊙ 1 / 2 , (36) where ⊙ , ⊙ and the fr a ction b a r denote element-wise matrix multiplication, power and di vision respectiv ely , and where P j and Q j are the matrices who se entr ies are the p j,f t and q j,f t deﬁned in ( 2 6) and (27). By applyin g exactly the same methodo logy , we o b tain the upd ate on H : H j ← H j ⊙ W T j ( P j ⊙ V ⊙− 2 j ) W T j ( V ⊙− 1 j + Q j ⊙ V ⊙− 3 / 2 j ) ! ⊙ 1 / 2 . (3 7) 4) Rela tion to other app r o aches: W e rem ark that if κ = 0 , then λ = ρ = 0 : ther e fore, q j,f t = 0 and p j,f t becomes the posterio r power of s j,f t , as mention e d in Sec tio n III- C1. Then, we recog nize in ( 25) the IS divergence between P j and W j H j , as in the EM algorithm for ISNMF [46]. Con- sequently , the updates r ules ( 36) and (3 7) are similar to those obtain ed in such a scen a r io [46], up to an addition al power 1 / 2 , which is common when apply in g the major ize - minimization metho dology fo r estimating ISNMF [ 44]. Besides, o n e can consider an alternative A G m odel as described in Section II-E. If one considers tha t the sources are center e d ( s j ∼ N (0 , Γ j ) ), then Q j = 0 : we reco gnize in (25) the IS diver gence b etween the NMF model and the phase-cor rected p osterior power . The derivation of the up date rules is th en easier than in the g eneral case, since it eliminates the n eed fo r the majorize-min imization method: on e can apply the co mmonly -used heuristic meth o d d escribed in [2] to obtain alternative multip lica tive upda te rules. Th is appr o ach is described in m ore d etails in [ 4 7]. D. M-step: ph ase parameters Let us now de riv e the updates on the p h ase p arameters. W e rewrite the function al (24) by removing the term s tha t d o no t depend on th e p hase parame ter s, which lea d s to: Q ML (Θ | Θ ( i − 1) ) c = J X j =1 X f ,t ℜ  α j,f t e − 2i µ j,f t + β j,f t e − i µ j,f t  , (38) with: α j,f t = ρ ((1 − λ 2 ) 2 − ρ 2 )[ W j H j ] f t ( c ′ j,f t + m ′ 2 j,f t ) , (39) and: β j,f t = 2 λ (1 − λ 2 − ρ ) ((1 − λ 2 ) 2 − ρ 2 ) p [ W j H j ] f t m ′ j,f t . (40) 7 Therefo re, adding th e log - prior over the phase parame ters (15) leads to max imizing the fo llowing functionals: g j,f t ( µ j,f t ) = ℜ  α j,f t e − 2i µ j,f t + ˜ β j,f t e − i µ j,f t  , ( 41) with respec t to µ j,f t , and where : ˜ β j,f t = β j,f t + τ  e i µ j,f t − 1 +2i π lν j,f t + e i µ j,f t +1 − 2i π lν j,f t +1  . (42) Let us remove the indexes j, f t in what fo llows for more clarity . W e th en seek to m aximize: g ( µ ) = ℜ  αe − 2i µ + ˜ β e − i µ  (43) = | α | cos(2 µ − ∠ α ) + | ˜ β | cos( µ − ∠ ˜ β ) , (44) which leads to ﬁnding the ro ots of: g ′ ( µ ) = − 2 | α | sin(2 µ − ∠ α ) − | ˜ β | sin( µ − ∠ ˜ β ) . (45) Unfortu n ately , it is not straigh tforward to write the solution s of this problem in closed-form. Besides, it r equires further operation s to d etermine which root m aximizes g , leading to a quite co mputation ally intensiv e p rocedur e. Instead, drawing on [ 28], since we experimentally observed that | α | << | β | , we pro pose to a p proxim ate (44) b y: ˜ g ( µ ) = ℜ  ˜ β e − iµ  = | ˜ β | cos( µ − ∠ ˜ β ) , (46) which is easily max imized b y µ = ∠ ˜ β . This u pdate d epends on the v alues of th e phase param eter in frames t − 1 and t + 1 , so it has to be ap plied sequentially over time frames (which is co mmon wh en usin g Markov chain priors such as in [39]) . T o assess the v alidity o f this update scheme, we applied both procedures ( maximization of the exact functional (44) and its approximation (46)) on the learning dataset used in the experimental ev aluation (see Section IV -A ). Th e average relativ e difference between the ph ases obtained with those two app r oaches was of approx imately 10 − 5 . Con sequently , we prop ose to u se the approx imate u p date scheme, since it yields very similar estimates while bein g signiﬁcantly faster than perf o rming th e exact m a x imization. Finally , if one consider an altern a ti ve AG mod el with n ull relation terms ( cf . Section I I-E), then α = 0 , which eliminates the nee d for this simp lifying assumption. It also mod iﬁes the values of β , p and q , therefor e lead ing to a different procedu re, which will be in vestigated in future work. E. Full pr ocedu r e The EM procedu re is summa r ized in Algorithm 1. The phase locatio n parameters µ j are initialized by assigning the mixture phase to each source. The initialization of the NMF matrices is d iscussed in Section s IV -A2 and I V -B. The fr equencies ν are pr ovided as inpu ts of the algorithm. W e estimate them by mean s of a qu adratic interpo lated FFT (QIFFT) [4 8] on the lo g-spectra of th e initial variance es- timates V j . This estimation is performed locally ( at each time fram e ) in order to a c c ount for slow variations o f the frequen cies. The frequ ency range is then deco m posed into r e gions of inﬂuen c e [21] to ensur e tha t th e phase in a given channel is u nwrapp ed with the a p propr iate freque n cy . Algorithm 1: E M algorithm for co mplex ISNM F 1 Inputs : Mixture X ∈ C F × T , 2 Phase parameter s κ an d τ ∈ R + , 3 In itial NMF matrices ∀ j , W j ∈ R F × K j + , H j ∈ R K j × T + , 4 In itial p hases ∀ j , µ j ∈ [0 , 2 π [ F × T , 5 Nor malized frequen cies ∀ j , ν j ∈ R × F × T . 6 Anisotropy parameters : 7 Comp ute λ and ρ with (8). 8 while stopping criterion not reached do 9 % E-step 10 Update m , γ a nd c with ( 1 7), 11 Update m x , γ x and c x with (11), 12 Update m ′ with (21), 13 Update γ ′ and c ′ with (23), 14 % M-step: NMF 15 Update p with (26) and q with (27). 16 ∀ j , Up date W j with (36) an d H j with (37), 17 Normalize W and H . 18 % M-step: ph a se 19 Update β with (40). 20 for t = 1 to T − 2 do 21 ∀ ( j, f ) , update ˜ β j,f t with (42), 22 µ j,f t = ∠ ˜ β j,f t . 23 end 24 end 25 Upd ate m , γ and c with ( 1 7), 26 Upd ate m x , γ x and c x with (11), 27 Upd ate m ′ with (21). 28 Outputs : m ′ ∈ C J × F × T . This algorithm inclu des a norm alization step after upd ating W j and H j , which elimina tes trivial scale indetermina cies and avoids numerical instabilities. W e impose a unitary ℓ 2 - norm on each column o f W j and scale H j accordin g ly , so that the co st functio n is n ot affected. Finally , one ﬁnal E-step is p erforme d after looping in order to estimate the sources with the most up-to - date p arameters. I V . E X P E R I M E N TA L E V A L U A T I O N In this section, we e xper imentally assess the potential of the p roposed comp lex ISNMF model f or a task o f mon a ural musical source separation . Sound excerpts c a n be foun d on the companion website fo r this paper [49]. I n the spir it of reprod ucible r esearch, the cod e of this exper im ental study is av ailable online 2 . A. Pr otocol 1) Data set: W e consider 10 0 mu sic song excerpts fro m the DSD100 database, a semi-p rofessionally mixed set of music songs used for th e SiSEC 201 6 campaign [50]. Eac h excerpt is 10 seco nds long and is made up of J = 4 sources: bass , dr um , vo cals and othe r . The datab ase is split in to two subsets of 50 songs: a learning set, o n 2 https:/ /github . com/magronp/complex - isnmf 8 which the meta-parameters of the algorithms are tuned an d the initialization strategies are in vestigated, an d a test set, on which the separation ben c hmark is perfo rmed. The signals are sampled at 4 4100 Hz and th e STFT is c o mputed with a 92 ms long Hann window and 75 % overlap. The resulting STFTs are therefo re matr ices o f dime n sions 20 49 × 433 . 2) Sep aration scen ario: In co ding-b a sed informed sour ce separation [51], we assume some side-infor mation c a n be computed fro m the isolated sources (the encoding stage) and then u sed to perfo rm separation (the deco ding stage ). A co mmon appr oach consists of computing a no nnegative matrix or tensor factor ization [ 52]–[54] on the isolated source spectrogra ms and then using th e correspo n ding decompo sition to estimate a W iener ﬁlter at the d ecoding stage. Here , we consider a semi-in fo rmed scenario, in which the dictionaries W j are estimated on the isolated sour ces and the activation matrices H j computed f rom the mixtu re. This setting is less restrictiv e than a f ully-info rmed setting since we only transmit the dictiona r ies instead of both NMF matr ices. Note than sin c e we use a learnin g d ataset for tunin g so m e parameters, this setting is actually supervised semi-inform ed, but we ref er to it as semi-in formed for brevity . Dictionaries are learned with 20 0 iteratio n s o f ISNMF ap- plied to each isolated spectro gram, using multiplicative update rules [ 4], ran dom initial matrices and a rank of factoriza tion K j = 50 , which cor responds to an 8 -fold compression ratio. The diction a ries are th en ﬁxed at the sep aration stage, since we experim e n tally ob served that it lead s to b etter results th an further upd ating th em on the mix ture. 3) Compa rison r efer ences: As baselin e s, we test the con- sistent anisotrop ic Wiener (CA W) ﬁlter [41] which combine s the consistent [38] and anisotropic [27] W ien er ﬁlters, and we also con sider th e phase-con strained CNMF [2 3]– [25]. In order to make the comp a rison fair, we implemen ted a version of CNMF known as CNMF with intra-sourc e additivity [55]: it consists in mod e ling the phase φ j of each so u rce in stead of the p hase of each NMF c o mpone nt, as in the classical CNMF mod el [ 12]. This signiﬁcantly redu ces th e n u mber of parameters o f the mo del, thus it lowers both th e memory and computatio n time requ ired for th e estimation of the model, at the cost o f a m oderate drop in terms of separation qua lity [55]. Source sep a r ation quality is measured with th e signal-to- distortion, signal-to-interf erence, and signal-to - artifact ratios (SDR, SIR, and SAR) [ 5 6] expressed in dB, where o n ly a rescaling (no t a reﬁltering) of th e r eference is allowed. B. Initia lizatio n strate gy W e brieﬂy investigate here o n the best strategy fo r initial- izing the com plex ISNMF algo rithm at the sep aration stage, once the diction aries are learned. A ﬁrst app roach is to provid e a warm start to the algo rithm thanks to 50 iterations of ISNMF computed on th e mixtu re, whose activ ation matrix is rando mly initialized. Besides, it is necessary to h av e a ﬁrst estimate o f the variances in o rder to comp ute the freq uencies, which are needed as inputs o f Algorithm 1. On top of that initialization , we r un 1 50 iter ations of comp lex ISNMF . Alternativ ely , we run 200 itera tio ns of complex ISNMF on top of a random 0 50 100 150 Iterations 5.2 5.3 5.4 5.5 5.6 5.7 SDR (dB) ISNMF initialization 0 50 100 150 200 Iterations -2 -1 0 1 2 3 4 Random initialization Fig. 5. SDR ov er iterat ions for an ISNMF (left) and random (right) initia lizati on. SDR (dB) 0 0.1 0.5 1 5 0 0.1 0.5 1 4 4.5 5 5.5 SIR (dB) 0 0.1 0.5 1 5 14.2 14.4 14.6 14.8 15 15.2 Fig. 6. Inﬂuenc e of the phase parameters κ and τ on the source separation qualit y (SDR and SAR are similar). The range is limite d to [0 , 1] and [0 , 5] for κ and τ respecti vel y for clarity purpose, since the performanc e decre ases outside of these ranges. initialization ( though we still use th e freq uencies as comp uted before) , so th e total numb er of iterations is the same in both scenarios. W e present the SDR ov er iter ations in Fig. 5 (results are av eraged over the learnin g set) for κ = τ = 0 . 5 : similar conclusion s can be d rawn from oth er values of th e p arameters and from the SIR and SAR. W e o bserve that initializing complex ISNMF with I SNM F provides better results than a random initializa tio n. Consequ ently , in the following exper- iments, we will retain th is ISNMF-initialization strategy in order to bootstrap the comp lex ISNMF alg orithm, wh ich will use 10 0 iteratio n s. C. Phase pa rameters inﬂu e nce W e run the different m ethods on the 50 songs that fo rm th e learning set in order to lear n the optima l p hase parameter s. 1) Complex IS NMF: The results pr e sented in Fig 6 show that for no n-null values of the phase parameters, the p roposed approa c h can outperf o rm a phase-un aware ap proach (for which κ = τ = 0 ) according to the SDR, SIR and SAR. W e found that κ = 0 . 5 and τ = 5 provides a quite goo d compromise between the different indicator s. 2) Ph a se-constrained CNMF: This method d epends on a weight parame ter σ u which prom otes the sinusoidal model phase constra in t. The separation work ﬂow is the same as for complex I SNMF , except we use here an NMF with Euclid ean distance [2] for both dictio n ary learnin g and in itialization on the m ixture. Indeed , since CNMF is based on the Euclidean distance, learning I S-based dictio naries would not be consis- tent with the d istortion metric in CNMF . The value σ u = 10 − 2 appears as the b est candidate, since the SDR is sligh tly reduced ( − 0 . 2 dB) compar ed to th e uncon stra in ed baseline (for which 9 T ABLE I S O U R C E S E PA R A T I O N P E R F O R M A N C E F OR E AC H I N S T R U M E N T ( S D R , S I R A N D S A R I N D B ) A V E R A G E D OV E R T H E D S D 1 0 0 T ES T D AT A S E T . Bass Drums Other V ocals SDR SIR SAR SDR SIR SAR SDR SIR SAR SDR SIR SAR W iener 2 . 6 7 . 9 4 . 4 4 . 7 17 . 4 5 . 1 3 . 7 12 . 9 4 . 4 7 . 6 18 . 1 8 . 1 A W 2 . 6 8 . 1 4 . 3 4 . 4 18.5 4 . 7 3 . 6 13.1 4 . 2 7 . 5 18.9 7 . 9 CA W 2 . 8 8 . 1 4.5 4 . 8 17 . 6 5 . 1 3 . 8 12 . 9 4 . 4 7 . 0 16 . 7 7 . 5 CNMF 2 . 3 6 . 9 4 . 5 3 . 7 12 . 8 4 . 4 2 . 6 10 . 1 3 . 7 5 . 9 15 . 7 6 . 5 Comple x ISNMF 3.0 10.1 4 . 1 5.4 15 . 9 5.9 3.8 12 . 4 4.6 7.7 18 . 4 8.2 T ABLE II S O U R C E S E PA R A T I O N P E R F O R M A N C E A V E RA G E D O V E R I N S T RU M E N T S : M E A N P L U S / M I N U S S T A ND A R D D E V I A T I O N O V E R T H E D ATAS E T . SDR SIR SAR W iener 4 . 7 ± 1 . 6 14 . 1 ± 2 . 9 5 . 5 ± 1 . 5 A W 4 . 5 ± 1 . 7 14.6 ± 2 . 8 5 . 3 ± 1 . 5 CA W 4 . 6 ± 2 . 0 13 . 8 ± 2 . 7 5 . 4 ± 2 . 0 CNMF 3 . 6 ± 1 . 7 11 . 4 ± 2 . 3 4 . 8 ± 1 . 6 Comple x ISNMF 5.0 ± 1 . 7 14 . 2 ± 2 . 8 5.7 ± 1 . 6 σ u = 0 ) , but it allows for mo re interferen ce redu ction ( +1 . 4 dB in SIR). V alues of σ u greater than 10 − 2 still increase the SIR, but at the co st of a sig n iﬁcant drop in SDR. 3) W iener ﬁlters: CA W [41] dep ends on two parameter s κ and δ which r espectively p r omote anisotropy an d consistency . W e ﬁrst estimate the variances with 1 50 iter a tions of ISNMF on the mixture, a n d then we apply the ﬁlter . W e pro pose th e following sets of values: • For κ = 1 an d δ = 0 , the SIR is improved by +0 . 6 dB at the cost of a slight d ecrease in SDR ( − 0 . 1 dB) comp ared to th e baselin e Wiener ﬁltering (for which κ = δ = 0 ). W e simp ly refer to it as A W since the co nsistency weight is null in this setting. • For κ = 0 . 1 and δ = 10 − 3 , the SIR is very slightly reduced c o mpared to the baseline ( − 0 . 0 2 dB) while the SDR is in c reased b y 0 . 05 dB. W e refer to it as CA W . One may chose other values for the p arameters in or d er to have th e best possible SDR (o r SIR/SAR), but the propo sed settings yield an overall comprom ise which d o es n ot exces- si vely fav or one indica to r over th e other s. D. Results o f the ben chma rk W e now consider the 5 0 songs that fo rm the test set and run the comp ared meth o ds. The results f o r each instrumen tal source are p resented in T able I, and the results av eraged over instruments are p resented in T able II. W e observe that the prop osed complex ISNMF approach yields the best results in terms of SDR and SAR for all instruments and among all the compared techniques, except for the bass tr ack in terms of SAR. It also outperfo rms the phase-u n aware W iener ﬁltering and the ph ase-constrain e d CNMF in terms of average SIR. Th is demo nstrates the in terest of exploiting some ph ase in formatio n in a p robab ilistic model to overcome the limitations of those baseline app roaches, as stressed in the introdu ction of this paper . The comp lex ISNMF estimates co n tain slightly mo re inter- ference than the A W estimates ( a 0 . 4 dB dif feren ce in SIR on average), but less artifacts ( a 0 . 4 dB difference in SAR on av erage ), which lead s to a greater SDR. Therefo re, it is overall pr eferable to e m ploy this method than our preliminary approa c h es [ 27], [4 1] to perfor m a join t estimation of magni- tude and p hase. Let us no te that the metrics do not vary much from on e technique to another . I ndeed, th e main dif ference between them is the phase recovery technique, which h as less im pact on the SDR, SIR and SAR than d ifferences in terms of m agnitude estimation strategy . An info rmal perceptu al evaluation is consistent with those results (sounds excerpts ar e a vailable at [ 49]). In particular, CNMF introd uces smearing artifacts in the separ a te d sources, and the bass an d drum track s estimated with the Wiener ﬁlters ar e strongly co rrupted by musical noise. In com p arison, the pr oposed comp lex ISNMF method yield s bass estimates which contain fewer a rtifacts and interference, and drums estimates with n eater attack s. E. F itting the data Finally , we inv estigate on the capability of the AG mod el to represent audio d ata, that is to say , to assess that the mixture variables x f t are well-repr e sented by A G distributions. T o do so, we need to norm a lize the v ariables x f t so tha t all TF en tries become iden tica lly distributed, which allows us to compu te their histogra m , and therefore to com pare their empirical and theoretical densities. Sin ce x f t ∼ N ( m x,f t , Γ x,f t ) , it can be shown that: y f t = ( x f t − m x,f t ) H Γ − 1 x,f t ( x f t − m x,f t ) (47) follows a chi-squ ared d istribution with 2 degrees of free- dom [3 5]. Then, once the mo del is estimated, we comp ute the norma lize d variable Y from the m ixture X accord ing to (4 7), and a ll the entries of Y are expec te d to be identi- cally chi-squa red distributed. Finally , even if there are some depend encies between the x f t because o f the NMF and phase models, they are conditiona lly independen t g iven the model param eters, which are estimated beforehand in ord er to com pute the y f t with (47). Th e r esulting variables y f t are then ind ependen t a n d iden tica lly distributed, th us it b ecomes possible to p lot their histogr a m. The setting is the same as in th e p revious experiments, but we set τ at 0 and we initialize Algorithm 1 with the true phase v alues for µ j . Indeed, a ﬁtting erro r can be due to a mismatch between the m odel and th e observed data, but also to an estima tion er r or . In this way , we only in vestigate o n th e accuracy of the mod el to rep resent the data, not on the ph ase 10 0 5 10 15 y 10 -4 10 -3 10 -2 10 -1 10 0 p(y) 0 0.01 0.05 0.1 0.3 0.5 0.7 0.9 Fig. 7. Empirical densities of the normalized data for se veral val ues of κ (solid lines) and referenc e chi-square d density (dashed line). estimation itself. The complex ISNMF algorithm is run on one song (similar results ar e obtained for the o ther songs) for se veral values of κ . The re su lts are presente d in Fig. 7. W e ob ser ve that small values of κ lead to empirical densities that app roach the theoretical o n e from ab ove for small values of x and fr om below for greater values of x . For greater values of κ , this trend is in verted. In particular, the v alue κ = 0 . 5 leads to a good ﬁt on a verage, which may explain why this value leads to th e best results in terms o f separation quality (see Section I V -C). Overall, a be tter ﬁt can b e obtained with no n-null v alues of κ , which d emonstrates the interest of A G distributions over isotropic variables to represent au dio data in th e STFT dom ain. V . C O N C L U S I O N In this paper , we introd uced comp lex ISNMF , a probab ilistic model b ased o n th e A G distribution. It consists of m odeling the sour ces with anisotrop ic random variables, wh ich makes it possible to enforc e som e desirab le phase prope rties, wh ile classical circula r ly-symmetr ic variables d o not allow one to fa vor a phase model. T herefor e, it combines th e a d vantages of ISNMF and CNMF , that is, using a d istortion metric well adapted to audio and phase-awareness. W e experimen ta lly showed that it o utperfo rms those two approach es, and th us appears as a go od candidate for phase-aware audio sourc e sep- aration in semi-in formed settings. This mo del is also suitable for supervised ap plications wh ere som e training material is av ailable, but th e n it is required to account for the po tential mismatch between tr aining and test ma terials [57], [5 8]. An inter esting direction for fu ture work is th e in vestigation of alternative phase-aware p r obabilistic models, in order to extend CNMF to o th er beta-divergences, as ﬁr st attemp ted in [59]. Altern a ti vely , one can exploit the family of multiv ari- ate stable distributions [ 60] with an aniso tr opic sha p e matrix in o rder to combine p hase-awareness and robust mag nitude modeling [6 1]. Finally , we could inco rporate deep neu ral net- works in this Bayesian framework for e stima tin g the variances instead of using an NMF m odel, as it was done in a multicha n - nel scen ario with isotr opic Gaussian variables [62]. Indeed , deep lear ning m ethods have shown remarkab ly go od results for musical source separation [63], but there is still so m e room for improvement, notably in term s o f ph ase recovery , since th o se methods usually explo it a p hase-unaware Wiener -like m ask to estimate the com plex-valued so urces. A P P E N D I X In this appendix, we detail the E-step of the proposed algorithm , which consists in computing the functional given by (20), whic h we recall her eafter: Q ML (Θ , Θ ( i − 1) ) = Z p ( S | X ; Θ ( i − 1) ) log p ( X , S ; Θ) d S . The comp lete data log-likelihood is given by: log p ( X , S ; Θ ) = X f ,t log p ( x f t | s f t ; Θ) + J ′ X j =1 log p ( s j,f t ; Θ) c = − 1 2 X f ,t log( | Γ J,f t | ) + B f t + J ′ X j =1 log( | Γ j,f t | ) + A j,f t , where: A j,f t = ( s j,f t − m j,f t ) H Γ − 1 j,f t ( s j,f t − m j,f t ) , and B f t = ( x f t − m J,f t − J ′ X j =1 s j,f t ) H Γ − 1 J,f t ( x f t − m J,f t − J ′ X j =1 s j,f t ) . Therefo re, (20) rewrites: Q ML (Θ , Θ ( i − 1) ) c = − 1 2 X f ,t J X j =1 log( | Γ j,f t | ) + X f ,t J ′ X j =1 E S | X ;Θ ( i − 1) ( A j,f t ) + E S | X ;Θ ( i − 1) ( B f t ) . ( 4 8) Firstly , let us comp u te the expectation E S | X ;Θ ( i − 1) ( A j,f t ) . W e remove the indices j, f t and the sub script S | X ; Θ ( i − 1) for clarity . W e have, th anks to the tr ace identity: E ( A ) = E  ( s − m ) H Γ − 1 ( s − m )  = ( m ′ − m ) H Γ − 1 ( m ′ − m ) + T r (Γ − 1 Γ ′ ) . Besides, T r (Γ − 1 Γ ′ ) = 1 | Γ | ( γ γ ′ − ℜ ( ¯ cc ′ )) , then: E ( A ) = 2 | Γ |  γ ( | m ′ − m | 2 + γ ′ ) − ℜ (¯ c (( m ′ − m ) 2 + c ′ ))  . Now , let us comp ute E ( B ) . W e use, on ce again, the trace identity , which lead s to: E ( B ) = E   ( x − m J − J ′ X j =1 s j ) H Γ − 1 J ( x − m J − J ′ X j =1 s j )   = ( x − m J − J ′ X j =1 m ′ j ) H Γ − 1 ( x − m J − J ′ X j =1 m ′ j ) + Tr (Γ − 1 J Γ ′ J ) . 11 Thanks to th e conservati ve prope rty o f the anisotropic Wiener ﬁltering (21), we have P J ′ j =1 m ′ j = x − m ′ J , so: E ( B ) = ( m ′ J − m J ) H Γ − 1 J ( m ′ J − m J ) + Tr (Γ − 1 J Γ ′ J ) . Then, E ( B ) is similar to E ( A ) , but applied to the last sou rce J . Finally , inco rporatin g the expression s of E ( A ) and E ( B ) into (48) le a ds to the expression of Q ML : Q ML (Θ , Θ ( i − 1) ) c = − X f ,t J X j =1 log( q | Γ j,f t | ) + 1 | Γ j,f t |  γ j,f t ( | m ′ j,f t − m j,f t | 2 + γ ′ j,f t )  − 1 | Γ j,f t |  ℜ (¯ c j,f t (( m ′ j,f t − m j,f t ) 2 + c ′ j,f t ))  . R E F E R E N C E S [1] P . Comon and C. Jutten, Handbook of blind sour ce separatio n: inde- pendent component analysis and applications . Acade mic press, 2010. [2] D. D. L ee and H. S. Seung, “Learning the parts of objects by non- nega ti ve matrix facto rization , ” Nature , vol . 401, no. 6755, pp. 788–791, 1999. [3] T . Vi rtanen, “Monaural s ound source separation by nonnegati ve matrix fac torizati on with tempo ral continuity and sparseness criteria, ” IEEE T ran sactions on Audio, Speech, and Languag e Proc essing , vol. 15, no. 3, pp. 1066–1074, March 2007. [4] C. F ´ evot te, N. Bertin, and J.-L . Durrieu, “Nonnega ti ve matrix fact or- izati on with the Itakura-Saito di ver gence: W ith applic ation to music analysi s, ” Neur al computation , vol. 21, no. 3, pp. 793–830, March 2009. [5] T . V irtane n, A. T . Cemgil, and S. Godsill, “B ayesian e xtensions to non-ne gati ve matrix factor isation for audio signal modelling, ” in Pr oc. of IEEE International Confer ence on A coustic s, Speec h and Signal Pr ocessing (ICASSP) , May 2008, pp. 1825–1828. [6] A. Liutkus, D. Fitzgeral d, and R. Badeau, “Cau chy nonnegati ve matrix fac torizati on, ” in Pr oc. of IE EE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics (W ASP AA) , October 2015, pp. 1–5. [7] U. Simsekli, A. Liutkus, and A. T . Cemgil, “ Alpha-stable matrix fac torizati on, ” IEEE Signal Proce ssing Letters , vol. 22, no. 12, pp. 2289– 2293, December 2015. [8] C. Fev otte and J. F . Cardoso, “Maximum likelih ood approach for blind audio source separati on using time-frequenc y Gaussian source m odels, ” in P roc. of IEEE W orkshop on Applications of Signal Pro cessing to Audio and Acoustics (W ASP AA) , October 2005, pp. 78–81. [9] A. Liutkus and R. Badeau, “Genera lized Wiene r ﬁltering with fractiona l po wer spectrograms, ” in Proc. of IEEE International Confer ence on Acoustics, Speec h and Signal Proc essing (ICASSP) , April 2015, pp. 266– 270. [10] P . Magron, R. Badeau, and B. David, “Phase recov ery in NMF for audio source separation: an insightful benchmark, ” in Pr oc. of IEE E Internati onal Confer ence on A coustic s, Speec h and Signal Proce ssing (ICASSP) , April 2015, pp. 81–85. [11] R. M. Parry and I. E ssa, “Incorporati ng phase informati on for source separat ion via spectrogram factoriza tion, ” in Proc . of IEEE Internationa l Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , April 2007, pp. II–661II–664. [12] H. Kameoka, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF: A ne w sparse representat ion for acoustic signals, ” in Pr oc. of IEE E Internati onal Confer ence on A coustic s, Speec h and Signal Proce ssing (ICASSP) , April 2009, p. 34373440. [13] J. L e Roux, H . Kameoka, E . Vi ncent, N. Ono, K. Kashino, and S. Saga yama, “Comple x NMF under spe ctrogram consistenc y con- straints, ” in Proc. of Acoustica l Societ y of Japan Aut umn Meeting , September 2009. [14] D. Grifﬁn and J. S. Lim, “Signal estimation from m odiﬁed short-time Fourier transform, ” IEEE T ransac tions on Acoustics, Speech and Sign al Pr ocessing , vol. 32, no. 2, pp. 236–243, April 1984. [15] J. Le Roux, N. Ono, and S. Sagay ama, “Explic it consistency constr aints for STFT spectrograms and their applica tion to phase recon struction, ” in Pr oc. of ISCA W orkshop on Statistical and P er ceptual Auditi on (SAP A ) , September 2008, pp. 23–28. [16] R. J. McAul ey and T . F . Quatieri, “Speech anal ysis/Synthesis based on a sinusoidal represen tation, ” IEE E T ransac tions on Acoustics, Speec h and Signal Pr ocessing , vol. 34, no. 4, pp. 744–754, August 1986. [17] M. K rawczyk and T . Gerkmann, “STFT phase improv ement for single channe l speech enhance ment, ” in Pr oc. of International W orkshop on Acoustic Signal Enhancement (IW AENC) , September 2012, pp. 1–4. [18] P . Magron, R. Badeau, and B. David , “Model-based STFT phase re- cov ery for audio source separati on, ” IEEE/ACM T ransac tions on Audio, Speec h and Language Proce ssing , vol . 26, no. 6, pp. 1095–1105, June 2018. [19] M. Krawczyk and T . Gerkmann , “STFT phase recon struction in voic ed speech for an improved single-chann el speech enhance ment, ” IEEE/ACM T ransac tions on Audio, Speec h, and Language Pr ocessing , vol. 22, no. 12, pp. 1931–1940, December 2014. [20] P . Mowla ee and J. Kulmer , “Harmonic phase esti mation in single - channe l speech enhan cement using phase decomposition and SNR informati on, ” IEEE/ACM T ransac tions on Audio, Speech, and Languag e Pr ocessing , vol. 23, no. 9, pp. 1521–1532, September 2015. [21] P . Magron, R. Badeau, and B. David, “Phase reconstruction of spectro- grams with line ar unwrapping : applicati on to audio signal restoratio n, ” in P r oc. of Eur opean Signal Proc essing Confer ence (EUSIPCO) , August 2015, pp. 1–5. [22] J. Laroche and M. Dolson, “Improve d phase vocoder time-scale m odi- ﬁcation of audio, ” IEEE T ransac tions on Speec h and Audio Pr ocessing , vol. 7, no. 3, pp. 323–332, May 1999. [23] J. Bronson and P . Depalle, “Phase constrain ed complex NMF: Separating ov erlapping partials in mixtures of harmonic musical sources, ” in Pr oc. of IEEE International Confer ence on A coustic s, Spee ch and Signal Pr ocessing (ICASSP) , May 2014, pp. 7475–7479. [24] F . J. Rodriguez-Serra no, S. Ewert, P . V era-Can deas, and M. Sandler , “ A score-info rmed shift in v ariant extensio n of comple x matrix factorisa tion for improvi ng the s eparat ion of overl apped partials in music recordings, ” in Pr oc. of IEE E International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , March 2016, pp. 61–65. [25] P . Magron, R. Badeau, and B. Da vid, “Comple x NMF under phase constrai nts b ased on signal modeling: applica tion to audio source separat ion, ” in P roc. of IEEE Internationa l Confer ence on Acoustics, Speec h and Signal P r ocessing (ICASSP) , March 2016, pp. 46–50. [26] R. Gray , A. Buzo, A. Gray , and Y . Matsuyama, “Distortion measures for speech processing, ” IEEE T ransac tions on Acoustics, Speech , and Signal Pr ocessing , vol. 28, no. 4, pp. 367–376, August 1980. [27] P . Magron, R. Badeau, and B. David, “Phase-depende nt anisot ropic Gaussian model for audio source sepa ration, ” in Pr oc. of IEEE In- ternati onal Confe renc e on Acoust ics, Speec h and Sign al Proce ssing (ICASSP) , March 2017, pp. 513–535. [28] P . Magron and T . V irtanen , “Bayesian anisotropic Gaussian model for audio source separation, ” in Proc . of IEEE Internation al Conferen ce on Acoustics, Speech and Signal Pr ocessing (ICASSP) , April 2018, pp. 166 – 170. [29] B. King, C. F ´ ev otte, and P . Smaragdis, “Opt imal cost function and magnitude power for NMF-based speech separation and m usic interpo- latio n, ” in P r oc. of IEEE Internati onal W orkshop on Machine Learning for Signal Pro cessing (MLSP) , September 2012, pp. 1–6. [30] Y . Agiomyr giannak is and Y . Stylianou, “Wrapped Gaussian mixture models for modeling and high-rate quantizat ion of phase data of speech, ” IEEE T ransaction s on Audio, Speech, and Language Pro cessing , vol. 17, no. 4, pp. 775–786, May 2009. [31] K. V . Mardia and P . J. Zemroch, “Algorithm AS 86: T he von Mises distrib ution function, ” Journal of the R oyal Statistic al Society . Series C (Applied Statisti cs) , vol. 24, no. 2, pp. 268–272, 1975. [32] T . Gerkmann, “MMSE-optimal enhance ment of complex speech coef- ﬁcients with uncerta in prior knowled ge of the clean speech phase, ” in Pr oc. of IEEE Internati onal Confer ence on Acoustics, Speec h and Signal Pr ocessing (ICASSP) , May 2014, pp. 4478–4482. [33] ——, “Baye sian estimation of clea n speech spectral coef ﬁcien ts giv en a priori kno wledge of the phase, ” IEEE T ransacti ons on Signal P r ocessing , vol. 62, no. 16, pp. 4199–4208, August 2014. [34] G. N. W atson, A treati se on the theory of Bessel functions . C ambridge uni versit y press, 1995. [35] B. Picinbon o, “Second-order complex random vecto rs and normal dis- trib utions, ” IE E E T ransac tions on Signal Proc essing , vol. 44, no. 10, pp. 2637–2640, Octobe r 1996. [36] A. Liutk us, C. Ro hlﬁng, and A. Delefor ge, “ Audio source separa- tion with magnitude priors: the BEADS model, ” in Pr oc. of IEEE Internati onal Confer ence on A coustic s, Speec h and Signal Proce ssing (ICASSP) , April 2018, pp. 56 – 60. 12 [37] P . Beckmann, “Statisti cal distribut ion of the amplitude and phase of a multiply scatte red ﬁeld, ” Jo urnal of Researc h of the National Bureau of Standar ds , vol. 66D, no. 3, pp. 231–240, May-June 1962. [38] J. Le Roux and E. V incent, “Consistent W iener ﬁltering for audio source separat ion, ” IEEE Signal Pro cessing L etters , vol. 20, no. 3, pp. 217–220, March 2013. [39] N. Bertin, R. Badeau, and E. V incent, “Enforcing harmonicity and smoothness in Bayesian non-ne gati ve matrix factori zation applied to polyphoni c m usic transcription, ” IEE E T ransac tions on Audio, Speec h and Languag e Pr ocessing , vol. 18, no. 3, pp. 538–549, March 2010. [40] A. P . Dempster , N. M. Laird, and D. B. Rubin, “Maximum lik elihood from incomplet e data via the EM algorit hm, ” J ournal of the r oyal statisti cal society . Series B (methodolo gical) , vol. 39, no. 1, pp. 1–38, 1977. [41] P . Magron, J. L e Roux, and T . V irtane n, “Consistent anisotrop ic Wiene r ﬁlterin g for audio source separat ion, ” in P r oc. of IEEE W orkshop on Applicat ions of Signal Proc essing to Audio and A coustics (W ASP AA) , October 2017, pp. 269–273. [42] D. R. Hunter and K. Lange, “A tutori al on MM algorithms, ” The American Statistic ian , vol. 58, no. 1, pp. 30–37, 2004. [43] C. F ´ ev otte and J . Idier , “Algorithms for nonnegati ve matrix factoriza tion with the beta-di ve rgence , ” Neura l Computation , vol. 23, no. 9, pp. 2421– 2456, September 2011. [44] C. F ´ evo tte, “Majoriz ation-mini mization al gorithm for smooth Itakura- Saito nonnegat iv e matrix factoriza tion, ” in Pr oc. of IE EE Internatio nal Confer ence on Acoustics, Spee ch and Signal Pr ocessing (ICASSP) , May 2011, pp. 1980–1983. [45] A. Lef ` evre , F . Bach, and C. F ´ ev otte, “Itakura-Saito nonne gati ve m atrix fac torizati on with group sparsity , ” in Proc . of IEE E Internati onal Confer- ence on Acoustics, Spee ch and Signal Proc essing (ICASSP) , May 2011, pp. 21–24. [46] P . Magron and T . V irtane n, “Expecta tion-maximiz ation algorithms for Itakura -Saito nonne gati ve matrix fact orizatio n, ” in Proc . of Interspeec h , September 2018, pp. 856–860. [47] ——, “T o wards complex nonnegat iv e matrix factoriz ation with the beta- di ver gence, ” in Pr oc. of the International W orkshop on Acoustic Signal Enhancement (iW AE NC) , September 2018. [48] M. Abe and J. O. Smith, “Design criteria for simple sinusoidal parameter estimati on based on quadrat ic interpolati on of FFT magnitude peaks, ” in Audio Engineering Society Con vention 117 , May 2004. [49] http://www .cs.tut.ﬁ/ ∼ magron/demos/ demo CISNMF .html. [50] A. Liutkus, F .-R. St ¨ oter , Z. Raﬁi, D. Kitamura , B. Ri ve t, N. Ito, N. Ono, and J. Fontec ave , “The 2016 signal separatio n ev aluat ion campaign, ” in Pr oc. of Internati onal Confer ence on Latent V ariable A nalysis and Signal Separa tion (L VA/IC A) , February 2017, pp. 323–332. [51] A. Liutkus, J. Pinel, R. Badeau, L. Girin, and G. Richard, “Informed source separatio n through spectrogram coding and data embeddi ng, ” Signal Pr ocessing , vol. 92, no. 8, pp. 1937–1949, 2012. [52] C. Rohlﬁng, J. M. B ecker , and M. W ien, “NMF-based informed source separation, ” in Pro f. IEEE Internati onal Conferen ce on Acoustics, Speec h and Signal Proce ssing (ICASSP) , March 2016, pp. 474–478. [53] C. Rohlﬁng, J. E. Cohen, and A. Liutkus, “V ery low bitrat e spatia l audio coding with dimensional ity reduction, ” in Pr of. IEEE International Confer ence on Acoust ics, Speec h and Signal Pr ocessing (ICASSP) , March 2017, pp. 741–745. [54] A. Ozerov , A. Liutk us, R. Badeau , and G. Richard, “Codin g-based informed s ource separat ion: Nonnegati ve tensor f actoriza tion approach , ” IEEE T ransaction s on Audio, Speech, and Languag e Proce ssing , vol. 21, no. 8, pp. 1699–1712, Aug 2013. [55] B. J. King, “New m ethods of complex matrix factor ization for single- channe l source separation and analysis, ” Ph.D. dissertation, Univ ersity of W ashington, 2012. [56] E. V incent, R. Gribon v al, and C. F ´ evotte , “Performance m easurement in blind audio source separation, ” IEEE T ransactions on Speec h and Aud io Pr ocessing , vol. 14, no. 4, pp. 1462–1469, July 2006. [57] T . V irtanen and A. T . Cemgil, “Mixtures of gamma priors for non- nega ti ve matrix factori zation based speech separati on, ” in P r oc. of Inter- national Confer ence on Latent V ariable Analysis and Signal Separatio n (L V A/ICA) , March 2009, pp. 646–653. [58] D. Kita mura, H. S aruwatari, K. Shikano, K. Kondo, and Y . T akahashi, “Music signal separa tion by supervi sed nonne gati ve matrix factoriz ation with basis deformation, ” in Pr oc. of International Conf eren ce on Digital Signal Pr ocessing (DSP) , July 2013, pp. 1–6. [59] H. Kameoka, H. Kagami, and M. Y ukawa, “Complex NMF with the general ized Kullback -Leibler div ergenc e, ” in Proc. of IEEE International Confer ence on Acoust ics, Speec h and Signal Pr ocessing (ICASSP) , March 2017, pp. 56–60. [60] S. Leglai ve, U. Simsekli, A. Liutku s, R. Badeau, and G. Richard, “ Alpha-stabl e multichanne l audio source separati on, ” in Pr oc. of IEE E Internati onal Confer ence on A coustic s, Speec h and Signal Proce ssing (ICASSP) , March 2017, pp. 576–580. [61] P . Magron, R. Badeau, and A. Liutkus, “L ´ evy NMF for robust nonneg- ati ve source separatio n, ” in Proc . of IEEE W orkshop on A pplicat ions of Signal Proc essing to Audio and Acousti cs (W ASP A A) , October 2017, pp. 259–263. [62] A. A. Nugrah a, A. Liutkus, and E. V incent, “Mult ichannel audio source separat ion with deep neur al ne tworks, ” IEEE/ACM T ransactions on Audio, Speec h, and Language Proc essing , vol. 24, no. 9, pp. 1652–1664, September 2016. [63] N. T akahashi and Y . Mitsufuji, “Multi-scal e multi-band DenseNets for audio source separation , ” in P roc. of IEEE W orkshop on Applicati ons of Signal P r ocessing to Audio and Acoustics (W ASP AA ) , October 2017, pp. 21–25.

Complex ISNMF: a Phase-Aware Model for Monaural Audio Source Separation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment