Under-determined reverberant audio source separation using a full-rank spatial covariance model

This article addresses the modeling of reverberant recording environments in the context of under-determined convolutive blind source separation. We model the contribution of each source to all mixture channels in the time-frequency domain as a zero-…

Authors: Ngoc Duong (INRIA - Irisa), Emmanuel Vincent (INRIA - Irisa), Remi Gribonval (INRIA - Irisa)

apport   de recherche ISSN 0249-6399 ISRN INRIA/RR--7116--FR+ENG Thème COM INSTITUT N A TION AL DE RECHERCHE EN INFORMA TIQUE ET EN A UTOMA TIQUE Under-determined rev erberant audio source separation using a full-rank spatial co variance model Ngoc Q.K. Duong — Emmanuel V incent — Rémi Gribon val N° 7116 December 2009 Centre de recherche INRIA Rennes – Bretagne Atlantique IRISA, Campus universitaire de Beaulieu, 3504 2 Rennes Cedex Téléphone : +33 2 99 84 71 00 — Téléco pie : +33 2 99 84 71 71 Under-determined rev erb eran t audio source separation using a fu ll-rank spatial co v ariance mo del Ngo c Q.K. Duong ∗ , Emman uel Vincen t † , R ´ emi Gribo n v al ‡ Th` eme COM — Syst` emes communican ts ´ Equip e-Pro jet METISS Rapp ort de recherche n ° 7116 — December 2009 — 19 pages Abstract: This article addresses the mo deling of r everbera nt recor ding en vi- ronments in the context of under-determined conv olutive blind s o urce separa - tion. W e mo del the co nt ribution of eac h source to all mixture channels in the time-frequency domain as a zero-mea n Gauss ian rando m v a riable whos e cov ari- ance enco des the spatial characteristics o f the sour c e. W e then consider four sp ecific cov ariance mo dels, including a full-ra nk unco nstrained model. W e de- rive a family o f itera tive exp ecta tion-maximization (EM) algo r ithms to estimate the parameter s of ea ch mo del and pr op ose suitable pro cedur es to initialize the parameters a nd to align the order of the es timated sour ces acro s s a ll frequency bins bas ed on their estimated dir e ctions of arriv al (DOA). Exp erimental results ov er reverber a nt synt hetic mixtures and live recordings of sp eech data show the effectiveness of the prop o sed approa ch. Key-w ords: Conv olutive blind so urce sepa r ation, under-determined mixtures, spatial cov ar iance mo dels, EM alg orithm, p ermutation pro ble m. ∗ qduong@irisa.fr † emman uel.vincent @inri a.fr ‡ remi.grib onv al@inria. fr S ´ eparation de m ´ elanges audio r ´ ev erb´ e ran ts sous-d ´ etermins l’aide d’un mo d ` ele de c o v ariance spatiale de rang plein R´ esum´ e : Cet a rticle traite de la mo d´ elisation d’environnemen ts d’enreg istrement r´ everb ´ erants dans le contexte de la s ´ eparation de sources sous-d´ etermin´ ee. Nous mo d´ elisons la contribution de chaque source l’ensemble des canaux du m´ ela nge dans le doma ine temps-fr´ equence comme une v ariable al´ eatoir e vectorielle gaus- sienne de moy enne nulle dont la cov ariance c o de les caract´ eristiques spa tiales de la s ource. Nous consid´ ero ns quatr e mo d` eles s p ´ ecifiques de co v a riance, dont un mo d` ele de rang plein non co nt raint. Nous explicitons une famille d’alg orithmes Exp ectation-Ma x imization (E M) pour l’estimation des param` etre s de chaque mo d` ele et nous prop osons des pro c´ edures a d´ e q uates d’initialisation des pa- ram` etres et d’apparie ment de l’or dre des sources travers le s fr ´ equences partir de leurs directions d’a rriv´ ee. Les r´ esultats exp´ er iment aux sur des m ´ ela nges r´ everb ´ erants sy nt h´ etiques et enreg is tr´ es mo ntren t la p er tinence de l’appro che prop os´ ee. Mots-cl´ es : S´ eparation de so urces conv olutive, m´ elanges so us-d´ etermin´ es, mo d` eles de cov ariance spa tia le, alg o rithme EM, probl` eme de per mut ation. Under-determine d r everb er ant audio sour c e sep ar ation 3 1 Int ro duction In blind sour ce separa tion (BSS), audio signals a re gener ally mixtures o f sev- eral sound sourc e s such as sp eech, mu sic, a nd background noise. The rec orded m ultichannel signa l x ( t ) is therefor e expre s sed as x ( t ) = J X j =1 c j ( t ) (1) where c j ( t ) is the spatial image of the j th source , that is the co nt ribution of this source to all mixture c hannels. F or a point source in a reverberant environment, c j ( t ) can b e expres sed via the convolutiv e mixing pr o cess c j ( t ) = X τ h j ( τ ) s j ( t − τ ) (2) where s j ( t ) is the j th s ource signal and h j ( τ ) the v ector of filter co efficients mo d- eling the acoustic path from this so urce to all microphones. Source sepa ration consists in rec overing either the J o r iginal so urce signals or their spatia l images given the I mix tur e channels. In the following, w e fo cus on the separatio n of under-determined mix tur es, i.e. such that I < J . Most existing appr oaches op era te in the time- fr equency domain using the short-time F ourier transform (STFT) a nd rely o n narrowband approximation of the convolutiv e mixture (2) b y co mplex-v a lued m ultiplication in ea ch frequency bin f and time frame n as c j ( n, f ) ≈ h j ( f ) s j ( n, f ) (3) where the mixing vector h j ( f ) is the F ourier trans form of h j ( τ ), s j ( n, f ) are the STFT co efficients of the so urces s j ( t ) a nd c j ( n, f ) the STFT co efficients of their spatial imag es c j ( t ). The so urces are t ypically estima ted under the assumption tha t they a re spar se in the STFT do main. F o r insta nce, the de- generate unmixing es tima tion technique (DUET) [1] uses binary masking to extract the predominant source in each time-frequency bin. Another p o pular techn ique known as ℓ 1 -norm minimization extra cts on the order of I source s per time-freq uency bin by solving a constr ained ℓ 1 -minimization problem [2, 3]. The sepa ration p er formance a chiev able by these tec hniques remains limited in reverber ant environmen ts [4], due in pa rticular to the fact that the narrowband approximation does not hold b eca use the mixing filters are muc h longer than the window length of the STFT. Recently , a distinct framework has emer ged where by the STFT coefficients of the source imag e s c j ( n, f ) a re mo deled by a phase-inv ariant m ultiv ar iate distribution whose parameters ar e functions of ( n, f ) [5]. O ne instance o f this framework cons ists in mo deling c j ( n, f ) as a zero-mean Ga us sian random v ari- able with cov ariance ma trix R c j ( n, f ) = v j ( n, f ) R j ( f ) (4) where v j ( n, f ) ar e s c alar time-v arying varianc es enco ding the sp ectro -temp o ral power of the sources and R j ( f ) a r e time- inv ar ia nt sp atial c ovarianc e matric es enco ding their spatial p o sition a nd spatial spread [6]. The mo del par ameters RR n ° 7116 4 Duong, Vinc ent, and Grib onval can then b e estimated in the maximu m likelihoo d (ML) sense and used estimate the spa tial imag es of all sources by Wiener filtering. This fra mework was first applied to the separatio n of instantaneous audio mixtures in [7, 8] and shown to provide b etter separatio n p erfor mance than ℓ 1 - norm minimiza tion. The instan taneous mixing pr o cess then tra ns lated into a rank-1 spa tial cov arianc e matrix fo r each so urce. In our preliminary pap er [6], we extended this approach to c o nv olutive mixtures and prop o sed to co nsider full-rank s patial cov ariance ma trices mo deling the s pa tial spread o f the s ources and cir cumv e nting the narrowband approximation. This appro ach was shown to improv e s eparation p erforma nce of r everbera nt mixtures in b oth an or acle context, wher e all mo del parameter s are k nown, and in a semi-blind context, where the spatia l cov ariance matrices of a ll so urces are known but their v ariance s are blindly estima ted from the mixture. In this ar ticle we extend this work to blind estimatio n of the mo de l par am- eters for BSS applicatio n. While the ge neral exp ectation-maximiza tion (EM) algorithm is well-known a s an a ppropria te choice for pa rameter es timation of Gaussian mo dels [9 , 10, 11, 12], it is very sens itive to the initialization [13], so that an effectiv e par ameter initializa tion scheme is necessa ry . Moreov er, the well-known source p ermutation pr oblem ar ises when the mo del pa rameters are indep endent ly estimated at different fr equencies [14]. In the following, we address these tw o issues for the pro p o sed mo dels and ev alua te these mo dels to- gether with s tate-of-the-ar t techniques on a cons ide r ably lar ger set of mixtures. The structur e of the rest o f the article is as follows. W e introduce the general framework under study as well as four sp ecific spatia l cov ariance mo dels in Section 2. W e then address the blind estimation o f all mo del parameters from the observed mixture in Section 3. W e compare the source separ ation per formance achiev ed by ea ch mo del to that o f state-o f-the-art techniques in v ar io us e x pe rimental settings in Section 4. Finally we conclude and discuss further resear ch direc tions in Section 5. 2 G eneral framew ork and spatial co v ariance mo d- els W e s tart by descr ibing the gene r al proba bilistic mo deling framework adopted from now on. W e then define four mo dels with different degrees o f flex ibility resulting in ra nk-1 or full-rank spatial cov ariance matrices. 2.1 General framew ork Let us assume that the vector c j ( n, f ) of STFT co efficients of the spatial imag e of the j th source fo llows a zer o-mean Ga us sian dis tribution whose cov ariance matrix factors as in (4). Under the cla ssical as sumption that the so ur ces are uncorrela ted, the vector x ( n, f ) of STFT co efficients of the mixture signal is also zero-mean Ga us sian with cov ariance matrix R x ( n, f ) = J X j =1 v j ( n, f ) R j ( f ) . (5) INRIA Under-determine d r everb er ant audio sour c e sep ar ation 5 In o ther words, the likeliho o d of the se t o f obs erved mixture STFT co efficient s x = { x ( n, f ) } n,f given the set of v ariance parameters v = { v j ( n, f ) } j,n,f and that o f spatial cov ariance matr ices R = { R j ( f ) } j,f is given by P ( x | v , R ) = Y n,f 1 det ( π R x ( n, f )) e − x H ( n,f ) R − 1 x ( n,f ) x ( n,f ) (6) where H denotes matrix conjuga te tra nsp osition a nd R x ( n, f ) implicitly de- pends on v and R according to (5). The cov ar ia nce ma trices ar e typically mo deled b y hig her-level spatial par ameters, as we shall see in the following. Under this mo del, source sepa ration ca n b e achiev ed in tw o steps . The v a ri- ance par ameters v and the spatial parameter s under lying R ar e first es timated in the ML sense. The spatial images of a ll sources are then obtaine d in the minim um mean square er r or (MMSE) sense by multic hannel Wiener filter ing b c j ( n, f ) = v j ( n, f ) R j ( f ) R − 1 x ( n, f ) x ( n, f ) . (7) 2.2 Rank- 1 con v olutiv e mo del Most existing a pproaches to audio source separa tio n rely on narr owband ap- proximation o f the conv olutive mixing pro cess (2) by the c omplex-v alued mul- tiplication (3). The cov ariance matrix of c j ( n, f ) is then given by (4 ) wher e v j ( n, f ) is the v ariance of s j ( n, f ) and R j ( f ) is equal to the rank-1 matr ix R j ( f ) = h j ( f ) h H j ( f ) (8) with h j ( f ) denoting the F ourier tr ansform o f the mixing filters h j ( τ ). This r ank-1 c onvolutive mo del of the spatia l cov ariance matrices has r ecently been exploited in [13] together with a different mo del of the sour c e v ariances. 2.3 Rank- 1 anec hoic mo del In an anechoic recor ding environmen t without reverb eration, each mixing filter bo ils down to the combination of a delay τ ij and a gain κ ij sp ecified b y the distance r ij from the j th source to the i th microphone [1 5] τ ij = r ij c and κ ij = 1 √ 4 π r ij (9) where c is sound velocity . The spatial c ov a r iance matrix of the j th s ource is hence given by the r ank-1 ane choic mo del R j ( f ) = a j ( f ) a H j ( f ) (10) where the F ourier transform a j ( f ) of the mixing filters is now par ameterized a s a j ( f ) =    κ 1 ,j e − 2 iπf τ 1 ,j . . . κ I ,j e − 2 iπf τ I ,j    . (11) RR n ° 7116 6 Duong, Vinc ent, and Grib onval 2.4 F ull-rank direct+diffuse mo del One p o s sible interpretation of the nar rowband a pproximation is tha t the so und of ea ch so urce as recorded o n the microphones comes fr om a single spatial p osi- tion at each frequency f , as sp ecified by h j ( f ) or a j ( f ). This approximation is not v alid in a reverbera nt environment , since reverb eratio n induces some spatial spread of each source, due to echoes at many different p ositio ns on the w alls of the r ecording roo m. This spread tra nslates in to full-rank spa tia l cov ar iance matrices. The theory o f s ta tistical ro o m acoustics assumes that the spatial image o f each source is co mp o sed of t wo uncorr elated parts: a direc t pa rt mo dele d by a j ( f ) in (11) and a reverb e rant part. The spatial c ov a r iance R j ( f ) of each source is then a full-rank matrix defined as the sum o f the cov ariance of its direct part a nd the cov ariance of its r everbera nt part such that R j ( f ) = a j ( f ) a H j ( f ) + σ 2 rev Ψ ( f ) (12) where σ 2 rev is the v ariance o f the reverb erant par t and Ψ il ( f ) is a function of the distance d il betw een the i th and the l th micr ophone such that Ψ ii ( f ) = 1. This mo del assumes that the re verb eratio n rec o rded at all micropho ne s has the same power but is correlated a s characteriz ed by Ψ( d il , f ). This mode l has b een employ ed for single source lo calization in [1 5] but not for source s eparation yet. Assuming that the reverb erant pa rt is diffuse, i.e. its intensit y is unifor mly distributed ov er all possible directio ns , its normalized cross-cor relation can be shown to b e rea l-v alue d a nd equal to [16] Ψ il ( f ) = sin(2 π f d il /c ) 2 π f d il /c . (13) Moreov er, the pow er of the reverber a nt part within a parallelepip edic ro om with dimensions L x , L y , L z is given by σ 2 rev = 4 β 2 A (1 − β 2 ) (14) where A is the total wall ar e a and β the wall reflection c o efficient computed from the r o om reverb eration time T 60 via E y ring’s formula [15] β = exp  − 13 . 82 ( 1 L x + 1 L y + 1 L z ) cT 60  . (15) 2.5 F ull-rank unconstrained mo del In practice, the assumption that the r everbera nt pa rt is diffuse is rarely satisfied. Indeed, ea r ly echoes cont aining mo re e ne r gy ar e not unifor mly distributed on the walls of the r e c ording ro om, but at cer tain p ositions dep ending on the p osition of the so urce and the microphones. When p er forming some simulations in a rectangular ro om, we obser ved that (13 ) is v alid on av erag e when co nsidering a large n um b er of sources at different p o sitions, but generally not v a lid for each source considered indep endently . Therefore, we also investigate the mo deling of each sourc e via an uncon- strained spatial cov aria nce matrix R j ( f ) who se co efficients are not related a INRIA Under-determine d r everb er ant audio sour c e sep ar ation 7 priori. Since this mo del is more g e neral tha n (8) a nd (1 2), it allo ws more flex- ible mo deling o f the mixing pro c ess and hence p o tentially improv es separ ation per formance of rea l-world conv olutive mixtur e s. 3 B lind estimation of the mo del parameters In or de r to use the ab ov e mo dels for BSS, we now need to estimate their pa- rameters from the o bserved mixture signal only . In o ur preliminary pa p e r [6], we used a quasi-Newton alg orithm for semi- blind separation that c onv erged in a very small num b er of iterations. How ever, due to the complexity of ea ch iter- ation, we later found out that the EM algorithm provided faster conv erg e nc e in practice despite a larger num ber of iterations . W e hence choos e E M for blind separatio n in the following. More precisely , we a dopt the following three-step pro cedure: initialization of h j ( f ) or R j ( f ) by hier archical cluster ing, iterative ML estimation of all mo del par a meters via EM, a nd per mutation alig nment . The latter step is needed only for the r ank-1 co nv olutive model a nd the full- rank unconstrained model whose para meters ar e es timated independently in each fr equency bin. The overall pro cedur e is depicted in Fig. 1 . Figure 1 : Flow of the prop osed blind s ource separ ation appro ach. 3.1 Initialization by hierarc hical clustering Preliminary exp eriments s howed that the initialization o f the model pa rameters greatly a ffects the sepa ration p erfo r mance res ulting fro m the EM algor ithm. In the fo llowing, we prop ose a hierar chical clustering-based initializa tion scheme inspired fro m the alg orithm in [2]. This scheme r e lies o n the assumption tha t the so und from each source comes from a certain region of space at each fr equency f , whic h is different for a ll sources. The vectors x ( n, f ) of mixture STFT co efficient s are then likely to cluster around the direction of the a s so ciated mixing vector h j ( f ) in the time frames n where the j th source is predo minant. In or der to estimate thes e clusters, we first nor malize the vectors of mixture STFT co efficient s as ¯ x ( n, f ) ← x ( n, f ) k x ( n, f ) k 2 e − i arg( x 1 ( n,f )) (16) where a rg( . ) denotes the phase of a co mplex num b er a nd k . k 2 the E uclidean norm. W e then define the distance b etw een t wo clusters C 1 and C 2 by the RR n ° 7116 8 Duong, Vinc ent, and Grib onval av erage distance betw een the asso ciated no r malized mixtur e STFT co efficients d ( C 1 , C 2 ) = 1 | C 1 || C 2 | X ¯ x 1 ∈ C 1 X ¯ x 2 ∈ C 2 k ¯ x 1 − ¯ x 2 k 2 (17) In a given frequency bin, the vectors of mixtur e STFT co efficients on all time frames are first co nsidered as clusters containing a single item. The distance betw een each pair of clusters is co mputed and the tw o clus ters with the sma llest distance are merged. This ”b ottom up” pro cess called linking is r ep e a ted until the n umber of clusters is smaller tha n a predetermined thr eshold K . This threshold is us ually m uch larger than the n umber of s ources J [2 ], so as to eliminate o utliers. W e finally choose the J clusters with the largest num ber of samples. T he initial mixing vector a nd spatial cov ariance matrix for ea ch source are then computed as h init j ( f ) = 1 | C j | X ¯ x ( n,f ) ∈ C j ˜ x ( n, f ) (18) R init j ( f ) = 1 | C j | X ¯ x ( n,f ) ∈ C j ˜ x ( n, f ) ˜ x ( n, f ) H (19) where ˜ x ( n, f ) = x ( n, f ) e − i arg ( x 1 ( n,f )) . Note that, co ntrary to the algo rithm in [2], w e define the dis ta nce b etw een clusters as the av erage distance b etw een the normaliz e d mixture STFT co efficients instead of the minimum dis tance b e- t ween them. Besides, the mixing vector h init j ( f ) is computed from the phase- normalized mixture STFT co efficients ˜ x ( n, f ) instead of bo th phas e and a mpli- tute normalized co efficients ¯ x ( n, f ). These mo difica tions were found to provide better initial approximation of the mixing parameters in our exper iments. W e also tested random initialization and directio n-of-arr iv al (DOA) based initial- ization, i.e. where the mixing vectors h init j ( f ) are der ived from known source and microphone p ositio ns assuming no reverber a tion. Bo th schemes were found to result in s lower conv ergence a nd po or er separation p er fo rmance than the prop osed scheme. 3.2 EM up dates for the rank-1 con volutiv e mo del The deriv a tion o f the EM parameter estimation algorithm for the r ank-1 con- volutiv e model is stro ngly inspired from the study in [1 3], which relies on the same mo del o f spatial cov ar iance ma trices but on a dis tinct mo del of source v ari- ances. Simila rly to [13], E M cannot be directly applied to the mixture mo del (1) since the e s timated mixing v ectors remain fix ed to their initial v a lue. This issue can b e addres sed by co nsidering the no is y mixture mo del x ( n, f ) = H ( f ) s ( n, f ) + b ( n, f ) (20) where H ( f ) is the mixing matrix whose j th co lumn is the mixing vector h j ( f ), s ( n, f ) is the vector of so urce STFT co efficients s j ( n, f ) and b ( n, f ) some addi- tive zero-mea n Gaussian noise. W e denote by R s ( n, f ) the diagona l cov ariance matrix of s ( n, f ). F ollowing [1 3], we assume that b ( n, f ) is statio na ry and spa- tially uncorr elated and denote by R b ( f ) its time-in v ar iant dia gonal cov ariance matrix. This matrix is initialized to a small v alue related to the av erage accuracy of the mixing vector initialization pro cedure. INRIA Under-determine d r everb er ant audio sour c e sep ar ation 9 EM is sepa rately der ived for eac h frequency bin f for the c omplete da ta { x ( n, f ) , s j ( n, f ) } j,n that is the s et o f mixtur e a nd so urce STFT coefficients of all time frames. The details o f one itera tio n a r e as follows. In the E-step, the Wiener filter W ( n, f ) and the conditional mean b s ( n, f ) a nd cov ariance b R ss ( n, f ) of the so urces are c o mputed a s R s ( n, f ) = diag( v 1 ( n, f ) , ..., v J ( n, f )) (21) R x ( n, f ) = H ( f ) R s ( n, f ) H H ( f ) + R b ( f ) (22) W ( n, f ) = R s ( n, f ) H H ( f ) R − 1 x ( n, f ) (23) b s ( n, f ) = W ( n, f ) x ( n, f ) (24) b R ss ( n, f ) = b s ( n, f ) b s H ( n, f ) + ( I − W ( n, f ) H ( f )) R s ( n, f ) (25) where I is the I × I identit y matrix and diag( . ) the diagonal matrix whose entries are g iven by its arg uments. C o nditional ex p ecta tions of multic hannel statistics are a lso computed by av eraging ov er all N time fra mes as b R ss ( f ) = 1 N N X n =1 b R ss ( n, f ) (26) b R xs ( f ) = 1 N N X n =1 x ( n, f ) b s H ( n, f ) (27 ) b R xx ( f ) = 1 N N X n =1 x ( n, f ) x H ( n, f ) . (28) In the M-step, the s ource v a riances, the mixing matr ix and the no ise cov ariance are up da ted via v j ( n, f ) = b R ss j j ( n, f ) (29) H ( f ) = b R xs ( f ) b R − 1 ss ( f ) (30) R b ( f ) = Diag( b R xx ( f ) − H ( f ) b R H xs ( f ) − b R xs H H ( f ) + H ( f ) b R ss ( n, f ) H H ( f )) (31) where Diag ( . ) pro jects a matr ix onto its dia g onal. 3.3 EM up dates for the full-rank unconstrained mo del The deriv ation of EM for the full-rank unconstrained mo del is muc h easie r s ince the ab ove issue do es not arise. W e hence stick with the exact mixture mo del (1), which ca n b e seen as an adv a ntage of full-rank vs. ra nk-1 mo dels. EM is aga in s eparately derived for each frequency bin f . Since the mixture can be recovered fro m the spa tia l imag es of all sources, the complete data reduce s to { c j ( n, f ) } n,f , that is the set of STFT co efficients of the spatia l images o f all so urces on all time frames. The details of one iteration are a s follows. In the E-s tep, the Wiener filter W j ( n, f ) a nd the conditional mean b c j ( n, f ) and RR n ° 7116 10 Duong, Vinc ent, and Grib onval cov ariance b R c j ( n, f ) of the spa tial image of the j th so urce are computed a s W j ( n, f ) = R c j ( n, f ) R − 1 x ( n, f ) (32) b c j ( n, f ) = W j ( n, f ) x ( n, f ) (33) b R c j ( n, f ) = b c j ( n, f ) b c H j ( n, f ) + ( I − W j ( n, f )) R c j ( n, f ) (34) where R c j ( n, f ) is defined in (4) and R x ( n, f ) in (5). In the M- step, the v a riance and the spa tial cov aria nce o f the j th source a re up dated via v j ( n, f ) = 1 I tr( R − 1 j ( f ) b R c j ( n, f )) (35 ) R j ( f ) = 1 N N X n =1 1 v j ( n, f ) b R c j ( n, f ) (36) where tr( . ) denotes the trace o f a square matrix. Note that, strictly sp ea king, this algor ithm is a ge ne r alized form of EM [17], sinc e the M- step increas es but do es not max imize the likelihoo d of the complete data due to the interleaving of (35) and (36). 3.4 EM up dates for the rank-1 anec hoic mo del and the full-rank direct+diffuse mo del The deriv ation of EM for the t wo remaining models is more complex sinc e the M- step cannot b e expressed in clos ed form. The complete data a nd the E-step for the r ank-1 anechoic model and the full-ra nk dire ct+diffuse mo del ar e identical to those for the rank-1 conv olutive mo del and the full-ra nk unconstrained mo del, resp ectively . The M- step, which co nsists of maximizing the likeliho o d of the complete data given their natura l statistics computed in the E- step, could b e addressed e.g. via a quasi-Newto n tec hnique o r by s a mpling p ossible pa rameter v alues from a grid [12]. In the following, we do not a ttempt to derive the details of these alg o rithms since these t wo mo dels app ear to provide lo wer p erfor mance than the rank-1 convolutiv e mo del and the full-rank unconstraine d mo del in a semi-blind context, a s discusse d in Section 4.2. 3.5 P erm utation alignmen t Since the para meters of the rank- 1 co nv o lutive mo del and the full-rank uncon- strained mo del are estimated independently in e ach frequency bin f , they s hould be or dered so a s to c o rresp ond to the same source a cross all frequency bins. In order to solve this so-called p er m utation problem, we apply the DOA-based algorithm desc r ib ed in [18] for the ra nk-1 mo del. Given the geometry of the microphone ar ray , this algorithm computes the DOAs of all sources and p er- m utes the mo del para meters by cluster ing the e s timated mixing vectors h j ( f ) normalized as in (16). Regarding the full-ra nk mo del, w e first apply principal co mpo nent analy- sis (P CA) to summarize the s pa tial cov aria nce matrix R j ( f ) of each s o urce in each fr e q uency bin by its first principal compo nent w j ( f ) that p oints to the direction of maximum v aria nce. This vector is co nceptually equiv alent to the mixing vector h j ( f ) o f the r ank-1 mo del. Thus, we can apply the same pr o ce- dure to solve the p ermutation pr oblem. Fig. 2 depicts the phase of the second INRIA Under-determine d r everb er ant audio sour c e sep ar ation 11 ent ry w 2 j ( f ) of w j ( f ) b efore and after solving the per mu tation for a r eal-world stereo recor ding of thr e e female sp eech sources with ro om reverb eratio n time T 60 = 250 ms, where w j ( f ) has b een nor ma lized as in (16). This pha se is unam- biguously related to the sourc e DOAs b elow 5 kHz [18]. Above that fr e quency , spatial aliasing [18] o ccurs . Nev ertheless, we ca n see tha t the s ource order is globally alig ned for most frequency bins a fter solv ing the p er mutation. Figure 2: Normalized argument of w 2 j ( f ) b efor e and after p ermutation align- men t from a real- world ster eo recor ding of three so urces with R T 60 = 2 50 ms. 4 Exp erimen tal ev aluation W e ev alua te the ab ov e mo dels and a lg orithms under three different expe rimen- tal settings . Firstly , we compare all four mo dels in a semi-blind setting so as to estimate an upp er b ound of their se pa ration p er fo rmance. Based on these results, we select tw o models for further study , na mely the rank-1 con volutiv e mo del a nd the full-rank unconstra ined mo del. Secondly , w e ev aluate these mo d- els in a blind setting o ver synthetic reverbera nt sp eech mixtures and co mpare them to state-of-the-art algorithms o ver the real-world sp eech mixtur es of the 2008 Sig nal Separ ation E v alua tio n Compaig n (SiSEC 2 008) [4]. Finally , we as - sess the robus tnes s of these tw o mo dels to s ource mov ements in a semi-blind setting. 4.1 Common parameter sett ings and per formance criteria The common parameter setting for all exp eriments are summarized in T able 1. In order to ev a luate the sepa ration p erfo r mance of the alg orithms, we use the s ignal-to-dis tortion ra tio (SDR), signal-to-interference ratio (SIR), s ignal-to- artifact ra tio (SAR) and sour ce image-to-s patial distortion r atio (ISR) criteria expressed in decib els (dB), a s defined in [1 9]. These cr iteria account resp ectively RR n ° 7116 12 Duong, Vinc ent, and Grib onval for ov erall dis tortion of the target source, res idual cr osstalk from other sources, m usical noise and s patial o r filtering distortion of the target. Signal dur a tion 10 s e conds Num b er of channels I = 2 Sampling rate 16 k Hz Window type sine window STFT fra me size 2048 STFT fra me shift 1024 Propag ation velocity 334 m/s Num b er of EM iter a tions 10 Cluster thres hold K = 30 T a ble 1: common exp er iment al parameter setting 4.2 P oten tial source separation p er formance of all mo dels The first exp eriment is dev oted to the in vestigation of the p otential sour ce separatio n p e r formance achiev able b y each mo del in a semi-blind co ntext, i.e. assuming knowledge of the true spatial co v ariance matrices. W e generated thr ee stereo synthetic mixtures of three s pe e ch sources by convolving differen t sets of sp eech s ig nals, i.e. male v oices , female voices, and mixe d male and female voices, with ro om impulse resp onses sim ulated via the source imag e metho d. The po sitions of the sources and the microphones are illustrated in Fig. 3. The distance from each sour ce to the center of the micr o phone pair w as 120 cm and the micropho ne spacing w as 2 0 cm. The r everbera tion time was set to R T 60 = 250 ms. Figure 3: Ro o m geo metr y setting for s y nt hetic conv olutive mixtures . The true spatial cov ariance matrices R j ( f ) of all sources were computed either fro m the p o sitions of the sources and the micropho ne s and o ther r o om parameters or from the mixing filters. More precisely , we used the equations in Sections 2.2, 2.3 and 2.4 for ra nk-1 models and the full-rank direct+diffuse mo del and ML estimation from the spa tia l images o f the true sources for the full-rank unco nstrained mo del. The source v ar iances were then estimated from INRIA Under-determine d r everb er ant audio sour c e sep ar ation 13 the mixture using the qua si-Newton technique in [6], fo r which an efficient ini- tialization exists when the spatial cov ariance matrices ar e fixed. B inary masking and ℓ 1 -norm minimization were also ev aluated for co mparison using the s a me mixing vectors as the rank- 1 con volutiv e model with the refer ence soft ware in [4]. The results a re av erag ed ov er all sources a nd all set of mixtures a nd shown in T able 2. Cov ariance mo dels Num b er of spatial parame- ters SDR SIR SAR ISR Rank-1 anechoic 6 0.8 2.4 7.9 5.0 Rank-1 convolutiv e 3078 3.8 7.5 5.3 9.3 F ull- r ank direct+diffuse 8 3.2 6.9 5.4 7.9 F ull- r ank unco nstrained 6156 5.6 10.7 7.3 11.0 Binary masking 3078 3.3 11 .1 2.4 8.4 ℓ 1 -norm minimiza tion 3078 2.7 7.7 3.4 8.6 T a ble 2: Average potential sour c e separa tion perfo rmance in a semi-blind setting ov er stereo mixtur e s o f three sourc e s with R T 60 = 250 ms. The rank-1 a nechoic mo del has lowest p erfo rmance b ecause it only acc o unts for the dir ect path. By contrast, the full-rank unconstrained mo del has high- est p erfor mance a nd improv es the SDR by 1.8 dB, 2.3 dB, and 2 .9 dB when compared to the rank- 1 conv olutive mo del, binary masking, and ℓ 1 -norm min- imization resp ectively . The full-rank direct+diffuse mo del results in a SD R decrease of 0.6 dB compared to the r a nk-1 co nvolutiv e mo del. This decrease app ears s ur prisingly s mall when co nsidering the fact that the former involv es only 8 spatial pa rameters (6 distances r ij , plus σ 2 rev and d ) instead of 307 8 pa- rameters (6 mixing co efficient s p er frequency bin) for the latter. Nevertheless, we fo cus on the tw o b e st mo dels, namely the rank-1 conv olutive mo del and the full-rank unco nstrained mo del in subsequent exp eriments. 4.3 Blind source separation p erformance as a function of the reverberation time The second exp eriment a ims to inv estigate the blind source separation p erfor- mance achiev ed via these two mo dels a nd via binary mask ing and ℓ 1 -norm min- imization in different reverber a nt co nditions. Syn thetic sp eech mixtures were generated in the same a s in the firs t exp eriment, exc e pt tha t the micr ophone spacing was changed to 5 cm and the distance from the sourc e s to the micro- phones to 50 cm. The reverbera tion time was v aried in the range from 50 to 500 ms. The resulting so urce separ ation pe rformance in terms of SDR, SIR, SAR, and ISR is depicted in Fig. 4. W e o bserve tha t in a low reverb erant en viro nmen t, i.e. T 60 = 5 0 ms, the rank-1 convolutiv e mo del provides the be s t SDR and SAR. This is consistent with the fact that the direct part c o ntains most o f the energy received at the microphones, so that the rank-1 spatial cov ariance matrix provides similar mod- eling accura c y than the full-rank mo del with few er par ameters. How ever, in an RR n ° 7116 14 Duong, Vinc ent, and Grib onval Figure 4: Av erage blind source sepa ration p erfor mance over stereo mixtures of three sources a s a function o f the reverb eration time. environmen t with r ealistic reverb e ration time, i.e. T 60 ≥ 130 ms, the full-rank unconstrained mo del o utper forms b oth the ra nk-1 mo del and binar y mas k ing in terms of SDR and SAR and results in a SIR very clo s e to tha t o f bina ry masking. F or insta nce, with T 60 = 5 00 ms , the SDR achieved v ia the full-rank unconstrained mo del is 2 . 0 dB, 1 . 2 dB and 2 . 3 dB lar ger than that of the rank- 1 conv olutive mo del, binary masking , and ℓ 1 -norm minimization r esp ectively . These r esults co nfirm the effectiveness of o ur prop os ed mo del pa r ameter esti- mation s cheme and a ls o show that full-rank spatial cov ar ia nce matrices b etter approximate the mixing pro cess in a re verb erant r o om. 4.4 Blind source separation with t he SiSEC 2008 test data W e conducted a third exp eriment to co mpare the prop osed full-r ank uncon- strained mo del-based algo rithm with s tate-of-the-ar t BSS algo rithms s ubmitted for ev aluation to SiSEC 2008 ov er r e al-world mixtures o f 3 or 4 sp eech so urces. Two mixtures were recor ded for each given n umber of sources, using either male or female s p e ech signals. The ro om reverber ation time was 250 ms and the mi- crophone spacing 5 cm [4]. The av erage SDR achiev ed by each alg orithm is INRIA Under-determine d r everb er ant audio sour c e sep ar ation 15 listed in T able 3. The SDR figures of all alg o rithms except yours were taken from the website of SiSE C 2008 1 . Algorithms 3 source mixtur es 4 so ur ce mixtures full-rank unco nstrained 3.8 2.0 M. Co b o s [20] 2.2 1.0 M. Mandel [21] 0.8 1.0 R. W eiss [2 2] 2.3 1.5 S. Ara ki [23] 3.7 - Z. El Chami [24] 3.1 1.4 T a ble 3: Average SDR over the re a l-world tes t data o f SiSEC 2008 with T 60 = 250 ms and 5 cm microphone spacing. F o r three-sour c e mixtures , our algo rithm provides 0.1 dB SDR impr ov ement compared to the b est cur rent result given by Ar aki’s a lgorithm [23] . F or four - source mixtures, it provides even higher SDR improv ement of 0.5 dB compa red to the b est curr e nt result given by W eiss’s alg orithm [22]. 4.5 In vestigation of the r obustness to small source mo v e- men ts Our las t exper imen t aims to to examine the robustness of the rank-1 convolutiv e mo del and the full-rank unconstrained model to small source mo vemen ts. W e made several re c ordings o f thr ee sp eech sour ces s 1 , s 2 , s 3 in a meeting r o om with 2 50 ms reverb e ration time using omnidirectio nal micropho ne s spaced by 5 cm. The distance from the source s to the microphones was 50 cm. F or e a ch recording , the spatial images of all s o urces w ere separ ately recorded and then added together to obtain a test mix tur e. After the first recording, w e kept the sa me positio ns for s 1 and s 2 and s uc c e ssively mov ed s 3 by 5 a nd 10 ◦ bo th clo ck-wise and co unt er clo ck-wise resulting in 4 new p ositions of s 3 . W e then applied the same pro cedure to s 2 while the p o s itions of s 1 and s 3 remained ident ical to those in the first r ecording. Ov era ll, w e c o llected nine mix tur es: one fr o m the first recor ding, four mixtures with 5 ◦ mov emen t of either s 2 or s 3 , and four mixtures with 10 ◦ mov emen t o f either s 2 or s 3 . W e perfor med source separatio n in a semi- blind setting: the sour ce spatial co v ariance matrices w ere estimated fr o m the s patial imag e s of all sources recor ded in the firs t recor ding while the sourc e v aria nc e s were estimated from the nine mixtures using the same algorithm a s in Section 4.2. The av erage SDR a nd SIR obtained for the fir s t mixture and for the mixtures with 5 ◦ and 1 0 ◦ source mov ement are depicted in Fig. 5 a nd Fig. 6, resp ectively . This pro cedure s imulates er rors enco un tered by on-line sourc e s eparation algo rithms in moving source environments, where the source s e paration pa rameters learnt at a given time are not applica ble a nymore at a later time. The separatio n p erforma nce o f the rank-1 conv olutive mo del degrades more than that of the full-rank unconstraine d mo del bo th with 5 ◦ and 10 ◦ source rotation. F or instance, the SDR dr o ps b y 0.6 dB for the full-rank unconstrained mo del based a lgorithm when a s o urce mo ves by 5 ◦ while the co rresp onding dro p 1 h ttp://sisec2008.wiki.irisa.f r/tiki-index.php?page=Under-determined+speech +and+music+mixtures RR n ° 7116 16 Duong, Vinc ent, and Grib onval Figure 5: SDR results in the small source mov emen t sc e narios. Figure 6: SIR results in the small so urce mov emen t sc e narios. for the rank-1 co nv olutive mo del equals 1 dB. This result can b e explained when considering the fact that the full-ra nk mo del acc o unts for the spatial sprea d of each source as well as its spa tia l dir ection. Therefor e, small source mov ements remaining in the range of the spa tial spread do no t affect muc h s eparation p er- formance. This res ult indicates that, b esides its numerous adv antages presented in the previous experiments, this mo del could also o ffer a pr omising approa ch to the separ ation o f moving sources due to its gr eater ro bustness to parameter estimation error s. INRIA Under-determine d r everb er ant audio sour c e sep ar ation 17 5 Conclus ion and d iscu ssion In this article, we presented a genera l probabilistic framework for conv olutive source separation based on the notion of spatial cov ariance matrix. W e pro p o sed four s pe c ific mo dels, including rank-1 mo dels based on the narr owband approx- imation and full-rank mo dels that ov erco me this a pproximation, and de r ived an efficient algo rithm to es timate their parameter s fro m the mixtur e . Exp erimen- tal results indica te that the prop ose d full-ra nk unconstra ined spatial cov ariance mo del be tter acco unts for r everbera tion and therefore improves s eparation p er- formance compared to r ank-1 mo dels and state-of-the-ar t algor ithms in realis tic reverber ant en viro nments. Let us now ment ion s e veral further re search directions. Short-term work will b e dedicated to the modeling and separation of diffuse and semi-diffuse sources or background noise via the full-rank unconstrained mo del. Co ntrary to the rank-1 mo del in [13] which inv olves an explicit s patially unco rrelated noise comp onent, this mo del implicitly repr esents noise a s any o ther sour ce and ca n account for m ultiple noise so urces as well as spatially corre lated noises with v a r i- ous spatial spr eads. A further goa l is to complete the probabilistic framework by defining a prior distribution for the mo del pa r ameters acro s s a ll frequency bins so as to improv e the r obustness of parameter estimation with s mall amounts of data a nd to address the p ermutation problem in a proba bilistically relev ant fashion. Finally , a promising way to impr ov e source separ a tion pe r formance is to combine the spatial cov ar iance mo de ls inv estigated in this article w ith mo d- els o f the source sp ectra such as Gaussian mixture mo dels [11] or nonnegative matrix fac to rization [13]. References [1] O . Yılmaz and S. T. Rick ar d, “Blind separ ation of spe ech mixtures via time- frequency ma sking,” IEEE T r ans. on Signal Pr o c essing , v ol. 5 2, no. 7, pp. 1830– 1847 , July 200 4. [2] S. Win ter, W. K ellermann, H. Saw a da, and S. Makino , “MAP-bas ed under- determined blind so ur ce s eparation of convolutiv e mixtures b y hierarchical clustering and ℓ 1 -norm minimiza tion,” EURASIP Journal on A dvanc es in Signal Pr o c essing , vol. 2 007, 20 07, article ID 24717 . [3] P . Bofill, “ Underdetermined blind separ ation of delayed sound sources in the frequency domain,” Neur o c omputing , vol. 55, no. 3 -4, pp. 627–641, 2003. [4] E . Vincent, S. Araki, and P . Bofill, “The 2 008 sig nal se pa ration ev alua- tion campaign: A communit y-based appro ach to large-s cale ev aluation,” in Pr o c. Int. Conf. on Indep endent Comp onent Analysi s and Signal Sep ar ation (ICA) , 20 0 9, pp. 734–7 41. [5] E . Vincen t, M. G. J afari, S. A. Ab dalla h, M. D. Plum bley , and M. E. Davies, “P robabilistic mo deling par adigms for a udio source separ a tion,” in Machine Audition: Princip les, Algo rithms and Systems , W. W ang, E d. IGI Global, to app ear. RR n ° 7116 18 Duong, Vinc ent, and Grib onval [6] N. Q . K. Duong, E. Vincent, and R. Grib onv al, “ Spatial cov ariance mo dels for under-determined re verberant a udio source sepa ration,” in Pr o c. 2009 IEEE W orkshop on Applic ations of Signal Pr o c essing to Audi o and A c ous- tics (W AS P AA) , 2009 , pp. 1 2 9–13 2. [7] C. F ´ ev otte and J.-F. Cardos o , “Maximum likelihoo d appr oach for blind audio sour ce separa tion using time-frequency Gaus sian mo dels,” in Pr o c. 2005 IEEE Workshop on Applic ations of Signal Pr o c essing to Audio and A c oustics (W A SP AA) , 2005 , pp. 7 8–81 . [8] E . Vincen t, S. Arberet, and R. Grib onv al, “Underdeter mined instantaneous audio so ur ce separ ation via loca l Gaussia n mo deling,” in Pr o c. Int. Conf. on In dep endent Comp onent Analy sis and Signal Sep ar ation (ICA) , 2 009, pp. 7 75–78 2. [9] A. P . Dempster, N. M. La ir d, and B. D. Rubin, “Maximum likeliho o d from incomplete data via the EM algorithm,” Journal of the R oyal Statistic al So ciety. Series B , vol. 39, pp. 1–3 8, 1977. [10] J.-F. Ca rdoso, H. Snoussi, J . Dela brouille, and G. Patanchon, “B lind sep- aration of noisy Gaussian sta tio nary s o urces. Applicatio n to cosmic mi- crow av e ba ckground imaging,” in Pr o c. Euro p e an Signal Pr o c essing Con- fer enc e (EUS IPCO) , vol. 1, 2002 , pp. 5 61–5 6 4. [11] S. Arber et, A. Ozerov, R. Grib onv al, a nd F. Bim b ot, “Blind sp ectra l-GMM estimation for underdetermined instantaneous audio sourc e sepa ration,” in Pr o c. Int. Conf. on Indep endent Comp onent Anal ysis and S ignal Sep ar ation (ICA) , 2009, pp. 751–7 58. [12] Y . Izumi, N. Ono, and S. Sa gay ama, “ Sparseness - based 2CH BSS using the EM algorithm in reverb erant environmen t,” in Pr o c. 2007 IEEE Workshop on Applic ations of Signal Pr o c essing to Audio and A c oustics (W A SP AA) , 2007, pp. 14 7 –150 . [13] A . Ozerov and C. F´ evotte, “Multichannel no nnegative matrix factor iza- tion in conv olutive mixtures for audio so ur ce s eparation,” IEEE T r ans. on Audio , Sp e e ch and L anguage Pr o c essing , to app ea r. [14] H . Saw ada, S. Araki, R. Muk ai, a nd S. Makino, “Grouping separa ted frequency comp onents by es timating propaga tion mo del par ameters in frequency-domain blind so urce separation,” IEEE T r ans. on Audio, Sp e e ch, and L anguage Pr o c essing , vol. 15, no. 5, pp. 1 592– 1 604, July 20 07. [15] T. Gustafsson, B. D. Rao, and M. T rivedi, “Sour c e lo calizatio n in rever- ber ant environmen ts: Mo deling a nd statistica l a nalysis,” IEEE T ra ns. on Sp e e ch and Audio Pr o c essing , vol. 11, pp. 791– 803, Nov 2003. [16] H . Kuttruff, R o om A c oustics , 4th ed. New Y ork: Sp on Press , 200 0. [17] G. McLachlan and T. Kr ishnan, The EM algorithm and ext ensions . New Y o r k, NY: Wiley , 19 97. INRIA Under-determine d r everb er ant audio sour c e sep ar ation 19 [18] H . Sawada, S. Araki, R. Muk a i, and S. Makino , “ Solving the p ermutation problem of frequency-do main bss when spatial alias ing o ccurs with wide sensor spacing,” in Pr o c. 2006 IEEE Int. Conf. on A c ou stics, S p e e ch and Signal Pr o c essing (ICAS SP) , 2 006, pp. 77– 80. [19] E. Vincent , H. Sa wada, P . Bofill, S. Makino, and J. Rosca , “ First s tereo a u- dio source separation ev a luation campaign: data, algorithms and results,” in Pr o c. Int. Conf. on Indep en dent Comp onent Analysis and Signal Sep a- r ation (ICA) , 200 7, pp. 552–5 5 9. [20] M. Cob o s and J. L´ opez, “Blind separ a tion o f underdetermined sp eech mix- tures based o n DOA segment ation,” IEEE T r ans. on Audio, Sp e e ch, and L anguage Pr o c essing , submitted. [21] M. Mandel and D. Ellis, “E M lo calization and s eparation using interaural level and phase cues,” in Pr o c. 2007 IEEE Workshop on Applic ations of Signal Pr o c essing t o Audio and A c oustics (W ASP AA ) , 2007, pp. 275– 278. [22] R. W eiss and D. Ellis, “Speech separ ation using spea ker-adapted eigenvoice sp eech mo dels,” Computer Sp e e ch and L anguage , vol. 24, no . 1, pp. 1 6–20 , Jan 2 010. [23] S. Araki, T. Nak atani, H. Sawada, and S. Makino, “Stereo source sepa ration and so urce counting with MAP estima tio n with Dirichlet prio r consider ing spatial aliasing problem,” in Pr o c. Int . Conf. on Indep endent Comp onent Analy sis and Signal S ep ar ation (ICA) , 200 9, pp. 742–7 50. [24] Z. El Chami, D. T. P ham, C. Serv i` ere, and A. Guerin, “A new mo del based underdetermined s ource separa tion,” in Pr o c. In t. Workshop on A c oustic Echo and Noise Contr ol (IW AENC) , 20 08. RR n ° 7116 Centre de recherche INRIA Rennes – Bretagne Atlantique IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France) Centre de recherc he INRIA Bordeaux – Sud Ouest : Domaine Uni ve rsitaire - 351, cours de la Libération - 33405 T alenc e Cedex Centre de recherc he INRIA Grenobl e – Rhône-Alpes : 655, ave nue de l’Europe - 38334 Montbonnot Saint-Ismier Centre de recherc he INRIA Lille – Nord Europe : Parc Scientifique de la Haute Borne - 40, ave nue Halley - 59650 V ille neuve d’Ascq Centre de recherc he INRIA Nanc y – Grand Est : L ORIA, T echnopôle de Nancy-Bra bois - Campus scientifique 615, rue du Jardin Botani que - BP 101 - 54602 V illers-lès-Na ncy Cede x Centre de recherc he INRIA Pari s – Rocquencourt : Domaine de V oluceau - Rocquencou rt - BP 105 - 78153 Le Chesnay Cedex Centre de recherc he INRIA Sacla y – Île-de-Franc e : Parc Orsay Uni ver sité - ZAC des V ignes : 4, rue Jacques Monod - 91893 Orsay Cedex Centre de recherc he INRIA Sophia Antipolis – Méditerranée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex Éditeur INRIA - Domaine de V olucea u - Rocquenc ourt, BP 105 - 78153 Le Chesnay Cede x (France) http://www.inria.fr ISSN 0249 -6399 apport   technique

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment