Efficient Full-Rank Spatial Covariance Estimation Using Independent Low-Rank Matrix Analysis for Blind Source Separation

In this paper, we propose a new algorithm that efficiently separates a directional source and diffuse background noise based on independent low-rank matrix analysis (ILRMA). ILRMA is one of the state-of-the-art techniques of blind source separation (…

Authors: Yuki Kubo, Norihiro Takamune, Daichi Kitamura

Efficient Full-Rank Spatial Covariance Estimation Using Independent   Low-Rank Matrix Analysis for Blind Source Separation
Ef ficient Full-Rank Spatial Co v ariance Estimation Using Independent Lo w-Rank Matrix Analysis for Blind Source Separation Y uki Kubo † , Norihiro T akamune † , Daichi Kitamura ‡ , Hiroshi Saruwatari † † The University of T okyo, Graduate School of Information Science and T echnology , 7-3-1 Hongo, Bunkyo-ku, T okyo 113-8656, Japan ‡ National Institute of T echnolo gy , Kagawa Colle ge, 355 Chokushi-cho, T akamatsu, Kagawa 761-8058, Japan Abstract —In this paper , we propose a new algorithm that efficiently separates a directional sour ce and diffuse background noise based on independent low-rank matrix analysis (ILRMA). ILRMA is one of the state-of-the-art techniques of blind source separation (BSS) and is based on a rank-1 spatial model. Although such a model does not hold f or diffuse noise, ILRMA can accurately estimate the spatial parameters of the directional source. Motivated by this fact, we utilize these estimates to restor e the lost spatial basis of diffuse noise, which can be considered as an efficient full-rank spatial covariance estimation. BSS experiments show the efficacy of the proposed method in terms of the computational cost and separation performance. Index T erms —Blind source separation, independent low-rank matrix analysis, full-rank spatial co variance model, diffuse noise I . I N T RO D U C T I O N Blind source separation (BSS) is a technique for separating an observed multichannel signal, which is a mixture of mul- tiple sources, into each source without any prior information about the sources or the mixing system. In a determined or ov erdetermined situation (number of sensors ≥ number of sources), frequency-domain independent component analysis (FDICA) [1], [2], independent vector analysis (IV A) [3], [4], and independent low-rank matrix analysis (ILRMA) [5], [6] hav e been proposed for audio BSS problems. In particular , ILRMA assumes low-rankness for the power spectrogram of each source using nonneg ati ve matrix factorization (NMF) [7], [8] in addition to statistical independence between sources, and achiev es efficient and accurate separation [5]. These methods assume a rank-1 spatial model; the frequency-wise acoustic path of each source can be represented by a single time- in variant spatial basis, which is often called a steering vector . Under this assumption, the determined BSS problem reduces to the estimation of a demixing matrix for each frequenc y . Howe ver , the assumption in the rank-1 spatial model becomes in valid in actual situations. For instance, when a target source (directional source) and dif fuse noise that arri ves from all directions are mixed, FDICA, IV A, and ILRMA cannot e xtract only the target source in principle [9], and the estimated target source includes residual diffuse noise. Multichannel NMF (MNMF) [10], [11] is theoretically equiv alent to ILRMA except for the mixing model, namely , MNMF employs a full-rank spatial cov ariance matrix [12]. This model can represent not only the acoustic path b ut also the spatial spread of each source or diffuse noise, while its optimization has a huge computational cost and lacks robustness against the initialization [5]. T o accelerate the pa- rameter estimation, FastMNMF has been proposed [13], [14]. It assumes a jointly diagonalizable spatial cov ariance matrix to greatly reduce the computational cost of the update algorithm, although its performance still depends on the initial values of parameters. T o increase the stability of its performance, ILRMA-based initialization was utilized for MNMF in [15]. Howe ver , the improvement is still limited because of the complexity of optimization with a large number of parameters. In this paper , we treat the BSS problem with one directional target source and dif fuse background noise, where more than or equal to two microphones are av ailable. In this case, the tar get source can be expressed using the rank-1 spatial cov ariance (one steering vector), but diffuse noise requires the full-rank spatial cov ariance because of its spatial spread. T o achiev e robust and computationally efficient BSS in this situation, we propose a new approach based on ILRMA: (a) rank-1 target covariance and rank- ( M − 1) diffuse noise cov ariance matrices are simultaneously estimated by ILRMA, where M is the number of microphones, (b) one lost spatial basis for diffuse noise is restored to obtain the rank- M (full- rank) noise cov ariance via the expectation-maximization (EM) algorithm, and (c) a multichannel Wiener filter is applied to enhance only the target source. The efficac y of the proposed method is confirmed through BSS experiments using a mixture of speech and diffuse noise. Regarding its relation to prior works, the proposed method is considered as a spatial model extension of FDICA, IV A, and ILRMA, which are the con ventional independence-based BSS algorithms utilizing the rank-1 spatial model. Compared with con ventional MNMF and FastMNMF based on the full- rank spatial model, the proposed method is re garded as a computationally efficient algorithm with higher separation accuracy . I I . I N D E P E N D E N T L OW - R A N K M A T R I X A N A L Y S I S A. F ormulation Let us denote a multichannel observed signal as x ij = ( x ij, 1 , . . . , x ij,m , . . . , x ij,M ) T ∈ C M that is obtained via a short-time Fourier transform (STFT), where i = 1 , . . . , I , j = 1 , . . . , J , and m = 1 , . . . , M are the indices of the frequency bins, time frames, and microphones, respecti vely , and T denotes the transpose. Also, source signals (dry sources) are denoted as s ij = ( s ij, 1 , . . . , s ij,n , . . . , s ij,N ) T ∈ C N , where n = 1 , . . . , N is the index of the sources and N is the number of sources. If each source in x ij can be represented by a time-in variant steering vector a i,n ∈ C M , the following mixing system holds: x ij = A i s ij , (1) where A i = ( a i, 1 · · · a i,N ) is called a mixing matrix. If M = N and A i is in vertible, the separated signal y ij = ( y ij, 1 , . . . , y ij,N ) T ∈ C N can be obtained by estimating the demixing matrix W i = ( w i, 1 · · · w i,N ) H = A − 1 i as y ij = W i x ij , (2) where H denotes the Hermitian transpose. B. Generative Model and Update Rules In ILRMA, as the generative model of source signals, the following comple x Gaussian distribution is assumed: s ij,n ∼ N c (0 , r ij,n ) , (3) where r ij,n is the time-frequency-v arying variance (po wer spectrogram model of s ij,n ). Also, r ij,n is modeled by NMF [16] as r ij,n = P l t il,n v lj,n , where t il,n ≥ 0 and v lj,n ≥ 0 are the NMF variables, l = 1 , . . . , L is the inde x of the NMF bases, and L is the number of bases. From (1) and (3), the generativ e model of the observed signal becomes x ij ∼ N c 0 , X n r ij,n a i,n a H i,n ! . (4) Since the mixing system (1) is assumed in ILRMA, the spatial cov ariance is represented by a rank-1 matrix as a i,n a H i,n , which is called the rank-1 spatial model. The cost function in ILRMA is defined as the ne gati ve log- likelihood function of (4) as L = − 2 J X i log | det W i | + X i,j,n  | y ij,n | 2 r ij,n + log r ij,n  , (5) where y ij,n = w H i,n x ij . Both the separation filter w i,n and the NMF variables t il,n and v lj,n can be optimized in the maximum likelihood sense (minimization of (5)) by iterating the following iterati ve update rules [5]: G i,n = 1 J X j 1 r ij,n x ij x H ij , (6) w i,n ← ( W i G i,n ) − 1 e n , (7) w i,n ← w i,n ( w H i,n G i,n w i,n ) − 1 2 , (8) 0 4 8 12 16 20 SIR imp rovement [dB] Diffuse noise Ta rget speech T a rget sp eech Diffuse noise Fig. 1. SIR improvement for directional speech and diffuse noise. where e n denotes the unit vector with the n th element equal to unity . The update rules for w i,n are called the iterati ve projection [17], which promises con vergence-guaranteed ef- ficient optimization. Also, we can update t il,n and v lj,n by minimizing the Itakura–Saito div ergence between P l t il,n v lj,n and r ij,n (see [5] for details). I I I . P RO P O S E D M E T H O D A. Motivation and Strate gy In this paper , we deal with a mixture signal that includes one directional target source and diffuse background noise. Since diffuse noise cannot be expressed by the rank-1 spatial model (one steering vector), BSS based on a full-rank cov ariance model, such as MNMF , should be applied in this situation. Howe ver , estimation of the full-rank covariance has a huge computational cost, and its p erformance is always more un- stable than ILRMA [5] because of the lar ge number of spatial parameters, I N M 2 , which can be reduced to I N M using the rank-1 spatial model (ILRMA). For this reason, to achieve ef ficient and stable BSS, we propose a new ILRMA-based full-rank covariance estimation using more than or equal to two microphones. Although the sources are categorized into two groups (tar get and noise), we assume that one target source and M − 1 noise components are mix ed ( N = M ). This assumption allows us to model the dif fuse noise using M − 1 spatial bases (rank- ( M − 1) spatial cov ariance). The extraction of the target source in this manner is still difficult because noise components exist ev en in the same direction as the target source. Ho wever , FDICA or ILRMA can separate the diffuse noise with high accuracy ev en if one spatial basis for diffuse noise is lacking. Figure 1 shows an example of the separation performance (source-to-interference ratio (SIR) [18]) obtained by ILRMA, where directional speech and diffuse noise are mixed and the experimental conditions are described in Sect. IV. It can be seen that diffuse noise is accurately estimated (almost perfectly with more than 20 dB accuracy) rather than the tar get speech, where dif fuse noise is modeled using the rank-( M − 1 ) spatial cov ariance. This is because the demixing filters for the dif fuse noise can precisely cancel the tar get speech, which is a point source [19], meaning that the steering vector of the directional source a i,n h can be estimated by ILRMA with high accuracy , where n h denotes the index of the target source. This implies that we can fix some spatial parameters in the full-rank spatial model for diffuse noise by utilizing the estimates obtained by ILRMA in advance. On the basis of the abov e motiv ation, we propose the following ne w estimation method for the full-rank spatial cov ariance of diffuse noise: (a) the rank-1 spatial cov ariance for the target source, a i,n h a H i,n h , and rank- ( M − 1) cov ariance for dif fuse noise, P n 6 = n h a i,n a H i,n , are estimated by ILRMA, (b) the lost spatial basis for dif fuse noise is restored via the EM algorithm to estimate the noise components in the direction of the target source, and (c) a multichannel W iener filter is applied to suppress the noise components remaining in the separated target source. B. Model of T ar get Sour ce and Diffuse Noise The observed signal x ij is assumed to be the sum of two components, as x ij = h ij + u ij , (9) where h ij = ( h ij, 1 , . . . , h ij,M ) T ∈ C M is the spatial image of the target source and u ij = ( u ij, 1 , . . . , u ij,M ) T ∈ C M is that of the diffuse noise. The tar get source h ij is modeled as h ij = a ( h ) i s ( h ) ij , (10) s ( h ) ij ∼ N c (0 , r ( h ) ij ) , (11) where a ( h ) i , s ( h ) ij , and r ( h ) ij are the n h th steering v ector a i,n h , the dry source component, and the power spectrogram of the n h th source, respecti vely . As mentioned in Sect. III-A, a ( h ) i can be accurately estimated by ILRMA. Thus, we hereafter consider a ( h ) i as a giv en and fixed parameter in the following processes. In addition to (11), to improv e the estimation performance, we introduce an a priori distrib ution for the variance r ( h ) ij using the in verse gamma distribution, p ( r ( h ) ij ; α, β ) = β α Γ( α )  r ( h ) ij  − α − 1 exp − β r ( h ) ij ! , (12) where α > 0 and β > 0 are shape and scale parameters, re- spectiv ely , and a large α with a small β induces the sparseness of r ( h ) ij . Since diffuse noise should hav e a full-rank spatial co- variance, the generativ e model of u ij is expressed by a multiv ariate complex Gaussian distribution as u ij ∼ N c ( 0 , r ( u ) ij R ( u ) i ) , (13) where r ( u ) ij and R ( u ) i are the variance and spatial co variance for the dif fuse noise, respectiv ely . From the estimated demixing filter w i,n obtained by ILRMA, we can model the full-rank spatial cov ariance of the diffuse noise as follows: R ( u ) i = R 0 ( u ) i + λ i b i b H i , (14) R 0 ( u ) i = 1 J X j W − 1 i diag  | w H i, 1 x ij | 2 , . . . , | w H i,n h − 1 x ij | 2 , 0 , | w H i,n h +1 x ij | 2 , . . . , | w H i,N x ij | 2   W − 1 i  H , (15) where b i is the unit eigen vector of R 0 ( u ) i that corresponds to the zero eigenv alue and λ i is a scalar weight used to complement the lost spatial basis, namely , the direction of the target source. Note that (15) includes a back-projection operation to compensate the scales of the signals [20]. Since R 0 ( u ) i consists of M − 1 noise estimates, its rank is M − 1 . Therefore, to restore the lost spatial basis in R 0 ( u ) i , we must simultaneously estimate the eigenv alue λ i , the variance of the target source r ( h ) ij , and the variance of the diffuse noise r ( u ) ij with a ( h ) i and the rank- ( M − 1) spatial covariance R 0 ( u ) i fixed. In summary , the number of spatial parameters to be estimated in the proposed method is I N M (for ILRMA) + I (for λ i ), i.e., I ( N M + 1) , which is much less than that of MNMF ( I N M 2 ) and FastMNMF ( I M 2 + I N M ). C. Update Rules Based on EM Algorithm The parameters λ i , r ( h ) ij , and r ( u ) ij are optimized by a maximum a posteriori estimation based on the EM algorithm. A Q function is defined by the expected value of the complete- data log-likelihood w .r .t. p ( s ( h ) ij , u ij | x ij ; ˜ Θ) as Q (Θ; ˜ Θ) = X i,j " − ( α + 2) log r ( h ) ij − ˆ r ( h ) ij + β r ( h ) ij − M log r ( u ) ij − log det R ( u ) i − tr  ( R ( u ) i ) − 1 ˆ R ( u ) ij  r ( u ) ij # + const ., (16) where const . includes the constant terms that do not depend on the parameters, Θ = n r ( h ) ij , r ( u ) ij , λ i o is the set of parameters to be updated, ˜ Θ = n ˜ r ( h ) ij , ˜ r ( u ) ij , ˜ λ i o is the set of up-to-date parameters, and ˆ r ( h ) ij and ˆ R ( u ) ij are the suf ficient statistics obtained by the E-step. The update rules in the E-step are as follows: ˜ R ( u ) i = R 0 ( u ) i + ˜ λ i b i b H i , (17) R ( x ) ij = ˜ r ( h ) ij a ( h ) i  a ( h ) i  H + ˜ r ( u ) ij ˜ R ( u ) i , (18) ˆ r ( h ) ij = ˜ r ( h ) ij −  ˜ r ( h ) ij  2  a ( h ) i  H  R ( x ) ij  − 1 a ( h ) i +    ˜ r ( h ) ij x H ij ( R ( x ) ij ) − 1 a ( h ) i    2 , (19) ˆ R ( u ) ij = ˜ r ( u ) ij ˜ R ( u ) i −  ˜ r ( u ) ij  2 ˜ R ( u ) i  R ( x ) ij  − 1 ˜ R ( u ) i +  ˜ r ( u ) ij  2 ˜ R ( u ) i  R ( x ) ij  − 1 x ij x H ij  R ( x ) ij  − 1 ˜ R ( u ) i . (20) In the M-step, we employ a coordinate ascent algorithm to the Q function. The update rules are as follows: r ( h ) ij ← ˆ r ( h ) ij + β α + 2 , (21) K i = 1 J X j 1 ˜ r ( u ) ij ˆ R ( u ) ij , (22) λ i ← b H i K i b i , (23) R ( u ) i ← R 0 ( u ) i + λ i b i b H i , (24) r ( u ) ij ← 1 M tr   R ( u ) i  − 1 ˆ R ( u ) ij  . (25) T ABLE I E X PE R I M EN TA L C O ND I T I ON S Sampling frequency 16 kHz STFT 256-ms-long Hamming window with 128 ms shift Number of NMF bases L 10 for source model Number of iterations 50 in ILRMA Number of iterations 200 in methods except ILRMA D. Multichannel W iener Filter After the estimation of all the parameters, the follo wing multichannel W iener filter is employed: ˆ h ij = r ( h ) ij a ( h ) i  a ( h ) i  H  R ( x ) ij  − 1 x ij , (26) ˆ u ij = r ( u ) ij R ( u ) i  R ( x ) ij  − 1 x ij . (27) E. Initialization of Sour ce V ariances Since the EM algorithm strongly depends on the initial values of the parameters, we employ the ILRMA estimates to initialize the source variances r ( h ) ij and r ( u ) ij to avoid trapping at a poor local solution as follows: r ( h ) ij = X l t il,n h v lj,n h , (28) r ( u ) ij = 1 M  ˆ y ( u ) ij  H  R 0 ( u ) i  + ˆ y ( u ) ij , (29) where t il,n h and v lj,n h are the lo w-rank source model of the target source obtained by ILRMA, + denotes the pseudoin- verse, and ˆ y ( u ) ij is the scale-fix ed source image of diffuse noise obtained as P n 6 = n h W − 1 i (0 , . . . , 0 , w H i,n x ij , 0 , . . . , 0) T . Also, λ i is initialized by the minimum nonzero eigen value of R 0 ( u ) i . I V . E X P E R I M E N T S A. Experimental Conditions T o confirm the efficac y of the proposed method, we con- ducted a BSS experiment using a simulated mixture of a target speech source and diffuse noise. W e compared se ven methods, namely , ILRMA [5], BSSA [19], the original MNMF [11], MNMF initialized by ILRMA (ILRMA+MNMF) [5], [15], the original FastMNMF [14], FastMNMF initialized by ILRMA (ILRMA+FastMNMF), and the proposed method ( α = 0 . 7 and β = 10 − 16 were selected experimentally). In ILRMA, the observ ation x ij was preprocessed via a sphering trans- formation using PCA. For BSSA, we replaced FDICA in [19] with ILRMA and set the oversubtraction and flooring parameters to 1.4 and 0, respectiv ely . F or ILRMA, the original MNMF , and the original FastMNMF , all the NMF variables were initialized by nonnegati ve random values. The demixing matrix W i in ILRMA and the spatial cov ariance matrix in the original MNMF and the original FastMNMF were initialized by the identity matrix I . For ILRMA+MNMF and ILRMA+FastMNMF , the NMF variables were taken from ILRMA. Also, the spatial co variance matrix was initialized using a i,n a H i,n + ε I for ILRMA+MNMF and a i,n a H i,n + ε P n 0 6 = n a i,n 0 a H i,n 0 for ILRMA+FastMNMF , where a i,n was estimated by ILRMA and ε was set to 10 − 5 . W e used speech signals obtained from the JNAS speech cor- pus [21] to produce the tar get speech source and dif fuse babble noise. The station and traf fic noise signals were obtained from DEMAND [22]. These dry sources were con voluted with the impulse responses shown in Fig. 2 to simulate the mixture, where the target source was located at 30 ◦ , 20 ◦ , 10 ◦ , or 0 ◦ clockwise from the normal to a microphone array , the 18 loudspeakers used to simulate diffuse noise were arranged at intervals of 10 ◦ except in the target source direction, the size of the recording room for these impulse responses was 3.9 m × 3.9 m, and its reverberation time was about 200 ms. Note that the diffuse babble noise was produced by con voluting 18 independent speakers with each impulse response, and the diffuse station and traf fic noises were produced by splitting the dry source into 18 short-time periods and conv oluting them with each impulse response. The speech-to-noise ratio was set to 0 dB. The other conditions are shown in T able I. B. Results Source-to-distortion ratio (SDR) [18] is used as a total ev al- uation score in terms of separation performance and sound dis- tortion. The SDR behaviors for each of the methods, which are the averaged results over 10 parameter-initialization random seeds and four target directions, are shown in Fig. 3, where those of ILRMA-initialized methods are depicted except for their initializing iterations of ILRMA. The proposed method outperformed the other methods. In particular , the full-rank spatial model in the proposed method showed an improvement of more than 3 dB compared with the rank-1 spatial model in ILRMA, and the efficac y of the proposed spatial model extension was confirmed. Also, we re veal that, even with the assistance of ILRMA-based initialization, the SDRs of the conv entional MNMFs and FastMNMFs with the full-rank spatial model cannot reach that of the proposed method. As regards the optimization cost, the EM algorithm in the proposed method con verged within five iterations, which was greatly reduced from the number of iterations required for MNMFs and F astMNMFs. In addition, the actual computa- tional times of MNMF , FastMNMF , and the proposed EM algorithm for each iteration were 10.18 s, 0.87 s, and 0.005 s, respectiv ely , further illustrating the adv antageousness of the proposed method. On the other hand, the unbiased sample standard deviations of SDR improv ements just after 200 iterations of ILRMA, original MNMF , ILRMA+MNMF , original FastMNMF , IL- RMA+FastMNMF , and the proposed method are 0.19, 2.37, 0.38, 5.75, 0.26, and 0.22, respecti vely . This means that the proposed method is a more stable algorithm than MNMF and FastMNMF in terms of initialization dependency . V . C O N C L U S I O N W e proposed a new algorithm that accurately and ef ficiently extracts a directional tar get source in dif fuse background noise. 0 100 200 Number of iterations 0 2 4 6 8 10 SDR improvement [dB] (a) Babble noise ILRMA BSSA Original MNMF ILRMA+MNMF Original FastMNMF ILRMA+FastMNMF Proposed method 0 100 200 Number of iterations 0 2 4 6 8 10 SDR improvement [dB] (b) Station noise ILRMA BSSA Original MNMF ILRMA+MNMF Original FastMNMF ILRMA+FastMNMF Proposed method 0 100 200 Number of iterations 0 2 4 6 8 10 SDR improvement [dB] (c) Traffic noise ILRMA BSSA Original MNMF ILRMA+MNMF Original FastMNMF ILRMA+FastMNMF Proposed method Fig. 3. SDR beha viors a veraged ov er 10 parameter-initialization random seeds and four target directions in separation of target speech and diffuse (a) babble, (b) station, (c) traf fic noises, where speech-to-noise ratio is 0 dB. 6.45 cm 10° 1.5 m 1.0 m Target s peec h Noise sources Fig. 2. Recording conditions of impulse responses (when target source is located at 30 ◦ ), where rev erberation time T 60 is 200 ms. The proposed method is based on ILRMA and restores the lost spatial basis by using the EM algorithm to extend the spatial covariance of the noise from a rank-( M − 1 ) matrix to the full-rank matrix. In an experiment, we confirmed that the proposed method outperforms the conv entional methods in terms of accuracy and computational efficienc y . A C K N O W L E D G M E N T This work was partly supported by SECOM Science and T echnology Foundation and JSPS KAKENHI Grant Numbers 17H06101, 19H01116, and 19K20306. R E F E R E N C E S [1] P . Smaragdis, “Blind separation of con volved mixtures in the frequency domain, ” Neurocomputing , vol. 22, no. 1, pp. 21–34, 1998. [2] H. Saruwatari et al., “Blind source separation based on a fast- con ver gence algorithm combining ICA and beamforming, ” IEEE T rans. ASLP , vol. 14, no. 2, pp. 666–678, 2006. [3] A. Hiroe, “Solution of permutation problem in frequency domain ICA using multiv ariate probability density functions, ” in Pr oc. ICA , 2006, pp. 601–608. [4] T . Kim et al., “Blind source separation exploiting higher -order frequency dependencies, ” IEEE T rans. ASLP , vol. 15, no. 1, pp. 70–79, 2007. [5] D. Kitamura et al., “Determined blind source separation unifying independent vector analysis and nonnegati ve matrix factorization, ” IEEE/ACM T rans. ASLP , vol. 24, no. 9, pp. 1626–1641, 2016. [6] D. Kitamura et al., “Determined blind source separation with indepen- dent lo w-rank matrix analysis, ” in Audio Source Separation , S. Makino, Ed., pp. 125–155. Springer , Cham, 2018. [7] D. D. Lee, H. S. Seung, “Learning the parts of objects by non-negativ e matrix factorization, ” Natur e , v ol. 401, no. 6755, pp. 788–791, 1999. [8] D. D. Lee, H. S. Seung, “ Algorithms for non-negativ e matrix factoriza- tion, ” in Pr oc. NIPS , 2000, pp. 556–562. [9] S. Araki et al., “Equi valence between frequency-domain blind source separation and frequency-domain adapti ve beamforming for con volutiv e mixtures, ” EURASIP J ASP , vol. 2003, no. 11, pp. 1–10, 2003. [10] A. Ozerov , C. F ´ evotte, “Multichannel nonnegati ve matrix factorization in con voluti ve mixtures for audio source separation, ” IEEE T rans. ASLP , vol. 18, no. 3, pp. 550–563, 2010. [11] H. Sawada et al., “Multichannel extensions of non-negativ e matrix factorization with complex-v alued data, ” IEEE T rans. ASLP , vol. 21, no. 5, pp. 971–982, 2013. [12] N. Q. K. Duong et al., “Under-determined rev erberant audio source separation using a full-rank spatial covariance model, ” IEEE T rans. ASLP , vol. 18, no. 7, pp. 1830–1840, 2010. [13] N. Ito, T . Nakatani, “FastMNMF: Joint diagonalization based accelerated algorithms for multichannel nonnegativ e matrix factorization, ” in Pr oc. ICASSP , 2019, pp. 371–375. [14] K. Sekiguchi et al., “Fast multichannel source separation based on jointly diagonalizable spatial co variance matrices, ” CoRR , vol. abs/1903.03237, 2019. [15] K. Shimada et al., “Unsupervised beamforming based on multichannel nonnegati ve matrix factorization for noisy speech recognition, ” in Pr oc. ICASSP , 2018, pp. 5734–5738. [16] C. F ´ evotte et al., “Nonnegati ve matrix factorization with the Itakura– Saito div ergence: With application to music analysis, ” Neural Comput. , vol. 21, no. 3, pp. 793–830, 2009. [17] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique, ” in Proc. W ASP AA , 2011, pp. 189–192. [18] E. V incent et al., “Performance measurement in blind audio source separation, ” IEEE T rans. ASLP , vol. 14, no. 4, pp. 1462–1469, 2006. [19] Y . T akahashi et al., “Blind spatial subtraction array for speech enhance- ment in noisy environment, ” IEEE Tr ans. ASLP , v ol. 17, no. 4, pp. 650–664, 2009. [20] N. Murata et al., “ An approach to blind source separation based on temporal structure of speech signals, ” Neurocomputing , vol. 41, no. 1–4, pp. 1–24, 2001. [21] K. Itou et al., “JNAS: Japanese speech corpus for large vocab ulary continuous speech recognition research, ” J. Acoust. Soc. Jpn. (E) , vol. 20, no. 3, pp. 199–206, 1999. [22] J. Thiemann et al., “DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments, ” June 2013, Supported by Inria under the Associate T eam Program VERSAMUS.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment