Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

1 Separation of Mo ving Sound Sources Using Multichannel NMF and Acoustic T racking Joonas Nikunen, Aleksandr Diment, and T uomas V irtanen, Senior Member , IEEE Abstract In this paper we propose a method for separation of moving sound sources. The method is based on ﬁrst tracking the sources and then estimation of source spectrograms using multichannel non-negativ e matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener ﬁltering. W e propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial cov ariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arriv al of tracked sources at each time frame. The ev aluation is based on established objecti ve separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conv entional beamforming and ideal ratio mask separation. The proposed method is sho wn to exceed the separation quality of other ev aluated blind approaches according to all measured quantities. Additionally , we ev aluate the method’ s susceptibility tow ards tracking errors by comparing the separation quality achiev ed using annotated ground truth source trajectories. Index T erms Sound source separation, moving sources, time-varying mixing model, microphone arrays, acoustic source tracking I . I N T RO D U C T I O N Separation of sound sources with time-varying mixing properties, caused by the movement of the sources, is a relev ant research problem for enabling intelligent audio applications in realistic operation conditions. These applications include, for example, speech enhancement and separation for automatic speech recognition [1] especially when using v oice commanded smart de vices from afar [2]. Another emerging application ﬁeld includes immersi ve audio for augmented reality [3] which requires modiﬁcation of the observed sound scene for example by removing sound sources and replacing them with augmented content. Separation of non-speech sources can be also used to improve sound ev ent detection in multi-source noisy environment [4]. Most existing works related to sound separation are assuming stationary sources, and not many blind methods have targeted the problem of moving sound sources despite its high relev ance in realistic conditions. The problem of sound source separation either from single or multi-channel recordings has been tackled with v arious methods ov er the years. The methods maximizing statistical independence of non-Gaussian sources, such as the independent component analysis (ICA) hav e been used for unmixing the sources in frequency domain [5], [6]. The concept of binary clustering of time- frequency blocks based on inter-channel cues, namely the level and the time dif ference, has resulted into class of separation methods based on time-frequency masking [7], [8]. Use of binary masks requires assuming that sound sources occupy disjoint time-frequency blocks [9]. More recently , single-channel speech enhancement and separation has been performed with the aid of machine learning and speciﬁcally by using deep neural networks (DNNs) [10], [11] for predicting the time-frequency masks for separation. Combining prediction of source spectrogram using DNNs and spatial information in the form of source cov ariance matrices for audio source separation has been proposed in [12], [13]. Another machine learning tool for masking based source separation is the spectrogram factorization by non-negati ve matrix factorization (NMF) and non-negati ve matrix decon volution (NMD) which both have been widely utilized for speech separation and enhancement [14], [15]. The NMF model decomposes mixture magnitude spectrogram into spectral templates and their time-dependent activ ations. In the case of single channel mixtures the separation is achie ved by learning of noise or speaker dependent spectral templates from isolated sources in a training stage. The NMF model can be extended for multichannel mixtures by incorporating spatial covariance matrix (SCM) estimation for the NMF components as in [16], [17], [18], [19], [20]. The analysis and introduction of spatial properties for NMF components allows separation based on spatial information, i.e. , NMF components with similar spatial properties are considered to originate from the same sound source. These models require operation with comple x-valued data and hereafter we refer to these extensions as multichannel NMF . Recording of a realistic auditory scene often consists of sound sources which are moving with respect to the recording device, and con ventional separation approaches assuming time-inv ariant mixing are not suitable for such a task. Howe v er , moving sound sources can be considered stationary within a short time block where the mixing can be assumed to be time- in v ariant. Using the block-wise approach for separation of moving sound sources requires merging the separated sources across J. Nikunen, A. Diment and T . V irtanen are with the Department of Signal Processing, T ampere Univ ersity of T echnology , T ampere, Finland, email: ﬁrstname.lastname@tut.ﬁ This research was supported by Nokia T echnologies. 2 individually processed blocks. For example, in block-wise ICA [21], [22], [23] this is done by propagating the mixing matrix from the pre vious block and thus slowly adapting the mixing and preserving the source ordering in consecuti ve blocks. A recent generalization of multichannel NMF model [16] for time-varying mixing and separation of moving sound sources was proposed in [24]. The reported results are promising, howe ver the proposed algorithm requires using other state-of-the art source separation method in a blind setting for initialization. Alternativ ely , separation of moving sources can be achie ved by tracking the spatial position or direction of arriv al (DOA) of the sources and using spatial ﬁltering (beamforming or separation mask) for extracting the signal originating from the estimated position or direction in each time instance. In [25] the problem of DOA tracking and separation mask estimation is formulated jointly , howe ver in this paper we consider a two stage approach where the acoustic tracking is done ﬁrst and the separation masks are estimated in a separate (ofﬂine) stage. Also the separation masks are binary in [25] which will lead to compromised subjectiv e separation quality even if oracle masks are used. Acoustic localization with microphone arrays can be achiev ed by transforming the time-difference of arri v al (TDO A) obtained using generalized cross-correlation (GCC) into source position estimates [26]. Methods for estimation of trajectories of moving sound sources are based on Kalman ﬁltering and its non-linear extensions [27], [28] for estimating the underlying state (position of the sound source) from the TDOA measurements. Alternatively , sequential Monte Carlo methods, i.e. , particle ﬁltering [29], [30] have been applied for tracking the position of the source based on TDO A measurements. For even more dif ﬁcult case of multiple target tracking with data association problem, a Rao-Blackwellised particle ﬁltering (RBPF) was proposed in [31], [32] and applied for acoustic tracking of multiple speakers in [33]. Additionally , the use of directional statistics and quantities wrapped on a unit circle or a sphere, such as the interchannel phase difference, has been recently considered for speaker tracking [34], [35]. 9. Mic r o ph o n e arra y s ignal 1...M c h anne ls SRP-PHAT WGMM Source tracking (particle filtering) Multichannel NMF Single-channel Wiener filter Sep arated sour ces 1...S 2. 3. 4. 6. 7. STFT … … Sample Covariance 5. 1. Inverse STFT Beamforming 8. Fig. 1: The block diagram of the proposed processing consisting of source tracking and multichannel NMF for separation of detected and tracked sound sources. In this paper we propose a separation method for moving sound sources based on acoustic tracking and estimation of source spectrogram from the tracked directions using the multichannel NMF with time-varying SCM model. The main contributions of this paper include: formulation of multichannel NMF model for time-v arying mixing (moving sound sources), integration of the spatial model with acoustic tracking to deﬁne spatial properties of sources in form of SCMs in each time frame and ﬁnally presenting the update equations for optimizing the multichannel NMF model parameters minimizing squared Frobenius norm. The parametrization of source DOA with directional statistics and using the tracker uncertainty for deﬁning the SCMs of sources is a novel approach for representing the spatial location and spread of sound sources in the multichannel NMF . The acoustic tracker realization is combination of existing works on wrapped Gaussian mixture models [36] and particle ﬁltering [37], but it will be presented in detail due to its output statistics are utilized in the proposed time-varying SCM model. The e v aluation of the proposed separation algorithm is based on objecti ve separation criteria [38], [39], [40] and testing with mixtures of two and three simultaneous mo ving speakers. Additionally , hand-annotated source DOA trajectories are used for ev aluating the performance of the acoustic tracker realization and studying the susceptibility of the proposed separation algorithm towards tracking errors. The compared separation methods include con v entional beamforming (DSB and MVDR) and upper reference is obtained by ideal ratio mask (IRM) separation [41]. The proposed method achiev es superior separation performance and the use of annotated trajectories shows no signiﬁcant increase in separation performance, proving the proposed method suitable for realistic operation in a blind setting. The paper is organized as follo ws. First the problem of separating moving sound sources and an overvie w of the proposed processing is giv en in Section II. Next we introduce directional statistics and describe the acoustic source tracker realization in Section III. In Section IV the multichannel NMF separation model for sources with time-v arying mixing is proposed and the utilization of tracker output within the separation model is explained. In Section V the tracking and separation performance of the proposed algorithm is ev aluated using real recorded test material captured using a compact four -element microphone array . The work is concluded in Section VII. I I . P RO B L E M S T A T E M E N T A N D A L G O R I T H M OV E RV I E W A. Mixing model of moving sound sour ces A microphone array composed of microphones ( m = 1 , . . . , M ) observes a mixture of p = 1 , . . . , P source signals s p ( t ) sampled at discrete time instances indexed by t . The sources are mo ving and have time-varying mixing properties, denoted by 3 room impulse response (RIR) h pmt ( τ ) , for each time index t . The resulting mixture signal can be giv en as x m ( t ) = P X p =1 X τ s p ( t − τ ) h pmt ( τ ) . (1) In sound source separation the aim is to estimate the source signals s p and their mixing h pmt ( τ ) by only observing x m ( t ) . In this paper audio is processed in frequency domain obtained using the short time Fourier transform (STFT). The STFT of a time-domain mixture signal is calculated by dividing the signal into short overlapping frames, applying a window function and taking the discrete Fourier transform (DFT) of the windowed frame. The mixing properties denoted by the time-dependent RIRs h pmt ( τ ) change slowly over time and in practice the difference between adjacent time indices t is small, thus we can consider mixing being constant within a small time window . This allo ws to approximate the time-dependent mixing (1) in time-frequency (TF) domain as x f n ≈ P X p =1 h f n,p s f n,p = P X p =1 y f n,p . (2) The STFT of the mixture signal is denoted by x f n = [ x f n 1 , . . . , x f nM ] T for each TF-point ( f , n ) of each input channel ( m = 1 , . . . , M ) . The single-channel STFT of each source p is denoted by s f n,p and their frequenc y domain RIRs (ﬁxed withing each time frame n ) are denoted by h f n,p = [ h f n 1 , . . . , h f nM ] T . The source signals con v olved with the impulse responses are denoted by y f n,p . B. Overview of the pr ocessing The proposed method consists of source spectrogram estimation based on the DOA of the sources of interest in each time frame and the estimated spectrograms are used for separation mask generation by generalized W iener ﬁlter . The processing is based on two stages: the acoustic tracker and the of ﬂine separation mask estimation by multichannel NMF . The block diagram of the method is illustrated in Figure 1. The source tracking branch operates frame-by-frame and can be though as online algorithm while the parameters of the multichannel NMF model are estimated from the entire signal at once (ofﬂine). First the STFT of the input signals is calculated. The tracking branch starts with calculating the steered response po wer (SRP) of the signal under analysis. SRP denotes the spatial energy as a function of DOA for each time frame. A wrapped Gaussian mixture model (WGMM) [36] of the SRP function in each time frame is estimated, which conv erts spatial energy histogram ( i.e. the SRP) into DOA measurements. WGMM parameters are used as measurements in acoustic tracking which is implemented using particle ﬁltering [37]. The multi-target tracker detects the births and deaths of sources, solves the data- associations of measurement belonging to one of existing sources and predicts the source trajectories. In the second stage a spatial co variance matrix model (SCM model) [19] parameterized by DO A is deﬁned based on the acoustic tracker output (source DOA at each time-frame). The obtained SCMs denote the spatial behavior of sources over time and a spectral model of sources originating from the tracked direction is estimated using multichannel NMF . The multichannel source signals are reconstructed using a single-channel Wiener ﬁlter based on the estimated spectrogram of each source and single-channel signals are obtained by applying the delay-and-sum beamforming to the separated multichannel signals. Finally , time-domain signals are reconstructed by applying in verse STFT and overlap-add. I I I . S O U R C E T R A J E C T O RY E S T I M A T I O N The goal of the ﬁrst part of the proposed algorithm is to estimate DOA trajectories of the sound sources that are to be separated. The process consist of three consecuti ve steps: calculating the spatial energy emitted from all directions (Section III-A), con verting the discrete spatial distribution into DOA measurements (Sections III-B and III-C) and multi-target tracking consisting of source detection, data-association and source trajectory estimation (Section III-D). A. T ime-dif fer ence of arrival and steered response power Spatial signal processing with spaced microphone arrays is based on observing time delays between the array elements. In far -ﬁeld propagation the wav efront direction of arriv al corresponds to a set of TDO A values between each microphone pair . W e start by deﬁning a unit direction vector k ∈ R 3 , || k || = 1 originating from the geometric center of the array p = [0 , 0 , 0] T and pointing to wards direction parametrized by azimuth θ ∈ [0 , 2 π ] and elev ation ϕ ∈ [0 , π ] . Gi ven a microphone array consisting of two microphones m 1 and m 2 at locations m 1 ∈ R 3 , m 2 ∈ R 3 the TDOA between them for a sound source at direction k is obtained as τ ( m 1 , m 2 ) = − k T ( m 1 − m 2 ) v , (3) where v is the speed of sound. The above TDOA corresponds to a phase difference of exp( − j ω f τ ( m 1 , m 2 )) in the frequency domain, where ω f = 2 π ( f − 1) F s / N ( F s is the sampling frequency and N is the STFT windo w length). 4 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −0.5 0 0.5 y−axis x−axis z−axis Fig. 2: Illustration of a sparse grid of direction vectors by lines with dot in end and an example array enclosure and enclosed microphones (circles). From now on we operate with a set of dif ferent directions indexed by d = 1 , . . . , D and the direction vector corresponding to d th direction is deﬁned as k d resulting to TDOA of τ d ( m 1 , m 2 ) . The spatial energy originating from the direction [ θ d , ϕ d ] at each time frame n can be calculated using the steered response power (SRP) with PHA T weighting [42] deﬁned as S dn = M − 1 X m 1 =1 M X m 2 = m 1 +1 F X f =1 x f nm 1 x ∗ f nm 2 | x f nm 1 x ∗ f nm 2 | exp( j ω f τ d ( m 1 , m 2 )) , (4) where ∗ denotes complex-conjugate and the term exp( j ω f τ d ( m 1 , m 2 )) is responsible for time-aligning the microphone signals. SRP denotes the spatial distribution of the mixture consisting of spatial evidence from multiple sources and searching for multiple local maxima of the SRP function at a single time frame n corresponds to DO A estimation of sources present in that time frame. Repeating the pick peaking for all time frames of SRP would result to DO A measurements that are permuted ov er time and subsequently in the tracking stage the permuted DO A measurements are associated to multiple sources over time. In a general case the directions d = 1 , . . . , D would uniformly sample a unit sphere around the array , but in this paper we only consider the zero elev ation plane, i.e. , ϕ d = 0 ∀ d . W e assume that the sources of interest lie approximately on the xy-plane with respect to the microphone array and the directional statistics used in tracking of the sources simpliﬁes to a univ ariate case. A sparse grid of directions vectors with spacing of adjacent azimuths by π 12 is illustrated in Figure 2 along with the array casing and microphones corresponding to the actual compact array used in the ev aluations. B. Wrapped Gaussian mixture model Instead of searching peaks from the SRP (4), we propose to model the mixture spatial distribution using a wrapped Gaussian mixture model (WGMM) estimated separately for each time-frame of the SRP . The estimation of the parameters of the WGMM results to con verting the discrete spatial distribution obtained by SRP into multiple DO A measurements with mean, variance and weight. The individual wrapped Gaussians model the spatial e vidence caused by the actual sources while some of them may model the noise or phantom peaks in the SRP caused by sound reﬂecting from boundaries. The use of WGMM alleviates effect of noise when the width of each peak, denoted by v ariance of each wrapped Gaussian, can be used to denote measurement uncertainty in the acoustic tracking stage. The probability density function (PDF) of univ ariate wrapped Gaussian distribution [36], [43], [44] with mean µ and variance σ 2 can be deﬁned as P ( θ ; µ, σ 2 ) = ∞ X l = −∞ N ( θ ; µ + l 2 π , σ 2 ) = ∞ X l = −∞ 1 √ 2 π σ 2 e − ( θ − µ +2 π l ) 2 2 σ 2 , (5) where N ( θ ; µ, σ 2 ) is a PDF of a regular Gaussian distrib ution, l is the wrapping index of 2 π multiples and θ ∈ [ − π , π ] . The multiv ariate version of the wrapped Gaussian distribution is gi ven in [36], but it is not of interest in this paper . The WGMM with weights a k for each wrapped Gaussian distribution k is deﬁned as P ( θ ; a , µ , σ 2 ) = K X k =1 a k ∞ X l = −∞ N ( θ ; µ k + l 2 π , σ 2 k ) . (6) 5 −180 −120 −60 0 60 120 180 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 angle [degrees] µ = −97.9505 µ = 96.1982 µ = 71.3191 σ 2 = 13.6204 σ 2 = 15.4265 σ 2 = 117.228 a = 0.005879 a = 0.0068183 a = 0.022208 Observed WGMM Fig. 3: The observed SRP S dn for a single time-frame n and the WGMM with K = 3 estimated from it. Algorithm 1 EM-algorithm for estimation of WGMM model parameters for a histogram s d (single frame from the entire SRP S dn ). Input: Histogram data s d and initial v alues for a k , µ k and σ 2 k E-STEP η dkl = N ( θ d ; µ k + l 2 π ,σ 2 k ) a k P K k =1 P ∞ l = −∞ N ( θ d ; µ k + l 2 π ,σ 2 k ) a k M-STEP µ k = P D d =1 P ∞ l = −∞ ( θ d − 2 π l ) η dkl s d P D d =1 P ∞ l = −∞ η dkl s d σ 2 k = P D d =1 P ∞ l = −∞ ( θ d − µ k − 2 π l ) η dkl s d P D d =1 P ∞ l = −∞ η dkl s d a k = 1 P d s d P D d =1 P ∞ l = −∞ η dkl s d where K is the total number of wrapped Gaussians in the model and EM algorithm for estimating parameters { a , µ , σ 2 } that maximize the log likelihood log L = D X d =1 log K X k =1 a k ∞ X l = −∞ N ( θ d ; µ k + l 2 π , σ 2 k ) , (7) is given in [36], [44]. The parameter θ d denotes the azimuth angles of the directions indices d = 1 , . . . , D used to calculate SRP in Equation (4). The EM-algorithm for WGMM as presented in [36], [44] requires observing data points generated by the underlying distribution whereas S dn for a single frame n is ef fectiv ely a histogram denoting spatial energy emitted from each scanned direction indexed by d . Estimating the WGMM parameters based on the histogram requires modiﬁcation of the algorithm presented in [36], [44] to account for inputs consisting of discretely sampled mixture distribution, i.e. , the histogram bin values. The modiﬁcation results to Algorithm 1 where SRP of a single frame n is denoted by s d and the updates are iterated until con ver ge of η dkl . Prior knowledge can be used to set initial v alues for a k , µ k and σ 2 k or they can be initialized randomly . An example of the SRP S dn of a single time-frame and three component WGMM estimated from it is illustrated in Figure 3. C. DO A measurements by WGMM Algorithm 1 is applied individually for each time frame n = 1 , . . . , N of S dn . A mixture of k = 1 , . . . , K wrapped Gaussians for each time frame is obtained and the resulting means µ n,k with variances σ n,k and weights a n,k are considered as permuted DO A measurements. At this point of the algorithm it is unknown which of the measurements k = 1 , . . . , K in each frame n are caused by actual sources and which correspond to noise. Also the detection of sources and association of different measurements k to sources p over time is unknown, i.e. k th measurement in adjacent frames may be caused by different sources. The source detection, data-association and actual source trajectory estimation is solved using the Rao-Blackwellized particle ﬁltering introduced in Section III-D. Random initialization of Algorithm 1 could be used in each time frame. Ho wev er , initial values close to the optimal ones speed up the algorithm con v ergence and in practice estimates from pre vious frame can be used as initialization for the subsequent frame. Please note that this initialization strategy does not guarantee preserving any association between k th wrapped Gaussians in adjacent frames and ordering need to be considered as permuted. The WGMM parameters { µ n,k , σ n,k , a n,k } are hereafter referred to as DO A measurements reg arding the acoustic tracking: µ n,k are the measurement means, σ n,k denote measurement reliability and a n,k are the proportional weights of the measure- ments. Gi ven that a WGMM with K components for each time frame is estimated, not all WGMM components are caused 6 by actual spatial e vidence but are merely modeling the noise ﬂoor and phantom peaks in the SRP . The situation is illustrated in Figure 3, where the third WGMM component with mean µ = 71 ◦ has a very high variance σ 2 = 117 ◦ in comparison to the actual observ able peaks. By in vestigating the variance and weight of each WGMM component the false measurements can be efﬁciently remov ed before applying the actual tracking algorithm. The means µ n,k for each time frame for an arbitrary test signal and after removing measurements with σ n,k > 35 ◦ or a n,k < 0 . 15 are illustrated in Figure 4. The remo val of false measurements reveals two distinct observable trajectories, howe ver the data association between each frame is unknown at this stage. The thresholds for measurement removal can be set to be global (signal and capturing en vironment independent) and their choice is discussed in more details in Section V. D. Acoustic trac king of multiple sound sources The problem setting in tracking of multiple sound sources is as follows. Multiple DO A measurements are obtained in each time frame and the task is to decide whether the ne w measurement is 1) associated to an existing source 2) identiﬁed as clutter , 3) evidence of new source (birth) and ﬁnally 4) determining possible deaths of existing sources. After the data-association step the dynamic state of the activ e sources is updated and particularly it is required to preserve the source statistics ov er short inactiv e segments (pauses between words in speech) by prediction based on pre vious state of the source (location, velocity and acceleration). In the following, we shortly revie w the Rao-Blackwellized particle ﬁlter (RBPF) framework proposed in [31] for the problem of multi-target tracking and use its freely av ailable implementation 1 documented in [37]. W e gi ve the state-space representation of the dynamical system that is being tracked but we do not go to any further details of RBPF . The algorithm proposed in [31] and the associated implementation has been used in [45] for tracking the azimuth angle of speakers in a similar setting. Multi-target tracking by RBPF is essentially based on dividing the entire problem into two parts, estimation of data association and tracking of single targets. This can be done with the Rao-Blackwellization procedure [32] where the estimation of the posterior distribution of the data associations is done ﬁrst and then applying single target tracking sub-problem conditioned on the data associations. Adding the estimation of unkno wn number of targets [31] using a probabilistic model results into RBPF framew ork that solves the entire problem of tracking unkno wn number of tar gets. The beneﬁt of the Rao-Blackwellization is that by conditioning the data association allows calculating the ﬁltering equations in closed form instead of using particle ﬁltering and data sampling based techniques for all steps, which leads to generally better results. 1) State-space model for speaker DO A tracking: In the RBPF framew ork the single target tracking consist of Bayesian ﬁltering which requires deﬁning the dynamic model and the measurement model of the problem. For the time being we omit the WGMM component index k and the source index p and deﬁne the state space model equations for the single target tracking sub-problem. The goal is to estimate the state of the dynamical system in each time instance n and the state in our case is deﬁned as a 2-D point ( x, y ) at the unit circle with velocities along both axes ( ˙ x, ˙ y ) deﬁned as s n =  x n , y n , ˙ x n , ˙ y n ,  T . (8) The angle of the x-y coordinate ( x n , y n ) represents the DO A of the source and it avoids dealing with the 2 π ambiguity of 1-D DO A variables in the dynamic model. The dynamic model that predicts the target state based on pre vious time step is deﬁned as s n = A n − 1 s n − 1 + q n − 1 (9) where A n − 1 is the state transition matrix and q n − 1 ∼ N (0 , λ 2 I ) is the process noise. W ith the above deﬁnition for state s n the transition matrix becomes linear and is deﬁned as, A n − 1 =     1 0 ∆ t 0 0 1 0 ∆ t 0 0 1 0 0 0 0 1     , (10) where ∆ t is the time difference between consecutive time steps. The resulting dynamic model can be described as follows: the predicted DOA at current time step n is the DO A of the previous time step in x-y coordinates added with its velocity in previous time step multiplied with the time constant, i.e. , the time between consecutive processing frames. For the measurement representation we use the rotating vector model [46] that con verts the wrapped 1-D angle measurements µ ∈ [0 , 2 π ] to a 2-D point on a unit circle, resulting in measurement vector m n =  cos( µ ) , sin( µ )  T . (11) 1 http://becs.aalto.ﬁ/en/research/bayes/rbmcda/ 7 0 1 2 3 4 5 6 7 8 9 10 −180 −120 −60 0 60 120 180 time [s] angle [degrees] 0 1 2 3 4 5 6 7 8 9 10 −180 −120 −60 0 60 120 180 time [s] angle [degrees] Fig. 4: Upper panel illustrates all estimated WGMM means µ n,k for each time-frame n . In the lower panel WGMM means after removing measurements with σ n,k > 0 . 6 rad ( ≈ 35 ◦ ) or a n,k < 0 . 15 are illustrated The measurement model is deﬁned as, m n = B n s n + r n , (12) where B n is the measurement model matrix and r n ∼ N (0 , σ I ) is the measurement noise. The measurement model matrix B n con verts the state s n into measurement m n (x-y coordinates) simply by omitting the velocities and is deﬁned as, B n =  1 0 0 0 0 1 0 0  . (13) The abo ve deﬁnitions result to linear dynamic and and measurement model matrices (10) and (13) allowing use of regular Kalman ﬁlter equations to update and predict the state of the particles in RBPF framework [37]. W e ackno wledge that the dynamic system used here is theoretically imperfect with respect to using 2-D quantities while the state and the measurements are truly 1-D, leading to additional noise in the system as pointed out in [34]. Howe ver , during the implementation of the acoustic tracker the chosen linear models were found performing better than the non-linear alternatives. Alternativ ely , the problem of tracking wrapped quantities using 1-D state could be addressed via wrapped Kalman ﬁltering as proposed in [34]. 2) Multi-tar get DO A tr acking implementation: For the actual RBPF implementation we no w reintroduce the WGMM component index k and the source inde x p . The state vector is deﬁned indi vidually for each detected source p = 1 , . . . , P and is denoted hereafter as s ( p ) n . Similarly the multiple measurements at same time step obtained from the WGMM model are denoted by m ( k ) n =  cos( µ n,k ) , sin( µ n,k )  T The RBPF implementation [37] is applied to the measurements m ( k ) n with measurement noise r ( k ) n ∼ N (0 , σ n,k I ) . The multi-target tracker detects the sources and makes the association of k th WGMM measurement belonging to one of sources p = 1 , . . . , P . Alternati vely , if none of the activ e source particle distributions indicate a probability higher than the clutter prior probability , then the current measurement is regarded as clutter . The clutter prior probability is a ﬁxed pre-set value to validate the minimum threshold when the observed measurement is linked to existing source. The output of the tracker is the state of each source p at each time frame, denoted by s ( p ) n . Extracting the DO A from the tracked source state requires calculating the angle of the vector deﬁned by the 2-D coordinates and thus the resulting DOA trajectories are obtained as ˆ µ n,p ← atan2( s ( p ) n, 2 / s ( p ) n, 1 ) . (14) The tracking result for a one test signal is illustrated in Figure 5, where the input of the acoustic tracking are the ones depicted in the bottom panel of Figure 4. The test signal is chosen such that it sho ws two problematic cases, sources start from the same position and intersect at 8 seconds going to the opposite directions. The tracking result indicates that the second source is detected at 2 seconds from the start just when the sources ha ve trav eled far enough from each other , resulting to 8 0 1 2 3 4 5 6 7 8 9 10 −180 −120 −60 0 60 120 180 time [s] angle [degrees] MAE = 5.4383 degrees, Recall Rate = 0.88 Estim. #1 Annotated #1 Estim. #2 Annotated #2 Fig. 5: The acoustic tracking result of two sources intersecting and ground truth annotations illustrated when the source is activ e (v oice activity detection by energy thresholding using signal from close-ﬁeld microphone). approximately one ﬁrst word being missed from the second speaker . The tracker is able to maintain the source association and track correctly the trajectories of intersecting sources. I V . S E PA R A T I O N M O D E L A. Mixtur e in spatial covariance domain For the separation part we represent the microphone array signal using mixture SCMs X f n ∈ C M × M . W e use magnitude square rooted version of the mixture STFT obtained as, ˆ x f n = [ | x f n 1 | 1 / 2 sign( x f n 1 ) , . . . , | x f nM | 1 / 2 sign( x f nM )] T (15) where sign( z ) = z / | z | is the signum function for comple x numbers. The mixture SCM is calculated as X f n = ˆ x f n ˆ x H f n for each TF-point ( f , n ). The diagonals of each X f n contains the magnitude spectrogram of each input channel. The argument and absolute value of [ X f n ] m 1 ,m 2 (off-diagonal values) represents the phase dif ference and magnitude correlation, respectively , between microphones ( m 1 , m 2 ) for a TF-point ( f , n ). The TF domain mixing in Equation (2) can be approximated using mixture SCMs as X f n ≈ ˆ X f n = P X p =1 H f n,p ˆ s f n,p , (16) where ˆ s f n,p = p s f n,p s ∗ f n,p is positi ve real-valued magnitude spectrogram of source p and H f n,p = h f n,p h H f n,p / || h f n,p h H f n,p || F are the SCMs of the frequency domain RIRs h f n,p . The mixing Equation (16) is hereafter referred to as spatial cov ariance domain mixing. B. Multichannel NMF model with time-variant mixing The proposed algorithm uses multichannel NMF for source spectrogram estimation and it is based on alternating estimation of the source magnitude spectrogram ˆ s f n,p and its associated spatial properties in the form of the SCMs H f n,p . In all previous works [16], [18], [20] the problem deﬁnition has been simpliﬁed for stationary sound sources and the SCMs being ﬁxed for all STFT frames n within the analyzed audio segment. Here we present a novel extension of the multichannel NMF model for time-v ariant mixing. In multichannel NMF the model for magnitude spectrogram is equiv alent to conv entional NMF , which is composed of ﬁxed spectral basis and their time-dependent acti vations. The SCMs can be unconstrained [47] or as proposed in the earlier works of the authors [19], [20] based on a model that represents SCMs as a weighted sum of entities called ”DO A kernels” each containing a phase difference caused by a single direction vector . This ensures SCMs to comply with the array geometry and match with the time-delays the chosen microphone placement allows. The NMF magnitude model for source magnitude spectrogram is gi ven as ˆ s f n,p ≈ Q X q =1 b q ,p t f q v q n , b q ,p , t f q , v q n ≥ 0 . (17) Parameters t f q ov er all frequency indices f = 1 , . . . , F represent the magnitude spectrum of single NMF component q , and v q n denotes the component gain in each frame n . One NMF component represents a single spectrally repetitiv e ev ent estimated from the mixture and one source is modeled as a sum of multiple components. Parameter b q ,p ∈ [0 , 1] represents a weight associating NMF component q to source p . The soft v alued b q ,p is motiv ated by dif ferent types of sound sources requiring different amount of spectral templates for accurate modeling and learning of the optimal division of components is determined 9 through parameter updates. For example, stationary noise can be represented using only a few NMF components whereas spectrum of speech varies over time and requires many spectral templates to be modeled precisely . Similar strategy for b q ,p is used for example in [18]. T ypically the ﬁnal values of b q ,p are mostly binary and only few number of NMF components are shared among sources. The multichannel NMF model with time-variant mixing can be trivially deriv ed from the SCM mixing Equation (16) by substituting the abov e deﬁned NMF model (17) in it, resulting in X f n ≈ ˆ X f n = P X p =1 H f n,p  Q X q =1 b q ,p t f q v q n  | {z } ≈ ˆ s f n,p . (18) C. Dir ection of arrival -based SCM model In order to constrain the spatial beha vior of sources over time using information from the acoustic tracking approach or other prior information, the SCMs need to be interpreted by spatial location of each source in each time frame. The SCM model proposed in [19] parametrizes stationary SCMs based on DOA and here we extend the model for time-variant SCMs H f n,p . Con verting the TDOA in Equation (3) to a phase difference results in DO A kernels W f d ∈ C M × M for a microphone pair ( m 1 , m 2 ) deﬁned as [ W f d ] m 1 ,m 2 = exp  j ω f τ d ( m 1 , m 2 )  , (19) where τ d ( m 1 , m 2 ) denotes the time delay caused by a source at a direction k d . A linear combination of DOA kernal giv es the model for time-varying source SCMs deﬁned as H f n,p = D X d =1 W f d z nd,p . (20) The directions weights z nd,p denote the spatial location and spread of the source p at each time frame n . The direction weights z nd,p can be interpreted as probabilities of source p originating from each direction d . In anechoic conditions only one of the directions weights z nd,p in each time frame would be nonzero, i.e. , the direct path explains all spatial information of the source, howe ver in reverberant conditions sev eral of the direction weights are activ e. The spatial weights corresponding to tracked sources and their DO A trajectories are set using the wrapped Gaussian distribution as z nd,p = N w ( θ d ; ˆ µ n,p , ˆ σ 2 n,p ) , (21) where ˆ µ n,p is obtained as speciﬁed in Equation (14) and the variance ˆ σ 2 n,p is used to control the width of the spatial window of source p in the separation model. The goal is to hav e a lar ge spatial windo w when the tracker is certain and the output suppressed (small spatial window) when no ne w measurements from the predicted source position are observed and the tracker state indicates high uncertainty . This strate gy is motiv ated by ensuring that small untracked deviations in source movement do not cause the spatial focus of the separation to veer of f momentarily from the target source and lead to suppression of the desired source content. In experimental tests the source spectrogram estimation by multichannel NMF proved to be sensitive to small errors if the spatial window used was very small. The small spatial windo w in case of no source activity can be motiv ated from similar perspecti ve, the estimated source spectrogram from very constrained spatial window is less likely to capture spectrogram details of other sources close to the target trajectory . The acoustic tracker output variance denoted as σ 2 n,p at each time step n indicates the uncertainty of the source being present at its respectiv e predicted direction ˆ µ n,p . The abov e speciﬁed strategy can be obtained from the tracker output variance by specifying ˆ σ 2 n,p = c − σ 2 n,p with a constant c = max n,p ( σ 2 n,p ) + min n,p ( σ 2 n,p ) . The operation maps the maximum output v ariance to the smallest spatial window and vice versa. In practice value range of σ 2 n,p is restricted to avoid specifying extremely wide and narrow spatial windows and thus the v alue of constant c is set in advance. The limits for σ 2 n,p are discussed in more details in Section V -C. The direction weights for each source at each time frame are scaled to unity l 1 -norm ( P D d =1 z nd,p = 1 ). This is done to restrict H f n,p to only model spatial behavior of sources and not affect modeling the ov erall energy of the sources. Additionally , when the source is considered inactiv e, i.e. , before its birth or after its death, all the direction weights in the corresponding time frame are set to zero. The spatial weights z nd,p corresponding to the tracking result in Figure 5 are illustrated in two top panels of Figure 6. By comparing to Figure 5, it can be seen that when no ne w measurements are observed the tracker output state variance is high and the spatial weights are concentrated tightly around the mean, whereas in case of high certainty the spatial spread is wider . In order to model the background noise and diffuse sources, an additional background source is added with direction weights set to one in indices where P p z nd,p < T and zero otherwise. The threshold T is set to allow the detected and tracked sources 10 Source #1 angle [degrees] 0 1 2 3 4 5 6 7 8 9 10 −180 −90 0 90 180 Source #2 angle [degrees] 0 1 2 3 4 5 6 7 8 9 10 −180 −90 0 90 180 Background source angle [degrees] Time [s] 0 1 2 3 4 5 6 7 8 9 10 −180 −90 0 90 180 0 0.05 0.1 0.15 0.2 0.25 0 1 2 3 x 10 −3 Fig. 6: The reconstructed spatial weights as giv en in Equation (21) for two detected sources are illustrated in two top panels and the spatial weights corresponding to the background source are illustrated in the bottom panel. to capture all spatial evidence within approximately +-30 degrees from their estimated DO As when the certainty for the given source is high. W ith the chosen background modeling strategy the track ed sources ha ve exclusi ve prior to model signals originating from the tracked DOA, with the exception of tw o DOA trajectories intersecting. An e xample of the background spatial weights is illustrated in bottom panel of Figure 6. Note that the differently colored regions at different times are due to the scaling ( P D d =1 z nd,p = 1 ) and dif ferent spatial windo w widths of the tracked sources at the corresponding time indices. D. P arameter Estimation The multichannel NMF model (18) with the time-varying DOA kernel based SCM model (20) and the spatial weights as speciﬁed in Equation (21) result to model X f n ≈ ˆ X f n = P X p =1 D X d =1 W f d z nd,p | {z } H f n,p Q X q =1 b q ,p t f q v q n | {z } ≈ ˆ s f n,p . (22) In order to use the abov e model for separation, parameters b q ,p , t f q and v q n deﬁning the magnitude spectrogram modeling part need to be estimated with respect to appropriate optimization criterion. W e use the squared Frobenius norm as the cost function, deﬁned as P F f =1 P N n =1 || X f n − ˆ X f n || 2 F . Multiplicative updates for estimating the optimal parameters in an iterati ve manner can be obtained by partial deriv ation of the cost function and use of auxiliary variables as in expectation maximization algorithm [48]. The procedure for obtaining multiplicativ e updates for different multichannel NMF models and optimization criteria are proposed and presented in [18] and can be extended for the new proposed formulation in (22). The entire probabilistic formulation is not repeated here and can be revie wed from [18]. The update equations for the non-negati ve parameters are b q ,p ← b q ,p P f ,n t f q v q n tr( X f n H f n,p ) P f ,n t f q v q n tr( ˆ X f n H f n,p ) , (23) t f q ← t f q P n,p b q ,p v q n tr( X f n H f n,p ) P n,p b q ,p v q n tr( ˆ X f n H f n,p ) , (24) v q n ← v q n P f ,p b q ,p t f q tr( X f n H f n,p ) P f ,p b q ,p t f q tr( ˆ X f n H f n,p ) . (25) Note that in contrast to earlier works on multichannel NMF for separation of stationary sound sources [18], [19], we do not update the SCM part H f n,p . It is assumed that the acoustic source tracking and spatial weights of the DOA kernels W f d fully represent the spatial behavior of the source. This strategy is assessed in more details in discussion Section VI. 11 Fig. 7: Illustration of the recording setup and the source mov ement. E. Sour ce separation For extracting the source signal from the mixture we use combination of single-channel W iener ﬁlter and delay-and-sum beamforming. Separation soft mask m f n,p for extracting the source spectrogram from the mixture are obtained using the estimated real v alued magnitude spectrogram ˆ s f n,p to formulate a generalized Wiener ﬁlter deﬁned as y f n,p = m f n,p x f n = ˆ s f n,p P p ˆ s f n,p x f n . (26) W e employ delay-and-sum beamforming to produce single channel source signal from the separated multichannel signals y f n,p (having the mixture signal phase). The ﬁnal estimate of the sources are giv en as y f n,p = w H f n,p y f n,p , (27) where w f n,p are the DSB weights (steering vector) tow ards estimated direction of source p at time frame n . Finally , the time-domain signal are reconstructed by applying in verse DFT to each frame which are further combined using ov erlap-add processing. V . E V A L UA T I O N In this section we present the objectiv e separation performance and tracking performance of the proposed method using real recordings of moving sound sources. Additionally , we ev aluate the separation performance of the proposed algorithm in a setting where the sources are not moving allowing comparison to conv entional spatial and spectrogram factorization models assuming stationary sources. A. Datasets with moving sources The de velopment and ev aluation material was recorded using a compact microphone array consisting of four Sennheiser MKE2 omnidirectional condenser microphones placed on a diamond pattern illustrated in 2 and exact locations of microphones are documented in [19]. The recordings were conducted in an acoustically treated room with dimensions 4 . 53 m × 3 . 96 m × 2 . 59 m and re verberation time T 60 = 0.26 s. The microphone array was placed approximately at the center of the room. Four persons spoke phonetically balanced sentences while walking clockwise (CW) and counterclockwise (CCW) around the microphone array at approximately constant velocity and at av erage distance of one meter from the array center . The ov erall assembly of the recordings and the movement of the source is illustrated in Figure 7. T wo CW and two CCW 30-second recordings with four persons were done totaling to 16 signals. All speakers started from the same position and walked on av erage two times around the array within the recorded 30-second segment. Recordings were done individually allowing producing mixture signals by combining the recordings from different persons. Reference speech signal was captured by a close-ﬁeld microphone (AKG C520). Additionally , 16 recordings with a loudspeaker playing babble noise and music outside the recording room with the door open were done and considered as a stationary (S) sound source with highly reﬂected propagation path. The movement of each individual speaker was annotated by hand based on SRP . An example of annotations is illustrated in Figure 5. Note that the annotations are only plotted when the source is active (a simple 12 T ABLE I: Description of datasets. Developmenet dataset Number of samples Source 1 Source 2 Details 2 CW CW ] > 45 ◦ 2 CCW CCW ] > 45 ◦ 4 CW CCW Sources intersect 12 CW/CCW S (babble) SNR = -5, -10 and -15 dB 4 CW/CCW S (music) SNR = -10 dB Evaluation dataset, 2 sources Number of samples Source 1 Source 2 Details 16 CW CW ] > 45 ◦ 16 CCW CCW ] > 45 ◦ 16 CW CCW Sources intersect Evaluation dataset, 3 sources Number of samples Source 1 Source 2 Source 3 Details 4 CW CW CW ] > 45 ◦ 4 CCW CCW CCW ] > 45 ◦ 4 CW CW CCW Sources intersect 4 CCW CCW CW Sources intersect Dataset with stationary sources from [19] Number of samples Source 1 Source 2 Details 8 / 8 45 ◦ / 135 ◦ 90 ◦ / 180 ◦ ] = 45 ◦ 8 / 8 0 ◦ / 45 ◦ 90 ◦ / 135 ◦ ] = 90 ◦ 8 / 8 0 ◦ / 45 ◦ 135 ◦ / 180 ◦ ] = 135 ◦ energy threshold from the close-ﬁeld microphone signal). The V AD information is only used for ev aluation purposes and not by the proposed algorithm. Three different datasets were generated, one for dev elopment and two for ev aluation purposes by mixing two and three individual speaker recordings. All mixture utterances in all datasets were 10 seconds in duration. In all datasets the signals were manually cut in such way that speaker trajectories based on the annotations were no closer than 45 ◦ when going in the same direction (CW vs. CW and CCW vs. CCW). Naturally , the trajectories can intersect in the case of opposite directions (CW vs. CCW). For the de velopment set the ﬁrst 15 seconds from each recording were used, while the remaining 15 to 30 seconds were used to generate the e valuation sets. In the development set 8 mixtures of two speakers and 16 li ve recordings were generated and each recording was only used once. The ﬁrst ev aluation dataset consists of 48 mixtures of two speakers using all possible unique combinations of the recordings with different speakers. The second ev aluation dataset contains 16 mixtures of three speakers based on a subset of all possible unique combinations. The subset was chosen to represent all different source trajectory combinations (all sources moving in the same direction vs one of the sources moving in the opposite direction). The datasets are summarized in T able I. The global parameters related to tracking and separation performance of the proposed algorithm were optimized using the dev elopment dataset. The recorded signals were downsampled and processed with sampling rate of F s = 24000 Hz. B. Dataset with stationary sources In order to compare the performance of the proposed algorithm against con ventional methods assuming stationary sources [18], [19], [20], we include additional ev aluation dataset with completely stationary sources. W e use the dataset introduced in [19] consisting of two simultaneous sound sources. In short the dataset contains speech, music and noise sources con volv ed with RIRs from various angles captured in a regular room (7.95 m x 4.90 m x 3.25 m ) with a rev erberation time T 60 = 350 ms. The array used for recording is exactly the same as the one used in the datasets introduced in Section V -A and more details of the recordings can be found from [19]. In total the dataset contains 48 samples with 8 different source types and 6 different DOA combinations. Each sample is 10 seconds in duration. The different conditions are summarized in last tabular of T able I. 13 C. Experimental setup For the WGMM parameter estimation the peaks in the SRP function were enhanced by exponentiation S (3 / 2) dn , which emphasizes high energy peaks (direct path) and low energy reﬂected content is decreased. This was found to improv e operation in moderate re verberation. A ﬁve-component ( K = 5 ) WGMM model (6) was estimated from the SRP and parameters from a previous frame were used as an initialization for next frame. The criteria for removing WGMM measurements were set to values of σ n,k > 0 . 6 rad (34 ◦ ) and a n,k < 0 . 15 by visually inspecting the dev elopment set results. The acoustic tracker parameters were optimized by maximizing the de velopment set tracking performance. The parameters were set to the following values. A verage variance of the WGMM measurements was scaled to σ 2 = 0 . 25 for each processed signal to be in appropriate range for the particle ﬁltering toolbox 2 . The clutter prior probability was ﬁxed for all measurements to C P = 0 . 1 . In the particle ﬁltering frame work the life time of the target is modeled using a gamma distribution with parameters α and β . The best tracking performance was achie ved with α = 3 and β = 4 . The target initial state was ﬁxed to s ( p ) n = [cos( π ) , sin( π ) , 0 . 1 , 0 . 1] . The pre-set prior probability of source birth was set to B P = 0 . 005 . The parameters of the multichannel NMF algorithm were set as follows: the window length was 2048 samples with 50% ov erlap and 80 NMF components were used for modeling the magnitude spectrogram. The entire signals were processed as whole. Before restoring the spatial weights by Equation (21) a minimum and maximum variance for σ n,p were set to 0.025 and 0.3, respecti vely . This was done in order to avoid unnecessarily wide or narrow spatial window (as can be seen from Figure 6). W ith the chosen minimum and maximum values for ˆ σ 2 n,p , the constant in Equation (21) becomes c = max( σ 2 n,p ) + min( σ 2 n,p ) = 0 . 325 . The background source threshold for setting the spatial weights acti ve was set to 0 . 01 , which corresponds to approximately ± 30 ◦ exclusi ve spatial window for the actual tracked sources when the tracker output state v ariance is at its minimum indicating high certainty of source being present at the predicted direction. D. Acoustic trac king performance The acoustic tracking performance is ev aluated against the hand-annotated ground truth source trajectories by using the accuracy (mean absolute error) and recall rate as the metrics. The tracking error for each source in each time frame with 2 π ambiguity is speciﬁed as e n,p = ˆ µ (ann . ) n,p − ˆ µ n,p = ˜ e n,p + 2 π N , N ∈ Z (28) where ˆ µ (ann . ) n,p denotes the annotated DO A of p th source in time frame n and ˆ µ n,p is obtained using Equation (14). Using the error term ˜ e which is wrapped to [ − π , π ] , we specify mean-absolute error ( MAE ) as MAE = ˆ P X p =1 1 N N X n =1 | ˜ e n,p | , (29) where ˆ P is the number of annotated sources. The recall rate is deﬁned as the proportion of time instances the detected source is correctly acti ve with respect to when the source was truly activ e and emitting sound. The ground truth of the activ e time instances is obtained by v oice acti vity detection (V AD) using the close-ﬁeld signal of the source. The V AD is used in order to take into account that some utterances start 1 to 2 seconds after the beginning of the signal ev en though the annotations are continuous for the whole duration of recordings. Additionally , if the tracked source dies before the end of the signal during a pause of speech, the duration of the pause is not accounted for as a recall error, but the remaining missing part is. W e will denote the recall rate using variable recall ∈ [0 , 1] . The proposed method uses multi-tar get tracker that can detect arbitrary number of sources and trajectories denoted as P . For e valuation of tracking and separation performance we need to match the annotated sources 1 , ..., ˆ P and detected sources 1 , ..., P by searching trough all possible permutations r of the detected sources denoted as P r : { 1 , ..., P } → { 1 , ..., ˆ P } . The permutation matrix P r is applied to change the ordering in which detected sources are e valuated against the annotations. W e propose to choose the permutation r for ﬁnal scoring that maximizes combination of MAE and recall . First MAE is con verted into a proportional measure MAER = 1 − (MAE /π ) ∈ [0 , 1] , where 1 denotes zero absolute error and 0 denotes maximum π rad = 180 ◦ tracking error at all times. Summing the MAER and the recall rate with permutation r applied to the estimated sources equals to F r = MAER r + recall r , (30) which is referred to as ov erall accuracy . The best permutation for each signal is chosen by ﬁnding the minimum value of F r ov er all permutations indexed by r . The combination of both measures is used to avoid fa voring permutations with very short detected trajectories with small MAE over longer trajectories with slightly lar ger MAE , for e xample accurate tracking of a single word from the entire utterance. Additionally , we do not consider and compensate for cases where one sound source is 2 http://becs.aalto.ﬁ/en/research/bayes/rbmcda/ 14 T ABLE II: Acoustic tracking results. T racking performance Dev . Eval. Eval. Criteria (2 sources) (2 sources) (3 sources) MAE 7 . 3 ◦ 6 . 1 ◦ 10 . 5 ◦ recall 86.3% 82.2% 64.7% Source detection performance Dev . Eval. Eval. Criteria (2 sources) (2 sources) (3 sources) P == ˆ P 79.2% 81.3% 50.0% P > ˆ P 20.8% 16.7% 25.0% P < ˆ P 0.0% 2.0% 25.0% correctly tracked by two trajectories with discontinuity during the pauses in speech. The effect of this is negligible due to the short test signals used (10 seconds). The acoustic tracking performance av eraged over all signals in all datasets and source detection performance is reported in T able II. The tracking error measured by MAE is below 10 degrees for datasets with tw o sources, which can be re garded as a good result. Noticeably the accuracy of the tracking is ev en better with the e valuation dataset. Ho we ver the recall rate drops by 4% mostly due to late detection of sources, which displays the difﬁculty of setting the optimal values for parameters controlling the birth and death of sources in the particle ﬁltering. In general, a lo w recall rate can be considered as conservati ve and only detecting and tracking dominant portions of sources. Alternatively , optimizing the parameters for 100% recall rate would lead to detection of numerous phantom sources, caused by re verberation and noise in the recordings. The recall rate and tracking accuracy are noticeably decreased for the e v aluation dataset with three simultaneous sources. As indicated by the second chart in T able II, the percentage of correctly detected number of sources is approximately 80% for both two source datasets and drops down to 56% for the dataset with three sources. The errors in source detection are mostly caused by overdetection in the case of two simultaneous sources, whereas in the more difﬁcult scenario of three sources the underdetection is also a signiﬁcant cause of error . E. Sour ce separation performance 1) Separation evaluation criteria: W e e valuate the separation performance of the proposed algorithm using the following objectiv e separation criteria with the close-ﬁeld microphone signal as a reference. From the separation ev aluation toolbox proposed in [38] we hav e included signal-to-distortion ratio (SDR) and signal-to-interference ratio (SIR) ev aluated in short segments of 200 ms. The score of each segment in dB scale is con v erted to linear scale and av eraged ov er all segments after which the resulting average is con verted back to dB scale. The resulting metrics are abbre viated as segmental SDR (SSDR) and segmental SIR (SSIR). The use of segmental e valuation was chosen due to the operation of BSSev al, which projects the separated signal into reference signal subspace and assumes that this projection is stationary . Howe ver , in the case of moving sound sources and reference by close-ﬁeld microphone the initial delay to the far -ﬁeld array and the room reﬂections change from frame to frame which requires the projection operator to be also time-variant. This is achie ved by assuming projection stationarity within each 200ms segment. Other metrics include the short-time objectiv e intelligibility measure (STOI) [39] which is used to predict the intelligibility of the separated speech in comparison to the reference signal, and the frequency- weighted se gmental signal-to-noise ratio (fwSegSNR) [40]. The latter metrics, STOI and fwSeqSNR, were calculated without segmenting. 2) Refer ence methods: The description of methods whose separation performance is e valuated are given in T able III and can be summarized as follows. The plain microphone signal from the array acts as a lowest performance baseline whereas the IRM indicates an upper limit. The tracking information is also used in the beamforming ( DSB and MVDR ) to specify the weights w H f n,p to enhance the signal at each time frame originating from the estimated DOA. The separation performance of proposed method was also e v aluated using the ground truth DO A trajectories for specifying the source mov ement for the multichannel NMF part. This e valuation indicates the highest achiev able separation performance with perfect tracking information and the results can be also used for validating the robustness of the overall proposed separation algorithm towards small tracking errors. The MVDR beamforming was implemented using the sample cov ariance method [42] for estimating the noise cov ariance matrix. The noise cov ariance was estimated from M = 20 previous frames with respect to each processed frame and it captures the stationary noise statistics as well as the immediate spectral details of interfering speech sources. Additionally , a diagonal loading ( σ = 5 ) of noise cov ariance matrices was applied to improve robustness of the MVDR beamformer . These parameters were optimized using the development set. 3) Separation results: The separation performance measured by SSDR, SSIR, STOI and fwSegSNR is calculated with the source permutation obtained by minimizing the accuracy criterion speciﬁed in Equation (30) and the results are av eraged over 15 T ABLE III: Description of compared separation methods. Abbrv . Description mic Microphone signal from the array (ch #1). DSB Delay-and-sum beamforming. MVDR Minimum variance distortionless beamforming. MNMF sta. Multichannel NMF assuming stationary sources [20]. MNMF Proposed method, i.e. multichannel NMF with time- varying SCM model. MNMF ann. Proposed method with ground truth annotations as source trajectories. IRM Ideal ratio mask separation. mic DSB MVDR MNMF sta. MNMF MNMF ann. IRM 0 2 4 6 8 10 [dB] a) SSDR Dev. Eval. (2 sources) Eval. (3 sources) mic DSB MVDR MNMF sta. MNMF MNMF ann. IRM 0 5 10 15 20 [dB] b) SSIR Dev. Eval. (2 sources) Eval. (3 sources) mic DSB MVDR MNMF sta. MNMF MNMF ann. IRM 0 0.2 0.4 0.6 0.8 1 STOI index c) STOI Dev. Eval. (2 sources) Eval. (3 sources) mic DSB MVDR MNMF sta. MNMF MNMF ann. IRM −10 0 10 20 30 40 [dB] d) fwSegSNR Dev. Eval. (2 sources) Eval. (3 sources) Fig. 8: Separation performance measured using various objective separation criteria: SSDR, SSIR, STOI, fwSegSNR. all sources and mixtures. The separation performance for all the tested methods with all the considered criteria are given in Figure 8 (a)-(d). Evaluation with the mixture signal (mic) indicates a baseline performance resulting to SSDR of approximately 4 dB for two simultaneous sources and 2 dB for three sources. Such high absolute performance for mixture signal is because of the segmental ev aluation. In contrast, ev aluating the entire signals in one segment resulted into negativ e average SDR for all tested methods (from -6 dB SDR for the mixture to -1 dB for the IRM) while the relativ e differences remained the same as reported in the Figure 8. The absolute results obtained this way did not reﬂect the subjecti ve separation performance and are not reported in the paper . The overall low scores were assumed to be caused by the problems of projection operation in BSSev al toolkit, discussed in the beginning of Section V -E. The beamforming methods, DSB and MVDR , consistently impro ve SSDR, SSIR and STOI in comparison to microphone signal. Howe v er , the overall improvement in all datasets is relativ ely poor: SSDR improvement varies from 0.45 dB to 0.70 dB, and STOI barely reaches the index of 0.5, indicating low predicted intelligibility for the separated sources. Additionally , MVDR beamforming has negati ve effect on the fwSegSNR criterion, which may be caused by unwanted canceling of target source due to the rudimentary noise cov ariance estimation method employed. DSB on the other hand does not have this negati ve effect. W ith all other ev aluated criteria MVDR beamforming e xceeds the DSB performance with a small margin. When violating the moving sources assumption and using multichannel NMF aimed for separation of stationary sound sources [20], the performance is poor especially with ST OI that decreases belo w the array microphone baseline. The other ev aluated metrics are similar to beamforming approaches. When the source is momentarily at the target static direction (blindly estimated in [20]) the separation quality is high, but as the source mov es it shifts out of spatial focus of the separation. The results re garding the proposed method can be summarized by stating that it signiﬁcantly increases the separation performance ov er the beamforming methods DSB and MVDR : in case of two sources the SSDR improves approximately by 1.5 dB and improvement is e ven greater for three simultaneous sources (2 dB). SSIR improv ement follows the trend of SSDR and STOI increases approximately by an index of 0.1. The use of ground truth annotations in the case of two simultaneous sources does not signiﬁcantly increase the performance: SSDR increases only by 0.13 dB and 0.15 dB, SSIR is unchanged and STOI increases by 0.01. This v alidates the good acoustic tracking results reported in T able II. The separation performance with the three sources follows the poorer tracking performance and the use of annotations has greater impact on improving the objective criteria, SSDR is increased by 0.5 dB by use of annotations but interestingly the SSIR decreases. This may be due to the annotations being always active ev en if the speech starts 1 to 2 seconds from the beginning of the signal resulting to nonzero separation output for the annotations, whereas in the actual tracking implementation the source signal is truly zero 16 Method SDR SIR SAR ISR MNMF proposed 3.2 dB 9.0 dB 7.0 dB 5.8 dB MNMF sta. [20] 5.6 dB 6.8 dB 13.1 dB 9.9 dB MNMF sta. [19] 4.8 dB 8.1 dB 10.3 dB 10.5 dB MNMF sta. [47] 3.7 dB 4.5 dB 12.7 dB 8.4 dB ICA [50] 2.0 dB 4.5 dB 8.2 dB 6.9 dB T ABLE IV: Results of dataset and methods from [19] and [47] with two simultaneous stationary sources. until it is detected. The behavior of all methods with all criteria is consistent with all datasets. Although the absolute differences in objective intelligibility are small, the increase in STOI by the proposed method over the beamforming methods is greater than the STOI improv ement by beamforming over the microphone signal. The overall difﬁculty of each dataset can be estimated from the IRM performance, which indicates that the actual ev aluation dataset with two sources is less difﬁcult than the development dataset. The performance gap between the proposed method and IRM separation is considerable especially in case of three simultaneous sources. This is due to the fact that IRM is not much affected by the adding of third source, since speech is relati vely sparse in time frequency domain and good separation can be achieved with oracle masks e ven with three simultaneous speakers. Evaluation of IRM performance by SSDR and SSIR for two source dataset indicates not as big difference in comparison to the proposed method. Howe ver , the IRM performance is also limited by the fact that far-ﬁeld and close-ﬁeld signal spaces are extremely different and time-frequency masking cannot recover the close-ﬁeld signal perfectly due to mixture phase is used. In subjecti ve e valuation IRM preserves the intelligibility of the speech much better than any separation method which is also indicated by the good results in the objectiv e e valuation of intelligibility , i.e. STOI for IRM is around 0.7-0.8. As a ﬁnal result we pro vide a comparison of separation performance obtained with similar DO A-based spatial and spectrogram factorization models assuming stationary sources [18], [19], [20]. The e valuation dataset consist of all sources being stationary , see Section V -B. The proposed algorithm was run as is with the exception that source reconstruction was done without DSB, since the reference signals are reverberated source signals and ev aluation in [20] is based on the spatial images of the sources [49]. The details of e valuation procedure and reference results are as presented in [20]. In theory , if the source DO A trajectory estimation would be perfect, similar results between multichannnel NMF-based methods regardless of source mov ement assumption should be obtained. Howe ver , methods proposed in [19], [20] also update elements of the DO A kernels (Equation (19)) whereas in this work they are ﬁxed to analytic anechoic array responses. The SDR, SIR, SAR and ISR are giv en in T able IV, which shows that the SDR performance of the proposed method is lower in comparison to methods utilizing the stationary assumption, while SIR is highest among the tested methods. The av erage tracking error (MAE) for the dataset was 8 . 8 ◦ and recall rate was 79% which are similar to the tracking performance for other 2 source datasets giv en in T able II. The comparison to multichannel NMF models assuming stationary source motiv ates the future work for reducing the performance gap while assuming moving sound sources and possible research directions are discussed in the next section. V I . D I S C U S S I O N In this section we present a few remarks regarding the algorithm development choices and possible future work for improving and extending the method. The strategy of using the estimated source DO A trajectories for deﬁnition of the SCM model in (22) means ef fectiv ely using only channel-wise time dif ferences and assuming anechoic en vironment. This strategy can be questioned in comparison to also updating the channel-wise le vel differences as in [19], [20]. Howe ver , the difﬁculty of updating the lev el dif ferences between input channels lies within the fact that with moving sources there may be only very few frames of data observed from each direction and in v estigation of the updating W f d in such setting was left for future work. The multichannel NMF model (22) would allow to use multichannel Wiener ﬁlter (MWF) for source reconstruction as in [18]. Informal experiments sho wed inferior performance with MWF in comparison to chosen combination of single- channel W iener ﬁlter and DSB. There are se veral possible reasons to explain the ﬁndings. The multichannel model used for representing source SCMs relies only on the anechoic responses which can be suboptimal for constructing the MVF for source reconstruction. Additionally , errors in source SCM estimation can lead to unexpectedly sharp spectral and spatial responses for source reconstruction with MVF . The strategy of single-channel Wiener ﬁlter and DSB is argued to be less destructiv e with respect small estimation errors. Analysis of the tracking performance indicated that fairly accurate source DO A trajectories can be estimated with existing methods in realistic capturing conditions which justiﬁes the applicability of the proposed separation algorithm for general use. T racking errors may be caused by erroneously representing a single source with two consecutive but separate tracks due to pauses in speech. Also in the case of intersecting DO A trajectories, the estimated tracks can switch the actual acoustic targets, i.e. , source 1 continues to track acoustic evidence of source 2 and vice versa. It should be noted that we did not account for the abov e problems in the tracking and separation performance ev aluation. The extremely good results of using of deep learning for speech separation [10], [11], [51] are quickly replacing the use factorization based models in source separation. W ith multichannel audio the spatial parameters being complex-v alued require 17 use of other approaches for SCM estimation, for example in [52] DNNs are used for spectrogram estimation while SCMs are estimated using a probabilistic model and EM-algorithm. The strength of the proposed method compared to DNN-based separation is that it operates on spatial information and spectral factorization of the observed data only and works relativ ely well in any scenario and all sound content (music, noise, everyday sounds) without any training material. V I I . C O N C L U S I O N S In this article a separation method for moving sound sources based on acoustic tracking and separation mask estimation by multichannel non-negativ e matrix factorization w as proposed. W e analyzed the objecti ve separation performance and the proposed method exceeded the con ventional beamforming using the same tracking information by a fair mar gin. The comparison against ground truth source DO A trajectories indicated only minor impairment to objecti ve separation performance. Additionally , analysis of the acoustic tracking realization showed good performance, recall rate ov er 80 % and absolute tracking error less than 10 degrees with two simultaneous moving sound sources. In conclusion the proposed method was sho wn to be robust and capable of separating at least two moving targets from mixtures recorded with a compact sized microphone array in realistic capturing conditions. R E F E R E N C E S [1] J. Barker , R. Marxer , E. V incent, and S. W atanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines, ” in 2015 IEEE W orkshop on Automatic Speech Recognition and Understanding (ASRU) . IEEE, 2015, pp. 504–511. [2] Y . Liu, P . Zhang, and T . Hain, “Using neural network front-ends on far ﬁeld multiple microphones based speech recognition, ” in Pr oceedings of IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2014, pp. 5542–5546. [3] V . V alimaki, A. Franck, J. Ramo, H. Gamper , and L. Savioja, “ Assisted listening using a headset: Enhancing audio perception in real, augmented, and virtual en vironments, ” IEEE Signal Pr ocessing Magazine , vol. 32, no. 2, pp. 92–99, 2015. [4] T . Heittola, A. Mesaros, T . V irtanen, and M. Gabbouj, “Supervised model training for overlapping sound events based on unsupervised source separation. ” in Pr oceedings of IEEE International Conference on Acoustics, Speech and Signal Processing , 2013, pp. 8677–8681. [5] P . Smaragdis, “Blind separation of conv olved mixtures in the frequency domain, ” Neurocomputing , vol. 22, no. 1, pp. 21–34, 1998. [6] F . Nesta, P . Svaizer , and M. Omologo, “Conv olutive BSS of short mixtures by ICA recursiv ely regularized across frequencies, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 19, no. 3, pp. 624–639, 2011. [7] A. Jourjine, S. Rickard, and O. Y ilmaz, “Blind separation of disjoint orthogonal signals: Demixing n sources from 2 mixtures, ” in Pr oceedings of IEEE International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP) , vol. 5. IEEE, 2000, pp. 2985–2988. [8] O. Y ilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking, ” IEEE transactions on Signal Processing , vol. 52, no. 7, pp. 1830–1847, 2004. [9] S. Rickard and O. Y ilmaz, “On the approximate W-disjoint orthogonality of speech, ” in Pr oceedings of IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing (ICASSP) , 2002, pp. 529–532. [10] A. Narayanan and D. W ang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition, ” in Proceedings of IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , V ancouver , Canada, 2013, pp. 7092–7096. [11] F . W eninger, J. R. Hershey , J. Le Roux, and B. Schuller , “Discriminatively trained recurrent neural networks for single-channel speech separation, ” in Pr oceedings of IEEE Global Conference on Signal and Information Pr ocessing (GlobalSIP) . IEEE, 2014, pp. 577–581. [12] A. A. Nugraha, A. Liutkus, and E. V incent, “Multichannel music separation with deep neural networks, ” in Eur opean Signal Processing Confer ence (EUSIPCO) , 2016. [13] ——, “Multichannel audio source separation with deep neural networks, ” in Researc h Report RR-8740, Inria. , 2015. [14] J. T . Geiger, J. F . Gemmeke, B. Schuller, and G. Rigoll, “Inv estigating NMF speech enhancement for neural network based acoustic models, ” Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) , 2014. [15] J. F . Gemmeke, T . V irtanen, and A. Hurmalainen, “Exemplar -based sparse representations for noise robust automatic speech recognition, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 19, no. 7, pp. 2067–2080, 2011. [16] A. Ozerov and C. Fevotte, “Multichannel nonnegative matrix factorization in con voluti ve mixtures for audio source separation, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 18, no. 3, pp. 550–563, 2010. [17] S. Arberet, A. Ozerov , N. Q. Duong, E. V incent, R. Gribon val, F . Bimbot, and P . V andergheynst, “Nonnegativ e matrix factorization and spatial covariance model for under-determined reverberant audio source separation, ” in Pr oceedings of International Confer ence on Information Sciences Signal Processing and their Applications (ISSP A) , 2010. [18] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel extensions of non-negativ e matrix factorization with complex-v alued data, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 21, no. 5, pp. 971–982, 2013. [19] J. Nikunen and T . V irtanen, “Direction of arrival based spatial cov ariance model for blind sound source separation, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 22, no. 3, pp. 727–739, 2014. [20] ——, “Multichannel audio separation by direction of arriv al based spatial cov ariance model and non-negati ve matrix factorization, ” in Pr oceedings of IEEE International Conference on Acoustic, Speech and Signal Processing . IEEE, 2014, pp. 6727–6731. [21] R. Mukai, H. Sawada, S. Araki, and S. Makino, “Robust real-time blind source separation for moving speakers in a room, ” in Proceedings of International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP) , vol. 5. IEEE, 2003, pp. V –469. [22] ——, “Blind source separation for moving speech signals using blockwise ICA and residual crosstalk subtraction, ” IEICE T r ansactions on Fundamentals of Electr onics, Communications and Computer Sciences , vol. 87, no. 8, pp. 1941–1948, 2004. [23] J. M ´ alek, Z. Koldo vsk ` y, and P . Ticha vsk ` y, “Semi-blind source separation based on ICA and overlapped speech detection, ” in Latent V ariable Analysis and Signal Separation . Springer, 2012, pp. 462–469. [24] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. Horaud, “ A variational em algorithm for the separation of time-varying conv olutiv e audio mixtures, ” IEEE/ACM Tr ansactions on Audio, Speech, and Language Pr ocessing , vol. 24, no. 8, pp. 1408–1423, 2016. [25] T . Higuchi, N. T akamune, T . Nakamura, and H. Kameoka, “Underdetermined blind separation and tracking of moving sources based on DO A-HMM, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2014 IEEE International Conference on . IEEE, 2014, pp. 3191–3195. [26] M. S. Brandstein and H. F . Silverman, “ A practical methodology for speech source localization with microphone arrays, ” Computer Speech & Language , vol. 11, no. 2, pp. 91–126, 1997. [27] U. Klee, T . Gehrig, and J. McDonough, “Kalman ﬁlters for time delay of arriv al-based source localization, ” EURASIP Journal on Applied Signal Pr ocessing , vol. 2006, pp. 167–167, 2006. [28] S. Gannot and T . G. Dvorkind, “Microphone array speaker localizers using spatial-temporal information, ” EURASIP journal on applied signal pr ocessing , vol. 2006, pp. 174–174, 2006. 18 [29] J. V ermaak and A. Blake, “Nonlinear ﬁltering for speaker tracking in noisy and reverberant en vironments, ” in Proceedings of Internation conference on Acoustics, Speech, and Signal Pr ocessing (ICASSP) , vol. 5. IEEE, 2001, pp. 3021–3024. [30] D. B. W ard, E. A. Lehmann, and R. C. Williamson, “Particle ﬁltering algorithms for tracking an acoustic source in a reverberant en vironment, ” IEEE T ransactions on Speech and Audio Pr ocessing , vol. 11, no. 6, pp. 826–836, 2003. [31] S. S ¨ arkk ¨ a, A. V ehtari, and J. Lampinen, “Rao-Blackwellized particle ﬁlter for multiple target tracking, ” Information Fusion , vol. 8, no. 1, pp. 2–15, 2007. [32] ——, “Rao-Blackwellized Monte Carlo data association for multiple tar get tracking, ” in Pr oceedings of the seventh international conference on information fusion , vol. 1. I, 2004, pp. 583–590. [33] X. Zhong and J. R. Hopgood, “Time-frequenc y masking based multiple acoustic sources tracking applying Rao-Blackwellised Monte Carlo data association, ” in IEEE W orkshop on Statistical Signal Pr ocessing . IEEE, 2009, pp. 253–256. [34] J. Traa and P . Smaragdis, “ A wrapped Kalman ﬁlter for azimuthal speaker tracking, ” IEEE Signal Pr ocessing Letters , vol. 20, no. 12, pp. 1257–1260, 2013. [35] ——, “Multichannel source separation and tracking with RANSAC and directional statistics, ” IEEE/ACM T r ansactions on Audio, Speech, and Language Pr ocessing , vol. 22, no. 12, pp. 2233–2243, 2014. [36] Y . Agiomyrgiannakis and Y . Stylianou, “Wrapped Gaussian mixture models for modeling and high-rate quantization of phase data of speech, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 17, no. 4, pp. 775–786, 2009. [37] J. Hartikainen and S. S ¨ arkk ¨ a, “RBMCDAbox-Matlab toolbox of Rao-Blackwellized data association particle ﬁlters, ” documentation of RBMCDA T oolbox for Matlab V , 2008. [38] E. V incent, R. Gribon val, and C. F ´ evotte, “Performance measurement in blind audio source separation, ” IEEE Tr ansactions on Audio, Speech, and Language Processing , vol. 14, no. 4, pp. 1462–1469, 2006. [39] C. H. T aal, R. C. Hendriks, R. Heusdens, and J. Jensen, “ An algorithm for intelligibility prediction of time–frequency weighted noisy speech, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 19, no. 7, pp. 2125–2136, 2011. [40] Y . Hu and P . C. Loizou, “Evaluation of objective quality measures for speech enhancement, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 16, no. 1, pp. 229–238, 2008. [41] S. Sriniv asan, N. Roman, and D. W ang, “Binary and ratio time-frequency masks for robust speech recognition, ” Speech Communication , vol. 48, no. 11, pp. 1486–1501, 2006. [42] I. T ashev , Sound Captur e and Pr ocessing: Practical Approac hes . John W iley & Sons Inc, 2009. [43] P . Smaragdis and P . Boufounos, “Learning source trajectories using wrapped-phase hidden markov models, ” in Pr oceedings of the IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics (W ASP AA) . IEEE, 2005, pp. 114–117. [44] J. T raa, “Multichannel source separation and tracking with phase differences by random sample consensus, ” Master’s thesis, Graduate College of the Univ ersity of Illinois at Urbana-Champaign, 2013. [45] C. Spille, B. Meyer, M. Dietz, and V . Hohmann, “Binaural scene analysis with multidimensional statistical ﬁlters, ” in The technology of binaural listening . Springer , 2013, pp. 145–170. [46] H. Nies, O. Loffeld, and R. W ang, “Phase unwrapping using 2D-Kalman ﬁlter-potential and limitations, ” in International Geoscience and Remote Sensing Symposium , vol. 4. IEEE, 2008, pp. 1213–1216. [47] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “New formulations and efﬁcient algorithms for multichannel NMF, ” in IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics (W ASP AA) . IEEE, 2011, pp. 153–156. [48] A. P . Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm, ” J ournal of the Royal Statistical Society , vol. 39, no. 1, pp. 1–38, 1977. [49] E. V incent, H. Sawada, P . Boﬁll, S. Makino, and J. P . Rosca, “First stereo audio source separation evaluation campaign: data, algorithms and results, ” in International Conference on Independent Component Analysis and Signal Separation . Springer, 2007, pp. 552–559. [50] H. Sawada, S. Araki, R. Mukai, and S. Makino, “Grouping separated frequency components by estimating propagation model parameters in frequency- domain blind source separation, ” IEEE Tr ansactions on Audio, Speech, and Language Pr ocessing , vol. 15, no. 5, pp. 1592–1604. [51] Y . Isik, J. L. Roux, Z. Chen, S. W atanabe, and J. R. Hershey , “Single-channel multi-speaker separation using deep clustering, ” arXiv pr eprint arXiv:1607.02173 , 2016. [52] A. A. Nugraha, A. Liutkus, and E. V incent, “Multichannel audio source separation with deep neural networks, ” in IEEE/ACM T ransactions on Audio, Speech, and Language Processing , vol. 24, no. 10, 2016, pp. 1652–1664.

Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment