Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments
We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter-channel feature that e…
Authors: Xiaofei Li, Yutong Ban, Laurent Girin
1 Online Localization and T racking of Multiple Mo ving Speakers in Re v erberant En vironments Xiaofei Li ∗ , Y utong Ban ∗ , Laurent Girin, Xavier Alameda-Pineda and Radu Horaud Abstract —W e address the problem of online localization and tracking of multiple moving speakers in r everberant en vir on- ments. The paper has the follo wing contributions. W e use the direct-path relative transfer function (DP-RTF), an inter - channel feature that encodes acoustic inf ormation rob ust against re verberation, and we propose an online algorithm well suited for estimating DP-R TFs associated with moving audio sources. Another crucial ingredient of the proposed method is its ability to properly assign DP-R TFs to audio-source directions. T owards this goal, we adopt a maximum-likelihood f ormulation and we propose to use exponentiated gradient (EG) to efficiently update source-dir ection estimates starting from their currently av ailable values. The problem of multiple speak er tracking is computationally intractable because the number of possible asso- ciations between observed source directions and ph ysical speakers gro ws exponentially with time. W e adopt a Bayesian framework and we propose a variational approximation of the posterior filtering distribution associated with multiple speaker tracking, as well as an efficient variational expectation maximization (VEM) solver . The proposed online localization and tracking method is thoroughly evaluated using two datasets that contain recordings performed in real en vir onments. Index T erms —Inter-channel acoustic features, reverberant en- vironments, sound-source localization, multiple target tracking, speaker tracking, Bayesian variational inference, expectation- maximization. I . I N T R OD U C T I O N The localization and tracking of multiple speakers in real world en vironments are very challenging tasks, in particular in the presence of rev erberation and ambient noise and of natural con versations, e.g. short sentences, speech pauses and frequent speech turns among speakers. Methods based on time differences of arri val (TDOAs) between microphones, such as generalized cross-correlation [1], are typically used for single-speaker localization, e.g. [2]. In the case of multiple speakers, beamforming-based methods, e.g. steered-response power (SRP) [3], and subspace methods, e.g. multiple signal classification (MUSIC) [4], are widely used. The W -disjoint orthogonality (WDO) principle [5] assumes that the audio signal is dominated by a single audio source in small regions of the time-frequency (TF) domain. This assumption is partic- ularly v alid in the case of speech signals. Applying the short- time Fourier transform (STFT), or an y other TF representation, inter-channel localization features, such as the interaural phase ∗ X. Li and Y . Ban have equally contributed to the paper . X. Li, Y . Ban, X. Alameda-Pineda and R. Horaud are with Inria Grenoble Rh ˆ one-Alpes and with Univ . Grenoble Alpes, France. L. Girin is with Univ . Grenoble Alpes, Grenoble-INP , GIPSA-lab and Inria. This work was supported by the ERC Advanced Grant VHIA #340113. differences (IPDs) [5], can be e xtracted. In [5], multiple- speaker localization is based on the histogram of inter-channel features, which is suitable only in the case where there is no wrapping of phase measures. In [6], a Gaussian mixture model (GMM) is used as a generativ e model of the inter -channel features of multiple speakers, with each GMM representing one speaker , and each GMM component representing one can- didate inter-channel time delay . An expectation maximization (EM) algorithm iterativ ely estimates the component weights and assigns the features to their corresponding candidate time delays. This method overcomes the phase ambiguity problem by jointly considering all frequencies in the likelihood maximization procedure. After maximizing the likelihood, the azimuth of each speaker is given by the component that has the highest weight in the corresponding GMM. The complex- valued version of IPD, i.e. the pair -wise relativ e phase ratio (PRP), is used in [7]. Instead of setting one GMM for each speaker , a single complex Gaussian mixture model (CGMM) is used for all speakers with each component representing one candidate speaker location. After maximizing the likelihood of the PRP features, with an EM algorithm, the weight of each component represents the probability that there is an activ e speaker at the corresponding candidate location. Therefore, for an unknown number of speakers, counting and localization of active speakers can be jointly carried out by selecting components with large weights. The inter-channel features and associated localization meth- ods mentioned above assume a direct-path propagation model: hence, they perform poorly in reverberant en vironments. T o ov ercome this limitation, sev eral TDO A estimators based on system identification were proposed in [8]–[11]. In [12] it is proposed to use the DP-R TF as a TF-domain inter - channel localization feature robust against rev erberation. The estimation of the DP-R TF is based on the identification of the room impulse response (RIR) in the STFT-domain, i.e. the con voluti ve transfer function (CTF) [13], [14]. Overall, the method of [12] combines the merits of robust TDOA estimators [8]–[11] and of the WDO assumption mentioned abov e. T o localize moving speakers, one-stage methods such as SRP and MUSIC can be directly used using frame-wise spatial spectrum estimators. In contrast, methods based on inter-channel features require to assign frame-wise features to speakers in an adaptiv e/recursiv e way , e.g. the smoothed histogram method of [15]. Similar to [7], [16] uses one CGMM for each predefined speaker; the model is plugged into a recursive EM (REM) algorithm in order to update the 2 Input: STFT of microphone signals: { x i t,f } I ,F − 1 i =1 ,f =0 Eqs. (3), (5) build the vectors: { ˜ x m t,f , y m t,f } M ,F − 1 m =1 ,f =0 Alg. 1 computes CTFs and inv erse covariances: { ˜ a t,f } F − 1 f =0 , { ˜ P t,f } F − 1 f =0 DP-R TF estimation and consistency test: C t = {{ ˆ c i t,f } i ∈I f } F − 1 f =0 Eqs. (17), (18) compute the exp. gradient: { r t − 1 ,d } D d =1 Eq. (10) updates the GMM weights: { w td } D d =1 Azimuth: { b td = (cos( ˜ θ d ) , sin( ˜ θ d )) > } D d =1 Observed v ariables: { o td = [ b td ; w td ] } D d =1 EM iteration: E-Z: Eq. (39), assignments { α tdn } D,N d =1 ,n =1 E-S: Eq. (36), posterior cov ariances { Γ tn } N n =1 Eq. (37), posterior means { µ tn } N n =1 M-step: Eq. (42), cov ariances of state dynamics { Λ tn } N n =1 Output: Azimuth directions { s tn = µ tn } N n =1 and angular velocities of tracked speakers Candidate directions: { ˜ θ d } D d =1 Predicted features: { c i,d f } F − 1 ,I ,D f =0 ,i =2 ,d =1 { ˜ a t − 1 ,f } F − 1 f =0 { ˜ P t − 1 ,f } F − 1 f =0 { w t − 1 ,d } D d =1 { µ t − 1 ,n } N n =1 { Γ t − 1 ,n } N n =1 Online DP-R TF Estimation (Section II) Multiple Speaker Localization (Section III) Multiple Speaker T racking (Section IV) Candidate directions Fig. 1: Flowchart of the proposed multiple-speaker localization and tracking methodology . mixture’ s weights. Speaker tracking methods are generally based on Bayesian inference which combines localization with dynamic models in order to estimate the posterior probability distribution of audio-source directions, e.g. [17]–[19]. Kalman filtering and particle filtering were used in [20] and in [21], respecti vely , for tracking a single audio source. In order to address the problem of multiple speakers, possibly with unknown and time-varying number of speakers, additional discrete latent variables are needed, i.e. observation-to-speak er assignments, as well as speaker-birth and -death processes, e.g. [22], [23]. Sampling-based methods were widely used, e.g. e xtended particle filtering [24]–[26], or sequential Monte Carlo imple- mentation of the probability hypothesis density (PHD) filter [27], [28]. Howe ver , the computational burden of sampling- based methods can be prohibiti ve in practice. Under some assumptions, the multiple-target tracking GMM-PHD filter of [29] has an analytical solution and is computationally efficient: it was adopted for multiple-speaker tracking in [18]. In this paper we propose a method for the simultaneous localization and tracking of multiple moving speakers (please refer to Figure 1 for a method ov erview). The paper has the following original contributions: • Since we deal with moving speakers or, more gener- ally , with moving audio sources, DP-R TF features are computed using the online CTF estimation frame work presented in [30], based on recursi ve least squares (RLS), rather than using the batch CTF estimation of [12] which assumes static audio sources. The online RLS algorithm has a faster con vergence rate than the least mean squares (LMS) algorithms described in [8], [9]. This is important when dealing with multiple moving sources, where the adaptiv e estimator is required to quickly switch between multiple sources and to deal with moving sources. 3 • A crucial ingredient of multiple speaker localization is to properly assign acoustic features, i.e. DP-R TFs, to audio- source directions. W e adopt the maximum-likelihood formulation of [7]. W e propose to use EG [31] to update the source directions from their current estimated values. The EG-based recursi ve estimator proposed belo w is better suited for moving sources/speakers than the batch estimator proposed in [12]. • The problem of multiple speaker tracking is computa- tionally intractable because the number of possible asso- ciations between acoustic features and sources/speakers grows exponentially with time. In this paper we adopt a Bayesian variational approximation of the posterior filtering distrib ution which leads to an efficient VEM algorithm. In order to deal with a varying number of speakers, we propose a birth process which allows to initialize new speakers at any time. This paper is an extended version of [30] which has pro- posed an online DP-R TF method that has been combined with REM to estimate the source directions. In this paper, while we keep the DP-R TF method of [30] we propose to use EG. The advantages of using EG instead of REM are described in detail in Section III. Moreover , the multiple speaker tracking method is completely novel. The paper is organized as follows (please refer to Figure 1). Section II presents the online DP-R TF estimation method. Sec- tion III describes the EG-based speaker localization method and Section IV describes the v ariational approximation of the tracker and the associated VEM algorithm. Section V presents an empirical e v aluation of the method based on experiments performed with real audio recordings. Section VI concludes the paper . Supplemental materials are av ailable on our website. 1 I I . R E C U R S I V E M U LT I C H A N N E L D P - RT F E S T I M A T I O N A. Recursive Least Squar es For the sake of clarity , we first consider the noise-free single-speaker case. In the time domain x i ( τ ) = a i ( τ ) ? s ( τ ) is the i -th microphone signal, i = 1 , . . . , I , where τ is the time index, s ( τ ) is the source signal, a i ( τ ) is the RIR from the source to the i -th microphone, and ? denotes the con volution. Applying the STFT and using the CTF approximation, for each frequency index f = 0 , . . . , F − 1 we have: x i t,f = a i t,f ? s t,f = Q − 1 X q =0 a i q ,f s t − q ,f , (1) where x i t,f and s t,f are the STFT coefficients of the corre- sponding signals, and the CTF a i t,f is a sub-band representa- tion of a i ( τ ) . Here, the con v olution is ex ecuted with respect to the frame index t . The number of CTF coefficients Q is related to the rev erberation time of the RIR. The first CTF coefficient a i 0 ,f mainly consists of the direct-path information, 1 https://team.inria.fr/perception/research/multi- speaker- tracking/ thence the DP-R TF is defined as the ratio between the first CTF coef ficients of two channels: a i 0 ,f /a r 0 ,f , where channel r is the reference channel. Based on the cross-relation method [32], using the CTF model of one microphone pair ( i, j ) , we hav e: x i t,f ? a j t,f = x j t,f ? a i t,f . This can be written in vector form as: x i > t,f a j f = x j > t,f a i f , (2) with a i f = ( a i 0 ,f , . . . , a i Q − 1 ,f ) > , where > denotes ma- trix/vector transpose, and x i t,f = ( x i t,f , . . . , x i t − Q +1 ,f ) > . The CTF vector in volving all channels is defined as a f = ( a 1 > f , . . . , a I > f ) > . There is a total of I ( I − 1) / 2 distinct microphone pairs, index ed by ( i, j ) with i = 1 , . . . , I − 1 and j = i + 1 , . . . , I . For each pair, we construct a cross-relation equation in terms of a f . For this aim, we define: x ij t,f = [0 , . . . , 0 | {z } ( i − 1) Q , x j > t,f , 0 , . . . , 0 | {z } ( j − i − 1) Q , − x i > t,f , 0 , . . . , 0 | {z } ( I − j ) Q ] > . (3) Then, for each pair ( i, j ) , we hav e: x ij > t,f a f = 0 . (4) Let’ s assume, for simplicity , that the reference channel is r = 1 . T o a void the trivial solution a f = 0 of (4), we constrain the first CTF coefficient of the reference channel to be equal to 1 . This is done by dividing both sides of (4) by a 1 0 ,f and by moving the first entry of x ij t,f , denoted by − y ij t,f , to the right side of (4), which rewrites as: ˜ x ij > t,f ˜ a f = y ij t,f , (5) where ˜ x ij t,f is x ij t,f with the first entry remov ed, and ˜ a f is the relativ e CTF vector: ˜ a f = ˜ a 1 > f a 1 0 ,f , a 2 > f a 1 0 ,f , . . . , a I > f a 1 0 ,f ! > , (6) where ˜ a 1 f = ( a 1 1 ,f , . . . , a 1 Q − 1 ,f ) > denotes a 1 f with the first entry removed. For i = 2 , . . . , I , the DP-R TFs appear in (6) as the first entries of a i > f a 1 0 ,f . Therefore, the DP-R TF estimation amounts to solving (5). Equation (5) is defined for one microphone pair and for one frame. In batch mode, the terms ˜ x ij > t,f and y ij t,f of this equation can be concatenated accross microphone pairs and frames to construct a least square formulation. For online estimation, we would like to update the ˜ a f using the current frame t . For notational con venience, let m = 1 , . . . , M denote the index of a microphone pair , where M = I ( I − 1) / 2 . Then let the superscript ij be replaced with m . The fitting error of (5) is e m t,f = y m t,f − ˜ x m > t,f ˜ a f . (7) At the current frame t , for the microphone pair m , RLS aims to minimize the error J m t,f = t − 1 X t 0 =1 M X m 0 =1 λ t − t 0 | e m 0 t 0 ,f | 2 + m X m 0 =1 | e m 0 t,f | 2 , (8) 4 Algorithm 1 RLS at frame t Input: ˜ x m t,f , y m t,f , m = 1 , . . . , M Initialization: ˜ a 0 t,f ← ˜ a M t − 1 ,f , P 0 t,f ← λ − 1 P M t − 1 ,f for m = 1 to M do e m t,f = y m t,f − ˜ x m > t,f ˜ a m − 1 t,f g = P m − 1 t,f ˜ x m ∗ t,f / (1 + ˜ x m > t,f P m − 1 t,f x m ∗ t,f ) P m t,f = P m − 1 t,f − g ˜ x m > t,f P m − 1 t,f ˜ a m t,f = ˜ a m − 1 t,f + e m t,f g end for Output: ˜ a M t,f , P M t,f which sums up the fitting error of all the microphone pairs for the past frames and the microphone pairs up to m for the current frame. The forgetting factor λ ∈ (0 , 1] giv es a lower weight to older frames, whereas all microphone pairs hav e the same weight at each frame. T o minimize J m t,f , we set its complex deriv ativ e with respect to ˜ a ∗ f to zero, where ∗ denotes complex conjugate. W e obtain an estimate of ˜ a f at frame t for microphone pair m as: ˜ a m t,f = R m − 1 t,f r m t,f , (9) with R m t,f = t − 1 X t 0 =1 M X m 0 =1 λ t − t 0 ˜ x m 0 ∗ t 0 ,f ˜ x m 0 > t 0 ,f + m X m 0 =1 ˜ x m 0 ∗ t,f ˜ x m 0 > t,f , r m t,f = t − 1 X t 0 =1 M X m 0 =1 λ t − t 0 ˜ x m 0 ∗ t 0 ,f y m 0 t 0 ,f + m X m 0 =1 ˜ x m 0 ∗ t,f y m 0 t,f . It can be seen that the cov ariance matrix R m t,f is computed based on the rank-one modification, thence its inv erse, de- noted by P m t,f , can be computed using the Sherman-Morrison formula, without the need of matrix in v erse. The recursion procedure is summarized in Algorithm 1, where g is the gain vector . The current frame t is initialized with the previous frame t − 1 . At the first frame, we initialize ˜ a 0 1 ,f as zero, and P 0 1 ,f as the identity . At each frame, all microphone pairs are related to the same CTF vector that corresponds to the current speaker direction, hence all microphone pairs should be simultaneously used to estimate the CTF vector of the current frame. In batch mode, this can be easily implemented by concatenating the microphone pairs. Howe ver , in RLS, to satisfy the rank-one modification of the covariance matrix, we need to process the microphone pairs one by one as shown in (8) and Algorithm 1. At the end of the iterations over all microphone pairs, ˜ a M t,f is the “final” CTF estimation for the current frame, and is used for speaker localization. The DP-R TF estimates, denoted as ˜ c i t,f , i = 2 , . . . , I , are obtained from ˜ a M t,f . Note that implicitly we have ˜ c 1 t,f = 1 . B. Multiple Moving Speakers So far , the proposed online DP-R TF estimation method has been presented in the noise-free single-speaker case. The noisy multiple-speaker case was considered in [12], but only for static speakers, i.e. batch mode, and in the two-channel case. W e summarize the principles of this method and then explain in details the present online/multi-channel extension. 1) Estimation of the CTF vector: It is reasonable to assume that the CTF vector doesn’ t v ary o ver a fe w consecuti ve frames and that only one speaker is acti ve within a small region in the TF domain, due to the sparse representation of speech in this domain. Consequently , the CTF vector can be estimated ov er the current frame and a fe w past frames. An estimated CTF value, at each TF bin, is then assumed to be associated with only one speaker . The CTF vector computation in the case of multiple speakers can be carried out using the RLS algorithm, presented in Section II-A, by adjusting the forgetting factor λ to yield a short memory . The forgetting factor λ is set to λ = P − 1 P +1 , where P is the number of frames being used. T o efficiently estimate the CTF vector ˜ a M t,f of length I Q − 1 , we need ρ × ( I Q − 1) equations, where the parameter ρ should be chosen in such a way to achieve a good tradeoff between the v alidity of the above assumptions and robust estimation of ˜ a M t,f . T o guarantee that ρ × ( I Q − 1) equations are av ailable, we need P = ρ ( I Q − 1) I ( I − 1) / 2 ≈ ρ 2 Q I − 1 frames. One may observe that the number of frames needed to estimate ˜ a M t,f decreases as the number of microphones increases. 2) Noise r eduction: When noise is present, especially if the noise sources are temporally/spatially correlated, the CTF estimate can be contaminated. In addition, even in a low- noise case, many TF bins are dominated by noise due to the sparsity of speech spectra. T o classify the speech frames and noise frames, and to remove the noise, we use the inter -frame spectral subtraction algorithm proposed in [30], [33]. The cross- and auto-power spectral density (PSD) between the con volution vector of the microphone signals, i.e. x i t,f , and the current frame of the reference channel, i.e. x 1 t,f , is computed by averaging the cross- and auto-periodograms over frames. In the present work, we use recursi ve averaging: φ i t,f = β φ i t − 1 ,f + (1 − β ) x i t,f x 1 ∗ t,f , i = 1 , . . . , I , (10) where the smoothing parameter β is set to achiev e a good tradeoff between low noise PSD variance and fast tracking of speech variation. The noise frames and speech frames are classified based on the minimum statistics [33] of the PSD of x 1 t,f , i.e. the first entry of φ 1 t,f . If the frames are well classified then the noise frames only include negligible speech power , due to the sparsity and non-stationarity of speech; the speech frames include noise power similar to the noise frames, due to the stationarity of noise. Therefore, inter-frame spectral subtraction can be performed as follo ws: for each speech frame, the cross- and auto-PSD of its nearest noise frame is subtracted from its cross- and auto-PSD, then its noise-free cross- and auto-PSD is obtained and denoted as ˆ φ i t,f . Instead of using x i t,f , we use ˆ φ i t,f to construct (3). Corre- spondingly , we have a ne w formula (4), which is still valid, since it is equiv alent to, with noise remo ved, taking the cross- and auto-PSD between both sides of the initial formula (4) and x 1 t,f . In the RLS process, only the speech frames (after spectral 5 subtraction) are used, and the noise frames are skipped. A speech frame with a preceding noise frame is initialized with the latest speech frame. 3) Consistency test: In practice, a DP-R TF estimate can sometimes be unreliable. Possible reasons are that in a small frame re gion, (i) the CTF is time-v arying due to a f ast mov ement of the speakers, (ii) multiple speakers are present, (iii) only noise is present due to a wrong noise-speech clas- sification, or (iv) only re verberation is present at the end of a speech occurrence. In [12], a consistency test was proposed to tackle this problem: If a small frame region indeed corresponds to one acti ve speaker , the DP-R TFs estimated using different reference channels are consistent, otherwise the DP-R TFs are biased, with inconsistent bias values. In the present work, we use the first and second channels as references, we obtain the DP-R TF estimates ˜ c i t,f (with ˜ c 1 t,f = 1 ) and ¯ c i t,f (with ¯ c 2 t,f = 1 ), respectiv ely . Then ˜ c i t,f and ¯ c i t,f / ¯ c 1 t,f are two estimates of the same DP-R TF a i 0 ,f /a 1 0 ,f . T o measure the similarity between these two estimates, we define the vectors c i 1 ,t,f = (1 , ˜ c i t,f ) > and c i 2 ,t,f = (1 , ¯ c i t,f / ¯ c 1 t,f ) > , where the first entries are the DP-R TFs corresponding to a 1 0 ,f /a 1 0 ,f = 1 . The similarity is the cosine of the angle between the two unit vectors: d i t,f = | c i H 1 ,t,f c i 2 ,t,f | q c i H 1 ,t,f c i 1 ,t,f c i H 2 ,t,f c i 2 ,t,f , (11) where H denotes conjugate transpose. If d i t,f ∈ [0 , 1] is larger than a threshold (which is fixed to 0.75 in this work) then the two estimates are consistent, otherwise they are simply ignored. Then, the two estimates are a veraged and normalized as done in [12], resulting in a final complex-v alued feature ˆ c i t,f whose module lies in the interval [0 , 1] . Finally , at frame t , we obtain a set of features C t = {{ ˆ c i t,f } i ∈I f } F − 1 f =0 , where I f ⊆ { 2 , . . . , I } denotes the set of microphone indices that pass the consistency test. Note that I f is empty if frame t is a noise frame at frequency f , or if no channel passes the consistency test. Each one of these features is assumed to be associated with only one speaker . I I I . L O C A L I Z A T I O N O F M U LT I P L E M OV I N G S P E A K E R S In this section we describe the proposed frame-wise online multiple-speaker localizer . W e start by briefly presenting the underlying complex Gaussian mixture model, followed by the recursiv e estimation of its parameters. A. Generative Model for Multiple-Speaker Localization In order to associate DP-R TF features from C t with speakers and to localize each activ e speaker , we adopt the generativ e model proposed in [7]. Let D = { ˜ θ 1 , . . . , ˜ θ d , . . . , ˜ θ D } be a set of D candidate source directions , e.g. azimuth angles. An observed feature ˆ c i t,f (cf. Section II), when emitted by a sound source located along the direction ˜ θ d , is assumed to be drawn from a complex-Gaussian distribution with mean c i,d f and v ariance σ 2 , i.e. ˆ c i t,f | d ∼ N c ( c i,d f , σ 2 ) . The mean c i,d f is the predicted feature at frequency f for channel i , and is precomputed based on direct-path propagation along azimuth ˜ θ d to the microphones. The v ariance σ 2 is empirically set as a constant v alue. The mar ginal density of an observ ed feature ˆ c i t,f (taking into account all candidate directions) is a CGMM with each component corresponding to a candidate direction: p (ˆ c i t,f |D ) = D X d =1 w d N c (ˆ c i t,f ; c i,d f , σ 2 ) , (12) where w d ≥ 0 is the prior probability (component weight) of the d -th component, with P D d =1 w d = 1 . Let us denote the vector of weights with w = ( w 1 , . . . , w D ) > . Note that this vector is the only free parameter of the model. Assuming that the observ ations in C t are independent, the corresponding (normalized) negati ve log-likelihood function, as a function of w d , is given by: L t = − 1 |C t | X ˆ c i t,f ∈C t log D X d =1 w d N c (ˆ c i t,f ; c i,d f , σ 2 ) , (13) where |C t | denotes the cardinality of C t . Once L t is minimized, each weight w d represents the probability that a speaker is activ e in the direction ˜ θ d . Therefore, sound source localization amounts to the minimization of L t . In addition, taking into account the fact that the number of actual activ e speakers is much lower than the number of candidate directions, an entropy term was proposed in [12] as a regularizer to impose a sparse solution for w d . The entropy is defined as H = − D X d =1 w d log ( w d ) . (14) The concav e-con ve x procedure [34] was adopted in [12], to minimize the objective function L + γ H w .r .t. w , where L is the normalized negati ve log-likelihood of the DP-R TF features of all frames, i.e. batch mode optimization, and the positiv e scalar γ was used to control the tradeof f between likelihood minimization and imposing sparsity ov er the weights. In the batch mode, the weight vector w is shared across all frames. Hence this method is not suitable for moving speakers. B. Recursive P arameter Estimation W e no w describe a recursi ve method for updating the weight vector from w t − 1 to w t , i.e. from frame t − 1 to frame t , using the DP-R TF features at t . This can be formulated as the following online optimization problem [31]: w t = argmin w χ ( w , w t − 1 ) + η ( L t + γ H ) , (15) s.t. w d > 0 , ∀ d ∈ { 1 . . . D } and D X d =1 w d = 1 , (16) where χ ( a , b ) is a distance between a and b . The positive scalar factor η controls the parameter update rate. T o minimize (15), the deriv ati ve of the objectiv e function w .r .t w is set to zero, yielding a set of equations with no closed-form solution. T o speed up the computation, it is assumed that w t 6 is close to w t − 1 , thence the deriv ativ e of L t + γ H at w can be approximated with the deriv ati ve of L t + γ H at w t − 1 . This assumption is reasonable when parameter ev olution is not too fast. As a result, when the distance χ ( w , w t − 1 ) is Euclidean, the objecti ve function leads to gradient descent with a step length equal to η . Nev ertheless, the constraints (16) lead to an inef ficient gradient descent procedure. T o obtain an efficient solver , we exploit the fact that the weights w d are probability masses, hence we replace the Euclidean distance with the more suitable Kullback-Leibler div ergence, i.e. χ ( w , w t − 1 ) = P D d =1 w d log w d w t − 1 ,d , which results in the exponentiated gradient algorithm [31]. The partial deriv ati ves of L t and H w .r .t w d at the point w t − 1 ,d are computed with, respectiv ely: ∂ ( L t ) ∂ w d w t − 1 ,d = − 1 |C t | X ˆ c i t,f ∈C t N c (ˆ c i t,f ; c i,d f , σ 2 ) P D d 0 =1 w t − 1 ,d 0 N c (ˆ c i t,f ; c i,d 0 f , σ 2 ) , ∂ H ∂ w d w t − 1 ,d = − (1 + log ( w t − 1 ,d )) , ∀ d ∈ { 1 . . . D } . (17) Then, the exponentiated gradient, r t − 1 ,d = e − η ∂ ( −L t ) ∂ w d w t − 1 ,d + γ ∂ H ∂ w d w t − 1 ,d , ∀ d ∈ { 1 . . . D } , (18) is used to update the weights with: w t,d = r t − 1 ,d w t − 1 ,d P D d 0 =1 r t − 1 ,d 0 w t − 1 ,d 0 , ∀ d ∈ { 1 . . . D } . (19) It is clear from (19) that the parameter constraints (16) are necessarily satisfied. The exponentiated gradient algorithm sequentially ev aluates (17), (18) and (19) at each frame. At the first frame, the weights are initialized with the uniform distribution, namely w 1 ,d = 1 D . When C t is empty , such as during a silent period, the parameters are recursively updated with w t,d = (1 − η 0 ) w t − 1 ,d + η 0 1 D . The weight w t as a function of ˜ θ d , i.e. w t,d , exhibits a hand- ful of peaks that could correspond to activ e speakers. The use of an entropy regularization term was sho wn to both suppress small spurious peaks, present without using the regularization term, and to sharpen the peaks corresponding to actual activ e speakers, thus allowing to better localize true speakers and to eliminate erroneous ones. In the case of moving speakers, a peak should shift along time from a direction ˜ θ d to a nearby direction. Spatial smoothing of the weight function raises the weight values around a peak, which results in smoother peak jumps. In our experiments, spatial smoothing is carried out with w t,d = ( w t,d + 0 . 02 w t,d − 1 + 0 . 02 w t,d +1 ) / 1 . 04 , where the smoothing factor 0 . 02 is empirically chosen in order to smooth peak motion from one frame to the next, while avoiding the peaks to collapse. One may think that spatial smoothing and entropy regularization neutralize each other, but in practice it was found that their combination is beneficial. C. P eak Selection and F r ame-wise Speaker Localization Frame-wise localization and counting of activ e speakers could be carried out by selecting the peaks of w t ( ˜ θ d ) larger than a predefined threshold [12], [30]. Howe ver , peak selection does not exploit the temporal dependencies of moving speak- ers. Moreover , peak selection can be a risky process since a too high or too lo w threshold value may lead to undesirable missed detection or false alarm rates. In order to avoid these problems, we adopt a weighted-data Bayesian framew ork: all the candidate directions and the associated weights are used as observations by the multiple speaker tracking method de- scribed in Section IV belo w . The localization results obtained with peak selection are compared with the localization results obtained with the proposed tracker in Section V. I V . M U LT I P L E S P E A K E R T R AC K I N G Let N be the maximum number of speakers that can be simultaneously activ e at any time t , and let n be the speaker index. Moreov er , let n = 0 denote no speaker . W e now introduce the main variables and their notations. Upper case letters denote random variables while lower case letters denote their realizations. Let S tn be a latent (or state) v ariable associated with speaker n at frame t , and let S t = ( S t 1 , . . . , S tn , . . . , S tN ) . S tn is composed of tw o parts: the speaker direction and the speaker velocity . In this work, speaker direction is de- fined by an azimuth θ tn . T o av oid phase (circular) ambigu- ity we describe the direction with the unit vector U tn = (cos( θ tn ) , sin( θ tn )) > . Moreover , let V tn ∈ R be the angular velocity . Altogether we define a realization of the state v ariable as s tn = [ u tn ; v tn ] where the notation [ · ; · ] stands for vertical vector concatenation. Let O t = ( O t 1 , . . . , O td , . . . , O tD ) be the observed vari- ables at frame t . Each realization o td of O td is composed of a candidate location, or azimuth ˜ θ td ∈ D , and a weight w td . The weight w td is the probability that there is an acti ve speak er in the direction ˜ θ td , namely (15). As above, let the azimuth be described by a unit vector b td = (cos( ˜ θ td ) , sin( ˜ θ td )) > . In summary we have o td = [ b td ; w td ] . Moreover , let Z td be a (latent) assignment v ariable associated with each observed variable O td , such that Z td = n means that the observa- tion indexed by d at frame t is assigned to activ e speaker n ∈ { 0 , . . . , N } . Note that Z td = 0 is a “fake” assignment – the corresponding observation is assigned to an audio source that is either background noise or any other source that has not yet been identified as an active speaker . The problem at hand can now be cast into the estimation of the filtering distribution p ( s t , z t | o 1: t ) , and further inference of s t and z t . In this work we make two hypotheses, namely (i) that the observations at frame t only depend on the assignment and state variables at t , and (ii) that the prior distribution of the assignment v ariables is independent of all the other variables. By applying the Bayes rule together with these hypotheses, 7 and ignoring terms that do not depend on s t and z t , the filtering distribution is proportional to: p ( s t , z t | o 1: t ) ∝ p ( o t | s t , z t ) p ( z t ) p ( s t | o 1: t − 1 ) , (20) which contains three terms: the observation model, the prior distribution of the assignment variable and the predictiv e distribution over the sources state. W e now characterize each one of these three terms. 1) A udio observation model: The audio observ ation model describes the distribution of the observations giv en speakers state and assignment. W e assume the different observ ations are independent, conditioned on speakers state and assignment, which can be written as: p ( o t | z t , s t ) = D Y d =1 p ( o td | z t , s t ) . (21) Since the weights describe the confidence associated with each observed azimuth, we adopt the weighted-data GMM model of [35]: p ( b td | Z td = n, s tn ; w td ) = ( N ( b td ; M s tn , 1 w td Σ ) if n ∈ { 1 , . . . , N } U ( vol ( G )) if n = 0 , (22) where the matrix M = [ I 2 × 2 , 0 2 × 1 ] projects the state v ariable onto the space of source directions and Σ is a cov ariance matrix (set empirically to a fixed value in the present study). Note that the weight plays the role of a precision: The higher the weight w td , the more reliable the source direction b td . The case Z td = 0 follows a uniform distribution over the volume of the observation space. 2) Prior distribution: The prior distribution of the assign- ment variable is independent ov er observations and is assumed to be uniformly distributed over all the speakers (including the case n = 0 ), hence: p ( z t ) = D Y d =1 p ( Z td = n ) with π dn = p ( Z td = n ) = 1 N + 1 . (23) 3) Pr edictive distribution: The predictive distribution de- scribes the relationship between the state s t and the past ob- servations up to frame t , o 1: t − 1 . T o calculate this distrib ution, we first marginalize p ( s t | o 1: t − 1 ) over s t − 1 , writing: p ( s t | o 1: t − 1 ) = Z p ( s t | s t − 1 ) p ( s t − 1 | o 1: t − 1 ) d s t − 1 , (24) where the two terms under the integral are the state dynamics and the marginal filtering distribution of the state variable at frame t − 1 , respectiv ely . W e model the state dynamics as a linear-Gaussian first-order Markov process, independent over the speakers, i.e. : p ( s t | s t − 1 ) = N Y n =1 N ( s tn ; D t − 1 ,n s t − 1 ,n , Λ tn ) , (25) where Λ tn is the dynamics’ cov ariance matrix and D t − 1 ,n is the state transition matrix. Given the estimated azimuth θ t − 1 ,n and angular velocity v t − 1 ,n at frame t − 1 , we have the following relation: cos( θ tn ) sin( θ tn ) = cos( θ t − 1 ,n + v t − 1 ,n ∆ t ) sin( θ t − 1 ,n + v t − 1 ,n ∆ t ) , (26) where ∆ t is the time increment between two consecutiv e frames. Expanding (26) and assuming that the angular dis- placement v t − 1 ,n ∆ t is small, the state transition matrix can be written as: D t − 1 ,n = 1 0 − sin( θ t − 1 ,n )∆ t 0 1 cos( θ t − 1 ,n )∆ t 0 0 1 . (27) In the follo wing D t − 1 ,n is written as D , only to lighten the equations. A. V ariational Expectation Maximization Algorithm At this point, the standard solution to the calculation of the filtering distribution consists of using EM methodology . EM alternates between ev aluating the expected complete-data log- likelihood and maximizing this expectation with respect to the model parameters. More precisely , the expectation writes: J ( Θ , Θ o ) = E p ( z t , s t | o 1: t , Θ o ) [log p ( z t , s t , o 1: t | Θ )] , (28) where Θ o denotes the current parameter estimates and Θ denotes the new estimates, obtained via maximization of (28). Howe ver , given the hybrid combinatorial-and-continuous nature of the latent space, such solution is intractable in practice, due to combinatorial explosion. W e thus propose to use of a variational approximation to solve the problem efficiently . W e inspire from [22] and propose the following factorization: p ( z t , s t | o 1: t ) ≈ q ( z t , s t ) = q ( z t ) N Y n =0 q ( s tn ) . (29) The optimal solution is then giv en by two E-steps, an E-S step for each individual state variable S tn and an E-Z step for the assignment variable Z t : log q ( s tn ) = E q ( z t ) Q m 6 = n q ( s tm ) [log p ( z t , s t | o 1: t )] , (30) log q ( z t ) = E q ( s t ) [log p ( z t , s t | o 1: t )] . (31) It is easy to see that in order to compute (30) and (31), two elements are needed: the predictiv e distribution (24) and the marginal filtering distribution at t − 1 , p ( s t − 1 | o 1: t − 1 ) . Remarkably , as a consequence of the factorization (29), we can replace p ( s t − 1 | o 1: t − 1 ) with q ( s t − 1 ) = Q N n =1 q ( s t − 1 ,n ) in (24) and compute the predictiv e distribution as follows: p ( s t | o 1: t − 1 ) ≈ Z p ( s t | s t − 1 ) N Y n =1 q ( s t − 1 ,n ) d s t − 1 . (32) This predictiv e distribution factorizes across speakers. More- ov er , one prominent feature of the proposed variational ap- proximation is that, if the posterior distribution at time t − 1 q ( s t − 1 ,n ) is assumed to be a Gaussian, say q ( s t − 1 ,n ) = N ( s t − 1 ,n ; µ t − 1 ,n , Γ t − 1 ,n ) , (33) 8 then (the approximation of) the predictiv e distribution (32) is a Gaussian. More specifically , the deriv ation of (32) leads to: p ( s tn | o 1: t − 1 ) = N ( s tn ; D µ t − 1 ,n , D Γ t − 1 ,n D > + Λ tn ) . (34) Moreov er , as we will see in the E-S-step belo w , the posterior distribution at time t , q ( s tn ) , is also a Gaussian. 1) E-S step: The computation of the v ariational posterior distribution q ( s tn ) , for all currently tracked speakers, is carried out by dev eloping (30) as follo ws. W e first exploit (20), (21), (23) and (34) to rewrite log p ( z t , s t | o 1: t ) in (30) as a sum of individual log-probabilities. Then we eliminate all terms not depending on s tn and we e valuate the expectation of the remaining terms. Because the terms not depending on s tn were disregarded, the expectation is computed only with respect to q ( z t ) . This nicely makes the computation of q ( s tn ) independent of the structure of q ( s tm ) for m 6 = n . Eventually , this yields a Gaussian distribution: q ( s tn ) = N ( s tn ; µ tn , Γ tn ) , (35) with the following parameters: Γ tn = D X d =1 α tdn w td M > Σ − 1 M + Λ tn + D Γ t − 1 ,n D > − 1 − 1 , (36) µ tn = Γ tn M > Σ − 1 D X d =1 α tdn w td b td + Λ tn + D Γ t − 1 ,n D > − 1 D µ t − 1 ,n , (37) where α tdn = q ( Z td = n ) is the v ariational posterior distribution of the assignment v ariable, which will be detailed in Section IV -A2. Importantly , the first two entries of µ tn in (37) represent the estimated azimuth of speaker n . The “final” azimuth estimate at frame t is thus given by this subvector at the end of the VEM iterations. Since we use a unit-vector representation, we normalize this vector at each iteration of the algorithm. Finally , note that since we have shown that q ( s t − 1 ,n ) being Gaussian leads to q ( s tn ) being Gaussian as well, it is sufficient to assume that q ( s 1 n ) is Gaussian, namely at t = 1 : q ( s 1 n ) = N ( s 1 n ; µ 1 n , Γ 1 n ) . 2) E-Z step: Developing (31) with the same principles as abov e, one can easily find that the variational posterior distribution of the assignment variable factorizes as: q ( z t ) = D Y d =1 q ( z td ) . (38) In addition, we obtain a closed-form expression for q ( z td ) : α tdn = q ( Z td = n ) = ρ tdn π dn P N i =0 ρ tdi π di , (39) where π dn was defined in (23), and ρ tdn is given by: ρ tdn = N ( b td ; M µ tn , 1 w td Σ ) × e − 1 2 tr w td M > Σ − 1 M Γ tn if 1 ≤ n ≤ N U ( vol ( G )) if n = 0 . (40) 3) M-step: Once the two expectation steps are ex ecuted, we maximize J in (28) with respect to the model parameters, i.e. the cov ariance matrix of the state dynamics Λ tn . By exploiting again the proposed variational approximation, the dependency of J on Λ tn can be written as: J ( Λ tn ) = E q ( s tn ) log N ( s tn ; D µ t − 1 ,n , D Γ t − 1 ,n D > + Λ tn ) , which can be further dev eloped as: J ( Λ tn ) = log | D Γ t − 1 ,n D > + Λ tn | + T r ( D Γ t − 1 ,n D > + Λ tn ) − 1 × (41) (( µ tn − D µ t − 1 ,n )( µ tn − D µ t − 1 ,n ) > + Γ tn ) . By equating to zero the gradient of (41) w .r .t. Λ tn , we obtain: Λ tn = Γ tn − D Γ t − 1 ,n D > +( µ tn − D µ t − 1 ,n )( µ tn − D µ t − 1 ,n ) > . (42) B. Speaker -Birth Pr ocess A birth process is used to initialize new tracks, i.e. speakers that become active. W e take inspiration from the birth process for visual tracking proposed in [22] and adapt it to audio tracking. The general principle is the following. In a short period of time, say from t − L to t , with L being small (typically 3), we assume that at most one new (yet untracked) speaker becomes activ e. For each frame from t − L to t , among the observations assigned to n = 0 we select the one with the highest weight, and thus obtain an observ ation sequence ˜ o t − L : t . W e then compute the mar ginal likelihood of this sequence according to our model, 0 = p ( ˜ o t − L : t ) . If these observations ha ve been generated by a speaker that has not been detected yet (hypothese H 1 ), then they are assumed to be consistent with the model, i.e. exhibit smooth trajectories, and 0 will be high; otherwise, i.e. if they have been generated by background noise (hypothese H 0 ), they will be more randomly spread ov er the range of possible observations, and 0 will be low . Giving birth to a new speaker track amounts to setting a threshold 1 and deciding between the two hypotheses: 0 H 1 > < H 0 1 . (43) This process is applied continuously ov er time to detect new speakers. This includes speaker track initialization at t = 1 . Note that initially all the assignment variables are set to n = 0 (background noise), namely Z 1 d = 0 , ∀ d . As for the computaiotn of p ( ˜ o t − L : t ) , we first rewrite it as the marginalization of the joint probability of the selected observations and the state trajectory ˆ s t − L : t of a potential speaker: 0 = Z p ( ˜ o t − L : t , ˆ s t − L : t ) d ˆ s t − L : t , (44) which, under the proposed model, is given by: 0 = (45) Z t Y i = t − L +1 p ( ˜ o i | ˆ s i ) p ( ˆ s i | ˆ s i − 1 ) p ( ˜ o t − L | ˆ s t − L ) p ( ˆ s t − L ) d ˆ s t − L : t . 9 Algorithm 2 V ariational EM tracking Input: audio observations b 1: t for t = 1 to end do Gather observations at frame t for iter = 1 to N iter do E-Z-step: for d ∈ { 1 , ..., D } do for n ∈ { 0 , ..., N } do Evaluate q ( Z td = n ) with (39) end for end for E-S-step: for n ∈ { 1 , ..., N } do Evaluate Γ tn and µ tn with (36) and (37); end for M-step: Evaluate Λ tn with (42); end for Speaker -Birth Process (see Section IV -B) Detect speaker activity (see Section IV -C) for n ∈ { 1 , ..., N } do if the speaker n is detected as acti ve then Output the results µ tn end if end for end for All the terms in the above equation have been defined during the deriv ation of our model except the marginal prior distribu- tion of the state p ( ˆ s t − L ) , and all these terms are Gaussian. For the track-birth process, we just want to test if the trajectory of observations from t − L to t is coherent, and we can define here p ( ˆ s t − L ) as a non-informativ e distribution, such as a uniform distribution. In practice we choose a Gaussian distribution with a very lar ge cov ariance, to ensure a closed-form solution to (45). Due to room limitation, we do not present more details. Let us just mention that in practice we set L = 3 , which enables efficient speaker birth detection. C. Speaker Activity Detection A very interesting feature of the proposed model is that, once speaker tracks hav e been estimated, the posterior distri- bution of the assignment variables Z t can be used for speech activity detection, i.e. who are the active speakers at each frame, a task also referred to as speaker diarization in the multi-speaker context. This can be formalized as testing for each frame t and each speaker n between the two following hypotheses: H 1 : Speaker n is acti ve at frame t , and H 0 : Speaker n is silent at frame t . In the present work, this is done by computing the following weighted sum of weights , av eraged o ver a small number of frames L 0 to take into account speaker acti vity inertia, and comparing with a threshold δ , a test formally written as: t X i = t − L 0 +1 D X d =1 α idn w d i H 1 > < H 0 δ. (46) Overall, the v ariational EM tracking algorithm is described in Algorithm 2. V . E X P E R I M E N T S A. Experimental setups 1) Datasets: W e tested and empirically v alidated our method with the LOCA T A and the Kinovis multiple speaker tracking (Kinovis-MST) datasets. The LOCA T A (a IEEE- AASP challenge for sound source localization and tracking) [36] data were recorded in the Computing Laboratory of the Department of Computer Science of Humboldt Univ ersity Berlin. The room size is 7 . 1 m × 9 . 8 m × 3 m, with a rev erberation time T 60 ≈ 0 . 55 s. W e report the results of the development corpus for tasks #3 and #5 with a single moving speaker , and for tasks #4 and #6 with two moving speakers, each task comprising three recorded sequences. 2 There are twelve microphones arranged such as to form a spherical array and placed on the head of a N A O robot. W e used two microphone configurations: four quasi-planar microphones, located on the top of the head, numbered 5, 8, 11, 12, and eight microphones numbered 1, 3, 4, 5, 8, 10, 11, 12. An optical motion capture system was used to provide ground-truth positions of the robot and of the speakers. The participants speak continuously during the entire recordings. Ho wev er , speech pauses are inevitable and these pauses may last se veral seconds. Each participant has a head- mounted microphone. W e applied the voice activity detector [37] to these microphone signals to obtain ground-truth voice activity information of each participant. The signal-to-noise ratio (SNR) is approximativ ely 23.4 dB The Kinovis-MST dataset was recorded in the Kinovis multiple-camera laboratory at INRIA Grenoble. 3 The room size is 10 . 19 m × 9 . 87 m × 5 . 6 m, with T 60 ≈ 0 . 53 s. A v5 N A O robot with four microphones [38] was used. The geomet- ric layout of the microphones is similar to the one of the robot used in LOCA T A. The speakers were moving around the robot with a speaker -to-robot distance ranging between 1 . 5 m and 3 . 5 m. As with LOCA T A, a motion capture system was used to obtain ground-truth trajectories of the moving participants and the location of the robot. T en sequences were recorded with up to three participants, for a total length of about 357 s. The robot’ s head has built-in fans located nearby the microphones, hence the recordings contain a notable amount of stationary and spatially correlated noise with an SNR of approximativ ely 2.7 dB [38]. The participants behave more naturally than in the LOCA T A scenarios, i.e. they take speech turns in a natural multi-party dialog. When one participant is silent, he/she manually hides the infrared marker located on his head to mak e it in visible to the motion capture system. This provides ground-truth speech activity information for each 2 The results obtained with the proposed method were officially submitted to the LOCA T A challenge and they will be av ailable soon at https://locata. lms.tf.fau.de/. 3 The Kinovis-MST dataset is publicly av ailable at: https://team.inria.fr/ perception/the- kinovis- mst- dataset/ 10 (a) Ground truth (b) SRP-PHA T (c) PRP-REM (d) DPR TF-REM (e) DPR TF-EG (f) VEM-tracking Fig. 2: Results of speaker localization and tracking for Recording 1 / T ask 6 of LOCA T A data. (a) Ground truth trajectory and voice activity (red for speaker 1, black for speaker 2). Intervals in the trajectories are speaking pauses. (b)-(e) One-dimensional heat maps as a function of time for the four tested localization methods. (f) Results for the proposed VEM-based tracker . Black and red colors demonstrate a successful tracking, i.e. continuity of the tracks despite of speech pauses. participant. This dataset and the associated annotations allow us to test the proposed tracking algorithm when the number of active speakers varies over time. 2) P arameter setting: For both datasets, we perform 360 ◦ - wide azimuth estimation and tracking: D = 72 azimuth directions at ev ery 5 ◦ in [ − 175 ◦ , 180 ◦ ] are used as candidate directions. The CGMM mean c i,d f is the head-related transfer function (HR TF) ratio between two microphones, which are precomputed based on the direct-path propagation model for each candidate direction. In the Kino vis-MST dataset, the HR TFs have been measured to compute the CGMM means. For LOCA T A, the TDOAs are computed based on the coordi- nate of microphones, which are then used to compute the phase of the CGMM means, while the magnitude of the CGMM means are set to a constant, e.g. 0.5, for all the frequencies. All the recorded signals are resampled to 16 kHz. The STFT uses the Hamming window with length of 16 ms and shift of 8 ms. The CTF length is Q = 8 frames. The RLS forgetting factor λ is computed using ρ = 1 . The smoothing factor β is set to 0 . 9 . The exponentiated gradient update factor is η = 0 . 07 . The smoothing factor η 0 is set to 0.065. The entropy regularization factor is γ = 0 . 1 . For the tracker , the covariance matrix is set to be isotropic Σ = 0 . 03 I 2 . The threshold gi ving birth to a new identity is τ 1 = 0 . 75 and L = 3 . T o decide whether a person is speaking or is silent, L 0 = 3 frames are used, with a threshold δ = 0 . 15 . At each time instance, the VEM algorithm 11 Fig. 3: ROC curve for the LOCA T A dataset. has 5 iterations. Corresponding to the STFT frame shift, i.e. 8 ms, the frame rate of the proposed system is 125 frames per second. 3) Comparison with Baseline Methods: The proposed method is e valuated both in “frame-wise localization” mode and in “tracker” mode. In the first mode, the frame-wise online localization module of Section III is applied without the tracker of Section IV. Instead, it is followed by the peak selection process described in [12]. This method is referred to as DP-R TF using EG (DPR TF-EG). In tracker mode, DPR TF-EG is directly followed by the proposed VEM tracker , without peak selection. It is then simply referred to as VEM- tracker . In that case, the directions of active speakers are giv en by the state variable, and the continuity of the speaker tracks is giv en by the assignment variable. W e compare DPR TF-EG with several baseline methods: • The standard beamforming-based localization method called SRP using phase transform (PHA T) (SRP-PHA T) [3]. The same STFT configuration and candidate directions are used for SRP-PHA T and for the proposed method. The steering vector for each candidate direction is deri ved from the HR TFs and TDO As for the Kinovis-MST and LOCA T A datasets, respecti vely . The frame-wise SRP is recursi vely smoothed with a smoothing factor set to 0 . 065 . • A method combining PRP features, CGMM model and parameter update using REM [16], referred to as PRP- REM. W e also combine the DPR TF features and CGMM with REM (referred to as DPR TF-REM). This is to ev aluate the proposed DP-R TF feature w .r .t. PRP , and the EG-based online parameters update method w .r .t. REM. For both baselines, the STFT and CGMM settings are the same as for the proposed method. The updating factor of REM is set to 0 . 065 . 4) Evaluation Metrics: The detected speakers should be assigned to the actual speakers for performance ev aluation. This is done using a greedy matching algorithm. First the azimuth dif ference for all possible detected-actual speaker pairs are computed, then the detected-actual speaker pair with the smallest difference is picked out as a matched pair . This procedure is iterated until the detected or actual speakers are all picked out. For each matched pair , the detected speaker is then considered to be successfully localized if the azimuth difference is not larger than 15 ◦ . The absolute error is calcu- lated for the successfully localized sources. The mean absolute error (MAE) is computed by av eraging the absolute error of all speakers and frames. For the unsuccessful localizations, we count the miss detection (MD) (speaker activ e but not detected) and false alarms (F As) (speaker detected but not activ e). Then the MD and F A rates are computed, using all the frames, as the percentage of the total MDs and F As out of the total number of actual speakers, respectively . In addition to these localization metrics, we also count the identity switches (IDs) to ev aluate the tracking continuity . ID is an absolute number . It represents the number of the identity changes in the tracks for a whole test sequence. The computation time is measured with the real-time factor (RF), which is the processing time of a method divided by the length of the processed signal. Note that all the methods are implemented in MA TLAB. B. Results for LOCA T A Dataset For conv enience, both the spatial spectrum of SRP-PHA T and the CGMM component weights profile will be referred to as heatmaps. Fig. 2 sho ws an example of a result obtained with a LOCA T A sequence. T wo speakers are moving and contin- uously speaking with short pauses. The SRP-PHA T heatmap (Fig. 2 (b)) is cluttered due to the non ideal beampattern of the microphone array and to the influence of reverberation and noise. For most of the time, SRP-PHA T has prominent response power for the true speaker directions. Localization of the most dominant speaker can be made by selecting the di- rection with the largest response power . Howe ver , it is difficult to correctly count the number of active speakers and localize less dominant speakers, since there exist a number of spurious peaks. PRP-REM (Fig. 2 (c)) exhibits a clearer heatmap compared to SRP-PHA T, but there exist some spurious tra- jectories as well, since the PRP features are contaminated by rev erberation. DPR TF-REM (Fig. 2 (d)) removes most of the spurious trajectories, which illustrates the robustness of the proposed DP-R TF feature against rev erberation. From Fig. 2 (e), it can be seen that the proposed EG algorithm further remo ves the interferences by applying the entropy regularization. In addition, the peak e volution is smoother compared with Fig. 2 (d), which is mainly due to the use of the spatial smoothing. Fig. 2 (f) illustrates the result obtained with the proposed VEM tracker , with DPR TF-EG providing the observations. The proposed tracker giv es smoother and cleaner results compared with the other methods. Even when the observations have a low weight, the tracker is still able to gi ve the correct speaker trajectories. This is ensured by the second term in (37) which exploits the source dynamics model and continues to provide localization information even when w t,d (and/or α tdn ) becomes small. As a result, the tracker is able to preserve the identity of speakers in spite of the 12 (a) Ground truth (b) SRP-PHA T (c) PRP-REM (d) DPR TF-REM (e) DPR TF-EG (f) VEM-tracking Fig. 4: Results of speaker localization and tracking for one sequence of the Kinovis-MST dataset. (a) Ground truth trajectory and voice activity (red for speaker 1, black for speaker 2, blue for speaker 3). (b)-(e) One-dimensional heat maps as a function of time for the four tested localization methods. (f) Results for the proposed VEM-based tracker . (short) speech pauses. In the presented sequence example, the estimated speaker identities are quite consistent with the ground truth. T o empirically e valuate the quality of the heatmaps provided by the localization methods, we computed the receiver oper- ating characteristic (R OC) curve (MD rate versus F A rate) for the LOCA T A dataset by varying the peak selection threshold, for each tested method, Fig. 3. For the R OC curve, the closer to the left-bottom the better . As already mentioned, in addition to using four microphones, we also tested an eight-microphone configuration, which is referred to as DPR TF-EG-8ch. By analyzing the ROC curves, one notices that the meth- ods based on DP-R TF perform better than SRP-PHA T and than PRP-REM, which is consistent with the heatmaps of Fig. 2: SRP-PHA T and PRP-REM are more sensitiv e to the presence of reverberations than the proposed methods. The performance of both DPR TF-REM and DPR TF-EG cannot be easily discriminated using the R OC curves. DPR TF-EG-8ch performs slightly better than DPR TF-EG, which means that the performance of the proposed method can be slightly improved by increasing the number of microphones. One may conclude that the proposed method is well suited when only a small number of microphones are av ailable. With all methods, the F A rate can be tri vially decreased to be close to 0 by increasing the peak selection threshold. Howe ver , the MD rate cannot be decreased to 0 e ven with a v ery small peak-selection threshold, since some speech frames that are actually present cannot be detected as the heatmap peaks due to the influence of noise 13 Fig. 5: ROC curve for the Kinovis-MST dataset. and reverberation, and to a possible latency in the detection. T ABLE I: Localization and tracking results for the LOCA T A data. MD rate (%) F A rate (%) MAE ( ◦ ) IDs RF SRP-PHA T 39.2 18.6 5.2 - 0.06 PRP-REM 30.9 19.6 5.0 - 0.30 DPR TF-REM 23.3 15.2 4.6 - 0.97 DPR TF-EG 23.9 13.0 4.0 - 0.97 DPR TF-EG-8ch 22.7 13.2 4.1 - 3.03 VEM + EG 22.7 12.4 4.1 10 2.05 VEM + EG-8ch 22.9 11.0 3.2 6 4.05 For each curve, a good balance between F A rate and MD rate is achiev ed at the left-bottom corner , which can be detected as the point with the minimum distance to the origin. The av erage localization results corresponding to this optimal left-bottom point are summarized in T able I for each tested method. It can be seen that, besides MD and F A, the DPR TF-based methods achiev e smaller MAE than SRP- PHA T and PRP-REM, since the proposed DP-R TF features are robust against reverberation and thus leads to smaller biases for the heatmap peaks. DPR TF-EG has a higher MD rate than DPR TF-REM, while it also has lower F A rate, and a lower MAE, due to the ef fect of entropy regularization. With eight microphones, i.e. DPR TF-EG-8ch, MD is 1% smaller than the MD of DPR TF-EG, since the use of a non coplanar microphone setup provides more accurate localization than a coplanar setup. The proposed tracker performs the best in terms of MD and of F A. For the four-microphone configura- tion, the tracker slightly reduces F A compared to DPR TF-EG. It also reduces the MD score since some correct speaker trajectories can be recov ered ev en when the observ ations have (very) low weights, as explained abo ve. In addition, the MAE is noticeably reduced when more microphones are used by the VEM tracker , which is not the case with the DPR TF- EG localizer . This phenomenon indicates that, compared with the localizer, the tracker is able to better exploit additional information a vailable with extra microphones, namely to revise the speaker trajectory estimation, since the state dynamics of the tracker helps correcting the possibly inaccurate additional T ABLE II: Localization and tracking results for the Kinovis-MST dataset. MD rate (%) F A rate (%) MAE ( ◦ ) IDs RF SRP-PHA T 60.0 37.1 5.5 - 0.07 PRP-REM 40.3 23.1 5.1 - 0.32 DPR TF-REM 37.6 22.0 5.5 - 0.73 DPR TF-EG 31.4 19.5 5.3 - 0.73 VEM + EG 31.1 11.7 4.9 11 2.12 localization information. The proposed tracker achiev es quite consistent speaker ID estimation. For the whole LOCA T A dataset, only ten identity switches were observed when using DPR TF-EG, and this number is reduced to six when using DPR TF-EG-8ch. The remaining identity switches are mainly due to speakers with crossing trajectories, a hard case for multiple audio-source tracking. As for the computation time, SRP-PHA T has the smallest RF. Based on the fact that the RFs of DPR TF-REM and DPR TF-EG are identical, we can conclude that the REM algorithm and the proposed EG algorithm have comparable computational complexities. The RFs of PRP-REM, DPR TF- REM (or DPR TF-EG) and DPR TF-EG-8ch are different due to different computational complexities for feature estimation, more precisely due to the dif ferent dimensions of the vector to be estimated. The CTF identification used for DP-R TF estimation solves an RLS problem with the unknown CTF vector ˜ a f ∈ C ( I Q − 1) × 1 . Remind that I and Q denote the number of microphones and the CTF length, respectiv ely . In the present work, we hav e set I = 4 / Q = 8 for DPR TF- REM (or DPR TF-EG), I = 8 / Q = 8 for DPR TF-EG-8ch. PRP is defined based on the narrow-band assumption, or equiv alently based on the CTF with Q = 1 , thence we hav e I = 4 / Q = 1 for PRP-REM. The proposed localization method, i.e. DPR TF-EG with four microphones, has an RF smaller than one, which means it can be run in real time. The RF for the proposed tracker (VEM) is computed by the sum of the localization time and of the tracking time. For acoustic tracking, the tracker observes an direction of arriv al (DO A) estimate ev ery 8 ms. Ho wev er , an 8 ms speaker motion is small. Thus in practice, the tracker uses one DO A estimate per 32 ms intervals, which leads to an RF of 2.05 for the four-channel (4ch) case and 4.05 for the eight-channel (8ch) case. The RF can be further improved by using less DO A estimates. C. Results for Kinovis-MST Dataset Fig. 4 shows an example of result for a Kinovis-MST sequence. Three participants are moving and intermittently speaking. It can be seen that, for many frames, the response power of SRP-PHA T and the CGMM component weights of PRP-REM corresponding to the true active speakers are not prominent, compared to the spurious trajectories. Again, DPR TF-REM and DPR TF-EG provide much better heatmaps, though they also miss some speaking frames, e.g. at the beginning of Speaker 3’ s trajectory (in blue). The possible reasons are i) the NA O robot (v5) has a relativ e strong ego- noise [38], and thus the signal-to-noise ratio of the recorded 14 signals is relativ e low , and ii) the speakers are moving with a varying source-to-robot distance and the direct-path speech is contaminated by more reverberations when the speakers are distant. Overall, DPR TF-REM and DPR TF-EG are able to monitor the moving, appearance, and disappearance of acti ve speakers for most of the time, with a small time lag due to the temporal smoothing. This kind of recording/scenario is very challenging for the tracking method, especially for speaker identity preservation, since the participants are intermittently speaking and moving. In a general manner, the proposed tracker achiev es relativ ely good results, as illustrated in Fig. 4 (f). The tracked trajectories are smooth and clean. If the true trajectory of one speaker has an approximately constant direction, the tracker is able to re- identify the speaker ev en after a long silence thanks to the abov e-mentioned combination of observations and dynamics in (37), e.g. Speaker 1’ s trajectory in red. In the case that the speaker changes his/her mov ement when he/she is silent, the track can be lost. When the person speaks again, it is indeed difficult to re-idendify him/her based on the dynamics estimated before the silence period. The tracker may then prefer to giv e birth to a ne w speaker . This is illustrated by the black trajectory turning into green, and the blue trajectory turning into cyan in Fig. 4. Note that the silence periods are here much longer than in the LOCA T A example of Fig. 2. Fig 5 show the R OC curves for the Kinovis-MST dataset. Compared to the R OC curves for the LOCA T A dataset, all the four localization methods hav e a worse R OC curve, especially along the MD rate axis, for the reasons mentioned abov e. T able II summarizes the localization and tracking results for the optimal bottom-left point of the R OC curv es. It can be seen that, for the four localization methods, MAEs are quite close, namely the heatmap peaks hav e similar biases. Compared with the results for the LOCA T A dataset, the advantage of the proposed tracker is more significant for the Kinovis-MST dataset. In particular , the F A rate is reduced by 7.8% relatively to DPR TF-EG, and is similar to the F A rate obtained with the LOCA T A dataset. This means that the dynamic model associated with the tracker can efficiently reduce the influence of incorrect source localizations caused by noise and by complex source movements. The identity switches are mainly caused by speakers changing their direction of mov ement while during silent periods, as discussed above. Compared to the LOCA T A dataset, DPR TF-EG has smaller RF, since the Kinovis-MST dataset is noisier and more noise frames are skipped in the RLS algorithm. D. Dicussion The experimental results obtained with the two datasets clearly show the effecti veness of the proposed method based on DP-R TF estimation, multiple speaker localization and vari- ational tracking. T o improv e rob ustness, temporal smoothing is used, which leads to localization/tracking latency . This latency causes MD and F A observed at both the beginning and the end of continuous speech segments. Howe ver , it can be observed from the examples sho wn in Fig. 2 and Fig. 4 that the latenc y is not that severe. The Kinovis-MST dataset is more challenging than the LOCA T A dataset for speaker localization/tracking, due to its lo wer SNR and the presence of casual speaking style. Even though, the proposed methods achiev e a comparable F A rate with the two datasets. Concerning the MD score, when applied to Kinovis-MST , the method yields larger MD rates than when applied to LOCA T A. This is due to the large number of TF bins dominated by a high SNR score, present in the Kinovis-MST recordings. The tracker’ s dynamics attenuate the influence of these TF bins to a limited extend. V I . C O N C L U S I O N In this paper , we proposed and combined i) a recursiv e DP-R TF feature estimation method, ii) an online multiple- speaker localization method, and iii) an multiple-speaker tracking method. The resulting framework provides online speaker counting, localization and consistent tracking (i.e. preserving speaker identity ov er a track in spite of intermittent speech production). The three algorithms are computationally efficient. In particular the tracking algorithm implemented in variational Bayesian framework yields a tractable solver under the form of VEM. Experiments with two datasets, recorded in realistic environment, verify that the proposed method is robust against re verberation and noise. Moreover , the tracker is able to efficiently track multiple moving speakers, detect whether they are speech or they are silent, as long as the motion associated with silent people is smooth. Howe ver , the tracking of the person from silent to acti ve remains a difficult task. The combination of the proposed method with speaker identification will be addressed in the future. The proposed VEM tracker can be easily adapted to work in tandem with any frame-wise localizer providing source loca- tion estimates and/or corresponding weights (and if no weights are provided by the localizer, the tracker can be applied with all weights set to one). This makes the proposed tracker very flexible, and easily reusable by the audio processing research community . R E F E R E N C E S [1] C. Knapp and G. C. Carter , “The generalized correlation method for estimation of time delay , ” IEEE T ransactions on Acoustics, Speech and Signal Processing , vol. 24, no. 4, pp. 320–327, 1976. [2] J. Chen, J. Benesty , and Y . Huang, “Time delay estimation in room acoustic environments: an overvie w , ” EURASIP Journal on applied signal processing , vol. 2006, pp. 170–170, 2006. [3] J. H. DiBiase, H. F . Silverman, and M. S. Brandstein, “Robust local- ization in reverberant rooms, ” in Microphone Arrays (M. S. Brandstein and D. W ard, eds.), pp. 157–180, Springer, 2001. [4] C. T . Ishi, O. Chatot, H. Ishiguro, and N. Hagita, “Evaluation of a music-based real-time sound localization of multiple sound sources in real noisy environments, ” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pp. 2027–2032, 2009. [5] O. Y ilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking, ” IEEE T ransactions on Signal Processing, , vol. 52, no. 7, pp. 1830–1847, 2004. [6] M. I. Mandel, R. J. W eiss, and D. P . Ellis, “Model-based expectation- maximization source separation and localization, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 18, no. 2, pp. 382–394, 2010. 15 [7] Y . Dorfan and S. Gannot, “Tree-based recursive expectation- maximization algorithm for localization of acoustic sources, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 23, no. 10, pp. 1692–1703, 2015. [8] Y . Huang and J. Benesty , “ Adaptiv e multichannel time delay estimation based on blind system identification for acoustic source localization, ” in Adaptive Signal Processing , pp. 227–247, Springer, 2003. [9] S. Doclo and M. Moonen, “Robust adaptiv e time delay estimation for speaker localization in noisy and rev erberant acoustic environments, ” EURASIP Journal on Applied Signal Pr ocessing , vol. 2003, pp. 1110– 1124, 2003. [10] T . G. Dvorkind and S. Gannot, “T ime difference of arrival estimation of speech source in a noisy and reverberant environment, ” Signal Pr ocessing , vol. 85, no. 1, pp. 177–204, 2005. [11] K. Ko walczyk, E. A. Habets, W . Kellermann, and P . A. Naylor, “Blind system identification using sparse learning for TDOA estimation of room reflections, ” IEEE Signal Pr ocessing Letters , vol. 20, no. 7, pp. 653–656, 2013. [12] X. Li, L. Girin, R. Horaud, and S. Gannot, “Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regularization, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 25, no. 10, pp. 1997–2012, 2017. [13] Y . A vargel and I. Cohen, “System identification in the short-time Fourier transform domain with crossband filtering, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 15, no. 4, pp. 1305– 1319, 2007. [14] R. T almon, I. Cohen, and S. Gannot, “Relative transfer function identification using conv olutive transfer function approximation, ” IEEE T ransactions on Audio, Speech, and Language Processing , v ol. 17, no. 4, pp. 546–555, 2009. [15] D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris, “Real-time multiple sound source localization and counting using a circular microphone array , ” IEEE Tr ansactions on Audio, Speech, and Language Pr ocessing , vol. 21, no. 10, pp. 2193–2206, 2013. [16] O. Schwartz and S. Gannot, “Speaker tracking using recursive EM algorithms, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 22, no. 2, pp. 392–402, 2014. [17] N. Roman and D. W ang, “Binaural tracking of multiple moving sources, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 16, no. 4, pp. 728–739, 2008. [18] C. Evers, A. H. Moore, P . A. Naylor , J. Sheaffer , and B. Rafaely , “Bearing-only acoustic tracking of moving speakers for robot audition, ” in IEEE International Confer ence on Digital Signal Pr ocessing (DSP) , pp. 1206–1210, 2015. [19] Y . Ban, L. Girin, X. Alameda-Pineda, and R. Horaud, “Exploiting the complementarity of audio and visual data in multi-speaker tracking, ” in ICCV W orkshop on Computer V ision for Audio-V isual Media , vol. 3, 2017. [20] Z. Liang, X. Ma, and X. Dai, “Robust tracking of moving sound source using multiple model kalman filter, ” Applied acoustics , vol. 69, no. 12, pp. 1350–1355, 2008. [21] J. V ermaak and A. Blake, “Nonlinear filtering for speaker tracking in noisy and rev erberant environments, ” in Acoustics, Speech, and Signal Pr ocessing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Confer ence on , vol. 5, pp. 3021–3024, IEEE, 2001. [22] S. Ba, X. Alameda-Pineda, A. Xompero, and R. Horaud, “ An on- line variational bayesian model for multi-person tracking from cluttered scenes, ” Computer V ision and Imag e Understanding , vol. 153, pp. 64– 76, 2016. [23] I. Gebru, S. Ba, X. Li, and R. Horaud, “ Audio-visual speaker diarization based on spatiotemporal Bayesian fusion, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , 2017. [24] M. F . Fallon and S. J. Godsill, “ Acoustic source localization and tracking of a time-varying number of speakers, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 20, no. 4, pp. 1409–1415, 2012. [25] J.-M. V alin, F . Michaud, and J. Rouat, “Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering, ” Robotics and Autonomous Systems , vol. 55, no. 3, pp. 216– 228, 2007. [26] V . Ce vher , R. V elmurugan, and J. H. McClellan, “ Acoustic multitar get tracking using direction-of-arri v al batches, ” IEEE T ransactions on Signal Pr ocessing , vol. 55, no. 6, pp. 2810–2825, 2007. [27] B.-N. V o, S. Singh, and W . K. Ma, “T racking multiple speakers using random sets, ” in Acoustics, Speech, and Signal Pr ocessing, 2004. Pr oceedings.(ICASSP’04). IEEE International Conference on , vol. 2, pp. ii–357, IEEE, 2004. [28] W .-K. Ma, B.-N. V o, S. S. Singh, and A. Baddeley , “Tracking an unknown time-varying number of speakers using tdoa measurements: A random finite set approach, ” IEEE T ransactions on Signal Pr ocessing , vol. 54, no. 9, pp. 3291–3304, 2006. [29] B.-N. V o and W .-K. Ma, “The gaussian mixture probability hypothesis density filter , ” IEEE T ransactions on signal pr ocessing , vol. 54, no. 11, pp. 4091–4104, 2006. [30] X. Li, B. Mourgue, L. Girin, S. Gannot, and R. Horaud, “Online localization of multiple moving speakers in reverberant environments, ” in The T enth IEEE W orkshop on Sensor Array and Multichannel Signal Pr ocessing , 2018. [31] J. Kivinen and M. K. W armuth, “Exponentiated gradient versus gradient descent for linear predictors, ” Information and Computation , vol. 132, no. 1, pp. 1–63, 1997. [32] G. Xu, H. Liu, L. T ong, and T . Kailath, “ A least-squares approach to blind channel identification, ” IEEE T ransactions on signal processing , vol. 43, no. 12, pp. 2982–2993, 1995. [33] X. Li, L. Girin, R. Horaud, and S. Gannot, “Estimation of relative trans- fer function in the presence of stationary noise based on segmental po wer spectral density matrix subtraction, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , pp. 320–324, 2015. [34] A. L. Y uille and A. Rangarajan, “The concave-con ve x procedure, ” Neural computation , vol. 15, no. 4, pp. 915–936, 2003. [35] I. D. Gebru, X. Alameda-Pineda, F . Forbes, and R. Horaud, “Em algorithms for weighted-data clustering with application to audio-visual scene analysis, ” IEEE transactions on pattern analysis and machine intelligence , vol. 38, no. 12, pp. 2402–2415, 2016. [36] H. W . L ¨ ollmann, C. Evers, A. Schmidt, H. Mellmann, H. Barfuss, P . A. Naylor , and W . Kellermann, “The LOCA T A challenge data corpus for acoustic source localization and tracking, ” in IEEE Sensor Array and Multichannel Signal Pr ocessing W orkshop , (Sheffield, UK), July 2018. [37] X. Li, R. Horaud, L. Girin, and S. Gannot, “V oice activity detection based on statistical likelihood ratio with adaptive thresholding, ” in IEEE International W orkshop on Acoustic Signal Enhancement (IWAENC) , pp. 1–5, 2016. [38] X. Li, L. Girin, F . Badeig, and R. Horaud, “Reverberant sound localiza- tion with a robot head based on direct-path relative transfer function, ” in IEEE/RSJ International Confer ence on Intelligent Robots and Systems (IR OS) , pp. 2819–2826, IEEE, 2016.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment