A cascaded multiple-speaker localization and tracking system

LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan A CASCADED MUL TIPLE-SPEAKER LOCALIZA TION AND TRA CKING SYSTEM Xiaofei Li, 1 Y utong Ban, 1 Laur ent Girin, 1 , 2 Xavier Alameda-Pineda, 1 Radu Horaud, 1 1 Inria Grenoble Rh ˆ one-Alpes, France 2 Uni v . Grenoble Alpes, Grenoble-INP , GIPSA-lab, France ABSTRA CT This paper presents an online multiple-speaker localization and tracking method, as the INRIA-Perception contribution to the LO- CA T A Challenge 2018. First, the recursiv e least-square method is used to adapti vely estimate the direct-path relati ve transfer function as an interchannel localization feature. The feature is assumed to associate with a single speaker at each time-frequency bin. Second, a complex Gaussian mixture model (CGMM) is used as a genera- tiv e model of the features. The weight of each CGMM component represents the probability that this component corresponds to an ac- tiv e speaker , and is adaptively estimated with an online optimiza- tion algorithm. Finally , taking the CGMM component weights as observations, a Bayesian multiple-speaker tracking method based on the variational expectation maximization algorithm is used. The tracker accounts for the variation of acti ve speakers and the localiza- tion miss measurements, by introducing speaker birth and sleeping processes. The experiments carried out on the development dataset of the challenge are reported. Index T erms — sound-source localization, multiple moving speakers, tracking, re verberant en vironments, LOCA T A Challenge 1. INTR ODUCTION For multiple-speaker localization, the W -disjoint orthogonality (WDO) [1] assumption is widely used. It assumes that the audio signal is dominated by only one speaker in each small region of the time-frequenc y (TF) domain, because of the natural sparsity of speech signals in this domain. After applying the short-time Fourier transform (STFT), interchannel localization features, e.g. interau- ral phase difference, can be extracted. T o assign the interchan- nel features to multiple speakers, a mixture of Gaussian mixture models (GMMs) is used as a generativ e model of the interchannel features of multiple speakers in [2] with each GMM representing one speaker , and then feature assignment was done based on the maximum likelihood criteria. In [3], instead of setting one GMM for each speaker , a single complex GMM (CGMM) is used for all speakers with each component representing one candidate speaker location. After maximizing the likelihood of the features, with an expectation-maximization (EM) algorithm, the weight of each com- ponent represents the probability that there is an active speaker at the corresponding candidate location. T o localize moving speak- ers, in [4], based on a CGMM model similar to [3], a recursi ve EM algorithm was proposed to update online the CGMM component weights. Counting and localization of acti ve speakers can be jointly carried out by selecting the components with large weights. T aking This work w as supported by the ERC Adv anced Grant VHIA #340113. the instantaneous outputs of a localization method as observations and using a speaker dynamic model, Bayesian tracking techniques estimate the posterior distribution of source locations, e.g. [5, 6]. T o tackle the tracking problem with an unkno wn and time-varying number of speakers, additional model features such as observ ation- to-speaker assignments, speaker track birth/death processes and a model of speech activity can be included [7, 8, 9, 10]. In the present paper , we present our online multiple-speaker localization and tracking method contrib uted to the LOCA T A Chal- lenge, which is composed of three modules: • A recursiv e DP-R TF (direct-path relative transfer function) esti- mation module. In the STFT domain, the room impulse response (RIR) can be approximated by the con volutiv e transfer function (CTF) [11, 12]. DP-R TF is deﬁned by the ratio between the ﬁrst taps of the CTF of two microphones, thus encodes the direct-path information and is used as an interchannel feature being robust against re verberation. The CTF estimation used for DP-R TF ex- traction was formulated in batch mode in [13, 14] with speakers considered as static. Based on recursiv e least-square (RLS), the online CTF estimation method was proposed in [15, 16], and will be brieﬂy presented in this paper , for mo ving speaker localization. • An online multiple-speaker localization module. In [14], we adopt the abov e-mentioned CGMM model [3] to assign the DP- R TF features to speakers, in addition, an entropy-based regular - ization term was used to impose the spatial sparsity of the esti- mated component weights. Howe ver , [14] only considered the batch mode and static speakers. The recursiv e EM algorithm pro- posed in [4] was adopted in [15] for online liklihood maximiza- tion without using the entropy regularization. Furtherly , an on- line optimization method, i.e. exponentiated gradient (EG) [17], was used in [16] for simultaneously online liklihood maximiza- tion and entropy minimization. This method will be brieﬂy pre- sented in this paper . • A multiple-speaker tracking module. In [16], a multiple-speaker tracking method was proposed. The results of the above local- ization module are taken as inputs and ef ﬁciently exploited in a Bayesian framework. The problem is efﬁciently solved by a variational e xpectation maximization (VEM) algorithm. Speaker birth and sleeping processes are included in the tracking process. The sleeping process is efﬁcient to tackle a missed detection by the localization procedure. This paper will brieﬂy present this multiple-speaker tracker , please refer to [8, 10, 18, 16] for more detailed algorithmic deriv ation and description. In Fig. 1, the three layers of temporal ev olutions depicts these three modules, respectively . In the following, we present them one by one from Section 2 to 4, and then giv es the experiments on the de- velopment dataset of the LOCA T A challenge [19] in Section 5. LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan Figure 1: Flowchart of the three localization and tracking modules. 2. RECURSIVE DP-R TF ESTIMA TION T o simplify the presentation, let us ﬁrst consider the noise-free single-speaker case. In the time domain, the i -th, i = 1 , . . . , I , microphone signal is: x i ( n ) = a i ( n ) ? s ( n ) , where n is the time index, s ( n ) is the source signal, a i ( n ) is the RIR from the source to the i -th microphone, and ? denotes the conv olution. Applying the STFT , and using the CTF approximation, we have for each fre- quency index f = 0 , . . . , F − 1 : x i t,f = a i t,f ? s t,f , where x i t,f and s t,f are the STFT coef ﬁcients of the corresponding signals, and the CTF a i t,f is a subband representation of a i ( n ) . Here, the con volu- tion is ex ecuted w .r .t the frame index t . The ﬁrst CTF coefﬁcient a i 0 ,f mainly consists of the direct-path information, and DP-R TF is deﬁned as the ratio between the ﬁrst CTF coefﬁcients of two chan- nels: a i 0 ,f /a r 0 ,f , where channel r is the reference channel. Based on the cross-relation method [20], for one micro- phone pair ( i, j ) , we ha ve x i > t,f a j f = x j > t,f a i f , where > de- notes matrix/vector transpose, the conv olution vectors are a i f = [ a i 0 ,f , . . . , a i Q − 1 ,f ] > and x i t,f = [ x i t,f , . . . , x i t − Q +1 ,f ] > , with Q denoting the CTF length. W e concatenate the CTF vector of all channels as a f = [ a 1 > f , . . . , a I > f ] > . For each microphone pair ( i, j ) , we construct a cross-relation equation in terms of a f . For this aim, deﬁne: x ij t,f = [0 , . . . , 0 | {z } ( i − 1) Q , x j > t,f , 0 , . . . , 0 | {z } ( j − i − 1) Q , − x i > t,f , 0 , . . . , 0 | {z } ( I − j ) Q ] > . (1) Then we ha ve x ij > t,f a f = 0 . The CTF vector a f can be estimated by solving this equation. T o av oid a trivial solution, i.e. a f = 0 , we constrain the ﬁrst CTF coef ﬁcient of the reference channel, say r = 1 , to be equal to 1 . This leads to the follo wing equation ˜ x ij > t,f ˜ a f = y ij t,f , (2) where − y ij t,f is the ﬁrst entry of x ij t,f , ˜ x ij t,f is x ij t,f with the ﬁrst entry remov ed, and ˜ a f is the relativ e CTF vector: ˜ a f = " ˜ a 1 > f a 1 0 ,f , a 2 > f a 1 0 ,f , . . . , a I > f a 1 0 ,f # > . (3) where ˜ a 1 f = [ a 1 1 ,f , . . . , a 1 Q − 1 ,f ] > , is a 1 f with the ﬁrst entry re- mov ed. For i = 2 , . . . , I , the DP-R TFs appear in (3) as the ﬁrst entries of a i > f a 1 0 ,f . The DP-R TF estimation amounts to solving the linear problem Eq. (2). Eq. (2) is deﬁned for one microphone pair at one frame. For the online case, we would like to update the estimate of ˜ a f using the current frame, say t . There is a total of M = I ( I − 1) / 2 distinct microphone pairs, For notational conv enience, instead of using ij , we use m = 1 , . . . , M denote the index of microphone pair . The ﬁtting error of (2) is e m t,f = y m t,f − ˜ x m > t,f ˜ a f . At the current frame t , for the microphone pair m , RLS aims to minimize the error J m t,f = t − 1 X t 0 =1 M X m 0 =1 λ t − t 0 | e m 0 t 0 ,f | 2 + m X m 0 =1 | e m 0 t,f | 2 , (4) which sums up the ﬁtting error of all the microphone pairs for the past frames and the microphone pairs up to m for the current frame. The forgetting factor λ ∈ (0 , 1] gi ves exponentially lower weight to older frames, whereas at one gi ven frame, all microphone pairs hav e the same weight. In RLS, this minimization problem can be recursiv ely solved. For the detailed recursion procedure please re- fer to Algorithm 1 in [16]. For each frame t , let ˜ a t,f denote the CTF estimate, and ˜ c i t,f , i = 2 , . . . , I denote the DP-R TF estimates extracted from ˜ a t,f . Note that implicitly we have ˜ c 1 t,f = 1 . W e now introduce how to extend the abo ve method to the noisy multiple-speaker case. W e assume that, o ver a short time, the speak- ers are static and only one source is active at each frequency bin. Therefore, the CTF can be estimated using a small number of recent frames. This can be done by adjusting the forgetting factor λ : T o approximately hav e a memory of P frames, we can set λ = P − 1 P +1 . P should be empirically set to achiev e a good tradeoff between the validity of the above assumptions and a robust CTF estimate. T o suppress the noise, we use the inter-frame spectral subtraction al- gorithm proposed in [21, 15]. At each frequency bin f , frames are ﬁrst classiﬁed into speech frames and noise frames, then inter- frame spectral subtraction is applied between them. In the RLS process, only the speech frames (after spectral subtraction) are used, and the noise frames are skipped. In practice, a DP-R TF estimate can sometimes be unreliable due to the imperfect performance of the above-mentioned methods. W e use the consistenc y test method proposed in [14, 16] to detect the unreliable estimates. Finally , at frame t , we obtain a set of features C t = {{ ˆ c i t,f } i ∈I f } F − 1 f =0 , where I f ⊆ { 2 , . . . , I } denotes the set of microphone indices that pass the consistency test. Each of the features is assumed to be associated with a single speaker . 3. ONLINE MUL TIPLE-SPEAKER LOCALIZA TION In order to assign the DP-R TF features in C t to speakers, the CGMM generativ e model proposed in [3] is adopted. W e deﬁne a set D of D candidate source locations. Let d = 1 , . . . , D denote the location index. The probability , that an observed feature ˆ c i t,f is emitted by candidate locations, is modelled by a CGMM: P (ˆ c i t,f |D ) = D X d =1 w d N c (ˆ c i t,f ; c i,d f , σ 2 ) , (5) LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan where the mean c i,d f is precomputed based on the direct-path propa- gation model from the d -th candidate location to the microphone i . The variance σ 2 is empirically set as a constant value. The component weight w d ≥ 0 , with P D d =1 w d = 1 , is the only free model parameter . Let us denote the vector of weights with w = [ w 1 , ..., w D ] > . Let L t denote the log-likelihood (normalized by the number of features) of the features in C t , as a function of w d . Once L t is maximized, the component weight w d represents the probabil- ity that there exist an activ e speaker at the d -th candidate loca- tion. In addition, taking into account the fact that the number of actual active speakers is much lower than the number of candi- date locations, an entropy minimization is taken as a regularization term to impose a sparse distribution of w d . The entropy is deﬁned as H = − P D d =1 w d log ( w d ) . Then the overall cost function is −L t + γ H . In the present online framework, we proceed to a re- cursiv e update of the weight vector at current frame t , hence now denoted w t , based on the estimate at pre vious frame w t − 1 and on the DP-R TF features of the current frame. This can be formulated as an online optimization problem [17]: w t = argmin w χ ( w , w t − 1 ) + η ( −L t + γ H ) , (6) with the constraints that w d , d = 1 , . . . , D are positiv e and sum to 1. χ ( w , w t − 1 ) is some distance measure between w and w t − 1 . The positiv e constant η controls the parameter update rate. T o ex- ploit the fact that w d are probability masses, we use the Kullback- Leibler div ergence, i.e. χ ( w , w t − 1 ) = P D d =1 w d log w d w d t − 1 , which results in the exponentiated gradient algorithm [17]. Let ∆ t − 1 = ∂ ( η ( −L t + γ H )) ∂ w d    w d t − 1 denote the partial deri vati ves of η ( −L t + γ H ) w .r .t w d at the point w d t − 1 . Then, the exponentiated gradient r d t − 1 = e − ∆ t − 1 is used to update the weights w d t = r d t − 1 w d t − 1 P D d 0 =1 r d 0 t − 1 w d 0 t − 1 , ∀ d ∈ D . (7) It is obvious from (7) that the parameter constraints, namely posi- tivity and summation to 1, are automatically satisﬁed. 4. MUL TIPLE-SPEAKER TRA CKING In the following, upper case letters denote random variables while lower case letters denote their realizations. Let N be the maximum number of tracks (speakers), and let n be the speaker index. More- ov er, let n = 0 denote no speaker , or background noise. Let S tn be a latent (or state) variable associated with speaker n at frame t , and let S t = ( S t 1 , . . . , S tn , . . . , S tN ) . S tn is composed of two parts: the speaker direction and the speaker velocity . In this work, speaker direction is deﬁned by an azimuth θ tn . T o a void phase (circular) ambiguity we describe the direction with the unit vector U tn = (cos( θ tn ) , sin( θ tn )) > . Moreo ver , let V tn ∈ R be the angular velocity . Altogether we deﬁne a realization of the state variable as s tn = [ u tn ; v tn ] where the notation [ · ; · ] stands for v er- tical vector concatenation. As mentioned above, the CGMM component weight w d rep- resents the probability that there exist an acti ve speaker at the d -th candidate location. The frame-wise localization of active speakers can be carried out by peak picking over w d . Howe ver , to fully use the weight information, without applying peak picking, all the can- didate locations and their associated weights are used. Formally , let O t = ( O t 1 , . . . , O td , . . . , O tD ) be the observed variables at frame t . Each realization o td of O td is composed of a candidate location, or azimuth ˜ θ td ∈ D , and a weight w td . As abov e, let the azimuth be described by a unit vector b td = (cos( ˜ θ td ) , sin( ˜ θ td )) > . In summary we ha ve o td = [ b td ; w td ] . Moreo ver , let Z td be a (latent) assignment variable associated with each observed v ariable O td , such that Z td = n means that the observation index ed by d at frame t is assigned to speaker n ∈ { 0 , . . . , N } . 4.1. Bayesian T racking Model The problem at hand can be cast into the estimation of the ﬁltering distribution p ( s t , z t | o 1: t ) , and further inference of s t and z t . By applying the Bayes rule, the ﬁltering distribution is proportional to: p ( s t , z t | o 1: t ) ∝ p ( o t | s t , z t ) p ( z t ) p ( s t | o 1: t − 1 ) , (8) which contains the following three terms. The audio observation model p ( o t | s t , z t ) describes the distri- bution of the observ ations giv en speakers state and assignment. W e assume the different observ ations are independent conditionally to speakers state and assignment. For each observation, we adopt the weighted-data GMM model of [22]: p ( b td | Z td = n, s tn ; w td ) = N ( b td ; M s tn , 1 w td Σ ) for n ∈ { 1 , . . . , N } , where the matrix M = [ I 2 × 2 , 0 2 × 1 ] projects the state variable onto the space of source di- rections and Σ is a cov ariance matrix (set empirically to a ﬁxed value). Note that the weight plays the role of a precision: The higher the weight w td , the more reliable the source direction b td . The case Z td = 0 follows a uniform distribution over the volume of the observation space, i.e. p ( b td | Z td = 0) = U ( vol ( G )) . The prior distribution of the assignment variable, i.e. p ( z t ) , is independent ov er observations and is assumed to be uniformly distributed o ver all the speakers, i.e. p ( Z td = n ) = π dn = 1 N +1 . T o calculate the state predicti ve distribution p ( s t | o 1: t − 1 ) , we marginalize it o ver s t − 1 : p ( s t | o 1: t − 1 ) = Z p ( s t | s t − 1 ) p ( s t − 1 | o 1: t − 1 ) d s t − 1 . (9) W e model the state dynamics p ( s t | s t − 1 ) as a linear-Gaussian ﬁrst-order Markov process, independent over the speakers, i.e. p ( s t,n | s t − 1 ,n ) = N ( s tn ; D t − 1 ,n s t − 1 ,n , Λ tn ) , where Λ tn is the dynamics’ covariance matrix and D t − 1 ,n is the state transition ma- trix. Importantly , since the state position subvector u tn lies on the unit circle, the dynamic model is designed for circular motion. Giv en the estimated azimuth angle θ t − 1 ,n at frame t − 1 , the state transition matrix can be written as: D t − 1 ,n =   1 0 − sin( θ t − 1 ,n ) 0 1 cos( θ t − 1 ,n ) 0 0 1   . (10) In the following D t − 1 ,n is written as D for notational simplicity . LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan 4.2. V ariational Expectation Maximization Algorithm T o estimate the model parameters Θ , and infer the posteriori dis- tribution of ( s t , z t ), we adpot an EM algorithm that alternates be- tween computing and maximizing the expected complete-data log- likelihood: J ( Θ , Θ o ) = E p ( z t , s t | o 1: t , Θ o ) [log p ( z t , s t , o 1: t | Θ )] , (11) where E denotes expectation, Θ o are the “old” model parame- ter estimates (obtained at previous iteration). Gi ven the hybrid combinatorial-continuous nature of the latent space, it is compu- tationally heavy to operate with the exact a posterior distribution p ( z t , s t | o 1: t ) . W e thus use a v ariational approximation to solve the problem [8], which factorizes the posterior distrib ution as: p ( z t , s t | o 1: t ) ≈ q ( z t , s t ) = q ( z t ) N Y n =0 q ( s tn ) . (12) It is seen that the posterior distribution factorizes across speakers. This principle is also valid for time t − 1 . As a result, the predictiv e distribution (9) also factorizes across speakers. If the posterior dis- tribution q ( s t − 1 ,n ) is assumed to be a Gaussian with mean µ t − 1 ,n and v ariance Γ t − 1 ,n , then the per-speaker predictiv e distribution p ( s tn | o 1: t − 1 ) is a Gaussian: p ( s tn | o 1: t − 1 ) = N ( s tn ; D µ t − 1 ,n , D Γ t − 1 ,n D > + Λ tn ) . (13) In E-S step, it can be derived that the variational posterior distri- bution of q ( s tn ) is a Gaussian distribution, with v ariance and mean respectiv ely as Γ tn = h D X d =1 α tdn w d t  H > Σ − 1 H +  Λ tn + D Γ t − 1 ,n D >  − 1 i − 1 , µ tn = Γ tn h H > Σ − 1  D X d =1 α tdn w d t b td  +  Λ tn + D Γ t − 1 ,n D >  − 1 D µ t − 1 ,n i , where α tdn = q ( Z td = n ) is the variational posterior distribution of the assignment v ariable, which is derived in E-Z step, the model parameters Λ tn is udated in M-step, please refer to [16] for detailed descriptions. At time t , VEM conv erges after a few iterations, e.g. 5 iterations used in this work. After con vergenc y , the posterior mean of s tn , i.e. µ t = ( µ t 1 , . . . , µ tn , . . . , µ tN ) , is output as the result of speaker tracking, and together with the posterior variance Γ t = ( Γ t 1 , . . . , Γ tn , . . . , Γ tN ) are transmitted to time t + 1 . 4.3. Speaker T rack Birth Process and Activity Detection A track birth process is used to initialize ne w tracks, i.e. new speak- ers that enter the scenario. The general principle is the following. In a short period of time, say from frame t − L to frame t , we as- sume that at most one new (yet untracked) speaker appears. For each frame from t − L to t , we select the observ ation assigned to the background noise with the highest weight, and obtain an ob- servation sequence ˜ o t − L : t . W e then calculate the marginal likeli- hood of this sequence according to our model, τ 0 = p ( ˜ o t − L : t ) . If these observations hav e been generated by a new speaker , they exhibit smooth trajectories, and τ 0 will be high. Therefore, the birth process is conducted by comparing τ 0 with a threshold. The posterior distribution of the assignment variable, i.e. α idn , can be used for multiple-speaker activity detection. This can be formal- ized as testing, for each frame t and each speaker n , whether the weighted assignments P D d =1 α idn w d i (av eraged over a small num- ber of frames) is larger than a threshold. 5. EXPERIMENTS ON LOCA T A DEVELOPMENT DA T A W e report the results of on the LOCA T A dev elopment corpus for tasks #3 and #5 with a single moving speaker , and tasks #4 and #6 with two moving speakers, each task comprising three recorded sequences. In this work, we use four microphones with indices { 5 , 8 , 11 , 12 } out of the twelve microphones of a spherical array built in the head of a humanoid robot, i.e. N A O, to perform azimuth localization and tracking. These four microphones are mounted on the top of the robot head, and they approximately lie in a horizon- tal plane. W e perform 360 ◦ -wide azimuth estimation and tracking: D = 72 azimuth directions at every 5 ◦ in [ − 175 ◦ , 180 ◦ ] are used as candidate directions. The TDO As are computed based on the co- ordinate of microphones, which are then used to compute the phase of the CGMM means, while the magnitude of the CGMM means are set to a constant for all the frequencies. All the recorded signals are resampled to 16 kHz. The STFT uses the Hamming window with length of 16 ms and shift of 8 ms. The CTF length is Q = 8 frames. The RLS for getting factor λ is computed using ρ = 1 . The exponentiated gradient update factor is η = 0 . 07 . The entropy reg- ularization factor is γ = 0 . 1 . For the tracker , the co variance matrix is set to be isotropic Σ = 0 . 03 I 2 . A localization and tracking example Fig. 2 shows an example for a LOCA T A sequence. T wo speakers are moving and continu- ously speaking with short pauses. Fig. 2 (b) shows that the localiza- tion method achiev es a heatmap with negligible interferences and smooth peak evolution. From Fig. 2 (c), it is seen that the track- ing method further smooth the speaker moving trajectories. Even when the observations hav e a lo w weight, the tracker is still able to giv e the correct speaker trajectories. This is ensured by exploiting the source dynamics model. As a result, the tracker is able to pre- serve the identity of speakers in spite of the (short) speech pauses. In the presented sequence example, the estimated speaker identities are quite consistent with the ground truth. Quantitative results The follo wing metrics are used for quantita- tiv e ev aluation. The detected speaker is considered to be success- fully localized if the azimuth difference is not larger than 15 ◦ . The absolute error is calculated for the successfully localized sources. The mean absolute error (MAE) is computed by averaging the ab- solute error of all speakers and frames. For the unsuccessful local- izations, we count the miss detections (MD) (speak er active b ut not detected) and false alarms (F A) (speaker detected but not active). Then the MD and F A rates are computed, using all the frames, as the percentage of the total MDs and F As out of the total number of actual speakers, respectively . In addition to these localization met- rics, we also count the identity switches (IDs) to e valuate the track- ing continuity . IDs represents the number of the identity changes in the tracks for a whole test sequence. T able 1 gives the quantitati ve localization and tracking results. The localization results are ob- tained by applying peak picking on the GMM weights. The tracker slightly reduces the F A rate compared to the localization method alone mainly by eliminating some spurious peaks that are present in the localization outputs. It also reduces the MD rate since some correct speaker trajectories can be reco vered ev en when the obser- vations ha ve (very) weak weights, as explained abo ve. The identity LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan (a) Ground truth (b) Localization: GMM W eights (c) T racking Figure 2: Results of speaker localization and tracking for Recording #1 / T ask #6. (a) Ground truth trajectory and voice activity (red for speaker 1, black for speak er 2). Interv als in the trajectories are speaking pauses. (b) Result for localization, i.e. GMM W eights. (f) Result for tracker . Black and red colors demonstrate a succesful tracking, i.e. continuity of the tracks despite of speech pauses. T able 1: Localization and tracking results. MD rate (%) F A rate (%) MAE ( ◦ ) IDs Localization 24.1 12.7 4.0 - T racking 22.7 12.4 4.1 10 switches are mainly due to the crossing of speaker trajectories, a hard case for the source dynamics model. 6. CONCLUSION In this paper, we presented the INRIA-Perception contribution to the LOCA T A Challenge 2018. W e combined i) a recursi ve DP-R TF feature estimation method, ii) an online multiple-speaker localiza- tion method, and iii) an multiple-speaker tracking method. The re- sulting frame work provides online speaker counting, localization and consistent tracking (i.e. preserving speaker identity over a track in spite of intermittent speech production). 7. REFERENCES [1] O. Y ilmaz and S. Rickard, “Blind separation of speech mixtures via time- frequency masking, ” IEEE T ransactions on Signal Pr ocessing, , vol. 52, no. 7, pp. 1830–1847, 2004. [2] M. I. Mandel, R. J. W eiss, and D. P . Ellis, “Model-based expectation- maximization source separation and localization,” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 18, no. 2, pp. 382–394, 2010. [3] Y . Dorfan and S. Gannot, “Tree-based recursive expectation-maximization algo- rithm for localization of acoustic sources, ” IEEE/ACM T ransactions on Audio, Speech, and Language Processing , vol. 23, no. 10, pp. 1692–1703, 2015. [4] O. Schwartz and S. Gannot, “Speaker tracking using recursive EM algorithms, ” IEEE/ACM T ransactions on Audio, Speech, and Language Processing , vol. 22, no. 2, pp. 392–402, 2014. [5] J. V ermaak and A. Blake, “Nonlinear ﬁltering for speaker tracking in noisy and rev erberant environments, ” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on , v ol. 5. IEEE, 2001, pp. 3021–3024. [6] J.-M. V alin, F . Michaud, and J. Rouat, “Robust localization and tracking of si- multaneous moving sound sources using beamforming and particle ﬁltering, ” Robotics and Autonomous Systems , vol. 55, no. 3, pp. 216–228, 2007. [7] C. Evers, A. H. Moore, P . A. Naylor, J. Sheaffer, and B. Rafaely , “Bearing-only acoustic tracking of moving speakers for robot audition, ” in IEEE International Confer ence on Digital Signal Processing (DSP) , 2015, pp. 1206–1210. [8] S. Ba, X. Alameda-Pineda, A. Xompero, and R. Horaud, “ An on-line variational bayesian model for multi-person tracking from cluttered scenes, ” Computer V i- sion and Image Understanding , vol. 153, pp. 64–76, 2016. [9] I. Gebru, S. Ba, X. Li, and R. Horaud, “ Audio-visual speaker diarization based on spatiotemporal Bayesian fusion, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , 2017. [10] Y . Ban, L. Girin, X. Alameda-Pineda, and R. Horaud, “Exploiting the comple- mentarity of audio and visual data in multi-speaker tracking, ” in ICCV W orkshop on Computer V ision for Audio-V isual Media , vol. 3, 2017. [11] Y . A vargel and I. Cohen, “System identiﬁcation in the short-time Fourier trans- form domain with crossband ﬁltering, ” IEEE T ransactions on Audio, Speech, and Language Processing , v ol. 15, no. 4, pp. 1305–1319, 2007. [12] R. T almon, I. Cohen, and S. Gannot, “Relati ve transfer function identiﬁcation using con volutive transfer function approximation, ” IEEE T ransactions on A udio, Speech, and Language Processing , vol. 17, no. 4, pp. 546–555, 2009. [13] X. Li, L. Girin, R. Horaud, and S. Gannot, “Estimation of the direct-path relati ve transfer function for supervised sound-source localization, ” IEEE/ACM Tr ansac- tions on Audio, Speec h and Language Pr ocessing , v ol. 24, no. 11, pp. 2171–2186, 2016. [14] ——, “Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regularization, ” IEEE/A CM T ransactions on Audio, Speech, and Language Processing , v ol. 25, no. 10, pp. 1997–2012, 2017. [15] X. Li, B. Mourgue, L. Girin, S. Gannot, and R. Horaud, “Online localization of multiple moving speakers in reverberant environments, ” in The T enth IEEE W orkshop on Sensor Array and Multichannel Signal Processing , 2018. [16] X. Li, Y . Ban, L. Girin, X. Alameda-Pineda, and R. Horaud, “Online Localization and T racking of Multiple Moving Speak ers in Reverberant En vironment, ” Jul. 2018, submitted to Journal on Selected T opics in Signal Processing. [Online]. A vailable: https://hal.inria.fr/hal- 01851985 [17] J. Ki vinen and M. K. W armuth, “Exponentiated gradient versus gradient descent for linear predictors, ” Information and Computation , vol. 132, no. 1, pp. 1–63, 1997. [18] Y . Ban, X. Alameda-Pineda, F . Badeig, S. Ba, and R. Horaud, “Tracking a vary- ing number of people with a visually-controlled robotic head, ” in IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems , 2017, pp. 4144–4151. [19] H. W . L ¨ ollmann, C. Evers, A. Schmidt, H. Mellmann, H. Barfuss, P . A. Naylor, and W . Kellermann, “The LOCA T A challenge data corpus for acoustic source localization and tracking, ” in IEEE Sensor Array and Multichannel Signal Pro- cessing W orkshop (SAM) , Shefﬁeld, UK, July 2018. [20] G. Xu, H. Liu, L. T ong, and T . Kailath, “ A least-squares approach to blind chan- nel identiﬁcation, ” IEEE T ransactions on signal pr ocessing , vol. 43, no. 12, pp. 2982–2993, 1995. [21] X. Li, L. Girin, R. Horaud, and S. Gannot, “Estimation of relative transfer func- tion in the presence of stationary noise based on segmental power spectral density matrix subtraction, ” in IEEE International Conference on Acoustics, Speech and Signal Processing , 2015, pp. 320–324. [22] I. D. Gebru, X. Alameda-Pineda, F . Forbes, and R. Horaud, “Em algorithms for weighted-data clustering with application to audio-visual scene analysis, ” IEEE transactions on pattern analysis and machine intelligence , v ol. 38, no. 12, pp. 2402–2415, 2016.

A cascaded multiple-speaker localization and tracking system

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment