Description of algorithms for Ben-Gurion University Submission to the LOCATA challenge

This paper summarizes the methods used to localize the sources recorded for the LOCalization And TrAcking (LOCATA) challenge. The tasks of stationary sources and arrays were considered, i.e., tasks 1 and 2 of the challenge, which were recorded with t…

Authors: Lior Madmoni, Hanan Beit-On, Hai Morgenstern

LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan DESCRIPTION OF ALGORITHMS FOR BEN-GURION UNIVERSITY SUBMISSION TO THE LOCA T A CHALLENGE Lior Madmoni, Hanan Beit-On, Hai Mor genstern and Boaz Rafaely Department of Electrical and Computer Engineering Ben-Gurion Uni versity of the Nege v Beer-She v a 84105, Israel liomad@gmail.com; hananbo26@gmail.com; haimorg@post.bgu.ac.il; br@bgu.ac.il ABSTRA CT This paper summarizes the methods used to localize the sources recorded for the LOCalization And T rAcking (LOCA T A) challenge. The tasks of stationary sources and arrays were considered, i.e., tasks 1 and 2 of the challenge, which were recorded with the Nao robot array , and the Eigenmike array . For both arrays, direction of arriv al (DOA) estimation has been performed with measurements in the short time F ourier transform domain, and with direct-path domi- nance (DPD) based tests, which aim to identify time-frequency (TF) bins dominated by the direct sound. For the recordings with Nao, a DPD test which is applied directly to the microphone signals was used. For the Eigenmike recordings, a DPD based test designed for plane-wa ve density measurements in the spherical harmonics do- main was used. After acquiring DO A estimates with TF bins that passed the DPD tests, a stage of k-means clustering is performed, to assign a final DO A estimate for each speaker . Index T erms — Direction of arriv al estimation, direct-path dominance test, robot audition, spherical arrays. 1. DO A ESTIMA TION WITH THE N A O ROBO T ARRA Y This section describes the method for direction of arriv al (DOA) estimation in tasks 1 and 2, that was performed with the Nao robot array . In this paper , the same spherical coordinate system is used, as described in [1], denoted by ( r , θ , φ ) , where r is the distance from the origin, and θ and φ are the elev ation and azimuth an- gles, respectively . Consider an array of Q omni-directional mi- crophones, representing the array mounted on Nao. In this case, let { r q ≡ ( r q , θ q , φ q ) } Q q =1 denote the microphones positions ar- ranged according to the configuration used in the LOCA T A chal- lenge for Nao [1]. In addition, a sound field which is comprised of L far field sources is also considered, arriving from directions { Ψ l ≡ ( θ l , φ l ) } L l =1 . These L sources can represent the direct sound from speak ers in a room and the reflections due to objects and room boundaries. In this case, the sound pressure measured by the array can be described in the short-time Fourier transform (STFT) domain as [2] p ( τ , ω ) = V ( ω , Ψ ) s ( τ , ω ) + n ( τ , ω ) , (1) where p ( τ , ω ) =  p ( τ , ω , r 1 ) , p ( τ , ω, r 2 ) , . . . , p ( τ , ω , r Q )  T is a Q × 1 vector holding the recorded sound pressure, s ( τ , ω ) =  s 1 ( τ , ω ) , s 2 ( τ , ω ) , . . . , s L ( τ , ω )  T is an L × 1 vector holding the source signal amplitudes, V ( ω , Ψ ) is a Q × L matrix hold- ing the steering vectors between each source and microphone and with Ψ =  Ψ 1 , Ψ 2 , . . . , Ψ L  T denoting the DOAs of the sources, n ( τ , ω ) =  n 1 ( τ , ω ) , n 2 ( τ , ω ) , . . . , n Q ( τ , ω )  T is a Q × 1 v ector holding the noise components, τ and ω are the time and frequency indices, respectiv ely , and ( · ) T denotes the transpose operator . The signals recorded by the Nao robot array were transformed to the STFT domain with a Hanning window of 512 samples (32 ms), and with an overlap of 50%. A focusing process was then ap- plied to this measured pressures vector in order to remove the fre- quency dependence of the steering matrices across ev ery J ω = 15 adjacent frequency indexes. The purpose of the focusing process is to enable the implementation of frequency-smoothing while pre- serving the spatial information. The focusing was performed by multiplying the sound pressure vector at each frequency index, ω , with a focusing transformation T ( ω , ω 0 ) that satisfies T ( ω , ω 0 ) V ( ω , Ψ ) = V ( ω 0 , Ψ ) , (2) where ω 0 is the center frequency in the frequenc y-smoothing range. The focusing transformations were computed in adv ance according to [3] using spherical harmonics (SH) order of N = 4 . W ith ideal focusing, the cross-spectrum matrix of the focused sound pressure can be written as [3] S ˜ p ( τ , ω ) = V ( ω 0 , Ψ ) S s ( τ , ω ) V ( ω 0 , Ψ ) H + S e n ( τ , ω ) (3) where S ˜ p ( τ , ω ) = E h T ( ω , ω 0 ) p ( τ , ω ) p ( τ , ω ) H T ( ω , ω 0 ) H i , S s ( τ , ω ) = E h s ( τ , ω ) s ( τ , ω ) H i , S ˜ n ( τ , ω ) = E h T ( ω , ω 0 ) n ( τ , ω ) n ( τ , ω ) H T ( ω , ω 0 ) H i , and ( · ) H is the Hermitian operator . In practice, an averaging across J τ = 3 time frames is used to approximate the e xpectation. A frequency- smoothing is then applied to S ˜ p ( τ , ω ) by av eraging across J ω = 15 frequency bins. Denoting the smoothed variables by an overline, i.e. S ( τ , ω ) = P J ω − 1 j ω =0 S ( τ , ω − j ω ) , the smoothed focused cross-spectrum matrix can be written as S ˜ p ( τ , ω ) = V ( ψ , ω 0 ) S s ( τ , ω ) V ( ψ , ω 0 ) H + S e n ( τ , ω ) . (4) The purpose of the frequency-smoothing operation is to restore the rank of the source cross-spectrum matrix, S s ( τ , ω ) , which is sin- gular when coherent sources, such as reflections, are present. After applying focusing and frequency-smoothing, the effecti ve-rank [4] of S ˜ p ( τ , ω ) reflects the number of sources Q and the noise sub- space can be correctly estimated [3]. Time-frequency (TF) bins in LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan which the direct-path is dominant are identified in a similar way to those proposed in the direct-path dominance (DPD) test [5] A MD-DPD = ( ( τ , ν ) : λ 1  S ˜ p ( τ , ω )  λ 2  S ˜ p ( τ , ω )  > T H MD-DPD ) , where λ 1  S ˜ p ( τ , ω )  and λ 2  S ˜ p ( τ , ω )  are the largest and the second largest eigen values of S ˜ p ( τ , ω ) , and T H MD-DPD is the test threshold, chosen independently for each recording, to ensure that 5% of all av ailable bins pass the test. Then, MUSIC with a sig- nal subspace of single dimension was applied to each of the bins in A MD-DPD . The noise subspace was estimated by the singular values decomposition of S ˜ p ( τ , ω ) . Next, k-means clustering was performed with the DOA esti- mates from the bins that passed the test. For task 1, a single speaker was present, thus, k-means clustering has been performed with a single cluster . For task 2, the number of clusters was chosen to the number of sources, which has been estimated for each recording by e xamining the scatter of DO A estimates on an azimuth-elev ation grid, and was therefore assumed to be kno wn apriori. This was per- formed in order to focus on the performance of the DOA estimation process rather than on source number estimation. Finally , since the sources in tasks 1 and 2 are known to be stationary , the final DOA estimates hav e been associated with a unique source identifier for all timestamps, regardless of its acti vity . 2. DO A ESTIMA TION WITH THE EIGENMIKE ARRA Y This section describes the method for DOA estimation in tasks 1 and 2, that was performed with the Eigemike array . The sound pressure system model described in (1), can be used with r q = r for all q = 1 , . . . , Q and with the same STFT param- eters, such that it now describes a spherical array . This formulation can facilitate the processing of signals in the SH domain [6, 7, 8], which was performed up to SH order of N = 3 . F ollowing that, plane wa ve decomposition had been performed, leading to [9]: a nm ( τ , ω ) = Y H ( Ψ ) s ( τ , ω ) + ˜ n ( τ , ω ) , (5) where a nm ( τ , ω ) =  a 00 ( τ , ω ) , a 1( − 1) ( τ , ω ) , a 10 ( τ , ω ) , . . . , a N N ( τ , ω )  T is a ( N + 1) 2 × 1 v ector holding the recorded plane w ave density (PWD) coefficients in the SH domain, Y H ( Ψ ) =  y ∗ (Ψ 1 ) , y ∗ (Ψ 2 ) , . . . , y ∗ (Ψ L )  is the ( N + 1) 2 × L steer- ing matrix in this domain, with its columns y (Ψ l ) =  Y 0 0 (Ψ l ) , Y − 1 1 (Ψ l ) , . . . , Y N N (Ψ l )  T , holding the SH functions Y m n ( · ) of order n and degree m . These functions are assumed to be order limited to N , which usually holds when both N = d k r e and ( N + 1) 2 ≤ Q [10, 8], where k is the wa venumber . The noise components in this domain are described by the ( N + 1) 2 × 1 vector ˜ n ( τ , ω ) , where ( · ) ∗ denotes the complex conjugate. In this chal- lenge, this plane-wav e decomposition was performed in a similar manner to the R-PWD method, described in [11] (equation (2.27)). Next, the local TF correlation matrices are computed for ev ery TF bin by [5]: ˜ S a ( τ , ω ) = 1 J τ J ω J ω − 1 X j ω =0 J τ − 1 X j τ =0 a nm ( τ − j τ , ω − j ω ) × a nm H ( τ − j τ , ω − j ω ) , (6) where J τ and J ω are the number of time and frequency bins for the av eraging, respecti vely . The values that were chosen for this array are J τ = 2 and J ω = 15 . Notice in (6) that frequency smoothing is performed directly without focusing matrices, in this domain [12]. The direct-path dominance enhanced plane-wave decomposi- tion (DPD-EDS) test is designed for PWD measurements in the SH domain, and it uses the local TF correlation matrix ˜ S a ( τ , ω ) , as in (6). W ith the aim of identifying TF bins dominated by the di- rect sound, it was shown in [13], that under some conditions, the dominant eigenv ector of ˜ S a ( τ , ω ) , denoted by u 1 ( τ , ω ) , may ap- proximately satisfy u 1 ( τ , ω ) ∝ y ∗ (Ψ 1 ) , (7) where Ψ 1 is the direction of the direct sound in the TF bin. Moti- vated by (7), identifying a bin dominated by the direct sound, can be achiev ed by examining u 1 ( τ , ω ) , and measuring to what e xtent it represents a single plane wav e. In this challenge, this has been performed by the following MUSIC-based measure E D S ( τ , ω ) = max Ω 1    P ⊥ u 1 ( τ ,ω ) y ∗ (Ω)    2 , (8) where P ⊥ u 1 ( τ ,ω ) is the projection into the subspace which is orthog- onal to u 1 ( τ , ω ) . Next, the following DPD-EDS test hav e been performed: A EDS = n ( τ , ω ) : E D S ( τ , ω ) > T H EDS o , (9) where T H EDS is the test thresholds which should hold T H EDS  1 , and in this challenge was chosen for each recording separately , to ensure that 2 . 5% of all available bins pass the test. Similarly to the previous section, a DOA estimation from each TF bin is giv en by the argument Ω that maximizes E D S ( τ , ω ) , Ω EDS =  Ω : arg max Ω E D S ( τ , ω ) , ∀ ( τ , ω ) ∈ A EDS  , (10) already computed in (8). For further information on the DPD-EDS test, the reader is referred to [13, 14]. The process of producing the final DO A estimates is performed similarly to the process described for the Nao robot array in the pre vious section, using k-means clus- tering. For most recordings, an analysis frequency range of [400 , 6000] Hz w as employed, with the exception of sev eral record- ings where the frequency range was reduced to [400 , 4000] Hz which seemed to yield more tightly dense clusters of DOA esti- mates. When the development data of the Eigenimke recordings was analyzed, a relativ ely constant bias of +8 ◦ in the azimuth an- gle, and − 5 ◦ in the elev ation angle, relative to the ground truth data, was present. Hence, this bias was subtracted from the final DO A estimates that were calculated with the ev aluation data, for all recordings. 3. REFERENCES [1] H. W . L ¨ ollmann, C. Ev ers, A. Schmidt, H. Mellmann, H. Bar- fuss, P . A. Naylor , and W . Kellermann, “The locata challenge data corpus for acoustic source localization and tracking, ” in IEEE Sensor Array Multichannel Signal Process. W orkshop (SAM) , 2018. LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan [2] H. L. V an Trees, Optimum array pr ocessing: P art IV of de- tection, estimation and modulation theory . Wile y Online Li- brary , 2002, v ol. 1. [3] H. Beit-On and B. Rafaely , “Speaker localization using the direct-path dominance test for arbitrary arrays, ” in Pr oceed- ings of the International Conference On The Science Of Elec- trical Engineering (ICSEE 2018) , accepted for publication. [4] O. Roy and M. V etterli, “The effecti ve rank: A measure of effecti ve dimensionality , ” in Signal Pr ocessing Conference , 2007 15th Eur opean . IEEE, 2007, pp. 606–610. [5] O. Nadiri and B. Rafaely , “Localization of multiple speakers under high reverberation using a spherical microphone array and the direct-path dominance test, ” IEEE/A CM T ransactions on Audio, Speech, and Language Processing , vol. 22, no. 10, pp. 1494–1505, 2014. [6] J. Meyer and G. Elko, “ A highly scalable spherical micro- phone array based on an orthonormal decomposition of the soundfield, ” in Pr oceedings of the IEEE International Confer- ence on Acoustics, Speec h, and Signal Pr ocessing (ICASSP), 2002 , vol. 2. IEEE, 2002, pp. II–1781. [7] T . D. Abhayapala and D. B. W ard, “Theory and design of high order sound field microphones using spherical microphone ar - ray , ” in Proceedings of the IEEE International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP), 2002 , vol. 2. IEEE, 2002, pp. II–1949. [8] B. Raf aely , Fundamentals of spherical arr ay pr ocessing . Springer , 2015, vol. 8. [9] D. Khaykin and B. Rafaely , “Coherent signals direction-of- arriv al estimation using a spherical microphone array: Fre- quency smoothing approach, ” in Pr oceedings of the IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics, 2009. W ASP AA’09. IEEE, 2009, pp. 221–224. [10] D. B. W ard and T . D. Abhayapala, “Reproduction of a plane- wa ve sound field using an array of loudspeakers, ” IEEE T r ans- actions on speech and audio pr ocessing , v ol. 9, no. 6, pp. 697– 707, 2001. [11] D. L. Alon and B. Rafaely , “Spatial decomposition by spher- ical array processing, ” in P arametric T ime-frequency Domain Spatial Audio , V . Pulkki, S. Delikaris-Manias, and A. Politis, Eds. John Wile y & Sons, 2017. [12] D. Khaykin and B. Rafaely , “Coherent signals direction-of- arriv al estimation using a spherical microphone array: Fre- quency smoothing approach, ” in Pr oceedings of the IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics, 2009. W ASP AA’09. IEEE, 2009, pp. 221–224. [13] L. Madmoni and B. Rafaely , “Direction of arriv al estimation for reverberant speech based on enhanced decomposition of the direct sound, ” IEEE Journal of Selected T opics in Signal Pr ocessing , pp. 1–1, 2018. [14] ——, “Improved direct-path dominance test for speaker lo- calization in rev erberant en vironments, ” in Pr oceedings of the 2018 26th Eur opean Signal Processing Confer ence (EU- SIPCO) , Sept 2018, pp. 2424–2428.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment