Description of algorithms for Ben-Gurion University Submission to the LOCATA challenge

LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan DESCRIPTION OF ALGORITHMS FOR BEN-GURION UNIVERSITY SUBMISSION TO THE LOCA T A CHALLENGE Lior Madmoni, Hanan Beit-On, Hai Mor genstern and Boaz Rafaely Department of Electrical and Computer Engineering Ben-Gurion Uni versity of the Nege v Beer-She v a 84105, Israel liomad@gmail.com; hananbo26@gmail.com; haimorg@post.bgu.ac.il; br@bgu.ac.il ABSTRA CT This paper summarizes the methods used to localize the sources recorded for the LOCalization And T rAcking (LOCA T A) challenge. The tasks of stationary sources and arrays were considered, i.e., tasks 1 and 2 of the challenge, which were recorded with the Nao robot array , and the Eigenmike array . For both arrays, direction of arriv al (DOA) estimation has been performed with measurements in the short time F ourier transform domain, and with direct-path domi- nance (DPD) based tests, which aim to identify time-frequency (TF) bins dominated by the direct sound. For the recordings with Nao, a DPD test which is applied directly to the microphone signals was used. For the Eigenmike recordings, a DPD based test designed for plane-wa ve density measurements in the spherical harmonics do- main was used. After acquiring DO A estimates with TF bins that passed the DPD tests, a stage of k-means clustering is performed, to assign a ﬁnal DO A estimate for each speaker . Index T erms — Direction of arriv al estimation, direct-path dominance test, robot audition, spherical arrays. 1. DO A ESTIMA TION WITH THE N A O ROBO T ARRA Y This section describes the method for direction of arriv al (DOA) estimation in tasks 1 and 2, that was performed with the Nao robot array . In this paper , the same spherical coordinate system is used, as described in [1], denoted by ( r , θ , φ ) , where r is the distance from the origin, and θ and φ are the elev ation and azimuth an- gles, respectively . Consider an array of Q omni-directional mi- crophones, representing the array mounted on Nao. In this case, let { r q ≡ ( r q , θ q , φ q ) } Q q =1 denote the microphones positions ar- ranged according to the conﬁguration used in the LOCA T A chal- lenge for Nao [1]. In addition, a sound ﬁeld which is comprised of L far ﬁeld sources is also considered, arriving from directions { Ψ l ≡ ( θ l , φ l ) } L l =1 . These L sources can represent the direct sound from speak ers in a room and the reﬂections due to objects and room boundaries. In this case, the sound pressure measured by the array can be described in the short-time Fourier transform (STFT) domain as [2] p ( τ , ω ) = V ( ω , Ψ ) s ( τ , ω ) + n ( τ , ω ) , (1) where p ( τ , ω ) =  p ( τ , ω , r 1 ) , p ( τ , ω, r 2 ) , . . . , p ( τ , ω , r Q )  T is a Q × 1 vector holding the recorded sound pressure, s ( τ , ω ) =  s 1 ( τ , ω ) , s 2 ( τ , ω ) , . . . , s L ( τ , ω )  T is an L × 1 vector holding the source signal amplitudes, V ( ω , Ψ ) is a Q × L matrix hold- ing the steering vectors between each source and microphone and with Ψ =  Ψ 1 , Ψ 2 , . . . , Ψ L  T denoting the DOAs of the sources, n ( τ , ω ) =  n 1 ( τ , ω ) , n 2 ( τ , ω ) , . . . , n Q ( τ , ω )  T is a Q × 1 v ector holding the noise components, τ and ω are the time and frequency indices, respectiv ely , and ( · ) T denotes the transpose operator . The signals recorded by the Nao robot array were transformed to the STFT domain with a Hanning window of 512 samples (32 ms), and with an overlap of 50%. A focusing process was then ap- plied to this measured pressures vector in order to remove the fre- quency dependence of the steering matrices across ev ery J ω = 15 adjacent frequency indexes. The purpose of the focusing process is to enable the implementation of frequency-smoothing while pre- serving the spatial information. The focusing was performed by multiplying the sound pressure vector at each frequency index, ω , with a focusing transformation T ( ω , ω 0 ) that satisﬁes T ( ω , ω 0 ) V ( ω , Ψ ) = V ( ω 0 , Ψ ) , (2) where ω 0 is the center frequency in the frequenc y-smoothing range. The focusing transformations were computed in adv ance according to [3] using spherical harmonics (SH) order of N = 4 . W ith ideal focusing, the cross-spectrum matrix of the focused sound pressure can be written as [3] S ˜ p ( τ , ω ) = V ( ω 0 , Ψ ) S s ( τ , ω ) V ( ω 0 , Ψ ) H + S e n ( τ , ω ) (3) where S ˜ p ( τ , ω ) = E h T ( ω , ω 0 ) p ( τ , ω ) p ( τ , ω ) H T ( ω , ω 0 ) H i , S s ( τ , ω ) = E h s ( τ , ω ) s ( τ , ω ) H i , S ˜ n ( τ , ω ) = E h T ( ω , ω 0 ) n ( τ , ω ) n ( τ , ω ) H T ( ω , ω 0 ) H i , and ( · ) H is the Hermitian operator . In practice, an averaging across J τ = 3 time frames is used to approximate the e xpectation. A frequency- smoothing is then applied to S ˜ p ( τ , ω ) by av eraging across J ω = 15 frequency bins. Denoting the smoothed variables by an overline, i.e. S ( τ , ω ) = P J ω − 1 j ω =0 S ( τ , ω − j ω ) , the smoothed focused cross-spectrum matrix can be written as S ˜ p ( τ , ω ) = V ( ψ , ω 0 ) S s ( τ , ω ) V ( ψ , ω 0 ) H + S e n ( τ , ω ) . (4) The purpose of the frequency-smoothing operation is to restore the rank of the source cross-spectrum matrix, S s ( τ , ω ) , which is sin- gular when coherent sources, such as reﬂections, are present. After applying focusing and frequency-smoothing, the effecti ve-rank [4] of S ˜ p ( τ , ω ) reﬂects the number of sources Q and the noise sub- space can be correctly estimated [3]. Time-frequency (TF) bins in LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan which the direct-path is dominant are identiﬁed in a similar way to those proposed in the direct-path dominance (DPD) test [5] A MD-DPD = ( ( τ , ν ) : λ 1  S ˜ p ( τ , ω )  λ 2  S ˜ p ( τ , ω )  > T H MD-DPD ) , where λ 1  S ˜ p ( τ , ω )  and λ 2  S ˜ p ( τ , ω )  are the largest and the second largest eigen values of S ˜ p ( τ , ω ) , and T H MD-DPD is the test threshold, chosen independently for each recording, to ensure that 5% of all av ailable bins pass the test. Then, MUSIC with a sig- nal subspace of single dimension was applied to each of the bins in A MD-DPD . The noise subspace was estimated by the singular values decomposition of S ˜ p ( τ , ω ) . Next, k-means clustering was performed with the DOA esti- mates from the bins that passed the test. For task 1, a single speaker was present, thus, k-means clustering has been performed with a single cluster . For task 2, the number of clusters was chosen to the number of sources, which has been estimated for each recording by e xamining the scatter of DO A estimates on an azimuth-elev ation grid, and was therefore assumed to be kno wn apriori. This was per- formed in order to focus on the performance of the DOA estimation process rather than on source number estimation. Finally , since the sources in tasks 1 and 2 are known to be stationary , the ﬁnal DOA estimates hav e been associated with a unique source identiﬁer for all timestamps, regardless of its acti vity . 2. DO A ESTIMA TION WITH THE EIGENMIKE ARRA Y This section describes the method for DOA estimation in tasks 1 and 2, that was performed with the Eigemike array . The sound pressure system model described in (1), can be used with r q = r for all q = 1 , . . . , Q and with the same STFT param- eters, such that it now describes a spherical array . This formulation can facilitate the processing of signals in the SH domain [6, 7, 8], which was performed up to SH order of N = 3 . F ollowing that, plane wa ve decomposition had been performed, leading to [9]: a nm ( τ , ω ) = Y H ( Ψ ) s ( τ , ω ) + ˜ n ( τ , ω ) , (5) where a nm ( τ , ω ) =  a 00 ( τ , ω ) , a 1( − 1) ( τ , ω ) , a 10 ( τ , ω ) , . . . , a N N ( τ , ω )  T is a ( N + 1) 2 × 1 v ector holding the recorded plane w ave density (PWD) coefﬁcients in the SH domain, Y H ( Ψ ) =  y ∗ (Ψ 1 ) , y ∗ (Ψ 2 ) , . . . , y ∗ (Ψ L )  is the ( N + 1) 2 × L steer- ing matrix in this domain, with its columns y (Ψ l ) =  Y 0 0 (Ψ l ) , Y − 1 1 (Ψ l ) , . . . , Y N N (Ψ l )  T , holding the SH functions Y m n ( · ) of order n and degree m . These functions are assumed to be order limited to N , which usually holds when both N = d k r e and ( N + 1) 2 ≤ Q [10, 8], where k is the wa venumber . The noise components in this domain are described by the ( N + 1) 2 × 1 vector ˜ n ( τ , ω ) , where ( · ) ∗ denotes the complex conjugate. In this chal- lenge, this plane-wav e decomposition was performed in a similar manner to the R-PWD method, described in [11] (equation (2.27)). Next, the local TF correlation matrices are computed for ev ery TF bin by [5]: ˜ S a ( τ , ω ) = 1 J τ J ω J ω − 1 X j ω =0 J τ − 1 X j τ =0 a nm ( τ − j τ , ω − j ω ) × a nm H ( τ − j τ , ω − j ω ) , (6) where J τ and J ω are the number of time and frequency bins for the av eraging, respecti vely . The values that were chosen for this array are J τ = 2 and J ω = 15 . Notice in (6) that frequency smoothing is performed directly without focusing matrices, in this domain [12]. The direct-path dominance enhanced plane-wave decomposi- tion (DPD-EDS) test is designed for PWD measurements in the SH domain, and it uses the local TF correlation matrix ˜ S a ( τ , ω ) , as in (6). W ith the aim of identifying TF bins dominated by the di- rect sound, it was shown in [13], that under some conditions, the dominant eigenv ector of ˜ S a ( τ , ω ) , denoted by u 1 ( τ , ω ) , may ap- proximately satisfy u 1 ( τ , ω ) ∝ y ∗ (Ψ 1 ) , (7) where Ψ 1 is the direction of the direct sound in the TF bin. Moti- vated by (7), identifying a bin dominated by the direct sound, can be achiev ed by examining u 1 ( τ , ω ) , and measuring to what e xtent it represents a single plane wav e. In this challenge, this has been performed by the following MUSIC-based measure E D S ( τ , ω ) = max Ω 1    P ⊥ u 1 ( τ ,ω ) y ∗ (Ω)    2 , (8) where P ⊥ u 1 ( τ ,ω ) is the projection into the subspace which is orthog- onal to u 1 ( τ , ω ) . Next, the following DPD-EDS test hav e been performed: A EDS = n ( τ , ω ) : E D S ( τ , ω ) > T H EDS o , (9) where T H EDS is the test thresholds which should hold T H EDS  1 , and in this challenge was chosen for each recording separately , to ensure that 2 . 5% of all available bins pass the test. Similarly to the previous section, a DOA estimation from each TF bin is giv en by the argument Ω that maximizes E D S ( τ , ω ) , Ω EDS =  Ω : arg max Ω E D S ( τ , ω ) , ∀ ( τ , ω ) ∈ A EDS  , (10) already computed in (8). For further information on the DPD-EDS test, the reader is referred to [13, 14]. The process of producing the ﬁnal DO A estimates is performed similarly to the process described for the Nao robot array in the pre vious section, using k-means clus- tering. For most recordings, an analysis frequency range of [400 , 6000] Hz w as employed, with the exception of sev eral record- ings where the frequency range was reduced to [400 , 4000] Hz which seemed to yield more tightly dense clusters of DOA esti- mates. When the development data of the Eigenimke recordings was analyzed, a relativ ely constant bias of +8 ◦ in the azimuth an- gle, and − 5 ◦ in the elev ation angle, relative to the ground truth data, was present. Hence, this bias was subtracted from the ﬁnal DO A estimates that were calculated with the ev aluation data, for all recordings. 3. REFERENCES [1] H. W . L ¨ ollmann, C. Ev ers, A. Schmidt, H. Mellmann, H. Bar- fuss, P . A. Naylor , and W . Kellermann, “The locata challenge data corpus for acoustic source localization and tracking, ” in IEEE Sensor Array Multichannel Signal Process. W orkshop (SAM) , 2018. LOCA T A Challenge W orkshop, a satellite ev ent of IW AENC 2018 September 17-20, 2018, T okyo, Japan [2] H. L. V an Trees, Optimum array pr ocessing: P art IV of de- tection, estimation and modulation theory . Wile y Online Li- brary , 2002, v ol. 1. [3] H. Beit-On and B. Rafaely , “Speaker localization using the direct-path dominance test for arbitrary arrays, ” in Pr oceed- ings of the International Conference On The Science Of Elec- trical Engineering (ICSEE 2018) , accepted for publication. [4] O. Roy and M. V etterli, “The effecti ve rank: A measure of effecti ve dimensionality , ” in Signal Pr ocessing Conference , 2007 15th Eur opean . IEEE, 2007, pp. 606–610. [5] O. Nadiri and B. Rafaely , “Localization of multiple speakers under high reverberation using a spherical microphone array and the direct-path dominance test, ” IEEE/A CM T ransactions on Audio, Speech, and Language Processing , vol. 22, no. 10, pp. 1494–1505, 2014. [6] J. Meyer and G. Elko, “ A highly scalable spherical micro- phone array based on an orthonormal decomposition of the soundﬁeld, ” in Pr oceedings of the IEEE International Confer- ence on Acoustics, Speec h, and Signal Pr ocessing (ICASSP), 2002 , vol. 2. IEEE, 2002, pp. II–1781. [7] T . D. Abhayapala and D. B. W ard, “Theory and design of high order sound ﬁeld microphones using spherical microphone ar - ray , ” in Proceedings of the IEEE International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP), 2002 , vol. 2. IEEE, 2002, pp. II–1949. [8] B. Raf aely , Fundamentals of spherical arr ay pr ocessing . Springer , 2015, vol. 8. [9] D. Khaykin and B. Rafaely , “Coherent signals direction-of- arriv al estimation using a spherical microphone array: Fre- quency smoothing approach, ” in Pr oceedings of the IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics, 2009. W ASP AA’09. IEEE, 2009, pp. 221–224. [10] D. B. W ard and T . D. Abhayapala, “Reproduction of a plane- wa ve sound ﬁeld using an array of loudspeakers, ” IEEE T r ans- actions on speech and audio pr ocessing , v ol. 9, no. 6, pp. 697– 707, 2001. [11] D. L. Alon and B. Rafaely , “Spatial decomposition by spher- ical array processing, ” in P arametric T ime-frequency Domain Spatial Audio , V . Pulkki, S. Delikaris-Manias, and A. Politis, Eds. John Wile y & Sons, 2017. [12] D. Khaykin and B. Rafaely , “Coherent signals direction-of- arriv al estimation using a spherical microphone array: Fre- quency smoothing approach, ” in Pr oceedings of the IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics, 2009. W ASP AA’09. IEEE, 2009, pp. 221–224. [13] L. Madmoni and B. Rafaely , “Direction of arriv al estimation for reverberant speech based on enhanced decomposition of the direct sound, ” IEEE Journal of Selected T opics in Signal Pr ocessing , pp. 1–1, 2018. [14] ——, “Improved direct-path dominance test for speaker lo- calization in rev erberant en vironments, ” in Pr oceedings of the 2018 26th Eur opean Signal Processing Confer ence (EU- SIPCO) , Sept 2018, pp. 2424–2428.

Description of algorithms for Ben-Gurion University Submission to the LOCATA challenge

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment