Realtime Active Sound Source Localization for Unmanned Ground Robots Using a Self-Rotational Bi-Microphone Array

This work presents a novel technique that performs both orientation and distance localization of a sound source in a three-dimensional (3D) space using only the interaural time difference (ITD) cue, generated by a newly-developed self-rotational bi-m…

Authors: Deepak Gala, Nathan Lindsay, Liang Sun

Realtime Active Sound Source Localization for Unmanned Ground Robots   Using a Self-Rotational Bi-Microphone Array
Noname man uscript No. (will be inserted b y the editor) Realtime A ctiv e Sound Source Lo calization for U nmanned Ground Rob ots Using a Self-Rotational Bi-Microphone Arra y Deepak Gala, Nathan Lindsa y , Liang Sun the date of receipt and accepta nce should b e inserted later This work pr esents a novel technique that p erfor ms bo th orientation and distance lo caliza tion of a sound source in a three-dimensional (3D) space using only the in teraura l time difference (ITD) cue, gener ated by a newly-developed self-ro tational bi-micro phone rob otic platform. The s ystem dynamics is established in the spherical co or dinate frame using a state-spa ce model. The observ a bility analysis of the state-space mo del shows that the system is unobser v able when the so und source is pla c e d with elev ation angles of 90 a nd 0 degr ee. The prop osed metho d utilizes the difference b etw een the az- im uth estimates res ulting fro m res p ectively the 3 D and the tw o-dimensional mo dels to chec k the zero-deg ree- elev ation condition and further estimates the elev ation angle using a p olynomial curve fitting a pproach. Also , the pro p osed metho d is capable of detecting a 90 -degree elev ation by extracting the zer o -ITD s ig nal ’buried’ in noise. Addit ionally , a distance lo calization is per formed b y first rota ting the microphone array to face to ward the s o und source and then shifting the micr ophone p er- pendicula r to the source-rob o t vector b y a predefined distance o f a fixed num b er of s teps. The integrated ro- tational and translational motions of the microphone array provide a complete o rientation and distance lo c a l- ization using only the ITD cue. A nov el robo tic plat- form using a self-rotationa l bi-micr ophone a r ray was also developed for unmanned g round rob ots p erfor ming sound s ource lo caliza tion. The prop osed tech nique was first tested in simulation and was then v e rified on the newly-developed rob otic platform. Exp erimental da ta collected by the micro pho nes installed on a KEMAR Deepak Gala, drgala@nmsu.edu, New M exico S tate Universit y , NM, USA; Nathan Lindsay , nl 22@nmsu.edu, New M exico State Unive rsity , NM, USA; Liang Sun, lsun@nmsu.edu, New M exico State Univer sity , NM, USA dumm y head were also used to test the prop o sed tech- nique. All r esults show the effectiv eness o f the prop osed tech nique. 1 INTRODUCTION The lo caliza tion problem in the rob otic field has b een recognized as the mo st fundamental pr o blem to make rob ots truly autonomous [10]. Lo calization techniques are of gr eat imp or tance for autono mous unmanned sys - tems to iden tify their own lo cations (i.e., self-lo caliza tion) and situationa l aw areness (e.g ., lo catio ns of surr ound- ing ob jects), es p ecia lly in an unknown environmen t. Mainstream techn olog y for lo calization is base d o n com- puter vision, suppo rted b y visua l s ensors (e.g., cam- eras), which, how ever, are sub ject to lightin g a nd line- of-sight conditions and rely on co mputationally dema nd- ing image-pro cessing algorithms. An acoustic sensor (e.g., a microphone), as a co mplemen tary comp onent in a rob otic sensing sy s tem, do es not r equire a line of sight and is able to work under v arying lig ht (or completely dark) conditions in an omnidirectiona l manner. Thanks to the adv ancement of micro elec tr omechanical technol- ogy , microphones beco me inexp ensive and do no t re- quire significant p ower to o p erate. Sound-source lo caliza tio n (SSL) techniques have been developed that iden tify the lo cation of sound sources (e.g., sp eech and music) in terms of dir e c tio ns and dis- tances. SSL tec hniques have b een widely used in civilian applications, such as intelligen t video c o nferencing [26, 52], environmental monitoring [49], hum an-ro b ot int er- action (HRI) fo r humanoid ro bo tics [25], and r ob ot mo- tion planning [37], as well as military applications, such as passive so nar for s ubmarine detections, surveillance systems that lo ca te ho stile tanks, artillery , incoming 2 missiles [27], aircr aft [7 ], and UA V s [11]. SSL techniques hav e grea t po tent ial by itself to enhance the sensing ca- pabilit y of a utonomous unmanned systems as well as working tog ether with vision-bas e d lo calization tech- niques. SSL has been achiev ed by using microphone a rrays with more than tw o microphones [38, 45, 47 , 48, 5 0]. The accuracy o f the lo calization techniques bas ed on micro- phone ar rays is dictated by their physical sizes [5,1 2, 56]. Microphone ar rays are us ually desig ned using particular (e.g., linear or circular) structures, which result in their relatively lar ge sizes and sophisticated control co mp o- nen ts for o p eration. Therefor e, it b ecomes difficult to use them o n small rob ots nor la r ge systems due to the complexity of mo unting and maneuvering. In the past decade, r esearch ha s b een car ried out for rob ots to hav e auditory b ehaviors (e.g. getting atten- tion to an even t, lo cating a so und source in p otentially dangerous situations, and lo cating and paying a tten- tion to a sp eaker) by mimicking human auditor y sys- tems. Humans perform sound lo calization with their t wo ears using integrated three types o f c ues, i.e., the in teraura l level difference (ILD), the interaural time dif- ference (ITD), and the spectra l information [22, 35]. ILD and ITD cues are usually used resp ectively to identify the horizontal lo cation (i.e., azimuth angle) of a sound source with higher a nd low er fr e q uencies. Sp ectral cues are us ua lly used to ident ify the vertical lo cation (i.e., elev ation a ng le) of a sound so ur ce with higher frequen- cies. Additionally , aco ustic la ndmarks a id tow ards b et- tering the SSL by hu mans [55]. T o mimic human acoustic sys tems, researchers hav e developed sound sour ce lo ca liza tion techniques using only t wo microphones . All three t yp es of cues hav e been used b y Ro dema nn e t al. [42] in a binaural approach o f estimating the azimuth ang le of a so und sour ce, while the a uthor s also stated that reliable elev atio n es tima- tion would need a third micropho ne. Spectr al cues were used by the head-r elated-transfer- function (HR TF) that was a pplied to ident ify bo th the azimuth a nd elev a- tion angles of a so und sour ce for binaura l sensor plat- forms [20, 25, 28, 29]. The ITD cues have als o b een used in binaur a l sound sour ce lo calization [15], where the problem of co ne of confusio n [5 1] has b een ov er come by incorp ora ting head mov ements, which also enable b oth azimuth and elev ation estimation [40, 51]. Lu et al. [33] used a par ticle filter for binaural tra cking of a mobile sound so urce on the basis o f ITD a nd motion para llax but the lo calization was limited in a t wo-dimensional (2D) plane and was not impressive under static condi- tions. P ang et al. [39] presented an a pproach for binau- ral azimuth estimation ba sed o n reverber a tion weight- ing and gener alized par ametric mapping. Lu et a l. [34] presented a binaur al distance lo calization approa ch us- ing the mo tion-induced rate of int ensity ch ange which requires the use of para llax motion and error s up to 3.4 m were observed. Kneip and Ba umann [32] established formulae for binaural ident ification of the az imuth and elev ation angles as well as the distance information of a sound source combining the r otational and transla tio nal motion of the interaural ax is . How ever, large lo caliza - tion erro rs were o bs e rved and no so lution was g iven to handle s e ns or noise nor mo del uncertaint y . Ro de- mann [41] prop osed a binaura l azimu th and distance lo- calization technique using signal a mplitude along with ITD and ILD cues in an indo o r environmen t with a sound source ra nging fro m 0 . 5 m to 6 m. How ever, the azimuth es timation degrades with the distance a nd re- duced error with the required calibra tion w as still large. Kumon and Uozumi [31] prop osed a binaura l s ystem on a rob ot to lo calize a mo bile so und sour ce but it re- quires the rob ot to mov e with a co ns tant velocity to achiev e 2D lo calization. Also, further s tudy was pro- po sed for a para meter α 0 in tro duced in the EKF. Zhong et al. [46, 54] and Gala et al. [17] utilized the extended Kalman filtering (EKF) technique to p erform orienta- tion lo calization us ing the ITD data acquired by a set of binaura l self-rota ting micropho nes. Moreover, la rge error s were observed in [54] when the elev a tio n angle o f a sound s o urce was c lose to zero. T o the b est of our k nowledge, the works pr esented in the literature for SSL using tw o micropho nes ba sed on ITD cues ma inly provided formulae that calculate the azimuth a nd elev ation a ngles of a so und s ource with- out incorp orating s ensor noise [32]. T he works tha t use probabilistic recurs ive filtering techniques (e.g ., EKF) for or ient ation es timation [54] did not co nduct any ob- serv ability ana ly sis on the system dynamics. In o ther words, no discus s ion on the limitation of the techniques for orientation es timatio n was found. In addition, no probabilistic recursive filtering techn ique was used to acquire distance informa tion of a sound source. This pap er aims to a ddress these research gaps. The contributions o f this pap er include (1) an o b- serv ability ana ly sis of the system dynamics for three- dimensional (3D) SSL us ing tw o micro phones and the ITD cue o nly; (2) a novel algor ithm that provides the estimation of the elev a tion ang le of a sound source when the states are unobse r v able; and (3) a new EKF-bas ed tech nique that estimates the ro b ot-sound distance. Both simul ations and exp e riments were co nducted to v alidate the prop osed tec hniques. The r est of this pap er is or ganized as follows. Sec- tion 2 describ es the preliminaries . In Section 3, 2D and 3D orientation lo ca lization mo dels a re presented alo ng with their observ ability analysis. In Sectio n 4, a nov el 3 H 1 (f) H 2 (f) Cross Corre lation Peak Detection y 1 (t) y 2 (t) d Fig. 1 In teraural Time Delay (ITD) estimation b et ween signals y 1 ( t ) and y 2 ( t ) using the cross-correlation technique. method is pr op osed to detect non-o bs e rv ability condi- tions and a solution to the non-o bserv ability problem is presented. Section 5 pre s ents a distance lo calizatio n mo del with its o bserv ability analysis. The EKF alg o- rithm is presented in Section 6 . In Sections 7 and 8 , the simulation a nd exp erimental r esults ar e presented resp ectively , follow ed by Section 9, which concludes the pap er. 2 PRE LIMINARIES 2.1 Calculation o f ITD The only cue used for lo calization in this pap er is the ITD, which is the time difference of a s o und signa l trav- eling to the t wo microphones and can b e calculated us- ing the c r oss-co rrelation technique [3, 3 0]. Consider a sing le stationar y sound s ource pla c e d in an environmen t. L e t y 1 ( t ) and y 2 ( t ) b e the sound signals captured by tw o spatia lly separ ated microphones in the presence of noise, which are given by [30] y 1 ( t ) = s ( t ) + n 1 ( t ) , (1) y 2 ( t ) = δ · s ( t + t d ) + n 2 ( t ) , (2) where s ( t ) is the so und signal, n 1 ( t ) and n 2 ( t ) ar e real and jointl y stationar y random pro ces s es, t d denotes the time difference of s ( t ) arriving at the tw o micro phones, and δ is the signal attenuation factor due to different trav e ling distances of the sound signal to the tw o micro- phones. I t is commonly assumed that δ changes slowly and s ( t ) is uncorrelated with noises n 1 ( t ) and n 2 ( t ) [30]. The c r oss-co rrelation function o f y 1 ( t ) and y 2 ( t ) is g iven b y R y 1 ,y 2 ( τ ) = E [ y 1 ( t ) · y 2 ( t − τ )] , where E [ · ] represents the ex p ecta tion o p erator. Fig- ure 1 shows the pro cess of delay es timatio n betw een y 1 ( t ) and y 2 ( t ) , where H 1 ( f ) and H 2 ( f ) re pr esent sca l- ing functions or pre-filters [30]. V ar ious techniques can be used to eliminate o r reduce the effect of background noise and reverberations [8, 9 , 18, 19, 36, 44]. An im- prov ed version of the cross- correla tion method inco r - po rating H 1 ( f ) and H 2 ( f ) is called Gener alized Cro ss- Correla tio n (GCC) [30], which further improves the es- timation of time delay . The time difference of y 1 ( t ) a nd y 2 ( t ) , i.e., the ITD, is given by ˆ T , arg max τ R y 1 ,y 2 . The distance difference of the s ound signal traveling to the tw o microphones is given by d , ˆ T · c 0 , where c 0 is the s ound sp eed and is usually selected to be 34 5 m/s. 2.2 F a r-Field Assumption The area aro und a sound sour ce can be divided into five different fields : free field, near field, far field, direct field and reverbera nt field [1, 21]. The reg ion close to a source where the so und press ure and the aco ustic par- ticle velocity ar e not in phase is re garded as the near field. The r ange of the near field is limited to a distance from the so urce equal to approximately a wav elength of so und or equal to three times the lar gest dimension of the sound so urce, which ever is the larger . The fa r field of a source b egins where the nea r field e nds a nd extends to infinity . Under the far-field as sumption, the acoustic wa vefront r eaching the micro phones is pla na r and not spherical, in the sense that the wa ves trav el in parallel i.e. the ang le of incidence is the same for the t wo microphones [14]. 2.3 Obs e r v ability Analysis Consider a nonlinear system descr ibed by a s tate-space mo del ˙ x = f ( x ) , (3) y = h ( x ) , (4) where x ∈ R n and y ∈ R m are the state and output v ec- tors, resp ectively , and f ( · ) and h ( · ) a re the pro ces s and output functions, resp ectively . The obser v ability matrix of the system descr ib e d by (3) a nd (4) is then given b y [23] Ω =   ∂ L 0 f h ∂ x  T  ∂ L 1 f h ∂ x  T · · ·  T , where the Lie deriv atives ar e given by L 0 f h = h ( x ) and L n f h = ∂ L n − 1 f h ∂ x f . The sys tem is observ able if the observ a bility matrix Ω ha s rank n . 3 M athematical Mo del s and Observ ability Analysis for Orien tation Lo calization The complete lo ca lization o f a so und source is usually achiev ed in tw o stages, the orientation (i.e., azimuth and elev a tion ang les) lo ca lization and distance lo caliza - tion. In this section, the metho dology o f the o rientation lo calization is pre sented. 4 O S (D , θ, ϕ) ψ ϕ L R Robo t β p q Fig. 2 T op view of the robot illustrating different angle defini- tions due to the rotat ion of the microphone array . Fig. 3 3D view of the system for orient ation lo calization. 3.1 Definitions As s hown in Figures 2 and 3, the acoustic signal g e n- erated b y the sound so urce S is c ollected b y the left and right micropho nes , L a nd R , resp ectively . Let O b e the center o f the rob ot as well as the tw o micr ophones. The lo ca tion of S is repr e s ented by ( D , θ, ϕ ), wher e D is the dis ta nce b e t ween the so urce and the cent er of the rob ot, i.e., the length of segment OS , θ ∈  0 , π 2  is the elev ation angle defined as the angle betw een OS and the hor izontal plane, and ϕ ∈ ( − π , π ] is the az im uth angle defined as the angle measured clo ckwise fro m the rob ot hea ding vector, p , to OS . Letting unit vector q be the orientation (heading) o f the micr ophone a rray , β be the angle b etw een p and q , a nd ψ b e the angle be - t ween q and O S , b oth following a right hand rotation rule, we have ϕ = ψ + β . (5) F or a clo ckwise ro tation, we have β ( t ) = ω t , wher e ω is the r otational sp eed of the tw o microphones , a nd ψ ( t ) = ϕ − ω t . In the shaded tria ngle, △ S OF , shown in Fig ures 3 and 4, define α , ∠ S O F and we hav e α + ψ = π 2 and cos α = cos θ sin ψ . Ba sed on the far- field assumption in Section 2.2, we hav e d = ˆ T · c 0 = b cos α = b cos θ sin ψ . (6) Fig. 4 The shaded triangle i n Figure 3. where b is the distance b etw een the tw o microphones, i.e. the leng th o f the se g ment LR . T o av oid cone of confusion [51] in SSL, the tw o- microphone ar r ay is rotated with a nonzer o angular ve- lo city [54]. Without loss of g enerality , in this pap er we assume a clo ckwise rotatio n of the microphone arr ay on the horizontal plane while the ro bo t itself do es not rotate nor translate throughout the entire estimation pro cess, w hich implies that ϕ is constant. 3.2 2D L o calization If the sound so urce and the r o b ot are on the s ame hor- izontal plane, i.e., θ = 0 , we have d = b sin ψ . Ass ume that the microphone array rotates clo ckwise with a con- stant angula r velocity , ω . Considering the state-space mo del for 2D lo calization with the sta te x 2 D , ψ , and the output as y 2 D , d , we have ˙ x 2 D = ˙ ψ = − ω , (7) y 2 D = b sin ψ . (8) Theorem 1 The system describ e d by Equations (7) and (8) is observable if 1) b 6 = 0 and 2) ω 6 = 0 or ψ 6 = 2 k π ± π 2 , wher e k ∈ Z . Pr o of The obser v ability ma trix [23, 24 ] for the sys tem describ ed b y Equations (7) and (8) is given by O 2 D =  b cos ψ bω s in ψ − b ω 2 cos ψ · · ·  T . (9) The s ystem is o bserv able if O 2 D has rank one, whic h implies b 6 = 0 . If ω = 0 , observ a bility requires that cos ψ 6 = 0 , which implies ψ 6 = 2 k π ± π 2 . If ω 6 = 0 , O 2 D is full r a nk fo r all ψ . R emark 1 Since the tw o microphones are s e pa rated by a non-zero dis ta nce, (i.e., b 6 = 0 ) a nd the microphone array ro tates with a non-zero co nstant ang ula r velo city (i.e., ω 6 = 0 ), the s ystem is obs e r v able in the domain of definition. 5 3.3 3D L o calization Considering the s ta te-space mo del for 3D lo calization with the state x 3 D , [ θ , ψ ] T , and the output a s y 3 D , d , we hav e ˙ x 3 D =  ˙ θ ˙ ψ  =  0 − ω  , (10) y 3 D = b cos θ sin ψ . (11) Theorem 2 The system describ e d by Equations (10) and (11 ) is observable if 1) b 6 = 0 , 2) ω 6 = 0 , 3) θ 6 = 0 o , and 4) θ 6 = 90 o . Pr o of The obser v ability matrix for (10) and (11) is g iven b y O 3 D =       − b sin θ sin ψ b cos θ cos ψ bω sin θ cos ψ bω c os θ sin ψ bω 2 sin θ sin ψ − bω 2 cos θ cos ψ − bω 3 sin θ cos ψ − bω 3 cos θ sin ψ · · · · · ·       . (12) It should b e noted that higher-or der Lie deriv a tives do not add ra nk to O 3 D . C o nsider the squar ed matrix con- sisting of the first tw o rows of O 3 D Ω 3 D =  − b sin θ sin ψ b cos θ cos ψ bω sin θ cos ψ b ω cos θ sin ψ  , and the deter minant of the Ω 3 D is det { Ω 3 D } = − b 2 ω sin θ cos θ . The system is o bserv able if b 6 = 0 , ω 6 = 0 , θ 6 = 0 o , and θ 6 = 90 o . (13) F urther in vestigation can b e done by selecting tw o even (or odd) rows from O 3 D to for m a squared ma trix, whose determinant is alwa ys zero. . R emark 2 As it is always true that b 6 = 0 and ω 6 = 0 due to Remark 1, the s y stem is obser v able o nly when θ 6 = 0 o and θ 6 = 90 o . Exper imental results presented by Zhong et a l. [54] using a s imilar mo del illustr a tes larg e estimation e r ror when θ is close to zero. T o fur ther inv estiga te the system obse rv ability , co nsider the following tw o sp ecial cases: (1 ) θ is k nown a nd (2) ψ is known. Assume that θ is known and consider the following system ˙ x ψ = ˙ ψ = − ω , (1 4) y ψ = b cos θ sin ψ . (15) Corollary 1 The azimuth angle in the system describ e d by Equations (14) and (15) is observable if 1) b 6 = 0 , 2) ω 6 = 0 , and 3) θ 6 = 9 0 o . Pr o of The observ ability matrix asso ciated with (14) and (15 ) is g iven b y O ψ =  b cos θ cos ψ bω cos θ sin ψ · · ·  T . (16) So, the s ystem is observ a ble if, b 6 = 0 , θ 6 = 9 0 o , and ω 6 = 0 or ψ 6 = 2 k π ± π 2 . (17) This shows that ψ is unobser v able when θ = 9 0 o . Assume that ψ is known and consider the following system ˙ x θ = ˙ θ = 0 , (18) y θ = b cos θ sin ψ . (19) Corollary 2 The elevation angle in the system describ e d by Equations (18) and (19) is observable if the fol low- ing c onditions ar e satisfie d: 1) b 6 = 0 , 2) ω 6 = 0 , and 3) θ 6 = 0 o . Pr o of The observ ability matrix asscoiated with (18) and (19) is g iven b y O θ =  − b sin θ sin ψ 0 · · ·  T . (20) So the system is observ able if b 6 = 0 , θ 6 = 0 o , and ψ 6 = k π . (21) As ψ is time-v a rying, so it won’t stay a t k π . It can b e seen that θ is unobs erv able when θ = 0 o . 4 Co mplete Orien tation Lo calization T o handle the unobser v able situations, i.e., θ = 0 o and θ = 90 o , we present a novel algorithm in this section that utilizes both the 2D and 3D lo calization mo dels to enable the orientation lo calization of a sound s o urce residing anywhere in the domain of definition, i.e., θ ∈ [0 , π/ 2 ] and ϕ ∈ ( − π , π ] . 6 −250 −200 −150 −100 −50 0 50 100 150 200 250 0 10 20 | X N ( ω ) | of the signal d(t) for a sound source a t θ = 5 0 ◦ Frequency (rad/s) Magnitude −250 −200 −150 −100 −50 0 50 100 150 200 250 0 10 20 | X N ( ω ) | of the signal d(t) for a sound source a t θ = 8 5 ◦ Frequency (rad/s) Magnitude Fig. 5 The signals after taking the Discrete F ourier transform (DFT) of the noised signal d ( t ) with the s ource lo cated at θ = 50 o and θ = 85 o , respectively . The tw o big p eaks in the upp er figure occur at ± 2 π / 5 rad/sec (i. e., the angular velocit y of the rotation of the microphone array) when θ = 50 o , whereas small p eaks are presen t in the b ottom figure when θ = 85 o . −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0 0.05 0.1 0.15 Estimated amplitude ˆ A d of the signal d(t) using DFT for a sound source at θ = 50 ◦ Frequency (rad/s) Amplitude (m) −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0 0.02 0.04 Estimated amplitude ˆ A d of the signal d(t) using DFT for a sound source at θ = 85 ◦ Frequency (rad/s) Amplitude (m) Fig. 6 Estimated amplitudes ˆ A d of the signal d ( t ) usi ng the DFT for the source l o cated at θ = 50 o and θ = 85 o , resp ectiv ely . The maximum of ˆ A d occurs when the f requency is at ± 2 π / 5 rad/sec. When θ = 85 o , the maximum of ˆ A d is less than 0.017 m. 4.1 Identi fication of θ = 90 o ITD could b e zero due to either 9 0 o elev ation or ab- sence o f sound, the latter of which can b e detected b y ev aluating the p ower reception of micropho nes. In this pap er, we foc us on the former case. Assume that the sensor noise is Gaussian, which dominates the ITD signal when θ gets close to 9 0 o . T o chec k the pr esence of the signal d ( t ) buried in the noise, we ca n first a pply the Discrete F ourier T ransfo r m (DFT) onto the sto r ed d ( t ) . The N -p oint DFT of the signal d ( t ) r e s ults in a sequence o f complex num b ers in the form of X r eal + j X imag , where X r eal and X imag rep- resent the r e a l and imag inary co o rdinates of the com- plex n umber. The magnitude of the complex num b er is then obtained by | X ( ω ) | = q X 2 r eal + X 2 imag . Figure 5 shows the resulting magnitude ( | X ( ω ) | ) sig nals of d ( t ) after taking DFT when the so und source is placed at θ = 50 o and 85 o , r esp ectively , in simulation. T wo big pea ks in the top subfigur e (i.e., when θ = 5 0 o ) are ob- served when the frequency is a t ± 2 π / 5 r ad/sec (i.e., the angular velocity of the ro tation o f the microphone array). How ever, the p eak s observed in the b ottom sub- figure (i.e., when θ = 8 5 o ) a re co mparatively very small. T o eliminate the noise in Figure 5 , define the es- timated a mplitude o f the ITD signal a s ˆ A d ( ω ) = 2 N · | X ( ω ) | . Figure 6 shows the es timated amplitude ( ˆ A d ) of the signa l d ( t ) resulting from Figure 5. The b ottom subfigure (i.e., when θ = 85 o ) shows that the maximum v alue o f ˆ A d is very small compared to the top subfigure (i.e., when θ = 5 0 o ). The ITD is considered a s zero if the maximum v alue of the estimated a mplitude ˆ A d (when the frequency equals the ang ula r velocity of the r ota- tion of the micr ophone array) is less than a predefined threshold, d thr eshold . The se lection of d thr eshold deter- mines the accurac y of the es timation when the so und source is around 90 o elev ation. The v alue of d thr eshold , for example, c a n b e s elected as 0 . 01 7 m, which corre- sp o nds to θ = 85 o as in Figure 6, thereby giving an accuracy of 5 o . 4.2 Identification of θ = 0 o Theorem 1 guara nt ees accurate azimuth a ngle estima- tion us ing the 2 D mo del when the sound source is lo- cated with zero e lev ation. W e obser ved that when the elev ation o f the sound so urce is no t clo se to zero, the estimation of the azimuth angle provided by the 2D mo del is far off the r eal v alue. 0 50 100 150 200 250 300 350 50 60 70 ψ ( deg ) ϕ ( deg ) 2D and 3D model’s e stimated ϕ comparison at θ = 0 ◦ 2D model estimation 3D model estimation Actual azimuth angle 0 50 100 150 200 250 300 350 50 60 70 2D and 3D model’s e stimated ϕ comparison at θ = 2 0 ◦ ψ ( deg ) ϕ ( deg ) Fig. 7 Comparison of azimuth angle estimations usi ng the 2D and 3D lo calization models when a sound source is lo cated at θ = 0 o and θ = 20 o , resp ectiv ely . On the other hand, Theor em 2 g uarantees that the azimuth ang le estimation using the 3D mo del is accu- rate for all elev ation ang les exce pt for θ = 90 o , which is detected by the approa ch in Sectio n 4 .1. Therefore, the estimations res ulting fro m bo th the 2D mo del 3D mo dels will be identical if the sound source is lo cated at θ = 0 o , as shown in Fig ur e 7. The r o ot-mean-s quare 7 error (RMSE) is use d as a measur e of the difference b e- t ween the t wo azimuth estimations as it includes b oth mean absolute error (MAE) as well a s a dditional infor- mation re la ted to the v ariance [1 3]. This er ror is dep en- den t on the v alue of elev a tion a ngle and it incr eases as the elev ation angle increas es, as shown in Figure 8. 0 RMSE between 2D and 3D model's estimated azimuth 100 Actual ϕ (deg) 200 300 0 Actual θ (deg) 5 10 3 2 1 0 15 RMSE (deg) Fig. 8 RMSE b etw een 2D and 3D lo calization m odel’s estimated azimu th angles. RMS E (deg) 0 0.5 1 1.5 2 θ (deg) 0 5 10 15 Polyfit approximation of estimated elevation Actual elevation Approximated elevation Fig. 9 Appro ximation of the elev ation angle from the RM SE data using the l east square fitted p olynomial. In order to ge t a n accurate estimate of the elev ation angle close to zero, a po lynomial curve fitting approach is used to map (in a least-squar e sense) the RMSE v al- ues to the elev ation angles. Different RMSE v alues a re collected b eforehand in the environmen t where the lo- calization w ould b e done. The RMSE v alues asso cia ted with the same elev a tio n angle but differen t a z im uth angles ex press s ma ll v ariations, as seen in Figur e 8. Therefore, for a par ticula r elev ation angle, the mean of all RMSE v a lues with different azimuth ang les will b e selected as the RMSE v alue c orresp o nding to the ele- v ation angle. An example curve is shown in Figure 9. Algorithm 0.1 Complete 3 D orientation lo calization 1: Calculate the ITD, ˆ T , f rom the recorded signals of tw o micro- phones. 2: IF ˆ A d < d thr eshold THEN 3: The elev ation angle of the sound source is θ = 90 o and the azimu th angle, ϕ , is undefined. 4: ELSE 5: Estimate the azimu th ϕ 2 D and ϕ 3 D using 2D and 3D l o- calization models, respective ly 6: Calculate the RM S E b etw een ϕ 2 D and ϕ 3 D 7: IF RM S E < RM S E thr eshold THEN 8: Use polynomial curv e fitting to determine θ using the calculated RM S E v alue and estimate ϕ using either 2D or 3D lo cali zation model 9: ELSE estimate both θ and ϕ using the 3D lo calization model 10: END IF 11: END IF 4.3 Co mplete O rientation L o calization Algor ithm Calculate ITD Start End Elevati on ≈ 9 0 ο and azimuth i s u ndefined Elevati on ≈ 0 ο , use 2D mod el t o est i ma te azimuth and curve fitt ing to es ti mate ele vation Y es Esti ma te azimuth us ing 3D model Esti ma te azimuth usi ng 2D model No Use 3D model es timation for azi muth and e l evati on Y es     d   ? No RMSE < RMSE thre sho ld ? Calculate RM SE Fig. 1 0 Flow ch art addressing the non-observ ability problems re- flected in the 3D lo calization model. Figure 10 illustrates the flow chart of the pr op osed algorithm for the complete orientation lo calization. The pseudo co de of the pro p o sed complete or ie ntation loca l- ization is given in Algo rithm 0.1. The R M S E thr eshold is the v alue used to chec k when the elev a tion angle is close to 0 o . This threshold v alue decides the po int until which the curve fitting is r equired, ansd a fter which the 3D mo del can b e trusted for elev ation estimation. 8 Fig. 11 3D view of the system for distance locali zation. Fig. 12 Gra y triangle in Figure 11. 5 Di stance Lo calization The novel distance lo caliza tion approach pr esented in this se c tio n dep ends on an accurate orientation lo ca l- ization. Assume that the angula r lo cation o f the s o und source has b een obta ined by using Algor ithm 0.1 and the micr o phone arr ay has be e n re g ulated facing toward the sound source, as shown in Figure 11. The prop o sed distance lo calization approach r equires the micro phone array , LR , to tra nslate with a distance ∆d along the line p er pendicula r to the center-source vector (on the horizontal pla ne). This tra nslation s hifts the center o f the microphone array , O , to a new p oint, O ′ , and γ is defined a s the angle b etw een vectors O ′ S and O S , as shown in Figure 12. Note that the center of the ro b ot, O, is unchanged. The ob jective is to estimate distance D betw een the center of the rob ot O and the source S. 5.1 Mathematica l Mo del for Distance Lo calization Consider the g ray triang le shown in Figure 12. B a sed on the far-field assumption in Section 2.2, the length R ′ P ′ is given by d ′ = b sin γ . (22) In tria ng le △ S OO ′ , we hav e sin γ = △ d p ( △ d ) 2 + D 2 . (23) Defining the state as x dist = D and output as y dist , the state-space mo del is given by ˙ x dist = 0 , (24) y dist = b △ d p ( △ d ) 2 + D 2 . (25) Theorem 3 The system describ e d by Equations (24) and (25) is observable if the fol lowing c onditions ar e satisfie d: 1) b 6 = 0 , 2) △ d 6 = 0 , and 3) D 6 = 0 . Pr o of The observ ability matrix asso ciated with (24) and (25 ) is g iven b y O dist = h − 2 b 2 ( △ d ) 2 D ( △ d ) 2 + D 2 · · · i . (26) So the system is observ able if b 6 = 0 , △ d 6 = 0 , and D 6 = 0 . (27) R emark 3 As the microphones a re separa ted by a non- zero distance, i.e., b 6 = 0 , a nd the microphone a rray is being translated by a non- z ero distance, i.e., △ d 6 = 0 , the system is always o bserv able unless the sound so urce and the ro b o t are at same lo cation mak ing D = 0 , which is not in the scop e o f discussion o f this pap er. 6 E xtended Kalman Filter The estimation for the a ngles and distance of the sound source is conducted by e x tended Kalman filters. De- tailed mathematical der iv ation of the EKF can be found in [4]. Algorithm 0 .2 summar ies the EKF pr o cedure used in this pap er for SSL. The se ns or cov ar iance ma- trix ( R ) is defined as σ 2 w , a nd the pro cess cov a riance matrix ( Q ) is defined as σ 2 v for the distance lo ca liza tion, σ 2 v 1 for the 2D or ient ation lo calization and diag { σ 2 v 1 , σ 2 v 2 } for the 3 D orientation lo calization, resp ectively , wher e σ vi is the pro cess no ise v aria nce cor resp onding to the i th state and σ w is the s e ns or no ise v a riance. Key pa - rameters ar e listed in T a ble 1. The complete EKF-based SSL pro cedure is illustrated in Figure 13. 9 T able 1 EKF parameters. Pa rameter Angular Distance lo cali zation lo cali zation Process noise v ariance ( σ vi ,i = 1 , 2 ) 0 . 01 0 . 1 Sensor noise v ariance ( σ w ) 0 . 01 0 . 001 Initial azimuth angle estimate ( ϕ initial ) 5 o – Initial elev ation angle estimate ( θ initial ) 5 o – Initial distance estimate ( D initial ) – 1 m Algorithm 0.2 Pseudo co de for EKF [4] 1: Initialize: ˆ x 2: At each v alue of sample rate T out , 3: FOR i = 1 to N DO Prediction 4: ˆ x = ˆ x + ( T out N ) f (ˆ x, u ) 5: A J = ∂ f ∂ x ( ˆ x, u ) 6: P = P + ( T out N )( A J P + P A T J + Q ) Updat e 7: C J = ∂ h ∂ x ( ˆ x, u ) 8: K = P C T J ( R + C J P C T J ) − 1 9: P = ( I − K C J ) P 10: ˆ x = ˆ x + K ( y [ n ] − h ( ˆ x, u [ n ]) 11: END FOR Alig n th e mic rop h o ne array perp end icular to the sou rce-c enter vector Estimate t he azimuth an d elevatio n ang le for ea ch ro tatio n step u sin g EKF Estimate t he d istance fo r eac h shift usin g EKF Fix th e c enter of th e micro ph o ne array Sh ift th e c enter o f the microp ho n e-arra y co nt in uo u sly by a s mall fixed dist ance at a co nstan t speed Rot ate th e array clo ckwise at a con stant ang u lar velocity An gle estim a ti on Di stance e sti mati on Fig. 13 Blo ck diagram showing the pro cess for the proposed complete angular and distance lo calization of a sound source us- ing successive rotational and translational m otions of a set of tw o microphones. 7 Si m ulation Results In this s ection, we present the simulation results of the prop osed lo calization technique for b oth a ng le and dis- tance lo calization o f a sound source. 7.1 Simulation En vironment The Audio Array T o olb ox [16] is used to simulate a r ect- angular space using the image metho d des crib ed in [2]. The rob ot was placed in the center (o rigin) of the ro o m. The tw o microphones were separated by a distance of 0 . 18 m from each other which is e qual to the approxi- mate distance b etw een human ea rs. The sound source and the microphones ar e assumed omnidirectiona l a nd the attenuation o f the sound is calcula ted a s p er the sp e cifications in T able 2. T able 2 Simulated room specifications Pa rameter V alue Dimension 20m x 20m x 20m Reflection coefficient of each wall 0.5 Reflection coefficient of the flo or 0.5 Reflection coefficient of the ceili ng 0.5 V elo city of the sound 345 m/s T emp erature 22 o C Static pressure 29.92 mmHg Relativ e humidit y 38 % 7.2 V a lidation of Observ a blit y As discussed earlier, Theorem 1 shows that the 2D mo del is a lwa ys obser v able, howev er , it do es not pro- vides any e lev ation informatio n o f the s ound source. O n the o ther ha nd, Theo rem 2 shows that the 3D mo del is unobser v able when the elev a tion ang le of the sound source is 0 o or 90 o . In order to v alida te the o bserv abil- it y analysis , lo calization was p erfor med in the sim ulated environmen t. F or a sound source lo cated on a 2D plane, Figure 14 shows the av e r age of abso lute estimation er rors versus different azimuth angles with the so und source at dis- tance of 5 m and 10 m to the rob ot, resp ectively . It can be seen that all errors are smaller than 1 . 8 o and the mean of the av erag e of absolute error s is a pproximately 1 o for the t wo cases. T o verify the o bserv ability conditions for the 3D mo del a s describ ed b y Equations (10) and (11), the 10 −200 −150 −100 −50 0 50 100 150 200 0 0.5 1 1.5 2 Average of absolute error in estimated azimuth with D = 5m Actual azimuth (deg) Error (deg) −200 −150 −100 −50 0 50 100 150 200 0 0.5 1 1.5 2 Average of absolute error in estimated azimuth with D = 10m Actual azimuth (deg) Error (deg) Fig. 14 A ve rage of absolute errors in azimuth angle estimation using 2D mo del with a sound source pl aced at differen t azimut h lo cations at a constan t distance of 5 m and 10 m f rom the cent er of the rob ot. −5 0 5 −5 0 5 0 5 Sx (m) Sound source locations with D=5m Sy (m) Sz (m) Sound source Robot Fig. 15 Sound source lo cations with a fixed distance of 5 m to the cen ter of the robot in the si mulat ed ro om. 0 20 40 60 80 100 −200 −100 0 100 200 0 2 4 6 8 Actual ϕ (deg) Average abs olute errors in estimated e levation Actual θ (deg) Error in θ (deg) Fig. 16 A v erage of absolute errors in elev ation estimation us- ing the 3D lo calization mo del. Relative ly large errors illustrate the non-observ ability condition i n elev ation angle estimation with sound source placed around 0 o elev ation, as describ ed by Theo- rem 2 . sound source is placed at different lo ca tions with a dis- tance of 5 m from the r ob ot in the simulated ro om, which evenly cover the hemisphere a b ove the gr o und, as shown in Fig ure 15. Figure 16 shows the av eraged absolute erro rs in the elev a tion estimation versus a c- tual azimuth a nd elev ation a ngles of the s ound source . Larger er rors were observed when the elev ation was close to 0 o , which coincides with Theore m 2. Figur e 1 7 shows the av erage d a bsolute errors in the a zimut h angle estimation for a s ingle sound source at different p osi- 0 20 40 60 80 100 −200 −100 0 100 200 0 50 100 150 200 Actual ϕ (deg) Average abs olute errors in estimated a zimuth Actual θ (deg) Error in ϕ (deg) Fig. 17 A v erage of absolute errors in azimuth estimation us- ing the 3D lo calization mo del. Relatively large errors il lustrate the non-observ abili t y condition in azim uth angle estimation with sound source placed around 90 o elev ation, as describ ed by Theo- rem 2 . tions. Lar ger err ors were observed when the elev ation was clo se to 9 0 o , which again echoes Theorem 2. 7.3 Simulation Results for O r ientation Lo calization A num ber o f exp er imen ts were p erfor med to v alidate the p er fo r mance of the prop osed SSL technique for ori- ent ation lo calization, as describ ed in Algorithm 0 .1. White noise a nd sp e e ch s ignals were used as a so und source which was placed individually a t different lo ca- tions in the simulated ro om with sp ecifications sum- marized in T able 2. The micro phone ar ray was rotated with an a ngular v elo city of ω = 2 π / 5 ra d/ sec in the clo ckwise direction for three complete r evolutions. The ITD was calculated after every 1 o rotation follow ed by the estimation p erfor med using the EKF with par ame- ters given in T able 1. F our different sets of exp eriments were p er formed keeping the s ource a t different lo ca- tions. In fir st t wo sets of exp eriments, the source was placed in all four quadra nt s including the axe s at dif- ferent distances, keeping the elev ation constant at 20 o and 60 o . T o v alidate the p erforma nce of the prop osed solution to the non-observ abilit y conditions, other tw o sets ex pe r iments w ere p er formed by keeping the so und source at elev ation clos e to 0 o and 90 o . The r esults of the lo calization are presented in T ables 3 and 4. It can be seen that or ie ntation lo calization is achiev ed with error s less than 4 o using sp eech as well as white noise sound source. Lar g e errors ar e obser ved when the elev a- tion o f the sound sour c e is around 0 o and 90 o . F urther, the erro rs with so urce elev ation a round 0 o is less a s com- pared to sour ce elev a tion around 9 0 o . This was a chiev ed b y using p olyno mial curve fitting approach mentioned in Section 4 .2, with R M S E thr eshold = 1 . 9 o , which co r- resp onds to θ = 15 o on the fitted curve shown in Fig - ure 9 . The v alue d thr eshold was calcula ted as 0 . 017 m 11 T able 3 Simulation results of orien tation lo calization for sp eec h Expt. Act . Ac t. Est. A vg of abs Act. Est. A vg of abs No. D(m) ϕ ( o ) ϕ ( o ) error ( o ) θ ( o ) θ ( o ) error ( o ) 1 a 5 0 0.60 0.60 20 20.39 0.39 1 b 5 50 51.03 1.03 21.44 1.44 1 c 7 90 91.21 0.21 20.83 0.83 1 d 7 12 0 121.57 1.57 20.96 0.96 1 e 3 180 181.03 1.03 20.16 0.16 1 f 3 -40 -39.33 0.67 19.10 0.90 1 g 10 -90 - 88.85 1.15 21.66 1.66 1 h 10 - 140 -139.52 0.48 21.18 1.18 2 a 5 0 2.31 2.31 60 60.68 0.68 2 b 5 50 50.65 0.65 60.53 0.53 2 c 7 90 91.79 1.79 60.70 0.70 2 d 7 12 0 121.85 1.85 60.84 0.84 2 e 3 180 181.66 1.66 60.05 0.05 2 f 3 -40 -38.66 1.34 60.38 0.38 2 g 10 -90 - 89.38 0.62 59.62 0.38 2 h 10 - 140 -138.20 1.80 59.78 0.22 3 a 5 50 50.69 0.31 0 3.39 3.39 3 b 7 -120 -119.00 1.00 4 2.40 1.60 4 a 5 -40 not def. not def. 86 90.00 4.00 4 b 7 15 0 not def. not def. 89 90.00 1.00 T able 4 Simulation results of orien tation lo calization for white noise Expt. Act . Ac t. Est. A vg of abs Act. Est. A vg of abs No. D(m) ϕ ( o ) ϕ ( o ) error ( o ) θ ( o ) θ ( o ) error ( o ) 1 a 5 0 1.18 1.18 20 19.66 0.34 1 b 5 50 51.03 1.03 20.44 0.44 1 c 7 90 90.25 0.25 20.11 0.11 1 d 7 12 0 121.35 1.35 19.70 0.30 1 e 3 180 180.41 0.41 20.48 0.48 1 f 3 -40 -39.44 0.56 19.75 0.25 1 g 10 -90 - 89.11 0.89 19.71 0.29 1 h 10 - 140 -139.67 0.33 21.18 1.18 2 a 5 0 1.31 1.31 60 60.38 0.38 2 b 5 50 51.59 1.59 60.39 0.39 2 c 7 90 90.74 0.74 60.87 0.87 2 d 7 12 0 121.21 1.21 60.39 0.39 2 e 3 180 181.16 1.16 60.51 0.51 2 f 3 -40 -38.66 1.34 60.41 0.41 2 g 10 -90 - 88.90 1.10 60.70 0.70 2 h 10 - 140 -138.64 1.36 60.57 0.57 3 a 5 50 51.45 1.45 0 1.57 1.57 3 b 7 -120 -118.36 1.64 4 1.57 2.43 4 a 5 -40 not def. not def. 86 90.00 4.00 4 b 7 15 0 not def. not def. 89 90.00 1.00 (whic h corres po nds to θ = 85 o , thereby giving a n ac- curacy of 5 o when the sound so urce gets close to 90 o elev ation) for the simulated environmen t with sp ecifi- cation given in T able 2. 7.4 Simu lation Results for Distance Lo caliza tion Speech and white-noise so unds were als o used to test the p erformance of the distance lo ca lization. A single sound so urce was placed at different lo ca tions a nd the ITD signal was recorded while the microphone a rray was contin uous ly shifted for 200 steps each with a dis- tance of △ d = 0 . 00 07 m. The r esults ar e summarize d in T ables 5 and 6. The key pa rameters of the EKF are given in T able 1. The results for the distance lo caliza- tion with a so und s o urce pla ced at different lo cations are shown in Fig ur e 18. It is observed that the er ror in the estimation co nverges quickly a nd a total shift of micropho ne array of approximately 3 cm is sufficient for the estimates to completely conv er ge to and r emain in the three standard deviation b ounds. The av er age of absoute er ror in the estimation is found to be less than 12 T able 5 Simulat ion results of distance lo calization using sp eech sound source Expt. Act. Act. Act. Est. A vg of abs No. ϕ ( o ) θ ( o ) D(m) D(m) error (m) 1 a 0 20 5 5.01 0.01 1 b 50 5 5.01 0.01 1 c 90 7 6.94 0.06 1 d 120 7 6.93 0.07 1 e 180 3 3.01 0.01 1 f -40 3 3.01 0.01 1 g -90 10 9.54 0.46 1 h -140 10 9.81 0.19 2 a 0 60 5 5.02 0.02 2 b 50 5 5.02 0.02 2 c 90 7 6.94 0.06 2 d 120 7 6.94 0.06 2 e 180 3 3.00 0.00 2 f -40 3 3.01 0.01 2 g -90 10 9.52 0.48 2 h -140 10 9.41 0.59 3 a 50 0 5 5.02 0.02 3 b -120 4 7 6.87 0.13 4 a -40 86 5 5.02 0.02 4 b 150 89 7 6.83 0.17 T able 6 Simulation results of distance l ocalization using white noise sound source Expt. Act. Act. Act. Est. A vg of abs No. ϕ ( o ) θ ( o ) D(m) D(m) error (m) 1 a 0 20 5 5.01 0.01 1 b 50 5 5.01 0.01 1 c 90 7 6.92 0.08 1 d 120 7 6.92 0.08 1 e 180 3 3.01 0.01 1 f -40 3 3.01 0.01 1 g -90 10 9.52 0.48 1 h -140 10 9.44 0.56 2 a 0 60 5 5.01 0.01 2 b 50 5 5.01 0.01 2 c 90 7 6.92 0.08 2 d 120 7 6.92 0.08 2 e 180 3 3.01 0.01 2 f -40 3 3.01 0.01 2 g -90 10 9.48 0.52 2 h -140 10 9.43 0.57 3 a 50 0 5 5.01 0.01 3 b -120 4 7 6.89 0.11 4 a -40 86 5 5.01 0.01 4 b 150 89 7 6.90 0.10 0 . 6 m in b o th the case of sp eech as well as white noise sound s ources. 8 E xp erimental R esults Exper imen ts were co nducted using tw o different ha rd- ware platforms: a KEMAR dummy hea d in a well equipp ed hearing labo ratory and a rob otic platform eq uipped with a set of tw o r otational micro phones. The follow- 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 −2 0 2 4 6 Microphone array shift distance (cm) Error (m) Distance = 5m Estimation error Bound 3 3.5 4 4.5 5 5.5 −0.5 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 −5 0 5 10 Microphone array shift distance (cm) Error (m) Distance = 10m Estimation error Bound 3 3.5 4 4.5 5 5.5 −0.5 0 0.5 Fig. 18 Simulation results for di s tance estimation using EKF. A single sound source was placed at tw o di fferent lo cations with distances of 5 m and 10 m , resp ective ly . The b ounds repre sent the three standard deviation of the estimation error. ing subsections discuss the har dware platforms and the results. 8.1 Results us ing K EMAR Dummy Hea d Exper imen ts using the KEMAR dummy head were co n- ducted in a high frequency fo cused sound trea ted ro om [53] with dimension 4 . 6 m x 3 . 7 m x 2 . 7 m as shown in Fig- ure 19. The ITD howev er is mo stly effective fo r low frequency so unds b elow 1.5 kHz as a spa tia l hear ing cue [35]. The w alls, flo or, and c e iling of the r o om were cov er ed by poly ur ethane acoustic foa m with a thickness of only 5 c m which is relatively low co mpa red to the sound wa velength thereby making a relatively low re- duction in low and middle frequencies [6], thereby ma k- ing it a challenging acoustic environment . F or bro ad band noise, T60 (i.e., the time requir ed for the sound level to decay 60 dB [4 3]) w as 97 ms. In an o ctave band cent ered a t 1 0 00 Hz, T60 for the noise was o n an av er- age o f 324 ms. The digitally genera ted audio signa ls using a MA T- LAB prog ram and three 12 -channel Digital-to-Analo g conv er ters r unning at 4 4,100 cyc le s each s e c o nd p er channel were amplified using AudioSource AMP 120 0 amplifiers b efore they were play e d from an ar ray of 36 loudsp eakers. The tw o microphones w ere installed on the KEMAR dummy head temp or arily mounted on a rotating chair which was rotated at a n approximate r ate of 32 °/s for ab out tw o cir cles in the middle of the r o om. The data collec ted in the second rota tio n was used for the EKF. Motion data was collected by the g yrosco p e mount ed on the top o f the dumm y head. The audio signals were a mplified and collected by a sound card which were then stor ed on a desktop co mputer for fur- ther pro cessing. The ITD was pr o cessed with a gener - alized cross- correla tion mo del [30] in ea ch time frame 13 Fig. 19 Setup of the KEM AR dummy head on a rotating c hair in the middle of the sound treated ro om [46] . corresp o nding to the 120 Hz sampling ra te o f the gyro- scop e. The computation was completed by a MA TLAB progra m on a desktop computer. Raw data w ith a sin- gle so und sour ce lo cated at four different lo cations were collected. 0 100 200 300 400 −100 −50 0 50 100 Number of samples ϕ (deg) 2D and 3D model’s estim ated ϕ comparison at θ = 3 0 ◦ 2D model estimation 3D model estimation Actual azimuth angle 0 100 200 300 400 80 85 90 95 100 Number of samples D and 3 D model’s estim ated ϕ comparison at θ = 0 ◦ 2D model estimation 3D model estimation Actual azimuth angle 0 100 200 300 400 0 20 40 60 Number of samples θ (deg) Estimated elevation Estimated elevation Actual elevation 0 100 200 300 400 −20 0 20 40 Number of samples Estimated elevation Estimated elevation Actual elevation Fig. 20 Experimental results for orien tation localization using the KEMAR dummy head. When θ = 0 o , the azimuth estimates using the 2D and 3D mo dels are ver y close (in the top-left fig- ure), whic h impl ies the elev ation estimates are not reliable (in the bottom-left figure). When θ = 30 o , the azim uth estimates are ob- viously different (in the top-right figure), which impl i es reliable elev ation estimates usi ng the 3D mo del (in the b ottom-righ t fig- ure). The left tw o subfigures in Figure 2 0 are genera ted when the actual elev a tion ang le is 0 o . It can be seen that the azimuth estimations using the 2 D and 3D mo dels are very close, which implies that the actual elev ation angle is close to 0 o and the elev atio n estimation using the 3D mo del is not reliable. The r ight tw o subfigures in Figure 20 are gener ated when the actual elev ation ang le is 30 o . It can be seen that the azimuth estimations using the 2 D and 3 D lo calization mo dels a r e obviously differ- ent while the elev ation estimation using the 3 D mo del is fairly a ccurate, which verifies the pro po sed algo rithm shown in Figure 10. T able 7 shows the estimation re- sults obtained using the 3D lo calizatio n mo del. It ca n be seen that the RMSE of the difference b etw een the es- timated a zimu th v a lues using r esp ectively the 2D and 3D mo dels works well in chec k ing the z e r o elev ation condition. 8.2 Results us ing Ro bo tic Platform Exper imen ts were als o per formed using a r ob otic plat- form shown in Figure 2 1. In these exp er imen ts, tw o mi- cro electromechanical s ystems (MEMS) analog/ digital microphones were used for recor ding the sound signal coming fro m the sound source. Flex a dapters were us e d to hold the microphones. The angular sp eed of the rota- tion of the micropho ne a rray was controlled by a bip ola r stepper motor with gea r ratio a djusted to 0 . 9 o per step. The s tepper motor w as controlled by an Arduino micro- pro cessor . The dista nce b et ween tw o micr ophones was kept constant as 0 . 3 m. An audio (music) was play ed in a lo ud sp e aker which w as used as a sound source kept at different lo cations . The e s timation results ar e shown in Figure 22 a nd T able 8. MEMS Microphone s Microphone Evaluation Board Arduino Board Motor shield Fig. 21 A tw o-microphone system is equipp ed on a ground robot. It can b e seen that the azimuth estimations using the 2D and 3D mo dels shown in the top-left subfigure in Figure 2 2 generated when the a ctual elev atio n ang le is 0 o are very close, which implies that the e lev ation is close to 0 o and the elev ation estimation shown in the bo ttom-left subfigure in Figure 22 using the 3D lo ca liza- tion mo del is not reliable. How ever, the tw o subfigures on the right in Figure 2 2 are genera ted by keeping the sound sour c e at an elev a tion a ngle of 55 o . As prop o s ed in the alg orithm shown in Figure 10, the azimuth esti- mations using the 2D and 3 D lo ca lization mo dels are different while the elev ation estimation us ing the 3D mo del is fairly a ccurate. T able 8 shows the es timation results o btained using the 3D lo calizatio n mo del. It can be seen that the zero elev ation condition can be chec ked 14 T able 7 Experimental results using KEMAR dummy head: Orienta tion lo calization using the 3D mo del. (RMSE: difference b et wee n azimu th estimations using the 2D and 3D m odels, respective ly) Expt. Act. Est. A vg of abs RMSE Ac t. Est. A vg of abs No. ϕ ( o ) ϕ ( o ) error ( o ) ( o ) θ ( o ) θ ( o ) err or ( o ) 1 90 91.21 1.21 1.39 0 13.64 13.64 2 -20 -21.53 1.53 1. 16 0 48.14 48.14 3 90 90.40 0.40 79.94 60 59.05 0.95 T able 8 Experimental results using the robotic platform: Orien tation lo calization using 3D mo del (RMSE: difference b etw een azimut h estimations using the 2D and 3D mo dels, resp ectiv ely) Expt. Act . Est. A vg of abs RMSE A ct. Est. A vg of abs No. ϕ ( o ) ϕ ( o ) error ( o ) ( o ) θ ( o ) θ ( o ) err or ( o ) 1 -140 - 140.65 0.65 0.72 0 14.96 14.96 2 180 178.71 1.29 0.69 5 11.59 6.59 3 40 3 9.67 0.33 8.80 55 55.24 0.24 4 40 3 8.20 1.80 10.96 65 64.67 0.33 0 100 200 300 400 0 100 200 300 Number of samples ϕ (deg) 2D and 3D mo del’s estimated ϕ comparison a t θ = 55 ◦ 2D model estimation 3D model estimation Actual azimuth angle 0 100 200 300 400 −150 −100 −50 0 Number of samples and 3D model’s estimated ϕ co mparison at θ = 0 ◦ 2D model estimation 3D model estimation Actual azimuth angle 0 100 200 300 400 0 20 40 60 80 Number of samples θ (deg) Estimated elevation Estimated elevation Actual elevation 0 100 200 300 400 −10 0 10 20 30 40 50 Number of samples Estimated elevation Estimated elevation Actual elevation Fig. 22 Experimental results for orien tation localization using the robotic platform. When θ = 0 o , the azimuth estimates using the 2D and 3D mo dels are very close (in the top-left figure), which implies the elev ation estimates are not reliable (in the b ottom- left figure). When θ = 55 o , the azimuth estimates are obviously differen t (in the top-righ t figure), whic h implies reliable elev ation estimates using the 3D mo del (in the b ottom-righ t figure). using the RMSE of the difference b etw een the estimated azimuth v alues using res pe c tively the 2D and 3D mo d- els. A fitted cur ve similar to one s hown in the Figure 9 can b e generated for the environment by keeping the sound so urce at different elev ation angles and r ecording the RM S E v alues b etw een ϕ 2 D and ϕ 3 D estimations. The v a lue of the parameter RM S E thr eshold can be de- cided, which ca n b e used to chec k the θ = 0 o scenario. F urther, the g e ner ated fitted curve can b e used to give a closer es timation of the elev ation angle. 9 Co nclusion This pap e r presents a nov el technique that p erforms a co mplete lo calizatio n (i.e., b oth orientation a nd dis- tance) of a stationary sound so urce in a three-dimensional (3D) space . T wo singular conditions when unreliable orientation lo calization (the elev ation angle equals 0 or 90 o ) o ccurs were found by using the obser v ability theory . The r o ot-mean-s q uared error (RMSE) v alue of the difference b etw een the a zimut h estimates using re- sp e ctively the 2D and 3D mo dels was used to check the 0 o elev ation condition a nd the elev ation was fur- ther estimated using a p olynomial curve fitting tec h- nique. The 90 o elev ation was detected by chec king zero - ITD s ignal. Based on an a ccurate or ientation lo caliza- tion, the distance lo calizatio n was do ne by first r otating the microphone array to face tow ar d the so und so urce and then s hifting the microphones p erp endicular to the source-r ob ot vector by a distance of a fixed num b er of steps. Under challenging acous tic environmen ts with relatively low-energy targets and high-ener gy noise, high lo calization accura cy was a chiev ed in b o th simulations and exp er iment s. The mean of the average of absolute estimation error was less than 4 o for a ng ular lo caliza- tion and less than 0 . 6 m for distance lo calization in simul ation results a nd techniques to detect θ = 0 o and 90 o are verified in b oth sim ulation and expe rimental results. A ckno w l edgemen ts Ac kno wledgment The authors would like to tha nk Dr. Xuan Zhong for providing with the exp erimental raw data using the KEMAR dumm y hea d. References 1. Intern ational Organization for Standar dization (ISO), British, European and Inte rnational Standards (BSEN), Noise emitted b y machiner y and equipment – Rules f or the drafting and presen tation of a noise test code. 12001: 1997 Ac oustics 15 2. Allen, J.B., Berkley , D. A.: Image method for efficien tly simulating small- room acoustics. The Journal of the Ac oustical Society of America 65 (4), 943–950 (1979). doi:10.1121/1.3825 99 3. Azaria, M., Hertz, D.: Time delay estimation by generalized cross correlation methods. IEEE T ransactions on Aco us- tics, Speech, and Signal Processing 32 (2), 280–285 (1984). doi:10.1109/T ASSP . 1984.116431 4 4. Beard, R. , McLain, T. : Small Unmanned Aircraft: Theory and Practice. Princeton U ni ve rsity Press (2012) 5. Benest y , J. , Chen, J., Huang, Y.: Microphone arra y signal processing, vol. 1. Springer Science & Business Media (2008). doi:10.1007/978-3-540 -78612-2 6. Beranek, L. L., Mell ow, T.J. : Acoustics: sound fields and transduce rs. Academic Press (2012) 7. Blumrich, R. , Altmann, J.: Medium-range lo calisation of aircraft via triangulation. Applied Acoustics 61 (1), 65–82 (2000). doi:10.1016/S0003-68 2X(99)00066 -3 8. Boll, S.: Suppression of acoustic noise i n speec h us- ing sp ectral subtr action. IEEE T ransactions on Acous- tics, Speech, and Signal Processing 27 (2), 113–120 (1979). doi:10.1109/T ASSP . 1979.116320 9 9. Boll, S., Pulsipher, D.: Suppression of acoustic nois e in sp eec h using t wo microphone adaptive noise cancellation. IEEE T ransactions on Acoust ics, Speech, and Signal Processing 28 (6), 752–753 (1980). doi:10.1109/T ASSP .1980.1163472 10. Borenstein, J., Ev erett , H., F eng, L.: Navigating mobile robots: systems and tec hniques. A K Pe ters Ltd. (1996) 11. Brandes, T.S., Benson, R.H.: Sound source i maging of l ow-flying airborne targets with an acoustic cam- era array . Applied Acoustics 68 (7), 752–765 (2007). doi:10.1016/j.apacoust .2006.04.009 12. Brandstein, M. , W ard, D.: Mi crophone arra ys: si gnal pro cess- ing tec hniques and applications. Springer Science & Business Media (2013). doi:10.1007/978-3-662-04619-7 13. Brassington, G.: Mean absolute error and ro ot mean square error : which is the better metric f or assessing m odel p erfor- mance? In: EGU General Assembly Conference Abstracts, v ol. 19, p. 3574 (2017) 14. Calmes, L.: Biologically i nspired binaural sound source lo cal- ization and tracking for mobile robots. Ph.D. thesis, R WTH Aac hen Universit y (2009) 15. Chen, J., Benest y , J., Huang, Y.: Time dela y estimation in room acoustic enviro nment s: an ov erview. EU RASIP Journal on applied si gnal processi ng pp. 170–170 (2006). doi:10.1155/ASP/200 6/26 16. Donoh ue, K.D.: Audio arra y to olb ox . [ Online] A v ailable: h ttp://vis.uky .edu/distributed-audio-lab/about/ , 2017, Dec 22 17. Gala, D., Lindsay , N., Sun, L. : Three-dimensional sound source lo calization for unmanned ground vehicles with a self- rotationa l tw o-mi crophone array . In: Proceedings of the 5th In ternationa l Conference of Con trol, Dynamic Systems, and Robotics (CDSR’18). Acc epted (2018) 18. Gala, D.R. , M isra, V.M .: SNR impro ve ment with sp eech enhancemen t technique s. In: Proceedings of the Inter- national Conference and W orkshop on Emerging T rends in T ech nology , ICWET ’11, pp. 163–166. ACM (2011). doi:10.1145/1980 022.1980058 19. Gala, D.R. , V aso ya, A., Mi sra, V.M.: Sp eech enhance- men t combining sp ectral subtraction and b eamforming tec hniques for microphone arra y . In: Proceedings of the In ternationa l Conference and W orkshop on Emerging T rends in T echn ology , ICWET ’10, pp. 163–166 (2010). doi:10.1145/1741 906.1741938 20. Gill, D., T roy ansky , L., Nelken , I.: Auditory lo calization using direction-dependen t sp ectral information. Neuro computing 32 , 767–773 (2000). doi:10.1016/S0925-2312(0 0)00242-3 21. Go elzer, B., Hansen, C.H. , Sehrndt, G.: Occupational ex- posure to noise: ev al uation, preven tion and cont rol. W orld Health Organisation (2001) 22. Goldstein, E. B., Bro c kmole, J.: Sensation and p erception. Cengage Learning (2016) 23. Hedrick, J.K., Girard, A.: Con trol of nonlinear dynamic sys- tems: Theory and applications. Con trollability and observ- ability of Nonlinear Systems p. 48 (2005) 24. Hermann, R., Krener, A.: Nonli near contr ollability and ob- serv ability . IEEE T ransactions on Automatic Cont rol 22 (5), 728–740 (1977). doi:10.1109/T AC.1977.110 1601 25. Hornstein, J., Lopes, M., Santo s-Victor, J., Lacerda, F.: Sound l ocalization for humanoid robots-building audio- motor maps based on the HR TF. In: IEEE/RSJ Inter national Conference on Int elligent Rob ots and Systems (IROS ), pp. 1170–117 6 (2006). doi:10.1109/IR O S.2006.281849 26. Huang, Y., Be nesty , J., El k o, G .W.: P assive acous- tic source l ocali zation for video camera steering. In: IEEE In ternat ional Conference on Aco ustics, Sp eec h, and Si gnal Processing, v ol. 2, pp. I I909–II912 (2000). doi:10.1109/ICASSP .2000.859108 27. Kaushik, B., N ance, D. , Ahuja, K. : A review of the role of acoustic sensors in the mo dern battlefield. In: 11th AIAA/CEAS Aeroaco ustics Conference (2005). doi:10.2514/6.2005 -2997 28. Keyrouz, F.: Adv anced binaural sound lo calization in 3- D for h umanoid robots. IEEE T ransactions on In- strumen tation and M easuremen t 63 (9), 2098–2107 (2014). doi:10.1109/TIM.2014.230 8051 29. Keyrouz, F., Diepol d, K. : An enhanced binaural 3D sound lo cali zation algorithm. In: IEEE Int ernational Symp osium on Signal Processing and Information T echnology , pp. 662– 665 (2006). doi:10.1109/ISSPIT.2006.2708 83 30. Knapp, C. , Carter, G.: The generalized correlation metho d for estimation of time delay . IEEE T ransactions on Ac ous- tics, Sp eec h, and Signal Pro cessing 24 (4), 320–327 (1976). doi:10.1109/T ASSP . 1976.116283 0 31. Kumon, M., Uozumi, S. : Binaural lo calization for a mobile sound source. Journal of Biomechan ical Science and Engi- neering 6 (1), 26–39 (2011). doi: 10.1299/jbse.6.26 32. Laurent Kneip, C.B.: Binaural model for artificial spa- tial sound lo calization based on intera ural time delays and mov ements of the int erauralaxis. The Journa l of the Acoustical So ciety of Am erica pp. 3108–3119. (200 8). doi:10.1121/1.2977 746 33. Lu, Y. C., Co oke, M.: Motion strategies for binaural lo cal- isation of sp eec h sources i n azim uth and di stance b y artifi- cial listeners. Speech Comm unication 53 (5), 622–642 (2011) . doi:10.1016/j.specom.2010.06.001 34. Lu, Y. C., Cooke, M . , Christensen, H.: Activ e binaural dis- tance estimation for dynamic sources. In: INTERSPEECH, pp. 574–577 (2007) 35. Mi ddlebrooks, J.C., Green, D.M. : Sound lo calization b y h u- man li s teners. Annual review of psycho logy 42 (1), 135–159 (1991). doi:10.1146/ann urev.ps.42.020191.001 031 36. Naylor, P ., Ga ubitch , N. D.: Speech derev erb era- tion. Springer Science & Busi ness M edia (2010). doi:10.1007/978-1-849 96-056-4 37. Nguyen, Q.V., Colas, F., Vincen t, E., Charpillet, F.: Long- term rob ot motion planning for activ e sound source lo cal- ization with Mont e Carlo tree s earc h. In: Hands-free Sp eec h Commun ications and Microphone A rrays (HSCMA), pp. 61– 65 (2017). doi: 10.1109/HSCMA.2017.7895 562 38. Omologo, M., Sv aizer, P .: Aco ustic source lo cation in nois y and rev erberan t environ ment using csp analysis. In: IEEE In ternationa l Conference on Ac oustics, Speech, and Sig- nal Pro cessing Conference, vol. 2, pp. 921–924 (1996). doi:10.1109/ICASSP .1996.543272 16 39. Pan g, C. , Li u, H., Zhang, J., Li , X. : Binaural sound lo- calization based on reverberation weigh ting and generalized parametr ic mapping. IEEE/A CM T ransactions on Audio, Speech, and Language Processing 25 (8), 1618–163 2 (2017). doi:10.1109/T ASLP . 2017.270365 0 40. Per rett, S., Noble, W.: The effect of head rotat ions on v ertical plane sound lo calization. The Journal of the Ac oustical So ciet y of America 102 (4), 2325–2332 (1997). doi:10.1121/1.4196 42 41. Ro demann, T.: A study on distance estimation in bi nau- ral sound locali zation. In: IEEE/RSJ Int ernational Confer- ence on Intelligen t Robots and Systems (IR OS), pp. 425–43 0 (2010). doi:10.1109/IR O S.2010.5651455 42. Ro demann, T., Ince, G., Joublin, F., Go eric k, C.: Using binaural and sp ectral cues for azimut h and elev ation lo cal- ization. In: IEEE/RSJ Int ernational Conference on Intel- ligent Rob ots and Systems (IR OS), pp. 2185–2190 (2008). doi:10.1109/IR OS.2008.4650667 43. Sabine, W.: Coll ected Papers on Acoustics. Harv ard Univer- sity Press (1922) 44. Spriet, A. , V an Deun, L., Eftaxiadis, K., Laneau, J., Mo o- nen, M., V an Dijk, B., V an Wieringen, A. , W outers, J.: Speech understanding in bac kground noise with the tw o- microphone adaptiv e b eamformer b eam in the nucleus free- dom co c hlear i mplant system. Ear and hearing 28 (1), 62–72 (2007). doi:10.1097/01.aud.0000 252470.54246.54 45. Sturim, D.E. , Brandstein, M.S., Silv erman, H. F.: T rac k- ing multiple talker s using m i crophone-ar ra y measuremen ts. In: IEEE Intern ational Conferen ce on Acou stics, Sp eec h, and Si gnal Processing, v ol. 1, pp. 371–374 vol.1 (1997). doi:10.1109/ICASSP .1997.599650 46. Sun, L., Zhong, X., Y ost, W.: Dynamic binaural sound source lo cali zation with inter aural time difference cues: Artificial lis - teners. The Journal of the Acoustical So ciet y of America 137 (4), 2226–2226 (2015). doi:10.1121/1.49206 36 47. T amai, Y., Kagami, S., Am emiya , Y., Sasaki, Y., Mizoguchi, H., T ak ano, T.: Ci rcular microphone arra y for rob ot’s au- dition. In: Proceedings of IEEE Sensors, 2004., vol. 2, pp. 565–570 (2004). doi:10.1109/ICSENS.2004.142622 8 48. T amai, Y., Sasaki, Y., Kagami, S., Mizoguc hi, H.: Three ring microphone arra y for 3D sound lo calization and separation for mobile rob ot audition. In: IEEE/RSJ Inte rnational Con- ference on Intelligen t Rob ots and Systems (IROS), pp. 4172– 4177 (2005). doi:10.1109/IR OS.2005.1545095 49. Tiete, J., Domínguez, F. , Silv a, B.d., Segers, L., Steenhaut, K., T ouhafi, A.: Soundcompass: a distributed MEMS micro- phone arra y-based sensor f or sound s ource lo calization. Sen- sors 14 (2), 1918–1949 (2014). doi:10.3390/s140201 918 50. V al i n, J.M., Michaud, F., Rouat, J., Letournea u, D. : Robust sound source l ocalization usi ng a microphone arra y on a mo- bile robot. In: IEEE/RSJ Inte rnational Conference on In- telligen t R obots and Systems (IROS), vol. 2, pp. 1228–1233 (2003). doi:10.1109/IR O S.2003.1248813 51. W allach, H.: On sound lo calization. The Journal of the Acoust ical Society of America 10 (4), 270–274 (1939). doi:10.1121/1.1915 985 52. W ang, H., Chu, P .: V oice s ource localization for au- tomatic camera p ointing system in video conferencing. In: IEEE Intern ational Conferen ce on Acou stics, Sp eec h, and Signal Pro cessing, vol. 1, pp. 187–190 (1997). doi:10.1109/ICASSP .1997.599595 53. Y ost, W.A., Zhong, X. : Sound source lo calization identifi- cation accura cy: Bandwidth dep endencies. The Journal of the Acoustica l Society of America 136 (5), 2737–2746 (2014 ). doi:10.1121/1.4898 045 54. Zhong, X. , Sun, L., Y ost, W.: Activ e binaural lo calization of multiple sound sources. Rob otics and Autonomous Systems 85 , 83–92 (2016). doi:10.1016/j.robot.2016.07.008 55. Zhong, X., Y ost, W., Sun, L.: Dynamic binaural sound source lo cali zation with ITD cues: Human listeners. The Journal of the Acoustical So ciet y of Am erica 137 (4), 2376–2376 (2015). doi:10.1121/1.4920 636 56. Zietlow, T., Hussein, H., Ko werk o, D.: Acoustic s ource lo cal- ization in home environ ment s-the effect of microphone arra y geometry . In: 28th Conference on Electronic Speech Signal Processing, pp. 219–226 (2017)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment