Multi-Sound-Source Localization Using Machine Learning for Small Autonomous Unmanned Vehicles with a Self-Rotating Bi-Microphone Array

Noname manuscript No. (will be inserted by the edito r) Multi-Sound-Source Localization Using Machine Lear ning f or Small A utonomous Unmanned V ehicles with a Self-Rotating Bi-Microphone Array Deepak Gala 1 , Nathan Lindsay 2 , and Liang Sun 3 the date of rece ipt and acceptan ce should be inserte d later Abstract While v ision-based localization techn iques ha ve been wid ely studied for small autonomo us unmanned vehicles (SA UVs), s ound-so urce localization capabilities hav e not been fully enabled for SA UVs. This paper presents two nov el approaches for SA UVs to per- form three-dimensional (3D) multi-sound-so urces localization (MSSL) using only the inter - channel time dif ference (ICTD) signal gener ated by a self-rotating bi-micropho ne array . The proposed two appr oaches are based on two machine learning techniques viz., Density-Based Spatial Clustering of Applications wi t h Noise (DBSCAN) and Random Sample Consensus (RANSA C) algorithms, respectiv ely , whose perform ances are tested and compared in both simulations and experiments. The results sho w that both approaches are capable of correctly identifying the number of sound s ources along with their 3D orientations in a re verb erant en vironment. 1 Introduction Small autonomous unmann ed vehicles (SA UVs, e.g., quadcopters and ground robots) hav e re v olutionized civilian and military missions by creating a platform for observation and per - mitting access to locations that are too dangero us, too difﬁcult or too costly to s end humans. The sensing capability of SA UVs has been enabled by various sensors, such as RGB cam- eras, infrared cameras, LiD AR s, RADARs, and ultrasou nd sensors. Howe ver , these main- stream sensors are subject to either lighting condition s or line-of-sight requiremen ts. On the other end of the spectrum, as the sound trav els to all directions and can be t ransmit- ted ev en with some small obstacles [1], sound sensors ha ve great potential to ov ercome the 1 Deepak Gala is with the Klipsc h School of Electri cal and Computer Engineering, New Me xico State Uni- versi ty , L as Cruc es, NM, 88001 USA drgala@nmsu.e du 2 Nathan Lindsay is wit h the Depa rtment of Mecha nical and Aerospace Engineeri ng, New Me xico State Uni- versi ty , L as Cruc es, NM, 88001 USA nl22@nmsu.edu 3 Liang Sun is with the Department of Mechani cal and A erospace Engineeri ng, Ne w Me xico State Uni versity , Las Cruces, NM, 88001 USA lsun@nmsu.edu Address(es) of author(s) should be gi ven line-of-sight constraints of the aforementioned sensors and prov ide SA U V s wit h an omni- directional full span sensing coverag e. Sound -based sensing capabilities would signiﬁcantly facilitate SA UVs i n critical application s (e.g., search and rescue) and enable s ociable and service robots (e.g., shopping assistants and restaurant waiters) to collaborate wit h human s in complicated scenarios [2, 3]. Among the sensing tasks for SA UVs, localization is of utmost signiﬁcance [4]. While vision-based localization techniqu es hav e been de veloped based on cameras, sound sourc e localization (SSL) has been achie ved using microphone arrays with differen t numbers (e.g., 2, 4, 8, 16) of microph ones. Although it has been reported that the accuracy of the localiza- tion is enhan ced as the number of microph ones increases [5, 6], this comes with a price of algorithm complexity and hardware cost, especially due to the expense of Analog-to-Digital con verters (ADC), which is prop ortional to the number of speaker chann els. Humans and many other animals can locate sound s ources with decent accuracy and responsi veness by using their two ears ass ociated with head rotations to av oid ambiguity (i.e., cone of conf usion) [7]. SSL techniques using a self-rotating bi-microph one array have been reported in the literature [8–12]. T o eliminate cone of confu sion [7], the bi-m icrophon e array is rotated around the center of the robot/dum my head on the horizon tal plane so tha t an Inter -C hannel T ime Difference (ICTD) signal is generated whose data points form multiple discontinuo us sinusoidal wa vef orms. Single-SSL (SSSL) techniques with dif ferent number s of micr ophone s ha ve been well-studied [13], while reported mu lti-sound-sou rce-localization (MSSL) techn iques typically require lar ge m icrophon e arrays with speciﬁc structures, which are not easy to be mounted on SA UVs. Pioneer work for MSSL as sumed the number of sources to be known beforehand [14, 15]. Some of these approaches [16–18] are based on sparse component analysis (SCA) that requires the s ources to be W -disjoint orthogonal [19] (i.e., i n some time-frequenc y components, at most one source is activ e), thereby making them unsuitable for rev erberant en vironments. Pa vlidi et al. [20] and Loesch et al. [21] pre- sented an S C A-based method that counts and localizes multiple sound sources b ut requires one sound source to be do minant ov er others in a time-frequenc y zone. Clustering methods hav e also been used to conduct MSSL [16–18]. Catalbas et al. [22 ] presented an approach for MSSL by using four micropho nes and the sound sources are required to be present within a predeﬁned bound ary . The technique was limited to localize sound orientations in the two -dimensional plane using k-medo ids clustering. The numb er of sound sources was calculated using t he exhausti ve elbo w method, which is instinctiv e and comp utationally exp ensi ve. T raa et al. [23] presented an approach that utilizes the time- delay between the micropho nes in the frequ ency domain to model the phase differ ences in each frequen cy bin of a short-time Fourier transform . Using the linear relationship between phase dif ference and fr equenc y , the data were then clustered using rando m sample co nsensus (RANSA C). In our previou s work [11, 12, 24, 25], we dev eloped an SSSL technique based on an extended Kalman ﬁlter and an MSSL technique based on a cross-correlation appro ach, which was computationally expensi ve. The contr ib utions of this paper include two novel MSSL approaches for SA UVs. Both approaches are able to identify both the number of sound sources and their 3D locations by only using a self-rotating bi-microph one array . In the ﬁrst approach, a nov el mapping mechanism is de veloped to con vert the acquired ICTD signal to an orientation domain. Un- supervised classiﬁcati on is then conducted using the Density-Based Spatial C lustering of Applications with Noise (DBSC A N) [26]. The s econd approach is based on a sinusoidal ICTD regression using a RANSA C-based method. Both simulation s and experiments were conducted to ver ify the propo sed methodology . The rest of the paper is or ganized as follows. In Section 2, the mathematical calcula- tion for the ICTD signal generated by the self-rotating microphone array is presented. In Section 3, the mapping mechanism for regression and clustering is presented. Section 4 presents the two proposed approaches for MSSL. Simulation and exp erimental results are presented and discussed in Section 5. Section 6 concludes the paper . 2 Preliminaries 2.1 Inter- Channel T ime Differen ce (ICTD) The ICTD is the time difference between a sound s ignal arriving at two microphon es and can be calculated using the cross-correlatio n technique [27, 28]. Consider a single stationary sound source and two spatially separated micro phones placed in an envir onment. Let y 1 ( t ) and y 2 ( t ) be the s ound signals captu red by the microphones in presence of noise, which are gi ven by [27] y 1 ( t ) = s ( t ) + n 1 ( t ) and y 2 ( t ) = δ · s ( t + t d ) + n 2 ( t ) , where s ( t ) is the sound signal, n 1 ( t ) and n 2 ( t ) are real and jointly stationary random noises, t d denotes the ti me dif ference of s ( t ) arri ving at the two micropho nes, and δ is the signal attenuation factor due to differen t tra veling distances. It is commonly assumed that δ changes slowly and s ( t ) is uncorrelated with noises n 1 ( t ) and n 2 ( t ) [27]. The cross-correlation of y 1 and y 2 is gi ven by R y 1 , y 2 ( τ ) = E [ y 1 ( t ) · y 2 ( t − τ )] , where E [ · ] represents the expectation operator . V arious pre-ﬁlters that eliminate or reduce the ef fect of backgrou nd noise and rev erberation s ha ve been used prior to the cross-cor relation [29–31]. The time differen ce of y 1 and y 2 , i.e., the ICTD, is g iv en by ˆ T , arg max τ R y 1 , y 2 . Th e distance dif ference of the sound signal trav eling to t he two microphon es is giv en by d , ˆ T · c 0 , where c 0 is the sound speed and is usually selected as 345 m/s on the Earth surf ace. Remark 1 For simplicity , the signal d is referred as ICTD in this paper . ICTD is the only cue used in this paper for the source coun ting and localization, while no aforementioned s caling functions nor pre-ﬁlters are used. 2.2 Far -Field Assumption The ﬁve differen t ﬁelds around a sound source are free ﬁeld, near ﬁ eld, far ﬁeld, direct ﬁeld, and rev erberant ﬁ eld [32, 33]. The re gion where the sound pressure and the acoustic particle velocity are not in phase is regarded as the near ﬁeld. The far ﬁeld of a source begin s where the near ﬁeld ends and extends to inﬁnity . Under the far -ﬁeld assumption, the acoustic wav efront reaching the microphones is planar and not spherical, in the sense that the wa ves trav el in parallel. This means that the angle of inciden ce will be the same for the two microp hones. Further , it can be shown that with D / b > 2 . 7, the error of the far -ﬁeld approxim ation drops belo w 0 . 5 o , where D is the distance of the s ound source to the center of the micropho ne array and b is the distance between the microph ones [34]. 2.3 Mathematical Model for ICTD signal In this paper , the location of a single sound source is deﬁned in a spherical coordinate frame, whose origin is assumed to coincide with the center of a ground robot. O S (D , θ, ϕ) ψ ϕ L R Rob ot β p q Fig. 1: T op -do wn vie w of t he system. Fig. 2: 3D vie w of the system. As shown in Figs. 1 and 2, the left and right microphones, L and R , collects the acoustic signal generated by the sound source S . Let O be the center of the robot as well as the bi- micropho ne array . The sound source location is represented by ( D , θ , ϕ ), where D is the distance between the source and the center of the robot, i.e., the length of segment OS , θ ∈  0 , π 2  is the ele v ation angle deﬁned as the angle between OS and the horizontal plane, and ϕ ∈ ( − π , π ] is the azimuth angle deﬁned as the angle measured clockwise from the robot heading vector , p , to OS ′ . Letting unit vector q be the orientation (heading) of the micropho ne array , β be the angle from p to q , and ψ be the angle from q to OS ′ , both follo wing a clockwise rotation rule, we hav e ϕ = ψ + β . (1) Fig. 3: T op-do wn vie w of the plane containing triangle S OF . In the shaded triangle, SOF , s ho w n in Figs . 2 and 3, deﬁne α = ∠ S OF and we ha ve cos α = cos θ sin ψ . Based on the far -ﬁeld assumption [32], we ha ve d , ˆ T · c 0 = 2 b cos α = 2 b cos θ sin ψ . (2) T o a v oid cone of confusion [7] in S SL, we are consider ing an IC TD signal generated by a micropho ne pair with time-v arying positions, such as a self-rotatin g bi-micropho ne array . W ithout loss of generality , in this paper , we assume a clockwise rotation of the microphone array on the horizontal plane, while the robot itself does not rotate througho ut the entire estimation process, which i mplies that ϕ in Equation (1) is constant. The rotation of the micropho ne array is assumed to be triggered by a sound detection mechanism, which is beyo nd the scope of this paper and hence will not be discussed. The initial head ing of the microph one array is conﬁgured to coin cide with the heading of the robot, i.e. , β ( t = 0 ) = 0, which implies that ϕ = ψ ( 0 ) . As the microphone array rotates clockwise with a constant angular velocity , ω , we ha ve β ( t ) = ω t and due to Equation (1) we ha ve ψ ( t ) = ϕ − β ( t ) = ϕ − ω t . The res ulting ti me-varyin g d ( t ) due to Equation (2) is then gi ven by d ( t ) = 2 b cos θ sin ( − ω t + ϕ ) . (3) Because the micropho ne array rotates on t he horizontal plane, θ does not change during the rotation for a stationary sound source. The resulting d ( t ) is a sinusoidal signal with the amplitude A , 2 b cos θ , which implies that θ = cos − 1 A 2 b . (4) It can be seen from Equation (3) that the phase angle of d ( t ) is the azimuth angle of the sound source. Therefo re, the location (i.e., azimuth and elev ation angle) of the sound s ource can by determined by estimating the characteristics (i.e., the amplitude and phase angle) of the sinusoidal signal, d ( t ) . The collection of the ICTD s i gnal for multiple s ound sources (as shown in Fig. 13a) illustrates a group of multiple discontinuous sinusoidal wav eforms, each of which corre- sponds to a single sound source, satisfying the amplitud e-ele v ation and phase-azimuth rela- tionship as mention ed above. 3 Model fo r Mapping and Sinusoidal Regression The signal d ( t ) in Equation (3) is sinusoidal wit h its amplitude A = 2 b cos θ and phase angle ϕ that corresponds to the azimuth angle of the sound source. S ince the frequency , ω , of d ( t ) is the kno wn rotation al s peed of the microp hone array , the localization tas k (i.e., identifying θ and ϕ ) is to es timate the amplitude and phase angle of d ( t ) , i.e., A and β . Consider a general form of d ( t ) exp ressed as d ( t ) = A 1 s ω t + A 2 c ω t , (5) where s ω t = sin ( ω t ) and c ω t = cos ( ω t ) , and we ha ve A = q A 2 1 + A 2 2 , and ϕ = tan − 1  A 2 A 1  . Consider two data points y 1 = d ( t 1 ) and y 2 = d ( t 2 ) , collected at two distinct time instants t 1 and t 2 , respecti vely , and we hav e  y 1 y 2  =  s ω t 1 c ω t 1 s ω t 2 c ω t 2   A 1 A 2  . (6) If s ω ( t 2 − t 1 ) 6 = 0 , then we can obtain A = q y 2 1 + y 2 2 − 2 y 1 y 2 c ω ( t 2 − t 1 ) s ω ( t 2 − t 1 ) , (7) and ϕ = tan − 1  y 1 s ω t 2 − y 2 s ω t 1 y 2 c ω t 1 − y 1 c ω t 2  . (8) 4 Methodology 4.1 DBSCAN-Bas ed MSSL DBSCAN is one of t he most popular nonlinear clustering techniques and it can discover any arbitrarily shaped clusters of densely grouped points in a data set and outperf orm other clustering methods in the literature [35, 36]. In the DB SCAN algorithm [35], a random point from the data s et is con sidered as a core cluster point when more than m poin ts (including itself) within a distance of ε (epsilon ball) exists in its neighbor hood. This cluster is then extend ed by checkin g all of the other points satisfying the ε − m criteria thereby letting the cluster gro w . A ne w arbitrary point is then chosen and the process is repeated. The point which is not a part of any cluster and ha ving fewer than m points in its ε -ball is considered as a “noise point”. The DB S CAN technique is more suitable for applications with noise and outperfo rms the k -means method , which requires a prior kno wledge of the number and the approxim ate initial centroids of clusters and is highly sensiti ve to noisy data points and to the selection of the initial centro ids [37]. The pr oposed DBSCAN-based MSSL techniqu e consists of two stages. In the ﬁrst stage, the data points of the ICTD signal are mapped to the orientation (i.e., the elev ation-azimu th coordinate) domain. The data set consisting of all data points in a multi-sou rce ICTD si gnal contains not only inliers b ut also outliers, which produce und esired mapped locations. When the number of inliers is s i gniﬁcantly greater than the outliers after a number of iterations, highly dense clusters will be formed. In the second stage, these clusters are detected using the DBSCAN technique by carefully selecting parameters m and ε . The number of clusters correspond s to the number of sound sources and the centroids of these clusters represent the locations of the sound s ources. The com plete DBSCAN-based MSSL algorithm is d escribed in Alg orithm 1. T wo po ints in the data set are selected rando mly and mapped into the orientation domain by calculat- ing angles θ and ϕ using Equations (3), (7), (4) and (8). A s et M : = { ( θ 1 , ϕ 1 ) , ( θ 2 , ϕ 2 ) , ..., ( θ N , ϕ N ) } of these mapped points is then created. The process for detection of clusters is then started. A point ( θ i , ϕ i ) in M is ran domly chosen and is decided to be a core cluster point or a noise point by checkin g the density-reac hability criteria under the m - ε condition [35]. The time complexity of Algorithm 1 is O ( N 2 D ) , where N D is the number of iterations for mapping and clustering. The selection of parameters for Algorithm 1 will be discussed in Section 4.3. Algorithm 1 DBSCAN-Based MSSL 1: Capture d ( t ) for one full rotation of the bi-microp hone array 2: Select m and ε 3: Select the numb er of iterations N D 4: FOR i = 1 to N D DO 5: Randomly choose non-repeated set of two points y 1 and y 2 from d , such that y 1 6 = y 2 and do not equal zero s imultaneously 6: Calculate ˆ A and ˆ ϕ using Equations (7) and (8) 7: Calculate ˆ θ i using Equation (4) and ˆ ϕ i = ˆ ϕ 8: END FOR 9: FOR i = 1 to N D DO 10: Randomly choose the pair ( θ i , ϕ i ) from the set M : = { ( θ 1 , ϕ 1 ) , ( θ 2 , ϕ 2 ) ..., ( θ N D , ϕ N D ) } 11: Calculate the distance between the chosen ( θ i , ϕ i ) and e very other point in S 12: IF the number of points in the range ε is greater than m 13: Label ( θ i , ϕ i ) as a core cl uster point 14: ELSE 15: Label ( θ i , ϕ i ) as a noise poin t 16: END IF 17: END FOR 4.2 RANSAC-Based MSSL The R ANSA C [38] algorithm iterativ ely uses a s et of observ ed data to estimate parame- ters of a mathematical model and can identify inliers (e.g., parameters of a mathematical model) in a data set that may contain a s igniﬁcantly large number of outliers. The input to the RANSA C algorithm includes a set of data, a parameterized model, and a conﬁdence pa- rameter ( σ con f ). In each iteration, a s ubset of t he original data is random ly selected and used to ﬁt the predeﬁned parameterized model. All other data poin ts in the origin al data s et are then tested against t he ﬁtted model. A point is determined to be an inlier of the ﬁtted model, if it satisﬁ es the σ con f condition. The process is repeated by selecting another random s ubset of the data. After a number of iterations, the parameter s are then selected for the best ﬁtt i ng (with maximum inliers) estimated model. Algorithm 2 RANSA C-Based MS S L 1: Capture d ( t ) for one full rotation of the bi-microp hone array 2: Select N R , σ con f and initialize e = 0 3: WHILE there are s amples in d 4: FOR j = 1 to N R DO 5: Randomly choose non-repeated set of two points y 1 and y 2 from d , such that y 1 6 = y 2 and do not equal zero s imultaneously 6: Calculate ˆ A and ˆ ϕ using Equations (7) and (8 ) 7: Calculate ˆ d = ˆ A sin ( ω t + ˆ ϕ ) 8: Calculate coun t = number of points in d ﬁtting ˆ d with at least σ con f 9: IF e < cou n t 10: A K = ˆ A , ϕ K = ˆ ϕ and e = co un t 11: END IF 12: END FOR 13: Calculate θ K using Equation (4) 14: e = 0 15: Remov e s amples on ˆ d within σ con f from d 16: END WHILE The R ANSA C-based MSSL method is described in Algorith m 2. It can be seen from Equation (3) that the signal d ( t ) generated by the self-rotating bi-microp hone array is si- nusoidal. T wo points from the ICTD signal are selected rando mly and a sine wav e with the gi ven f requenc y (i.e., the angular speed of the ro tation, ω ) is generated. The coun t represents the numb er of poin ts whose distance to the ﬁtted sine wa ve is les s than σ con f , which is the threshold for a point to be considered inlier . Then the points in d that belong to ˆ d according to the σ con f condition will be remo ved from d , This pro cedure is repeated for N R iterations and the parameters A K and B K are updated ev ery time the number of inliers is greater than that in the previo us iterations. This process is repeated until either all the points in d are examin ed or N R iterations are comp leted. The time comp lexity of Algorithm 2 is O ( n 2 · N R ) , where n is the num ber of samples i n the ICTD. After the ﬁrst few of N R iterations, most of the data points are remo ved. This results in n to be a small number as compared to N R . 4.3 Parameter Selection The v alue of σ con f can be chosen depending on the possibility of sound sources to be close to each other and the noise lev el . The maximum numb er of possible unique combin ations of randomly chosen data points y 1 and y 2 is C ( u , v ) = u ! ( u − v ) ! v ! , where the v alue of u is the number of data points and v is t he number of rando mly chosen data points. Ho we ver , the number of iterations, N D , needs only to be large enough for the efﬁcient formation of the clusters. The larger the value of N D , the denser the clusters would be, which will further modify the ε v alue in Algorithm 1. On the other hand, N R should be chosen large enough to ensure that at least one of the sets of randomly selected points does not include an outlier . The v alue of the N R can be smaller than N D due to the remo val of all the data points within the σ con f range, which reduces the data points to further deal with. Further , these v alues also depen ds on ho w noisy the data is. The number of iterations for RANSAC-b ased MSSL can be calculated using N R = log ( 1 − s ) log ( 1 − ( 1 − ε ) w ) , where s is the probability of success, ε is outlier ratio and w is the numb er of required points to ﬁt t he model. The number of sound sources is determined by carefully selecting a threshold, as s ho wn in Fig. 4. The conﬁdence about the presence of a sound source is dependen t on the coun t v alue. The s ource with the maxim um cou n t is considered to be qualiﬁed with 100 % conﬁ- dence and the conﬁdence v alues for other sources are calculated relativ ely . The source with a conﬁdence value less than the threshold is considered to be noise and is not qualiﬁed as a sound source. 1 2 3 4 0 20 40 60 80 100 Number of Sound Source (N) Confidence (%) Confidence on presence of a sound source Threshold Fig. 4: Conﬁdence on presence of s ound sources. 5 Simulation and Experimental Results 5.1 Simulation and Experimental Setup Audio Array T o olbox [39] is used to establish an emulated rectang ular room using the image method described in [40]. The robot was placed in the origin of the room. The sound sources and the microphones are assumed omnidirectio nal and the attenuation of the sound are cal- culated per the speciﬁcations in T able 1. Fig. 5 shows the simulation setup with the robot placed at the origin and the four sound sources placed at different azimuth and ele v ation angles. T able 1: Simulated room speciﬁcations Parameter V alue Dimension 20 m x 20 m x 20 m Reﬂection coef ﬁ cient 0.5 (walls, ﬂoor and ceiling) Sound speed 345 m/s T emp erature 22 o C Static pressure 29.92 mmHg Relativ e humidity 38 % −5 0 5 −5 0 5 0 1 2 3 4 5 Sy (m) Sx (m) Robot and source locations Sz (m) S1 S2 S3 S4 Robot Fig. 5: Simulation setup sho wing four sound sources placed at S 1 ( 20 o , 50 o ) , S 2 ( 30 o , 150 o ) , S 3 ( 50 o , 200 o ) and S 4 ( 60 o , 300 o ) at a distance of 5 m from the center (or igin) which is also the center of the microphone array on the robot. A number of recorded audio signals av ailable at [41] were used as sound s ources to test the technique. Dif ferent numbers of sound sources were placed at v arious azimuth and elev ation angles at a ﬁxed distance of 5 m and the ICTD signal was recorded by the rotating bi-microph one array with mics separated by a distance of 0 . 18 m. The s ound sources were separated by at least 20 o in azimuth and at l eas t 10 o in elev at i on. The ICTD value was calculated and recorded e very 1 o of rotation. Zero-mean noise with a v ariance ( σ noise ) of 0 . 001 was added to this ICTD signal in simulations to accoun t for sensor noise. Experimen ts were conducted using a K ob uki T urtlebot 2 robot s ho wn in Fig. 6 mounted with a robotic platform as sho wn in Fig. 7 consisting of tw o microelectro mechanical systems (MEMS) analog microphones shown in Fig. 8. These experiments were conducted in an indoor en vironment with the rev erberation time RT 60 = 670 ms (where R T60 is the time required for a sound to decay 60 dB). Fig. 10 shows the impulse response of the room. The ex perimental s etup is shown in Fig. 9. The sampling frequen cy of the two microphones was 44,100 Hz while recording the signal. A micropho ne ev aluation boar d assembly was used for data acquisition. The angular speed of the rotation of the micropho ne array was R obo t Fig. 6: K ob uki T urtlebot rob ot with the platform moun ted with two microelectromech anical systems (MEMS) microph ones. Mi crophone ev alua tion board Mo tor sh i e ld Ardu i no b oard MEMS mics Fig. 7: Robotic platform with tw o microelectromechanical systems (MEMS) microphon es, micropho ne ev aluation board for acquisition of sound signals recor ded by the microp hones, motor shield and the arduino board used to rotate the bipolar stepper motor . controlled by a bipolar stepper motor with gear ratio adjusted to 0 . 9 o per step and rotating at an angular velocity of 2 π / 5 rad/sec which was controlled by an Arduino board. The distance between the two microphon es was kept constant as 0 . 3 m. Audio clips were played in a num ber of loudspeak ers which were used as sound sources. These speakers were kept at different locations with a 1 . 5 m distance from the robot, thereby s at i s fying the far -ﬁeld approxim ation [34]. MEMS micro phon e Fig. 8: One of the two microelectromechan ical systems (MEMS) micropho ne mounted on the robotic platfor m. S 3 (34 0 o , 30 o ) S 1 (20 o , 10 o ) S 2 (18 0 o , 20 o ) S 4 (22 0 o , 60 o ) R obo t heading 0 o el evatio n mark Fig. 9: Experimental setup showing four sound sources mounted on the four poles. The poles are marked with the zero-de gree ele v ation which corresponds to the height of the bi- micropho ne array on the robotic platform placed at the center (orig in). T able 2: Parameters for RANSA C-Bas ed and DBSCAN-Based MSSL Parameters For simulations For experiments σ con f 0 . 0157 m 0 . 0261 m N D 10000 10000 N R 5000 5000 Threshold 10 % 7 % ε 3 o 3 o m 40 40 σ noise 0 . 001 m – 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 −1 −0.5 0 0.5 1 Impulse response of the room Time (sec) Amplitude Fig. 10: Impulse response of the room re verberation showing secondary peaks representing the reﬂections from the ﬂoor and the walls. 5.2 Results and Discussion 0 50 100 150 200 250 300 350 0 10 20 30 40 50 60 70 80 90 Azimuth (deg) Elevation (deg) Mapped data points (a) Simulation results showing m apped data points for three sound sources pl aced a t ( 30 o , 340 o ) , ( 10 o , 20 o ) , and ( 20 o , 180 o ) . 0 50 100 150 200 250 300 350 0 10 20 30 40 50 60 70 80 90 Azimuth (deg) Elevation (deg) Mapped data points (b) Experimental results sho wing m apped data points for four sound s ources placed at ( 30 o , 340 o ) , ( 10 o , 20 o ) , ( 60 o , 220 o ) and ( 20 o , 180 o ) . Fig. 11: ICTD samples mapp ed to the orientation domain. Figures 11a and 11b sho w the formation of the clusters in simulation and real en vironment, respecti vely , during t he mapping of ICTD data points to t heir respectiv e ϕ and θ angles using Equations 4, 7 and 8. 0 50 100 150 200 250 300 350 0 10 20 30 40 50 60 70 80 90 Clusters detected by DBSCAN Azimuth (deg) Elevation (deg) Core cluster data point Centroid of the cluster (a) Simulati on results showin g clusters detecte d using DBSCAN-based MSSL technique. Three s ound sources placed at ( 30 o , 340 o ) , ( 10 o , 20 o ) , and ( 20 o , 180 o ) which were locali zed at ( 29 . 91 o , 340 . 11 o ) , ( 14 . 40 o , 20 . 01 o ) and ( 21 . 02 o , 180 . 58 o ) respecti vely . 0 50 100 150 200 250 300 350 0 10 20 30 40 50 60 70 80 90 Clusters detected by DBSCAN Azimuth (deg) Elevation (deg) Core cluster data point Centroid of the cluster (b) Experime ntal results showi ng cluster s detecte d using DBSCAN-ba sed MSSL tec hnique. Four sound sources placed at ( 30 o , 340 o ) , ( 10 o , 20 o ) , ( 60 o , 220 o ) and ( 20 o , 180 o ) in the real en vironment which were local ized at ( 29 . 35 o , 341 . 06 o ) , ( 13 . 55 o , 20 . 36 o ) , ( 60 . 20 o , 220 . 76 o ) and ( 22 . 87 o , 182 . 05 o ) respec ti vely . Fig. 12: Clusters detected using DBSCAN technique Figures 12a and 12b sho w the detection of these clusters in simulation and real en viron- ment, respectively , using the DBSC AN technique for a sample run. The parameter s used for the RANSAC- and DBSCAN-based algor ithms are listed in T able 2 . The valu e σ con f was chosen to be 0 . 0157 m for s imulation and 0 . 0261 m for experimen ts, which implies that the sound sources with the same azimuth are as sumed to be separated by at least 5 o in elev ation. T able 3 sho ws the simulation and ex perimental results of localization with the num ber of sound sources v arying from one to four . 1000 Monte Carlo simulation run s were performed using the two pro posed approaches, respecti vely , with speciﬁcations gi ven in T able 1. Simulations were run with K = 1 , 2 , 3 , 4 , 5 sources and the results of the source counting are listed in the T able 4. T able 3: Mean absolute error (MAE) for localization perform ed with DB SCAN-Based and RANSA C-Based MSSL in simulation (Sim) and experiments (Expt) for differen t num ber of sound sources. Number MAE (Sim) MAE (Expt) of ϕ (deg ) θ (deg ) ϕ (deg ) θ (deg ) source(s) DBSCAN 4 1.73 3.20 3 .71 5 .66 3 0.8 4.68 1 .77 5 .78 2 1.06 2.18 2 .89 3 .92 1 0.94 0.57 2 .27 0 .35 RANSA C 4 0.88 8.12 3 .92 7 .20 3 2.51 5.56 2 .33 5 .62 2 2.55 5.01 3 .07 3 .88 1 1.61 1.97 2 .15 3 .07 T able 4: Estimated vs actual numb er of sound s ource count for DBSCAN-Based and RANSA C-Based MSSL in the simulated en vironm ent. Act K 1 2 3 4 5 6 ≥ 7 Estimated - DBSCAN 1 944 56 0 0 0 0 0 2 11 902 65 15 7 0 0 3 1 61 84 7 47 38 5 1 4 12 39 12 6 687 82 36 18 5 4 17 80 183 595 110 11 Estimated - RANSA C 1 1000 0 0 0 0 0 0 2 7 991 2 0 0 0 0 3 0 56 89 8 46 0 0 0 4 0 4 104 888 4 0 0 5 0 0 1 85 750 139 25 Figure 13a and 13b sho ws the result of a s ample run by the RANSA C-based algorithm in simulation and real env ironment respectiv ely . Since the ICTD signal is noisy , any point very close ( σ con f = 0 . 015688) to any of d n was chosen to be on the ICTD by the RANSAC algorithm. The s ignal to noise ratio (SNR ) of t he measured signal d was 18 . 94 dB. For a source to be considered as a qualiﬁed sound source, the threshold for the conﬁdence that work ed for us was 10% in simulation and 7% in experiments. As sho wn in top subﬁgure of Fig. 15, the averag e error of orientation localization with the DBSCAN-based algorithm is less as compared to the R ANSA C-based algorithm, which, howe ver , generates compara- tiv ely more accurate results for source counting , as shown in the bottom subﬁgure of Fig. 15. In both simulations and experiments, the error of elev ation angle estimation was found to be large for sources kep t close t o zero ele vation, which coincides the conclu sion in [12]. The performan ce of the localization and source counting using both proposed techniques can be impro ved by increasing the number of rotations of the bi-micropho ne array . 0 50 100 150 200 250 300 350 −0.4 −0.2 0 0.2 ψ (deg ) Amplitude(m) Estimation of d n from multi−source ITD signal Signa l d for mult iple so ur ce Est d for ϕ =18 0 , θ =2 0 Est d for ϕ =20 , θ =10 Est d for ϕ =34 0 , θ =3 0 (a) Simulation results of estimati on of signa l d n from the multi-sourc e signal d using RANSA C-based MSSL te chnique . Three sound sour ces pl aced at ( 30 o , 340 o ) , ( 10 o , 20 o ) , and ( 20 o , 180 o ) whi ch were local- ized at ( 33 . 11 o , 338 . 86 o ) , ( 14 . 40 o , 22 . 01 o ) and ( 24 . 27 o , 181 . 15 o ) respecti vely . 0 50 100 150 200 250 300 350 400 −0.9 −0.25 0.4 Estimation of d n from multi−source signal d ψ (deg ) Amplitude(m) Signa l d for mult iple so ur ce Est d for ϕ =34 0 , θ =3 0 Est d for ϕ =18 0 , θ =2 0 Est d for ϕ =22 0 , θ =6 0 Est d for ϕ =20 , θ =10 (b) Experimental results of estimati on of signal d n from the multi-sourc e signa l d using RANSA C-based MSSL technique . Four sound sources placed at ( 30 o , 340 o ) , ( 10 o , 20 o ) , ( 60 o , 220 o ) and ( 20 o , 180 o ) which were local ized at ( 32 . 41 o , 341 . 26 o ) , ( 13 . 69 o , 20 . 86 o ) , ( 58 . 20 o , 219 . 02 o ) and ( 22 . 27 o , 181 . 15 o ) respec- ti vely . Fig. 13: Estimated s ignals d n from the multi-sou rce ICTD signal d 6 Conclusion and Future Scope T wo no vel techniques are presented for small auton omous unmanne d vehicles (SA UVs) to perform 3D multi-sound- source localization (MSSL) using a s elf-rotating bi-microphone array . The Density-Based Spatial Clustering of Applications wit h Noise (DBSCAN) based MSSL approach iterativ ely maps the randomly chosen points in the inter -channel time dif- ference (ICTD) signal to the orientation domain, leading to a data sets for clustering. The number of clusters represents the number of sound sources and the location of the cen- troid of a cluster represents the location of a sound source. The Random Sample Consensus (RANSA C) based approach iterati vely estimates parameters of a model using two randomly 1 2 3 4 5 6 0 20 40 60 80 100 Number of sound sources (N) Confidence (%) Confidence on presence of sound sources using DBSCAN Threshold (a) Experiment al results for conﬁden ce on presence of sound sources placed at ( 30 o , 340 o ) , ( 10 o , 20 o ) , ( 60 o , 220 o ) and ( 20 o , 180 o ) using DBSCAN. 1 2 3 4 5 0 20 40 60 80 100 Number of sound sources (N) Confidence (%) Confidence on presence of sound sources using RANSAC Threshold (b) Experi mental results for conﬁdence on presenc e of sound s ources placed at ( 30 o , 340 o ) , ( 10 o , 20 o ) , ( 60 o , 220 o ) and ( 20 o , 180 o ) using RANSA C. Fig. 14: Experimental results for source count chosen data points from the ICTD signal data. It then uses a threshold to decide the number of qualiﬁed sound sources. The simulation and ex perimental results show the ef fectiven ess of both appr oaches in identifying the number and the 3D orientations of the sound sources. The techniques presented in the paper are able to localize multiple stationary sound sources by a stationary robotic platform. Consideration s on the presence of obstacles and the robot/sou nd motions motiv ate our future work. References 1. Q. W ang, K. Ren, M. Zhou, T . Lei, D. Ko utsonik olas, and L. Su, “Messages behind the sound: real- time hidden acoustic signal ca pture with smart phones, ” in Procee dings of the 22nd Annual Inte rnational Confer ence on Mobile Computing and Networking . A CM, 2016, pp. 29–41. 1 2 3 4 Number of sound sources 0 5 Avg error (deg) Localization error DBSCAN RANSAC 1 2 3 4 5 Number of sound sources 0 50 Error (%) Source count error DBSCAN RANSAC Fig. 15: T op subﬁgure represents the av erage of simulation and exp erimental localization error and the bottom subﬁgur e represents the numb er of sound source identiﬁcation percent- age error in simulation by the DB SCAN-based and RANSA C-based MSSL for dif ferent number of sound s ources. 2. H.-J. B ¨ ohme, T . Wil helm, J. K ey , C. Schaue r , C. Schr ¨ oter , H.-M. Groß, and T . Hempel, “ An approach to multi-modal human–machine interac tion for intel ligent service robots, ” Robotics and Autonomo us Systems , vo l. 44, no. 1, pp. 83–96 , 2003. 3. J. C. Murray , H. Erwin, and S. W ermter , “Robotics sound- source local izati on and tracki ng using inter- aural time dif ference and cr oss-correla tion, ” in AI W orkshop on Neur oBotics , 2004. 4. J. Borenstei n, H. Everett, and L. Feng, Navigating mobile robots: systems and technique s . A K Peters Ltd., 1996. 5. D. V . Rabinkin, “Opti mum sensor placement fo r mic rophone a rrays, ” Ph.D. di ssertati on, R UT GERS The State Uni versity of New Jerse y - Ne w Brunswick, 1998. 6. M. Brandstei n and D. W ard, Micr ophone arrays: signal pr ocessing techniqu es and applicatio ns . Springer Scien ce & Business Media, 2013. 7. H. W allach , “On sound localizat ion, ” The J ournal of the Acoustical Society of America , vol. 10, no. 4, pp. 270–27 4, 1939. 8. S. Lee, Y . Park, and Y .-s. Park, “Three-di mensional sound s ource localizat ion using inte r-cha nnel time dif ferenc e traj ectory , ” International Journa l of Advanced Robotic Systems , vol. 12, no. 12, p. 171, 2015. 9. A. A. Handzel and P . Krishnaprasad, “Biomimeti c sound-source loca liza tion, ” IEEE Sensors J ournal , vol. 2, no. 6, pp. 607–616, 2002. 10. G. H. Eriksen, “V isualiza tion tools and graph ical methods for source local izati on and signa l separa tion, ” Master’ s thesis, Uni versityof OSLO , Department of Informatics, 2006. 11. X. Z hong, W . Y ost, and L. Sun, “Dyna mic binaura l sound source localizat ion with ITD cue s: Human listene rs, ” The Journal of the Acoustical Society of America , vol. 137, no. 4, pp. 2376–2376, 2015. 12. D. Gala, N. Lindsay , and L. Sun, “Thre e-dimensio nal sound source loc aliza tion for unmanned ground vehi cles wit h a self-rot ationa l two-microphone arra y , ” in P r ocee dings of the 5t h int ernation al confer ence of contr ol, dynamic s ystems, and r obotics (CDSR’18) , 2018, pp. 104.1 – 104.11. 13. J.-M. V ali n, F . Mic haud, J. Rouat, and D. L ´ etournea u, “Robust sound s ource loc aliza tion using a mi- crophone array on a m obile robot, ” in Intelli gen t Robots and Systems, 2003.(IR OS 2003). Pr oceedings. 2003 IEEE/RSJ Interna tional Confer ence on , vol. 2. IEEE, 2003, pp. 1228–123 3. 14. L. Sun and Q. Cheng, “Indoor multiple sound source localiza tion using a nov el data selec tion scheme, ” in 48th Annual Confe re nce on Information Sciences and Systems (CISS) . IE EE, 2014, pp. 1–6. 15. X. Z hong, L . Sun, and W . Y ost, “ Acti ve bin aural locali zatio n of multi ple sound sources, ” Roboti cs and Autonomou s Systems , vol . 85, pp. 83–92, 2016. 16. C. Blandin, A. Ozero v , and E. V ince nt, “Multi -source TDOA estimati on in re ve rberant audio using an- gular spectra and clu stering, ” Signal Pr ocessing , vol . 92, no. 8, pp. 1950–19 60, 2012. 17. M. Swartli ng, B. S ¨ allber g, and N . Grbi ´ c, “Source loc aliza tion for multiple speec h source s using lo w comple xity non-paramet ric s ource separat ion and c lusterin g, ” Sig nal Pr ocessing , vol. 91, no. 8, pp. 1781– 1788, 2011. 18. T . Dong, Y . Lei , and J. Y ang, “ An algorit hm for underdet ermined mixing matrix esti mation, ” Neur ocom- puting , vol . 104, pp. 26–34, 2013. 19. O. Y ilmaz and S. Rickard , “Blind separa tion of speec h mixtur es via time-frequenc y masking, ” IEEE T ransaction s on signal proc essing , vol. 52, no. 7, pp. 1830–1847, 2004. 20. D. Pavlidi , A. Grif ﬁn, M. Puigt , and A. Mouchtaris, “Real-ti me multiple sound source loca liza tion and countin g using a circular microphone array , ” IEEE T ransactions on Audio, Speech, and Language Pr o- cessing , v ol. 21, no. 10, pp. 2193–220 6, 2013. 21. B. Loesch and B. Y ang, “Source number estimati on and cluste ring for underdeter mined blind source sep- aratio n, ” in Internatio nal W orkshop on Acoustic Sig nal Enhancement (IW A ENC), Seattle , W ashington, USA , 2008. 22. M. C. Catalb as and S. Dobrisek, “3D moving sound source localiz ation via con ventional m icrophon es, ” Elektr onika ir Elektr otec hnika , vol. 23, no. 4, pp. 63–69 , 2017. 23. J. Traa and P . Smaragdis, “Bli nd multi-cha nnel source separa tion by circular -linea r statisti cal modeling of phase dif ferences, ” in IEEE Interna tional Confere nce on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2013, pp. 4320–4324. 24. D. Gala and L. Sun, “Moving sound s ource loc aliz ation and tracking using a self rota ting bi-microph one array , ” in ASME 2019 Dynamic Systems and Cont r ol Confere nce . American Soci ety of Mechanical Engineers Digita l Collectio n, 2019. 25. D. Gala, N. Lindsay , and L. Sun, “Realtime ac ti ve sound source lo caliz ation for unmanned ground robots using a self-rota tiona l bi-micropho ne array , ” J ournal of Intelli gen t & R obotic Systems , vol. 95, no. 3-4, pp. 935–95 4, 2019. 26. M. Ester , H. -P . Krie gel, J. Sander , and X. Xu, “ A density-base d algorit hm for discove ring clusters in larg e spatial database s with noise , ” in Kdd , vol. 96, no . 34, 1996, pp. 226–231. 27. C. Knap p and G. Carter , “The gen eraliz ed correlati on m ethod for estimati on of time delay , ” IEEE T rans- actions on Acoustic s, Speech, and Signal Proc essing , vol . 24, no. 4, pp. 320–327 , Aug 1976. 28. M. Azaria and D. Hertz, “T ime delay estimati on by gene ralize d cross correlat ion methods, ” IEEE T rans- actions on Acoustic s, Speech, and Signal Proc essing , vol . 32, no. 2, pp. 280–285 , 1984. 29. P . Naylor and N. D. Gaub itch, Speech der ev erberat ion . Springer Scien ce & Busi ness Media, 2010. 30. D. R. Gala, A. V asoya, and V . M. Misra, “Speec h enhancement combining spectral subtracti on and beamforming techniques for microp hone array , ” in P r oceed ings of the Internationa l Conferen ce and W orkshop on Emergi ng T rends in T echnol ogy (ICWET) , 2010, pp. 163–166. 31. D. R. Gala and V . M. Misra, “SNR improvement with speech enhancement technique s, ” in Proce edings of the Internat ional Confer ence and W orkshop on Emerging T rends in T echnolo gy (ICWET) . A CM, 2011, pp. 163–1 66. 32. “Interna tional Organiza tion for Standardi zati on (ISO), British, European and Internationa l Standards (BSEN), Noise emitted by mac hinery and equ ipment – Rul es for t he drafti ng and prese ntati on of a noise test code , ” 12001: 1997 Acoustics . 33. B. Goelzer , C. H. Hansen , and G. Sehrndt, Occu pational ex posur e to noise: ev aluati on, pre venti on and contr ol . W orld Health Organisat ion, 2001. 34. L. Cal mes, “Biologic ally inspired binaural sound source localizat ion and tracking for mobile robots. ” Ph.D. disser tation , R WTH Aachen Univ ersity , 2009. 35. C. D. Raj, “Comparison of K means K medoids DBSCAN algorit hms using DNA microarray dat aset, ” Internati onal Journal of Computatio nal and Applie d Mathematic s (IJCAM) , 2017. 36. N. Farmani , L. Sun, and D. J. Pack, “ A scalable multita rget tracking system for coope rati ve unmanne d aerial vehicle s, ” IEEE T ransac tions on Aer ospace an d Ele ctr onic Systems , v ol. 53, no. 4, pp. 1947–1961 , Aug 2017. 37. M. E. Celebi, H. A. Kingra vi, and P . A. V ela , “ A comparat i ve study of efﬁc ient initial izat ion methods for the k-mean s clustering algorithm, ” Expert systems with applicat ions , vol . 40, no. 1, pp. 200–21 0, 2013. 38. M. A. Fischl er and R. C. Bolle s, “Random sample consensus: a parad igm for model ﬁtting with appli - catio ns to image analysis and automated cartography , ” Communication s of the ACM , vol. 24, no. 6, pp. 381–395, 1981. 39. K. D. Donohue, “ Audio array toolbox, ” [Online] A vailable : ht tp:// vis.uk y .edu/ distributed- audio- lab/ about/ , 2019, May 20 . 40. J. B. Allen and D. A. Berkle y , “Image method for efﬁc ientl y simulati ng small-room acoustics, ” The J ournal of the Acoustical Societ y of America , v ol. 65, no. 4, pp. 94 3–950, 1979. 41. K. D. Donohu e, “ Audio systems lab e xperimen tal data - single-t rack single -speak er speech, ” [Online ] A vailable: http:// web .engr .uky . edu/ ∼ donohue/ audio / Data/ audioexpdata.htm , 2019 , May 20 .

Multi-Sound-Source Localization Using Machine Learning for Small Autonomous Unmanned Vehicles with a Self-Rotating Bi-Microphone Array

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment