Multiple Sound Source Localisation with Steered Response Power Density and Hierarchical Grid Refinement

SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGU A GE PR OCESSING, V OL. XX, NO. YY , MARCH 2018 1 Multiple Sound Source Localisation with Steered Response Po wer Density and Hierarchical Grid Reﬁnement Mert Burkay C ¸ ¨ oteli, Student Member , IEEE , Orhun Olgun, Student Member , IEEE , and H ¨ useyin Hacıhabibo ˘ glu, Senior Member , IEEE Abstract —Estimation of the direction-of-arri val (DO A) of sound sources is an important step in sound ﬁeld analysis. Rigid spherical microphone arrays allo w the calculation of a compact spherical harmonic repr esentation of the sound ﬁeld. A basic method for analysing sound ﬁelds r ecorded using such arrays is steered response power (SRP) maps wherein the source DO A can be estimated as the steering direction that maximises the output power of a maximally-directive beam. This approach is computationally costly since it requir es steering the beam in all possible directions. This paper presents an extension to SRP called steer ed r esponse po wer density (SRPD) and an associated, signal-adaptive sear ch method called hierarchical grid reﬁnement (HiGRID) for reducing the number of steering directions needed for DOA estimation. The pr oposed method can localise coherent as well as incoherent sources while jointly providing the number of prominent sources in the scene. It is shown to be robust to re verberation and additive white noise. An evaluation of the proposed method using simulations and real recordings under highly reverberant conditions as well as a comparison with state- of-the-art methods are presented. Index T erms —Source localization, rigid spherical microphone arrays, steered response power maps, direction-of-arriv al estima- tion I . I N T R O D U C T I O N Sound source localization is an essential stage in sound ﬁeld analysis and is used in a multitude of contexts including object-based audio, robot audition, and acoustic surveillance. Spatial information encapsulated in the sound ﬁeld provides a basis to localize sound sources. Such information can be extracted from recordings made with microphone arrays [1]. Recent years have seen progress in the development of the theory and applications of rigid spherical microphone arrays [2]. Different methods developed for DO A estimation using such arrays include those which are speciﬁc to array sig- nal processing, subspace-based methods adapted from classical spectrum estimation theory , and methods that are based on the energetic analysis of sound ﬁelds [3]. These methods suf fer from at least one of the following problems: They either hav e a high computational cost, require prior information about the M. B. C ¸ ¨ oteli, O. Olgun, and H. Hacıhabibo ˘ glu are with the Spatial Audio Research Group (SP ARG), Graduate School of Informatics, Middle East T echnical Univ ersity (METU), Ankara, TR-06800, T urke y e-mail: { mbcoteli, oolgun, hhuseyin } @metu.edu.tr . The work reported in this paper is supported by the T urkish Scientiﬁc and T echnological Research Council (T ¨ UB ˙ IT AK) Research Grant 113E513 “Spatial Audio Reproduction Using Analysis-based Synthesis Methods”. Manuscript received March 03, 2018. number of sources, or fail when there are multiple coherent sources or a high le vel of re verberation. Steered-response power (SRP) method 1 is based on the calculation of a spatial map of sound power calculated at different steering directions. Source DOAs correspond to the maxima of this distribution and can be obtained after the SRP map is calculated. Although SRP maps can pro vide excellent DO A estimation accuracy , two important problems pre vent their widespread use in DO A estimation: 1) computational cost [5], and 2) rob ustness to additive noise [4]. A new DO A estimation and source counting method ro- bust to re v erberation and additi ve noise that aims to reduce the computational cost of steered beamforming based sound source localization is proposed in this article. First, a measure of the spatially av eraged steered response power , called the steered response power density (SRPD) is deﬁned. SRPD is then interpreted as the probability of a sound source being present in a spherical quadrilateral sector and used in the adaptiv e reﬁnement of a hierarchical search grid. This is done by selecting the new steering directions based on the change in the total information gain. Only the directional sectors that are likely to contain a source are selected and scanned, allowing DO A estimation without calculating the SRP map on the whole sphere. This is followed by the segmentation and labelling of the multiresolution SRPD maps, allowing the joint localization and counting of the prominent sources in the acoustic scene. The effecti v eness of the algorithm is demonstrated via an extensiv e localization experiment. Op- eration of the proposed method under real-life conditions is demonstrated. A comparison of the proposed method with three state-of-the-art methods as well as the baseline SRP method is also presented. The paper is structured as follows. A summary of earlier research using spherical microphone arrays for DO A estima- tion is gi v en in Sec. II. A brief introduction to the theory of rigid spherical microphone arrays is gi ven in Sec. III. Steered response power density (SRPD) is deﬁned and its calculation from microphone array recordings is described in Sec. IV. A method using a signal-adaptiv e hierarchical search based on spatial entropy is proposed in Sec. V. The computational cost of the proposed method is compared with that of SRP 1 It should be noted that SRP method that we are referring to in this article, is based on plane-wa ve decomposition (PWD) using a maximally-directiv e beampattern and not for example minimum-variance distortionless response beamforming the eigenbeam domain (EB-MVDR) [4]. SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 2 in Sec. VI. The ev aluation of the proposed method under different emulated conditions is given, a comparison with the state-of-the-art methods is presented, and the utility of the method under real-life conditions is demonstrated in Sec. VII. Sec. VIII concludes the paper . I I . B A C K G RO U N D One of the most straightforward ways of DO A estimation using rigid spherical microphone arrays in v olves scanning all possible directions by steering a maximum directivity beam and calculating a map of the response power for each of these directions [6]. The maxima of this map occur at the DOAs of the sound sources present in the recorded scene. This brute- force approach, called steered response power (SRP) map, incurs a computational cost that can be prohibitiv ely high for practical applications. The use of SRP for sound source localisation is also common for general microphone arrays which may or may not hav e a regular geometric structure (e.g. linear or planar arrays). In such cases, SRP maps are typically obtained via generalised cross correlation with phase transform (GCC- PHA T) [7] or its deriv ati ves. Se veral different methods that aim to reduce the computational cost associated with this approach hav e been proposed. For e xample, stochastic region contraction (SRC) [8] and coarse-to-ﬁne region contraction (CFRC) [9] are two methods that search for the volumetric region with the global SRP maximum by gradually shrinking the search grid boundaries. Other methods search for the global SRP maximum from an SRP map obtained via a spatially averaged GCC-PHA T by iteratively decomposing the search volume into subv olumes [10], [11] or use bound and branch method to subdi vide the search grid to obtain the global SRP maximum [12]. Subspace-based DO A estimation methods such as multi- ple signal classiﬁcation in the eigenbeam domain (e.g. EB- MUSIC) and estimation of signal parameters via rotational in v ariance in the eigenbeam domain (EB-ESPRIT) [13], [4] are adaptations of the well-known spectrum estimation methods to the problem of DO A estimation. These methods provide accurate DO A estimations. Howe v er , they require a priori information on the number of sources and are computationally costly . Also, while EB-ESPRIT has a lo wer computational cost in comparison with EB-MUSIC, it may still fail for directions where the estimation function is singular [6]. Methods that aim to reduce the computational cost of DOA estimation with rigid spherical microphone arrays have also been proposed. One of these methods, inspired by energetic analysis of sound ﬁelds, uses pseudo-intensity v ectors (PIV) calculated from zeroth and ﬁrst-order spherical harmonics [5], [14]. While the PIV method was sho wn to be successful in localizing multiple sources when these sources are approx- imately W -disjoint orthogonal [15] (e.g. speech), its perfor- mance will decrease for coherent sources. Such a situation may occur , for example, in cases with multiple musical instruments or a source and its strong specular reﬂections. More recently , two extensions to the PIV method have been proposed in order to improve its estimation accuracy . Augmented intensity vec- tors (AIV) method extends the PIV concept using higher-order spherical harmonics [16], [17]. Subspace PIV (SSPIV) uses DO A estimations obtained from PIVs in the signal subspace calculated from the time-frequency smoothed spatial covari- ance matrix via singular value decomposition (SVD) [18]. Similarly , a multisource estimation consistency metric, which provides a measure of estimation conﬁdence, can be used to improv e accuracy for multiple source scenarios [19]. The desire to exploit the DOA estimation accuracy of EB- MUSIC prompted the in v estigation of ways to decrease the computational cost it incurs. One possibility to achie ve this aim is to estimate DOAs using EB-MUSIC only at time- frequency bins for which only a single source is active [20], [21]. For such bins, it is possible to deduce the dimensions of the signal and noise subspaces and also to reduce the number of time-frequency bins over which EB-MUSIC spectrum is to be calculated. Direct-path dominance (DPD) test uses the ratio of the largest tw o singular values of the time-frequency smoothed spatial covariance matrix in order to identify bins with only one active source [22]. DPD-test was also used as a pre-processing stage for the PIV method [23]. While DPD test identiﬁes time-frequency bins that contain a single source, it is also possible to use the correlation coefﬁcient between microphone pairs to identify directional zones that contain a single source and search for sources (e.g. by steered beamforming) in those zones only [24], [25]. This approach is conceptually similar to the method proposed in this article. I I I . R I G I D S P H E R I C A L M I C RO P H O N E A R R A Y S A function deﬁned on the unit sphere can be represented using the spherical harmonic functions as a basis such that: f ( θ , φ ) = ∞ X n =0 n X m = − n f nm Y m n ( θ , φ ) . (1) Here, the spherical harmonic functions of order n ∈ N and degree m ∈ Z are deﬁned as: Y m n ( θ , φ ) = s 2 n + 1 4 π ( n − m )! ( n + m )! P m n (cos θ ) e i mφ (2) with n ≥ 0 , m ≤ | n | , where P m n ( · ) is the associated Legendre functions, θ and φ are the inclination and azimuth angles, and f nm = Z 2 π 0 Z π 0 f ( θ , φ )[ Y m n ( θ , φ )] ∗ sin θ dθ dφ (3) are the corresponding spherical harmonic coef ﬁcients, respec- tiv ely . The projection of the function, f ( θ , φ ) , onto spherical harmonic basis as described above is called the spherical har - monic decomposition (SHD). Since SHD results in a compact representation of bandlimited functions and distributions on a sphere using only a few non-zero SHD coefﬁcients, it is used in man y different ﬁelds of physics as well as in signal processing including applications in acoustics [26]. If the pressure distrib ution is kno wn at each point on a sphere, (3) can be used to obtain the SHD directly . Howe ver , in real-life applications pressure can be sampled only at a ﬁnite number of points on the sphere. If the pressure is SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 3 sampled at Q ≥ ( N + 1) 2 discrete points, ( θ q , φ q ) , the spherical harmonic components can be calculated using the corresponding spherical quadrature up to a maximum degree of n = N as: p nm ( k ) = Q X q =1 w q p ( θ q , φ q , k )[ Y m n ( θ q , φ q )] ∗ (4) where k = 2 π f /c is the wav e number , c is the speed of sound, and w q are the quadrature weights. In order for this expression to con v erge to the real SHD coefﬁcients, spherical sampling scheme has to satisfy the discrete orthonormality condition [27]. A spatially bandlimited approximation of the pressure around a rigid sphere of radius r a , located at the origin, due to a unit amplitude plane wa ve incident from the direction ( θ S , φ S ) and with a frequency f , can be written for r ≥ r a as: p ( θ , φ, k ) = N X n =0 n X m = − n p nm ( k ) Y m n ( θ , φ ) , (5) where p nm ( k ) = 4 π i n b n ( k r a )[ Y m n ( θ S , φ S )] ∗ (6) with b n ( k r ) = j n ( k r ) − j 0 n ( k r a ) h (2) 0 n ( k r a ) h (2) n ( k r ) . (7) Here, j n ( · ) , h (2) n ( · ) , j 0 n ( · ) , and h (2) 0 n ( · ) are, the spherical Bessel function of the ﬁrst kind, spherical Hankel function of the second kind and their deri v ati v es with respect to their argu- ments, respectiv ely [2]. It may be observed that an important beneﬁt of this representation is the decoupling of direction and frequency dependent terms. Rigid spherical microphone arrays consist of a number of pressure sensitive microphones positioned at appropriate quadrature nodes on the surface of a rigid spherical bafﬂe. Such arrays have gained popularity in acoustic scene analysis applications as they possess spherical symmetry and allow for an almost trivial calculation of spherical harmonic coefﬁcients subject to the order limitations mentioned above. Spatial aliasing that results from using a ﬁnite number of samples on the sphere is discussed elsewhere [28]. I V . S T E E R E D R E S P O N S E P OW E R D E N S I T Y In a realistic scenario where multiple sources or a source and its reﬂections are present, the SHD coef ﬁcients will consist of a linear combination of multiple directional terms: p nm ( k ) = 4 π i n b n ( k r a ) S X s =1 α s ( k )[ Y m n ( θ s , φ s )] ∗ (8) where α s ( k ) ∈ C is the amplitude of a single plane wa ve. Using the completeness property of the spherical harmonic functions [2], it can be shown that: y ( θ , φ, k ) = ∞ X n =0 n X m = − n p nm ( k ) 4 π i n b n ( k r a ) Y m n ( θ , φ ) (9) = S X s =1 α s δ (cos θ − cos θ s ) δ ( φ − φ s ) , where δ ( · ) is the Dirac delta function. In other words, if there is no order limitation on the SHD, the above operation will result in a combination of a ﬁnite number , S of Dirac impulses on the unit sphere with complex amplitudes, α s ( k ) , located at directions that correspond to the directions of the indi vidual plane wav es, ( θ s , φ s ) . This is called plane wave decomposition (PWD) [29]. Only a spatially bandlimited approximation of PWD can be obtained with rigid spherical microphone arrays due to the order limitation mentioned abo ve, such that: y N ( θ , φ, k ) = N X n =0 n X m = − n p nm ( k ) 4 π i n b n ( k r a ) Y m n ( θ , φ ) (10) It may then be shown by the spherical harmonics addition theorem [30] that the ef fect of order limitation results in spatially bandlimited impulses: y N ( θ , φ, k ) = N + 1 4 π S X s =1 α s ( k )  P N +1 (cos Θ s ) − P N (cos Θ s ) P 1 (cos Θ s ) − P 0 (cos Θ s )  where Θ s is the angle between the direction of the incident plane wave, ( θ s , φ s ) and the analysis direction, ( θ, φ ) , and P n ( · ) is the Legendre polynomial of order n . The resulting beam pattern is called the r e gular beam pattern [31] or plane- wave decomposition beam pattern [30] and is maximally directiv e. The functional, y N ( θ , φ, k ) is called the steered r esponse power (SRP). When it is interpreted as a distribution on the unit sphere, the spatial detail that it can resolve, also kno wn as the Rayleigh condition , is approximately π / N [29]. In other words, only sources that have more than π / N sepa- ration between their DO As can be discriminated as separate, independent sources. DO A estimation using SRP inv olv es ﬁnding the steering direction, ( θ u , φ u ) that maximises the po wer , such that: ( θ u , φ u ) k = argmax θ u ,φ u | y N ( θ u , φ u , k ) | 2 . (11) For multiple sources, the SRP functional will ha ve multiple peaks corresponding to source DOAs. These can be accurately identiﬁed by using a dense search grid with a desired resolu- tion. Howe v er , this brute-force approach is usually not suitable for practical applications due to its high computational cost. For the case where a sparse, multiresolution search grid is to be used, SRP can miss peaks especially at low resolutions. A more appropriate functional which can be used for source localization is proposed in this article. W e deﬁne steer ed r esponse power density (SRPD) for an arbitrary , bounded surface element, S i , on the unit sphere as: P i ( k ) = 1 A i Z S i | y N ( θ , φ, k ) | 2 d S i (12) where A i , A ( S i ) is the area of the surface element. Note that SRP is related to SRPD as: lim A i → 0 P i ( k ) = | y N ( θ i , φ i , k ) | 2 (13) where ( θ i , φ i ) is the centre of the differential region, S i . SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 4 It is possible to express SRPD using (10) as: P i ( k ) = X n,m,n 0 ,m 0 p nm ( k ) p ∗ n 0 m 0 ( k ) b n ( k r a ) b ∗ n 0 ( k r a ) Q m,m 0 n,n 0 ( S i ) (14) where: Q n 0 ,m 0 n,m ( S i ) = 1 (4 π ) 2 A i Z S i Y m n ( θ , φ ) h Y m 0 n 0 ( θ , φ ) i ∗ d S i . (15) Here, the decoupling of the frequenc y and the direction- dependent terms via SHD allows isolating the integration to terms which do not depend on the SHD coefﬁcients b ut only on the analysis re gion. It is possible to express the sum in (14) as the grand sum of the matrix: H i = P ◦ Q i (16) where ◦ represents the Hadamard (i.e. element-wise) product, P = pp H is an ( N + 1) 2 × ( N + 1) 2 matrix with p =  p 00 ( k ) 4 π b 0 ( k r a ) , p 1 − 1 ( k ) 4 π b 1 ( k r a ) , · · · , p N N ( k ) 4 π b N ( k r a )  T (17) and the ( N + 1) 2 × ( N + 1) 2 cross spatial density matrix Q i is giv en as: Q i =       Q 0 , 0 0 , 0 ( S i ) Q 1 , − 1 0 , 0 ( S i ) · · · Q N ,N 0 , 0 ( S i ) Q 0 , 0 1 , − 1 ( S i ) Q 1 , − 1 1 , − 1 ( S i ) · · · Q N ,N 1 , − 1 ( S i ) . . . . . . . . . . . . Q 0 , 0 N ,N ( S i ) Q 1 , − 1 N ,N ( S i ) · · · Q N ,N N ,N ( S i )       . (18) Note that both P and Q i are Hermitian. The grand sum of H i can be represented as: P i = e T S i e = e T ( P ◦ Q i ) e (19) where e is a column vector of ones. This expression can be simpliﬁed by employing an identity of the Hadamard product [32] and the eigendecomposition of the cross spatial density matrix such that: P i = tr( PQ T i ) = tr( PV i D V H i ) (20) where tr( · ) is the trace operator . Here, the columns of V i are the eigen v ectors, and the diagonal matrix D contains the eigen v alues λ i,m of Q T i . Using the cyclic permutation in v ariance property of the trace operator [33]: P i = tr( V H i PV i D ) (21) = e T  V H i PV i ◦ D i  e (22) Since the eigen v alue matrix D i is diagonal, the grand sum of the Hadamard product will only contain a sum of the diagonal elements of V H i PV i weighted by the eigenv alues, λ i,m of the cross spatial density matrix, such that the SRPD is gi ven as: P i = X m λ i,m  V H i PV i  m,m , (23) or more compactly as: P i = k V H i p k 2 D i (24) where k x k D = ( x H Dx ) 1 / 2 represents the weighted norm. In other words, once the SHD coef ﬁcients are known, SRPD can be calculated using the eigen values and the eigenv ectors of the cross spatial density matrix, Q i . The expression abov e assumes that all the eigen v alues and eigen vectors of the spatial density matrix are employed. Howe v er , most of the eigen v alues will be v ery small depending on the area over which the SRPD is calculated. Fewer eigen- values and eigen vectors can be used in order to reduce the computational cost associated with the calculation of SRPD. Deﬁning an energy ratio as: L i,M , M X m =0 λ 2 i,m  ( N +1) 2 X m =0 λ 2 i,m (25) the largest M < ( N + 1) 2 eigen v alues and the corresponding eigen v ectors can be selected, for which this ratio is greater than a gi ven threshold close to unity . This way the computational cost can be substantially reduced without having a detrimental effect on the calculated SRPD for the corresponding area. In the examples that we present in the article, we use the lar gest M eigenv alues and eigen vectors such that L i,M ≥ 0 . 99 . The most important dif ference between SRP and SRPD is that of cov erage: while SRP corresponds to the response po wer for a distinct steering direction, SRPD corresponds to the power density in an area of interest. As will be evident in the next section, the latter is more suitable when a multiresolution search grid is used for DO A estimation. V . S P A T I A L E N T RO P Y BA S E D G R I D R E F I N E M E N T The method proposed in this paper aims to provide accurate DO A estimations comparable with those provided by SRP but with a lower computational cost, for multiple, possibly coherent sources. This is attained by using a hierarchical, mul- tiresolution search grid which is reﬁned based on information gain. The proposed method is based on the clustering of local DO A estimates obtained from individual bins in the time- frequency domain. For this purpose, short-time Fourier trans- forms (STFT) of each microphone channel from the spherical microphone array are obtained ﬁrst. This is followed by a time- frequency bin selection process that chooses bins which are likely to contain one or more strong direct path components. W e use time-frequency bins that correspond to onsets. Note that, unlike the DPD-test, such selection does not preclude cases where multiple coherent onsets are present. The method, which we will call hierarchical grid reﬁnement (HiGRID) henceforth, is applied using the SHD coefﬁcients of the selected time-frequency bins. The output of HiGRID for each time-frequency bin is an SRPD map deﬁned on a sparse, multiresolution grid. Segmentation and labelling of this map will jointly provide the number of sources and their DOAs for the analysed bin. This is achie ved by using a v ariant of connected components labelling (CCL) followed by the calculation of cluster centroids on the labelled SRPD map segments. A global histogram of the centroids obtained from all the processed TF bins is then used to estimate the source DO As. The diagram in Fig. 1 shows the proposed algorithm SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 5 STFT SHD Entropy-based Hierarchical Grid Refinement (HiGRID) TF bin selection Neighboring Nodes Labeling . . . . . . . . . . . . Calculate Cluster Centroids Direction Histograms Histogram Postproc. Fig. 1. The ﬂow diagram of the proposed algorithm. The core part is highlighted with a blue box. which consists of: 1) pre-processing, 2) SRPD map calculation via grid reﬁnement, 3) local DO A estimation, and 4) post- processing stages. These stages will be explained in more detail in the follo wing sections. A. Pr e-pr ocessing: STFT , SHD, and Onset Detection Pre-processing inv olv es obtaining the time-frequency repre- sentations of the microphone array signals, selecting the time- frequency bins ov er which DOA estimations will be made, and calculating the SHD coef ﬁcients for these bins. In the ﬁrst preprocessing stage, a windo wed Fourier trans- form is used to represent the recorded microphone signals in the time-frequency domain, p q ( τ , κ ) , where q is the index of the microphone, and τ and κ are the time and frequency indices, respectiv ely . The second preprocessing stage inv olv es the selection of bins at which the proposed algorithm will be applied. For this purpose, spectrum based onset detection is applied on the omnidirectional component, p o ( τ , κ ) ≈ 1 Q Q X q =1 p q ( τ , κ ) , (26) in order to obtain the time indices, { ˆ τ } , at which the onsets occur . This is followed by the selection of the bins, ( ˆ τ , ˆ κ ) whose normalized energy is above a threshold. While many different onset detection algorithms could be suitable for this purpose, we use the SuperFlux algorithm that is robust to short-time frequency modulations [34]. The set of bins that are selected this w ay will be represented as T , { ( ˆ τ , ˆ κ ) } . The selection of time-frequency bins based on onset de- tection is not a strictly necessary step for localizing multiple sources. Howe v er , it is done for reasons based on the pre- supposition that sound sources of interest such as speech and music will always have onsets. These reasons are as follows 2 : 1) Since the direct path from at least one of the sources will always be present at the onset time frames, selecting these frames for analysis would improv e estimation accuracy , and 2) The number of time-frequency bins to be processed, and thus the associated computational cost would be signiﬁcantly reduced. 2 While onset detection was chosen for its simplicity , it is not the only way to select TF bins. The proposed method can almost trivially be adapted for use with other TF bin selection methods such as the DPD test [22] or sound ﬁeld directivity test [35]. In the third pre-processing stage, the SHD coefﬁcients are calculated as in (4) such that: p nm ( ˆ τ , ˆ κ ) = Q X q =1 w q p q ( ˆ τ , ˆ κ )[ Y m n ( θ q , φ q )] ∗ . (27) for all the selected bins, ( ˆ τ , ˆ κ ) ∈ T . These coefﬁcients are used in the calculation of SRPD maps at their respectiv e time- frequency locale. B. Hierar c hical Grid Reﬁnement (HiGRID) The proposed algorithm consists of the identiﬁcation and reﬁnement of regions of interest in the SRPD map that contain sources, followed by the clustering of contiguous regions to count and localize these sources. 1) Analysis Grid: The selection of a suitable spherical analysis grid is essential. A necessary property is that it is hierarchical such that each reﬁnement of the grid also contains the coarser grid and also each grid element can be represented as a node in a tree structure. A desirable property is that the grid is approximately , e ven if not exactly , uniform. Hierarchical Equal Area isoLatitude Pixelization (HEALPix) [36] is used here as it allows an ef ﬁcient representation of spherical data using quadrilateral, non- ov erlapping, and reﬁnable grid elements (i.e. pixels) on a sphere. At the resolution level, l ∈ N , HEALPix provides spherical tessellation with 12 · 2 2 l equal area grid elements. The elements are separated uniformly with an angular resolution (in radians) of: Θ ∆ = r 3 π π 3 · 2 l , (28) and their areas depend on the resolution level such that: A l = 4 π R 2 12 · 2 2 l , (29) where R is the radius of the sphere [36]. Note that at the lowest resolution lev el (i.e. l = 0 ) there are 12 grid elements. Mesh reﬁnement partitions a grid element, S l,m at the resolution le vel l with index m , into four ne w grid elements neighboring each other, at the higher resolution lev el, l + 1 , such that: S l,m = S l +1 , 4 m ⊕ S l +1 , 4 m +1 ⊕ S l +1 , 4 m +2 ⊕ S l +1 , 4 m +3 (30) where  S ( l +1 , 4 m + k )  for k = 0 · · · 3 are four higher resolution grid elements. This allows for a quadtree representation where a spherical function or distrib ution can be represented at different lev els of detail for different directions. The ﬂex- ible multiresolution representation afforded by HEALPix is SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 6 essential for the DO A estimation and source counting method proposed in this paper . 2) Entr opy-based Hierar chical Grid Reﬁnement: When multiple coherent sources are present in the sound scene, SRPD maps will have more than one meaningful local maxi- mum, that correspond to the directions of the acti ve sources. In order to ﬁnd these maxima without steering a beam in all possible directions, it is necessary to identify the regions that contain these maxima, and scan these regions with a higher resolution. Spatial entropy [37], which is a measure of the spatial disorganization, is used to identify such regions. Let us assume that at a giv en iteration, t , the analysis grid (or equiv alently the leaf nodes of the corresponding quadtree), G t for a time frequency bin ( ˆ τ , ˆ κ ) consists of L t elements (i.e. nodes) G t = {S ( t ) l,m } at different resolution lev els, l ∈ { 0 · · · t − 1 } . W e can deﬁne the total spatial entropy of the representation as: H ( G t ) = − X ∀S ( t ) l,m ∈G t γ ( S ( t ) l,m ) log γ ( S ( t ) l,m ) A ( S ( t ) l,m ) (31) and γ ( S ( t ) L,M ) = P L,M P ∀S ( t ) l,m ∈G t P l,m (32) where P l,m is the SRPD of the grid element with index m at le vel l [as in (24)], and A ( S ( t ) l,m ) is its area. Here, 0 ≤ γ ( S ( t ) l,m ) ≤ 1 is interpreted as the probability of the grid element S ( t ) l,m of containing a source, and: X ∀S ( t ) l,m ∈G t γ ( S ( t ) l,m ) = 1 . (33) The decision on whether or not to reﬁne a speciﬁc grid element is made according to its effect on total spatial entropy . A candidate grid G 0 t can be obtained from the existing grid G t by reﬁning a speciﬁc grid element, S ( t ) l,m such that: G 0 t = G t ∪ V ( S l,m ) \ {S ( t ) l,m } . (34) where V ( S l,m ) = {S l +1 , 4 m , S l +1 , 4 m +1 , S l +1 , 4 m +2 , S l +1 , 4 m +3 } is the set of children nodes in the quadtree representation of the analysis grid. The proposed algorithm aims to decrease the total spatial entropy of the representation by using the dif ference between the total entropy before and after the reﬁnement, i.e. : I ( l, m ) = H ( G t ) − H ( G 0 t ) . (35) This difference is known as mutual information or information gain in the context of decision trees [38] and, when positive, indicates that the reﬁnement would result in the creation of children nodes with similar SRPD v alues indicating a local maximum. Therefore, if the reﬁnement decreases the total spatial entropy such that, I ( l , m ) > 0 , the representation is updated with the reﬁned candidate grid, G 0 t . Otherwise the corresponding branch of the quadtree is pruned at its present resolution level and remov ed from the search path. Algorithm 1 shows the pseudocode for the HiGRID method. Algorithm 1: Hierarchical Grid Reﬁnement (HiGRID) Data: SHD coefﬁcients, p nm for the time-frequency bin ( ˆ τ , ˆ κ ) and the maximum reﬁnement lev el, L max Result: HEALPix quadtree with the set of leaf nodes, B = {S l,m , l = 0 ..L } containing the SRPD ev aluated at the corresponding pixel, P l,m tr eeLev el ← 1 Initialize the quadtree by calculating the SRPD at each leaf node, P 1 ,m N ← ∅ while tr eeLevel ≤ L max do B = {S l,m | l = tr eeLev el } while B 6 = ∅ do S ← fetchRandomLeafNodeFrom( B ) C ← childrenOf( S ) // Calculate SRPDs for C = { C i , i = 1 · · · 4 } B 0 = B ∪ C \ S E B ← spatialE ntr opy ( B ) E B 0 ← spatialE ntr opy ( B 0 ) if E B 0 < E B then N ← N ∪ C else N ← N ∪ {S } end B ← B \ S end B ← N tr eeLev el ← tr eeLev el + 1 end Fig. 2 sho ws the progress of the proposed algorithm using the SHD coefﬁcients of the sum of three unit amplitude, monochromatic plane wav es with the common frequency of F = 3 kHz, incident on a spherical aperture of radius 4 . 2 cm from the directions of ( φ 1 , θ 1 ) = ( π / 2 , 3 π / 5) , ( φ 2 , θ 2 ) = (2 π / 3 , π / 5) , and ( φ 3 , θ 3 ) = ( π / 3 , 9 π / 5) , respecti vely . The real DO As of the plane wav es are denoted by black cross signs. The decomposition at different resolution lev els, from 1 through to 4, obtained using the HiGRID algorithm are shown using the Mollweide projection. It may be observed that the SRPD map contracts in regions corresponding to the DO A of each plane wa ve. SRPD is calculated at 33, 72, 165 and 429 possible directions for levels 1, 2, 3, and 4, respectiv ely . In order to attain the same angular resolution without entropy- based selection of regions of interest, 48, 192, 768 and 3072 directions would hav e had to be scanned at each of these different le v els. 3) Sour ce Localization: Interpreting the HEALPix grid as a quadtree allows a natural clustering of the underlying data that the grid represents. Once the multiresolution SRPD map is formed up to a giv en resolution level, the corresponding quadtree can provide a spatial map of the recorded scene with the leav es (i.e. pixels; grid elements) at the highest lev el representing the probable source directions. W e adapt the connected components labelling (CCL) [39] technique from the computer vision domain for use with the quadtree SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 7 0 0.19796 LEVEL 1 + + + (a) 0 0.27262 LEVEL 2 + + + (b) 0 0.32528 LEVEL 3 + + + (c) 0.336911 0 LEVEL 4 + + + (d) Fig. 2. Hiearchical grid reﬁnement (HiGRID) applied on a single time- frequency bin, representing a sum of three unit amplitude, monochromatic plane wav es with a frequency of F = 3 kHz. The progress of the algorithm at levels 1-4 are shown in (a)-(d), respectiv ely . representation of the grid to identify and label re gions of interest based on neighbourship relations between the nodes. The algorithm which we call neighbouring nodes labelling (NNL) is shown in Algorithm 2 3 . CCL and NNL are similar in the way that they identify regions based on their contiguity and that the y carry out labelling and clustering jointly . The major differences between CCL and NNL are that NNL can operate o n a multiresolution representation whereas CCL operates on uniformly sampled pixels. Also, as opposed to CCL, NNL takes into account the periodic boundaries (i.e. at φ = 2 π , and θ = 0 , π ) on the spherical grid while identifying and labelling contiguous regions. Similar to CCL which operates on binarized images, NNL operates on a set of leaf nodes whose v alues are abov e a threshold. Thresholding also allows obtaining more compact regions of interest. While it is possible to employ more elaborate thresholding techniques such as Bradley’ s adapti ve thresholding [41], we used the mean value of the leaf nodes of the tree as a threshold due to its simplicity . Once contiguous regions are identiﬁed, the centroid of each region, n ( ˆ τ , ˆ κ ) , is stored as the local DO A estimation for one of the sources in the time-frequency bin. Note that the directions of these unit vectors will not necessarily coincide with the centres of HEALPix grid elements. C. P ost-pr ocessing Post-processing stage uses a 2D direction histogram (with a uniform 1 ◦ bin size for θ and φ ) obtained from the set of unit vectors denoting source DO As, { n ( ˆ τ , ˆ κ ) } . Three operations are used on this histogram prior to the ﬁnal DO A estimation: 3 It should be noted that NNL algorithm is provided here for conv enience and not for its no velty since there exist similar applications of CCL in quadtree representations of data e.g. [40]. Algorithm 2: Neighboring Nodes Labelling (NNL) Data: Set of leaf nodes, G = {S l,m l = 0 ..L } of a HEALPix quadtree containing the v alues of SRPD distribution, P l,m Result: Sets of nodes C s where s = 1 · · · S // Binarization via thresholding T H R ← calculateThreshold( {P l,m } ) for {S l,m } ∈ G do if P l,m < T H R then Remov e node S l,m from G end end s ← 1 // Initialize the label count G L = {S l,m | l = L max } // Select leaf nodes at the highest resolution level while G L 6 = ∅ do stopFlag ← True C s ← ∅ // Initialize cluster s S ← fetchRandomElementFrom( G L ) C s ← C s ∪ {S } G L ← G L \ {S } while stopFlag is True do N ← neighborsOfSet( C s ) D ← G L ∩ N if D 6 = ∅ then C s ← C s ∪ D G L ← G L \ D else s ← s + 1 stopFlag ← False end end end 1) Due to the small number of DO A estimates from the selected time-frequency bins, the directional histogram will be sparse. In order to denoise the 2D histogram, spurious bins with a single occurrence are eliminated and a median ﬁlter with a neighbourhood size of 3 is applied to the resulting 2D histogram. 2) The denoised histogram is then processed with a Gaus- sian ﬁlter with a kernel size of 3 to obtain a dense representation that is suitable for segmentation-based clustering. 3) The resulting histogram is then cast back to a HEALPix grid to be used with the NNL method described earlier . Finally , NNL is applied on this aggregate representation and centroids of the identiﬁed clusters are registered as source DO As. V I . C O M P U T A T I O N A L C O S T It was shown in the pre vious section that using HiGRID, it is possible to obtain DO A estimates using far fewer steering directions than SRP . Howe ver , there is some ov erhead due to the calculation of information gain and also due to the calculation of SRPD which is different than SRP . An analytic SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 8 Fig. 3. Ratio of computing time of HiGRID ( T H iGRI D ) to SRP ( T S RP ) for different resolution le vels from 1 to 4 and for different number of coherent plane waves with random directions. Each data point represents the ratio calculated over the av erage of computing times for 50 simulations. Fig. 4. Ratio of computing time of HiGRID ( T H iGRI D ) to SRP ( T S RP ) for a total of 18 plane waves with different number of, S = 1 , S = 2 , S = 3 and S = 4 coherent plane wav es with random directions. Each data point represents the ratio calculated over the av erage of computing times for 50 simulations. deriv ation of the asymptotic complexity of HiGRID is very hard if not impossible to obtain. Therefore, the time required to process a single time-frequency bin is used as a measure to compare the computational performances of HiGRID and SRP . The comparisons that we present here inv olv ed simulating up to 18 monochromatic plane wa ves ( F = 3 kHz), incident from random directions with at least π/ 4 separation. A number of 50 repetitions were made for each source count. V ectorised implementations of both methods programmed in Python 2.7, running on a MacBook Pro with Intel Core i5 2.9 GHz and 16 GB RAM were used. Fig. 3 shows the ratio of the time spent using HiGRID ( T H iGRI D ) to time spent using SRP ( T S RP ) for different numbers of coherent, unit amplitude plane wa v es. It may be observed that HiGRID pro vides drastic computational ad- vantages for higher resolution (e.g. at level 4) and at lower source counts in comparison with SRP . At the resolution level 4 and for a single unit amplitude plane wave, computation time required for HiGRID is 16 . 4% of that required by SRP . The computation time becomes similar for 12 plane wav es. At lev el 3, the computational time is similar for 6 plane wav es. It may be observed that at level 1, SRP outperforms HiGRID in terms of computation time by a factor of 2. This is due to the dimensions of the basis used in SRPD calculations. Ho we ver , since the angular resolution at this lev el is 29 . 3 ◦ , which is too low to be useful in a practical application, this difference is not relev ant. Note that cases with many unit amplitude plane wa ves are not representative of real life scenarios as it is unlikely that a high number of fully coherent plane wa ves with equal amplitudes will impinge on the array simultaneously . Howe v er , these conditions represent the worst-case scenarios and hence chosen to demonstrate the computational ef ﬁcienc y of the proposed algorithm in comparison with SRP . In order to ev aluate the computational performance of the proposed method in a more realistic scenario, simulations of a total of up to 18 plane waves were carried out at a resolution le vel of 4. In the four dif ferent simulations, S = 1 , S = 2 , S = 3 , and S = 4 were unit amplitude, coherent plane wav es, respectively . The remaining plane wav es were assigned random, complex-v alued amplitudes with magnitudes drawn from a uniform distribution in [0 , 0 . 5] . This represents a situation similar to a combination of direct and diffuse ﬁelds. Fig. 4 shows the ratio of the time spent using SRP to time spent using HiGRID as well as lines ﬁt to the av erages of these ratios for increasing number of components. It may be observed that the ratio is belo w 20% , 35% , 51% and 57% for S = 1 , S = 2 , S = 3 , and S = 4 coherent plane wa ves, respectiv ely . It increases linearly with the total number of components. Howe v er , this trend is very slow with the maximum slope of 0 . 6% for S = 3 . V I I . P E R F O R M A N C E E V A L U A T I O N The proposed algorithm is ev aluated using acoustic impulse responses (AIRs) measured in a highly rev erberant classroom. Robustness of the proposed method to additiv e white noise is assessed via emulated acoustic scenes. A separate comparison with the state-of-the-art methods is presented. In addition a microphone array recording of a musical performance was used to demonstrate the performance of the algorithm in a real-life scenario. A. Acoustic Impulse Response Measur ements In order to provide ﬂexibility in terms of creating dif ferent test scenarios, we measured multichannel acoustic impulse responses (AIRs) in an empty classroom at METU Graduate School of Informatics with a rigid spherical microphone array (mh Acoustics Eigenmike em32). Logarithmic sine sweep method [42] was used in the measurements. The sound source was a Genelec 6010A two-way loudspeaker whose acoustic axis pointed at the vertical axis of the array at all measurement positions. Eigenmike em32 has 32 microphones positioned on a rigid sphere on the vertices of a truncated icosahedron and allows the decomposition of the sound ﬁeld up to fourth order (i.e. N = 4 ) [43]. The classroom in which the measurements were made has a high reverberation time ( T 60 ≈ 1 . 12 s) when empty . It is approximately rectangular and has the dimensions 6 . 5 × 8 . 3 × 2 . 9 m. AIR measurements were made at 240 points on a rectilinear grid of 0 . 5 m horizontal and 0 . 3 m vertical SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 9 0.5 m 8.3 m 6.5 m x y Fig. 5. T op view of the measurement positions inside the classroom. Circles represent the source locations and the square represents the position of the microphone array . The walls are made of concrete covered with plaster, the ﬂoor is carpeted, and the ceiling has acoustic tiles. One of the walls also has a window and a wooden door that were both kept closed during the measurements. resolution surrounding the array . The array w as positioned at a height of 1 . 5 m. The measurement planes were positioned at the heights of 0 . 9 , 1 . 2 , 1 . 5 , 1 . 8 and 2 . 1 m from the ﬂoor level. These positions co ver the whole azimuth range and an elev ation range of approximately ± 50 ◦ abov e and below the horizontal plane. W e also measured room impulse responses (RIRs) at the same positions using an Alctron M6 omnidirectional microphone. This second set of measurements were used to calculate the corresponding direct-to-re verberant (D/R) energy ratios for these positions. Fig. 5 sho ws the top view of the measurement positions. B. Sour ce signals In order to ev aluate the proposed algorithm with near - incoherent (i.e. nominally W -disjoint) sources, we used the ﬁrst 4 seconds of anechoic speech recordings from B&O Music for Archimedes CD [44]. These signals included speech recordings in English and Danish from two female and two male speakers. These were resampled to 48 kHz. In order to ev aluate the proposed algorithm with near - coherent sources, we used 4 seconds (01:00-01:04) of the anechoic recordings of Gusta v Mahler’ s Symphony Nr . 1, fourth mov ement [45]. Only the four violin tracks were used in which violins play the same musical phrase in unison. It should be noted that while there may be minor differences, the source signals have substantially similar magnitude spectrograms (see Fig. 6). This selection of test signals constitutes a very exacting, albeit realistic scenario, for example in object-based audio capture. The sampling rate was 48 kHz. The speech signals were normalized according to ITU P .56 [46] using the VOICEBOX toolbox [47]. The violin signals were normalized for ener gy . Fig. 6. Magnitude spectrograms of the ﬁrst second of the four violin tracks used in the evaluations. C. P erformance Indicators W e use three indicators for assessing the performance of DO A estimation performance. The ﬁrst indicator is the av erage DO A estimation error deﬁned as:  DOA = 1 S est S est X s =1 | arccos h n est,i , n src,i i| (36) where S est is the number of correctly estimated sources, n est,i and n src,i are the unit vectors in the direction of the estimated DOA and the nominal source direction, respectively . The closer this v alue to zero, the better the DO A estimation. The second performance indicator is the average number of localised sources, S avg . While this indicator may provide in- sight about the av erage success of the algorithm in identifying the source count, it may also be misleading. For example, if for a scenario in volving 4 sources, 8 sources are identiﬁed in one trial and none in another trial, the average would indicate ideal performance. The third indicator depends on the source count disparity deﬁned as: ∆ s = S est − S act (37) where S act is the actual number of sources. Ideally , ∆ s = 0 (i.e. S est = S act ). It should be noted that ∆ s < 0 (e.g. S est < S act ) is a more serious problem than ∆ s > 0 (e.g. S est > S act ) since the former indicates one or more undetected sources, while the latter may indicate that all sources in the scene are detected alongside some strong co- herent reﬂections. Therefore, the third performance indicator , ne gative source disparity , is deﬁned as the number of times the source count disparity is negati v e in a given number of trials. SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 10 D. Evaluation of the HiGRID algorithm In the performance tests explained below , HiGRID algo- rithm was used with a maximum tree lev el of 3 which corresponds to a grid resolution of 7 . 33 ◦ . The maximum order of SHD was N = 4 . The frequency range was between 2608 Hz and 5216 Hz for which the mode strength for order , N = 4 , is high. The FFT size and the hop size were 1024 and 64 samples, respectiv ely . For each of the test cases, the selection of the prescribed number of sources is made randomly , subject to two criteria. The source positions were ﬁrst grouped according to their D/R ratios. Sev en D/R ratio clusters with a cluster width of 1 . 38 dB were used. The cluster centres corresponded to − 1 . 65 , − 0 . 27 , 1 . 11 , 2 . 50 , 3 . 88 , 5 . 26 , 6 . 65 dB, respectively . Source combinations in each test scenario included positions only from one of these clusters so that the D/R ratio was similar across different source positions. It was also made sure that the separation between any pair of positions in a scenario were greater than π / 4 . T en test scenarios were randomly generated for each of the S = 1 , 2 , 3 , and 4 source cases and for each of the D/R ratio clusters. Kuhn-Munkres algorithm [48] was used to assign DOA estimations to the ground truth. In all trials, if the DOA estimation error was found to be larger than π / 4 , that case was identiﬁed as an extreme value and excluded from the analysis. 1) Ideal conditions: This condition corresponds to a noise- less scenario where the source signals were conv olv ed with the measured AIRs and summed to obtain the test signals. Fig. 7 shows the DO A estimation errors for the case where speech signals are used. There were a total of 19 extreme values in 280 distinct test scenarios and 700 source positions. These were not included in the calculation of the DOA estimation error a verage. The average DO A estimation errors for S = 1 , 2 , 3 , and 4 sources were 2 . 71 ◦ , 2 . 99 ◦ , 3 . 20 ◦ , and 3 . 39 ◦ , respecti vely . The ov erall a verage was 3 . 18 ◦ . It may be observed that the DO A estimation error is affected neither by the D/R ratio nor by the number of sources in the scene. Fig. 8 shows the DO A estimation errors for the case where violin signals are used. There were 5 extreme v alues in 280 distinct test scenarios and 700 source positions which were not included in the calculation of the av erage DOA estimation errors. The av erage DO A estimation errors for S = 1 , 2 , 3 , and 4 sources were 3 . 34 ◦ , 3 . 44 ◦ , 3 . 97 ◦ , and 4 . 37 ◦ , respectiv ely . The overall av erage was 3 . 96 ◦ . Similar to the case with speech signals, the DoA estimation error for violin signals is not affected by the D/R ratio or by the number of sources in the scene. The av erage number of sources identiﬁed for S = 1 , 2 , 3 , and 4 source cases were, S avg = 1 . 67 , 2 . 97 , 3 . 52 , 4 . 65 for speech, and 1 . 02 , 2 . 01 , 2 . 78 , 3 . 52 for violin signals, respectiv ely . Ne gati ve source disparity was lo w when D/R ratio was high and also when the source count was low . For speech signals, the the worst performance occured for the low D/R ratio of − 0 . 27 dB and for S = 4 . The the average number of sources identiﬁed for this condition was 5 . 7 . F or violin signals, the worst performance occured for the moderate D/R ratio of 1 . 11 dB and for S = 4 . The av erage number of sources identiﬁed for this condition was 3 . 2 . T otal negati ve source Fig. 7. Results of DO A estimation for speech signals using HiGRID for S = 1 , 2 , 3 , and 4 sources and dif ferent D/R ratio lev els. The box plots indicate the distribution of error . The solid lines indicate the mean DO A error. Asterisks indicate outliers. Extreme values are not shown. Fig. 8. Results of DO A estimation for violin signals using HiGRID for S = 1 , 2 , 3 , and 4 sources and different D/R ratio levels. The box plots indicate the distribution of error. The solid lines indicate the mean DO A error. Asterisks indicate outliers. Extreme values are not shown. 0 10 20 30 40 0 5 10 0 5 10 0 5 10 0 10 20 30 40 0 10 20 30 40 0 5 10 0 10 20 30 40 DOA Est. Error [deg] DOA Est. Error [deg] DOA Est. Error [deg] DOA Est. Error [deg] Count Count Count Count S i n g l e S o u r c e T w o S o u r c e s T h r e e S o u r c e s F o u r S o u r c e s S N R = 0 d B S N R = 1 0 d B S N R = 2 0 d B S N R = 3 0 d B Fig. 9. Histograms of DO A estimation errors for different number of sources and different noise lev els. Bin size is 0 . 5 ◦ and each count corresponds to the DO A estimation error for a single source. disparity was observed 8 times for speech and 47 times for violins over the total of 280 test scenarios each. 2) Effect of noise: Sensor noise was simulated by adding uncorrelated white Gaussian noise to individual microphone signals for dif ferent signal-to-noise ratios (SNR). The noise lev el was calculated based on the ener gy of the omnidirectional SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 11 T ABLE I A V E R A G E N UM BE R O F L OC A L I SE D S O U R C ES F OR D IFF ER E N T S N R S SNR (dB) S = 1 S = 2 S = 3 S = 4 0 1 . 22 2 . 08 2 . 92 3 . 57 10 1 . 01 2 . 30 2 . 74 3 . 75 20 1 . 00 2 . 12 2 . 78 3 . 38 30 1 . 05 2 . 16 2 . 84 3 . 65 ∞ 1 . 02 2 . 01 2 . 78 3 . 52 component, obtained as the av erage of each of the individual channels. The tested le vels were 0 , 10 , 20 , 30 dB SNR. Only the violin signals were used in this ev aluation and the selection of the scenarios was made in the same way as in the ideal case. Fig. 9 shows the distribution of the error for dif ferent number of sources and different noise lev els tested. The rows in the ﬁgure indicate grouping according to the number of sources and the columns indicate grouping according to the noise lev el. A one-way analysis of variance (ANO V A) was carried out to understand the effect of noise lev el on the DO A estimation error . The independent variable was SNR which has four levels from 0 to 30 dB. The dependent variable was the DO A estimation error . Results of the ANO V A was signiﬁcant at α = 0 . 05 le vel ( F (3 , 2564) = 3 . 625 , p = 0 . 013 ). Le vene’ s test indicated that the v ariance was different across the tested groups ( F (3 , 2564) = 3 . 796 , p = 0 . 01 ). Post-hoc comparisons using Games-Howell test indicated that the difference between the error means between 0 dB SNR and 10 dB SNR as well as 0 dB SNR and 20 dB SNR are signiﬁcant. Ho we ver , these differences (i.e. 0 . 51 ◦ and 0 . 49 ◦ , respectiv ely) are negligible in a practical sound source localisation scenario. The a verage number of identiﬁed sources and the ne gati ve source disparity also did not change with noise. T able I shows the average number of localised sources. It may be observed that for lower source counts (i.e. S = 1 and S = 2 ) the source count is slightly ov erestimated in av erage, and vice versa. T otal negati v e source disparity was 46 , 44 , 45 , 43 for SNRs of 0 , 10 , 20 , and 30 dB, over 280 trials each, respectiv ely . Therefore, the proposed method is considered to be robust to additiv e, uncorrelated Gaussian white noise. E. Comparison of HiGRID with state-of-the-art methods W e compared HiGRID with three other state-of-the-art methods as well as SRP-based DO A estimation. The state- of-the-art methods chosen for comparison are PIV [5], SSPIV [18], and DPD-MUSIC [22]. A general overvie w of the compared methods as well as the speciﬁc test setup and results are presented in this section. Speciﬁc details of the algorithms are av ailable in the respective articles cited in the text. W e use the SRP method based on plane-wav e decomposition as the baseline. The parameters used for PIV , SSPIV and DPD- MUSIC methods (e.g. time and frequency resolutions, amount of overlap in the time-frequency representation, as well as the employed thresholds) in the comparison are selected to match those in the original publications. 1) DO A estimation methods used in the comparison: PIV method is based on the concept of activ e intensity which is the vector denoting the net ﬂo w of ener gy in free ﬁeld [49]. While it is impossible to measure active intensity using a rigid spherical microphone array , an approximation called pseudo- intensity vector (PIV) can be obtained by using the zeroth and ﬁrst-order SHD coefﬁcients. Assuming that the sources are W -disjoint orthogonal, their DO As can be estimated from the long-term directional statistics of PIVs. In the comparisons, the frequency range was between 0 and 4 kHz, the window size was 8 ms, and the ov erlap was 4 ms [5]. SSPIV extends the PIV method by utilizing higher-order SHD coef ﬁcients. SSPIV is calculated at each time-frequency bin by decomposing the time-frequency smoothed spatial cov ariance matrix obtained from the SHD coef ﬁcients into signal and noise subspaces via singular value decomposition (SVD). SSPIV vector is calculated from the ﬁrst three elements of the signal subspace vector [18]. In the comparisons, the frequency range was between 500 and 3850 Hz, the windo w size was 8 ms, the ov erlap was 6 ms, and the decomposition order was N = 3 [5]. DPD-MUSIC is a subspace-based DOA estimation method that uses EB-MUSIC for estimating DOAs in time-frequency bins with a single dominant component [22]. While there are recent extensions of DPD-MUSIC that aim to improve estimation accuracy [50] or reduce computational cost [35] we chose the coherent version of the original algorithm for our comparison since these extensions are originally intended for single-source scenarios. In the comparisons, the frequenc y range for DPD-MUSIC was between 500 and 3875 Hz, the window size was 16 ms, the ov erlap was 75% , and the de- composition order was N = 3 [22]. EB-MUSIC spectrum was calculated at all the pixel centroids of a third-lev el HEALPix grid and the DO A was estimated for each time-frequency bin as the pixel centroid at which this spectrum is maximum. For both SRP and HiGRID, the frequency range was be- tween 2608 and 5214 Hz, the window size was 21 . 3 ms, the ov erlap was 20 ms, and the decomposition order w as N = 4 . SRP was calculated at all the pixel centroids of a third-lev el HEALPix grid. For HiGRID, the maximum depth of the analysis tree was 3 . For both methods, DOA estimation was made using the NNL method explained abov e. 2) T est Scenario: A simple scenario in v olving four concur- rent sources positioned at S1: (90 ◦ , 45 ◦ ) , S2: (90 ◦ , 135 ◦ ) , S3: (90 ◦ , 225 ◦ ) and, S4: (90 ◦ , 315 ◦ ) at a distance of 1 . 41 m w as emulated using the AIR measurements described above. The D/R ratios for these speciﬁc source positions were, 3 . 04 , 3 . 29 , 3 . 56 , and 2 . 39 dB, respecti v ely . This constitutes a simple setup with spatially well-separated sources and a high D/R ratio. T wo types of sources (speech and violin) that were used in the previous section for ev aluating the HiGRID approach were also used in the comparison. As with the previous section, the sampling rate and the duration were F s = 48 kHz and 4 s, respectiv ely . In the case in volving speech signals sources 1 to 4 are female speech (English), male speech (English), female speech (Danish), and male speech (Danish), respectiv ely . 3) Results: DOA estimation results for the two different source types are shown in T ables II and III. While calculating the DO A estimation errors, mapping between the identiﬁed sources and the ground truth was made using the Kuhn- Munkres algorithm [48]. SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 12 T ABLE II D O A E S T I MAT I O N E R RO R S F O R F O UR C ON C U R RE NT S PE E C H S OU RC ES U S IN G D I FFE R E N T M E TH O D S . Source SRP HiGRID PIV SSPIV DPD-MUSIC 1 1 . 54 ◦ 1 . 15 ◦ 13 . 12 ◦ 0 . 87 ◦ 0 . 69 ◦ 2 1 . 87 ◦ 1 . 34 ◦ 5 . 76 ◦ 1 . 21 ◦ 2 . 04 ◦ 3 1 . 39 ◦ 0 . 32 ◦ 6 . 86 ◦ 1 . 56 ◦ 1 . 17 ◦ 4 0 . 80 ◦ 0 . 36 ◦ 3 . 51 ◦ 1 . 28 ◦ 0 . 56 ◦ A verage 1 . 40 ◦ 0 . 79 ◦ 7 . 31 ◦ 1 . 23 ◦ 1 . 12 ◦ T ABLE III D O A E S T I MAT I O N E R RO R S F O R F O UR C ON C U R RE NT V IO L I N S OU RC ES U S IN G D I FFE R E N T M E TH O D S . Source SRP HiGRID PIV SSPIV DPD-MUSIC 1 1 . 21 ◦ 1 . 15 ◦ 4 . 89 ◦ 6 . 60 ◦ 3 . 45 ◦ 2 1 . 05 ◦ 1 . 34 ◦ 10 . 07 ◦ 4 . 11 ◦ 4 . 21 ◦ 3 1 . 15 ◦ 0 . 70 ◦ 20 . 41 ◦ 2 . 31 ◦ 2 . 28 ◦ 4 1 . 05 ◦ 1 . 18 ◦ 11 . 92 ◦ 7 . 20 ◦ 1 . 12 ◦ A verage 1 . 12 ◦ 1 . 09 ◦ 11 . 82 ◦ 5 . 06 ◦ 2 . 77 ◦ In both scenarios, HiGRID showed a performance that is very similar to SRP with consistent DO A estimates and very small estimation errors both for individual sources and on av erage. This indicates a distinct advantage ov er SRP in that the computational cost associated with HiGRID is signiﬁcantly lower as discussed abov e. It may be observed from T able II showing the DOA es- timates for speech signals that except for PIV , all methods provide an acceptable le vel of accuracy with DOA estimation errors in the order of one degree. Overall, PIV method pro- vides the worst performance while HiGRID provides the best performance. Howe v er , the dif ferences between SRP , HiGRID, SSPIV and DPD-MUSIC are not so high as to be signiﬁcant in a practical conte xt. The results in T able III show the detrimental effects of coherent sources on PIV and to a lesser extent on SSPIV and DPD-MUSIC. Except for HiGRID and SRP , the average DOA estimation errors are much higher than the test using speech signals. HiGRID and SRP pro vide a DOA estimation perfor- mance that is similar to the case with speech sources. Ho wev er , the performance of PIV and SSPIV are reduced sharply where the DOA estimation from PIV will be completely unusable for Source 3. DO A estimates from DPD-MUSIC for individual sources are all higher than those for the ﬁrst test case. The av erage DOA estimation error is larger than that in the ﬁrst case by a f actor of 2 . 5 . F . Real Recor ding In order to test the performance of HiGRID in real life conditions, we recorded the pre-concert rehearsal of Nemeth Quartet (a classical quartet consisting of two violins, a viola, a cello) in the small recital hall of Erimtan Museum in Ankara. The rev erberation time of the hall was T 60 = 1 . 19 s. The recording was made by positioning the spherical array at a central position between the musicians. The height of the microphone array was chosen as 1 . 5 m in order to allow the musicians to maintain eye contact. W e were not able to make any precise measurements of the positions of the Violin 1 Violin 2 Viola Eigenmike em32 Cello Fig. 10. Setup for the recording of the classical quartet. Position of the microphone array and directions of the instruments with respect to the array are indicated. sound sources due to practical considerations (e.g. movement of musicians during the performance). Howe ver , we made AIR measurements with a Genelec 6010A loudspeaker at positions that roughly coincide with the positions of the musical in- struments. In order to assess the accuracy of the proposed method, SRP based DO A estimates obtained from windo wed direct-path components of AIRs were used as reference. The program material used in the e v aluation was a ﬁve second excerpt from the third mov ement of Ludwig van Beethoven’ s String Quartet Nr . 11, Op. 95. All instruments were playing in this section but the individual parts were not in unison unlike the coherent source scenario simulated in the previous section. Fig. 10 shows the recording setup. HiGRID was used with a maximum tree level of 3 . The maximum order of SHD was N = 4 . T wo different frequency ranges were used: 1) 2608 Hz to 5216 Hz, and 2) 1308 Hz to 5216 . A 1024 point FFT with a hop size of 64 samples was used for the STFT . The reference DOAs and estimation results are given in T able IV. It may be observed that using the ﬁrst frequency range, three out of four instruments (two violins and the viola) can be localised. No spurious sources were detected in this case. Using the second frequency range, all instruments can be localised along with six spurious DOAs corresponding to strong room reﬂections (not shown in the table). It should be noted that reporting the DOA estimation error with respect to the reference values would be misleading due to the inaccuracy of the reference DO As as well as the sources being volumetric and not static. Therefore, the presented results are only intended to sho w –with reasonable but imprecisely quantiﬁed accuracy– that HiGRID can ﬁnd source DO As under real-life conditions. V I I I . C O N C L U S I O N S Estimation of DOAs of multiple, possibly coherent sources in highly rev erberant en vironments is an important step in sound scene analysis. While there exist different methods proposed for this purpose, these are either computationally prohibitiv ely costly , or they fail for highly coherent sound sources. An extension to steered response power (SRP) and an accompanying entropy-based hierarchical grid reﬁnement SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 13 T ABLE IV R E FE RE N C E A ND E S T I M A TE D D O A S F OR T HE M IC RO P HO NE A RR A Y R E C O R DI N G O F T HE C L A S S IC A L Q UA RT ET ( θ, φ ) V iolin 1 V iolin 2 V iola Cello Reference (116 . 1 ◦ , 92 . 4 ◦ ) (114 . 9 ◦ , 32 . 3 ◦ ) (124 . 4 ◦ , 327 . 8 ◦ ) (132 . 4 ◦ , 268 . 4 ◦ ) 2608 − 5216 Hz (109 . 7 ◦ , 89 . 4 ◦ ) (109 . 7 ◦ , 28 . 1 ◦ ) (112 . 3 ◦ , 340 . 4 ◦ ) - 1304 − 5216 Hz (118 . 1 ◦ , 88 . 5 ◦ ) (107 . 1 ◦ , 30 . 9 ◦ ) (115 . 0 ◦ , 337 . 7 ◦ ) (138 . 1 ◦ , 276 . 4 ◦ ) method (HiGRID) was proposed in this article. The extension of SRP , which we call steered response power density (SRPD) uses the spherical harmonic decomposition of a sound ﬁeld to extract the power density in a given region on the unit sphere. This allows representing the probability of a source being present in the gi v en analysis region on the unit sphere. SRPD is calculated on a HEALPix grid which is a spherical tessellation allowing a multiresolution representation of data. Maxima of the SRPD map is searched on this grid using information gain as a decision measure. The obtained multiresolution maps are segmented to obtain multiple local DO A estimates. These are then aggregated to estimate DOAs of multiple sources. W e showed that HiGRID has a lo wer computational cost than SRP , is rob ust to reverberation and additive noise, and can accurately localise highly coherent sources as well as speech. HiGRID also compared fav ourably with other state of the art methods in terms of its DOA estimation accuracy . Application of HiGRID using a microphone array recording of a musical performance demonstrated its utility under real-life conditions. A C K N OW L E D G M E N T W e would like to thank the members of Nemeth Quartet (http://www .nemethquartet.com) for allo wing us to record their rehearsal. This paper is dedicated to the memory of the third author’ s late grandmother , Belkıs Hacıhabibo ˘ glu. R E F E R E N C E S [1] M. Brandstein and D. W ard, Microphone arrays: signal processing techniques and applications . Springer Science & Business Media, 2013. [2] B. Rafaely , Fundamentals of Spherical Arr ay Pr ocessing . Springer- V erlag, 2015. [3] D. P . Jarrett, E. A. P . Habets, and P . A. Naylor, Theory and Applications of Spherical Microphone Array Pr ocessing . Springer , 2016. [4] H. Sun, H. T eutsch, E. Mabande, and W . Kellermann, “Robust local- ization of multiple sources in reverberant en vironments using eb-esprit with spherical microphone arrays, ” in Proc. IEEE Int. Conf. on Acoust. Speech and Signal Pr ocess. (ICASSP 2011) , Prague, Czech Republic, May 22-27 2011, pp. 117–120. [5] D. P . Jarrett, E. A. P . Habets, and P . A. Naylor, “3D source localization in the spherical harmonic domain using a pseudointensity vector, ” in Pr oc. 18th European Signal Process. Conf. (EUSIPCO 2010) , Aalborg, Denmark, Aug. 23-27 2010, pp. 442–446. [6] H. Sun, E. Mabande, K. Ko walczyk, and W . Kellermann, “Localization of distinct reﬂections in rooms using spherical microphone array eigen- beam processing, ” J. Acoust. Soc. Am. , vol. 131, no. 4, pp. 2828–2840, 2012. [7] C. Knapp and G. Carter , “The generalized correlation method for estimation of time delay , ” IEEE T rans. on Acoust. Speech and Signal Pr ocess. , vol. 24, no. 4, pp. 320–327, 1976. [8] H. Do, H. F . Silverman, and Y . Y u, “A real-time SRP-PHA T source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array, ” in Pr oc. IEEE Int. Conf. on Acoust. Speech and Signal Pr ocess. (ICASSP 2007) , Honolulu, HI, USA, Apr 16-20 2007, pp. I–121–I–124. [9] H. Do and H. F . Silverman, “A Fast Microphone Array SRP-PHA T Source Location Implementation using Coarse-T o-Fine Region Contrac- tion (CFRC), ” in Pr oc. IEEE W orkshop on Appl. of Signal Process. to Audio and Acoust. (W ASP AA 2007) , New Paltz, NY , USA, Oct. 21-24 2007, pp. 295–298. [10] A. Marti, M. Cobos, J. J. Lopez, and J. Escolano, “A steered response power iterative method for high-accuracy acoustic source localization, ” J. Acoust. Soc. Am. , vol. 134, no. 4, pp. 2627–2630, Oct. 2013. [11] M. V . S. Lima, W . A. Martins, L. O. Nunes, L. W . P . Biscainho, T . N. Ferreira, M. V . M. Costa, and B. Lee, “A V olumetric SRP with Reﬁnement Step for Sound Source Localization, ” IEEE Signal Process. Lett. , vol. 22, no. 8, pp. 1098–1102, Aug. 2015. [12] L. O. Nunes, W . A. Martins, M. V . S. Lima, L. W . P . Biscainho, M. V . M. Costa, F . M. Goncalves, A. Said, and B. Lee, “A Steered-Response Power Algorithm Employing Hierarchical Search for Acoustic Source Localization Using Microphone Arrays, ” IEEE Tr ans. Signal Process. , vol. 62, no. 19, pp. 5171–5183, Oct. 2014. [13] H. T eutsch and W . Kellermann, “Detection and localization of multiple wideband acoustic sources based on wav eﬁeld decomposition using spherical apertures, ” in Pr oc. IEEE Int. Conf. on Acoust. Speech and Signal Pr ocess. (ICASSP-08) , Las V egas, NV , USA, Mar . 31 - Apr . 4 2008, pp. 5276–5279. [14] C. Evers, A. H. Moore, and P . A. Naylor , “Multiple source localisation in the spherical harmonic domain, ” in Pr oc. 14th Int. W orkshop on Acoust. Signal Enhancement (IW AENC 2014) , Antibes, France, Sept. 8-11 2014, pp. 258–262. [15] S. Rickard and O. Y ilmaz, “On the approximate W-disjoint orthogonality of speech, ” in Proc. 2002 IEEE Int. Conf. on Acoust., Speech, and Signal Pr ocess. (ICASSP-02) , vol. 1, Orlando, Florida, USA, May 13-17 2002, pp. 529 – 532. [16] S. Hafezi, A. H. Moore, and P . A. Naylor, “Multiple source localization in the spherical harmonic domain using augmented intensity vectors based on grid search, ” in Pr oc. 24th European Signal Process. Conf. (EUSIPCO 2016) , Budapest, Hungary , Aug. 29 - Sept. 2 2016, pp. 602– 606. [17] ——, “ Augmented intensity vectors for direction of arri val estimation in the spherical harmonic domain, ” IEEE/ACM T rans. on Audio, Speech and Language Pr ocess. , vol. 25, no. 10, pp. 1956 – 1968, Aug. 2017. [18] A. H. Moore, C. Evers, and P . A. Naylor, “Direction of arrival estima- tion in the spherical harmonic domain using subspace pseudointensity vectors, ” IEEE/ACM T rans. on Audio, Speech and Language Process. , vol. 25, no. 1, pp. 178–192, 2017. [19] S. Hafezi, A. H. Moore, and P . A. Naylor , “Multi-source estimation consistency for improved multiple direction-of-arriv al estimation, ” in Pr oc. 5th W orkshop on Hands-free Speech Comm. and Microphone Arrays (HSCMA-17) , San Francisco, CA, USA, Mar . 1-3 2017, pp. 81– 85. [20] D. Khaykin and B. Rafaely , “Coherent signals direction-of-arrival es- timation using a spherical microphone array: Frequency smoothing approach, ” in Proc. IEEE W orkshop on Appl. of Signal Process. to Audio and Acoust. (W ASP AA’09) , New Paltz, NY , USA, October 18-21 2009, pp. 221–224. [21] ——, “Acoustic analysis by spherical microphone array processing of room impulse responses, ” J. Acoust. Soc. Am. , vol. 132, no. 1, pp. 261– 270, Jul. 2012. [22] O. Nadiri and B. Rafaely , “Localization of multiple speakers under high rev erberation using a spherical microphone array and the direct-path dominance test, ” IEEE/ACM T r ans. on Audio, Speec h and Language Pr ocess. , vol. 22, no. 10, pp. 1494–1505, Oct. 2014. [23] A. Moore, C. Evers, P . A. Naylor, D. L. Alon, and B. Rafaely , “Direction of arrival estimation using pseudo-intensity vectors with direct-path dominance test, ” in Proc. 23r d Eur opean Signal Pr ocess. Conf. (EUSIPCO-15) , Nice, France, Aug. 31 - Sept. 4 2015, pp. 2296– 2300. [24] D. Pavlidi, A. Grifﬁn, M. Puigt, and A. Mouchtaris, “Real-time multiple sound source localization and counting using a circular microphone SUBMITTED FOR PUBLICA TION IN IEEE/A CM TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING, VOL. XX, NO. YY , MARCH 2018 14 array , ” IEEE T r ansactions on Audio, Speech, and Language Processing , vol. 21, no. 10, pp. 2193–2206, Oct 2013. [25] D. Pavlidi, S. Delikaris-Manias, V . Pulkki, and A. Mouchtaris, “3d doa estimation of multiple sound sources based on spatially constrained beamforming driven by intensity vectors, ” in 2016 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , March 2016, pp. 96–100. [26] E. G. W illiams, F ourier Acoustics . London, UK: Academic Press, 1999. [27] Z. Li, R. Duraiswami, E. Grassi, and L. S. Davis, “Flexible layout and optimal cancellation of the orthonormality error for spherical microphone arrays, ” in Proc. IEEE Int. Conf. on Acoust. Speech and Signal Process. (ICASSP 2004) , vol. 4, Montreal, Canada, May 17-21 2004, pp. IV –41 – IV –44. [28] B. Rafaely , B. W eiss, and E. Bachmat, “Spatial aliasing in spherical microphone arrays, ” IEEE T ransactions on Signal Pr ocessing , vol. 55, no. 3, pp. 1003–1010, 2007. [29] B. Rafaely , “Plane-wave decomposition of the sound ﬁeld on a sphere by spherical conv olution, ” J. Acoust. Soc. Am. , vol. 116, no. 4, pp. 2149– 2157, Oct. 2004. [30] B. Rafaely , Y . Peled, M. Agmon, D. Khaykin, and E. Fisher, “Spherical Microphone Array Beamforming, ” in Speech Processing in Modern Communication , G. S. Cohen I, Benesty J, Ed. Heidelberg, Germany: Springer-V erlag, 2010, pp. 281–305. [31] Z. Li and R. Duraiswami, “Flexible and optimal design of spherical microphone arrays for beamforming, ” IEEE T rans. on Audio, Speech, and Language Pr ocess. , vol. 15, no. 2, pp. 702–714, Feb . 2007. [32] R. A. Horn and C. R. Johnson, Matrix Analysis . Cambridge University Press, Feb. 1990. [33] G. B. Arfken and H. J. W eber , Mathematical Methods F or Physicists International Student Edition . Elsevier , Jul. 2005. [34] S. B ¨ ock and G. Widmer , “Maximum ﬁlter vibrato suppression for onset detection, ” in Pr oc. 16th Int. Conf. on Digital Audio Effects (DAFx-13) , Maynooth, Ireland, Sept. 2-5 2013. [35] B. Rafaely and K. Alhaiany , “Speaker localization using direct path dominance test based on sound ﬁeld directivity, ” Signal Pr ocessing , vol. 143, no. 2, pp. 42–47, 2018. [36] K. M. Gorski, E. Hi von, A. J. Banday , B. D. W andelt, F . K. Hansen, M. Reinecke, and M. Bartelmann, “HEALPix: A Framework for High- Resolution Discretization and Fast Analysis of Data Distributed on the Sphere, ” Astr ophys. J. , vol. 622, no. 2, pp. 759–771, Apr . 2005. [37] M. Batty , “Spatial entropy , ” Geographical Analysis , vol. 6, no. 1, pp. 1–31, January 1974. [38] K. P . Murphy , Mac hine learning: a pr obabilistic perspective . Cam- bridge, MA, USA: MIT Press, 2012. [39] P . Soille, Morphological image analysis: principles and applications . Heidelberg, Germany: Springer-V erlag, 2004. [40] H. Samet, “Connected component labeling using quadtrees, ” J. ACM , vol. 28, no. 3, pp. 487–501, Mar . 1981. [41] D. Bradley and G. Roth, “ Adaptive thresholding using the integral image, ” J. Graphics T ools , vol. 12, no. 2, pp. 13–21, Feb . 2007. [42] A. Farina, “Simultaneous measurement of impulse response and dis- tortion with a swept-sine technique, ” in Pr oc. 108th Audio Eng. Soc. Con vention , no. Preprint #5093, Paris, France, Feb . 1 2000. [43] J. Meyer and G. W . Elko, “Spherical microphone arrays for 3d sound recording, ” in Audio signal processing for next-gener ation multimedia communication systems , B. J. Huang Y , Ed. New Y ork, NY , USA: Springer US, 2004, pp. 67–89. [44] Bang and Olufsen, “Music for Archimedes, ” Audio CD, 1992. [45] J. P ¨ atynen, V . Pulkki, and T . Lokki, “ Anechoic recording system for symphony orchestra, ” Acta Acust. united with Acust. , vol. 94, no. 6, pp. 856–865, June 2008. [46] International T elecommunications Union, “Objectiv e measurement of activ e speech level, ” ITU-T Recommendation, 1993. [47] M. Brookes et al. , “V oicebox: Speech processing toolbox for matlab, ” Software, available [Mar . 2011] fr om www . ee. ic. ac. uk/hp/staff/dmb/voicebox/voicebox. html , vol. 47, 1997. [48] H. W . Kuhn, “The Hungarian method for the assignment problem, ” Naval Res. Logist. (NRL) , vol. 2, no. 1-2, pp. 83–97, January 1955. [49] F . Fahy , Sound intensity , 2nd ed. CRC Press, 1995. [50] B. Rafaely and D. Kolossa, “Speaker localization in reverberant rooms based on direct path dominance test statistics, ” in 2017 IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP’17) , New Orleans, USA, 5-9 March 2017, pp. 6120–6124. Mert Burkay C ¸ ¨ oteli (S’18) recei ved his B.Sc. (honors) and M.S. in elec- trical and electronic engineering from the Middle East T echnical Uni versity (METU), Ankara, Turke y , in 2009 and 2013, respectiv ely . He is currently pursuing his Ph.D. in Information Systems at the Graduate School of Informatics at the same university . He is also working as a senior system engineer at Aselsan A.S. His research interests include microphone array signal processing and acoustic scene analysis. Orhun Olgun (S’18) was born is Ankara, T urkey in 1994. He received the B.Sc degree in Electrical and Electronics engineering from Bilkent University , Ankara, T urk ey in 2016. He is currently pursuing his M.Sc. in Multimedia Informatics at the Graduate School of Informatics, METU, Ankara, T urkey . His research interests include 3D audio, microphone array signal processing, and acoustic scene analysis. H ¨ useyin Hacıhabibo ˘ glu (S’96-M’00-SM’12) is an Associate Professor of Signal Processing and Head of Department of Modelling and Simulation at Graduate School of Informatics, Middle East T echnical Univ ersity , Ankara, T urke y . He receiv ed the B.Sc. (honors) degree from the Middle East T echnical Univ ersity (METU), Ankara, Turke y , in 2000, the M.Sc. degree from the Univ ersity of Bristol, Bristol, U.K., in 2001, both in electrical and elec- tronic engineering, and the Ph.D. degree in computer science from Queen’ s Univ ersity Belfast, Belfast, U.K., in 2004. He held research positions at Univ ersity of Surrey , Guildford, U.K. (2004-2008) and King’ s College Lon- don, London, U.K. (2008-2011). His research interests include audio signal processing, room acoustics, multichannel audio systems, psychoacoustics of spatial hearing, microphone arrays, and game audio. Dr. Hacıhabibo ˘ glu is a Senior Member of the IEEE, Mmmber of the IEEE Signal Processing Society , Audio Engineering Society (AES), T urkish Acoustics Society (T AD), and the European Acoustics Association (EAA) and the associate editor of IEEE/ACM T r ansactions on Audio, Speech, and Language Pr ocessing .

Multiple Sound Source Localisation with Steered Response Power Density and Hierarchical Grid Refinement

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment