Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network

Direction of Arri v al Estimation for Multiple Sound Sources Using Con v olutional Recurrent Neural Network Sharath Adav anne * 1 , Archontis Politis * 2 , T uomas V irtanen 1 1 Laboratory of Signal Processing, T ampere University of T echnology , Finland 2 Department of Signal Processing and Acoustics, Aalto Univ ersity , Finland Abstract —This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neu- ral network (DO Anet) generates a spatial pseudo-spectrum (SPS) along with the DO A estimates in both azimuth and elevation. W e av oid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DO As of multiple concurrently pr esent sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DO Anet is capable of estimating the number of sources and their respectiv e DOAs with good precision and generate SPS with high signal-to-noise ratio. I . I N T RO D U C T I O N Direction of arri val (DOA) estimation is the task of identi- fying the relative position of the sound sources with respect to the microphone. DOA estimation is a fundamental operation in microphone array processing and forms an integral part of speech enhancement [1], multichannel sound source separation [2] and spatial audio coding [3]. Popular approaches to DO A estimation are based on time-delay-of-arriv al (TDO A) [4], the steered-response-power (SRP) [5], or on subspace methods such as multiple signal classiﬁcation (MUSIC) [6] and the estimation of signal parameters via rotational in variance tech- nique (ESPRIT) [7]. The aforementioned methods dif fer from each other in terms of algorithmic complexity , and their suitability to various arrays and sound scenarios. MUSIC speciﬁcally is v ery generic with regards to array geometry , directional properties and can handle multiple simultaneously activ e narrowband sources. On the other hand, MUSIC and subspace methods in general, require a good estimate of the number of active sources, which are often unav ailable or difﬁcult to obtain. Furthermore, MUSIC can suffer at low signal to noise ratio (SNR) and in rev erberant scenarios [8]. In this paper , we propose to ov ercome the abov e shortcomings with a deep neural network (DNN) method, referred to as DO Anet, that learns the number of sources from the input data, generates high precision DO A estimates and is robust to re verberation. The proposed DO Anet * Equally contributing authors in this paper . The research leading to these results has received funding from the European Research Council under the European Unions H2020 Frame work Programme through ERC Grant Agreement 637422 EVER YSOUND. The authors also wish to acknowledge CSC-IT Center for Science, Finland, for computational resources also generates a spatial acoustic activity map similar to the MUSIC pseudo-spectrum (SPS) as an intermediate output. The SPS has numerous applications that rely on a directional map of acoustic acti vity such as soundﬁeld visualizations [9], and room acoustics analysis [10]. In comparison, the proposed DO Anet outputs the SPS and DOA ’ s of multiple ov erlapping sources similar to any popular DOA estimators like MUSIC, ESPRIT or SRP without requiring the critical information of the number of activ e sound sources. A successful implementa- tion of this will enable the integration of such DNN methods to higher-le vel learning based end-to-end sound analysis and detection systems. Recently , sev eral DNN-based approaches hav e been pro- posed for DO A estimation [11], [12], [13], [14], [15], [16]. There are six signiﬁcant differences between them and the proposed method: a) All the aforementioned works focused on azimuth estimation, with the exception of [15] where the 2-D Cartesian coordinates of sound sources in a room were predicted, and [11] trained separate networks for azimuth and elev ation estimation. In contrast, we demonstrate the estima- tion of both azimuth and ele vation for the DO A by sampling the unit sphere uniformly and predicting the probability of sound source at each direction. b) The past works focused on the estimation of a single DO A at e very time frame, with the exception of [13] where localization of azimuth for up to two sources simultaneously was proposed. On the other hand, the proposed DOAnet does not algorithmically limit the number of directions to be estimated, i.e., with a higher number of audio channels input, the DO Anet can potentially estimate a larger number of sound e vents. c) Past w orks were e valuated with different array geometries making comparison difﬁcult. Although the DOAnet can be applied to any array geometry , we ev aluate the method using real spherical harmonic input signals, which is an emerging popular spatial audio format under the name Ambisonics. Microphone signals from various arrays, such as spherical, circular , planar or volumetric, can be transformed to Am- bisonic signals by an appropriate transform [17], resulting in a common representation of the 3-D sound recording. Although the DOAnet is scalable to higher -order Ambisonics, in this paper we ev aluate it using the compact four-channel ﬁrst-order Ambisonics (FO A). d) Regarding classiﬁers, earlier methods have used fully connected (FC) neural networks [11], [12], [13], [14], [15] and con volutional neural networks (CNN) [16]. In this work, along with the CNNs we use recurrent neural network (RNN) layers. The usage of RNN allows the network to learn long- term temporal information. Such an architecture is referred to as a conv olutional recurrent neural network (CRNN) in literature and is the state-of-the-art method in many single- [18], [19] and multichannel [20], [21] audio tasks. e) Pre vious methods used inter-channel features such as generalized cross- correlation with phase transform (GCC-PHA T) [15], [12], eigen-decomposition of the spatial cov ariance matrix [13], inter-channel time delay (ITD) and inter-channel level dif fer- ences (ILD) [11], [14]. More recently , Chakrabarty et al. [16] proposed to use only the phase component of the spectrogram, av oiding explicit feature extraction. In the proposed method, we use both the magnitude and the phase component. Contrary to [16], which employed omnidirectional sensors only , general arrays with directional microphones additionally encode the DO A information in magnitude differences, while Ambisonics format especially encode directional information mainly in the magnitude component. f) All previous methods were ev aluated on speech recordings that were synthetically spatialized and spatially static. W e continue to use the static sound sources in the present work and extend them to a lar ger v ariety of sound ev ents, such as impulsive and transient sounds. I I . M E T H O D The block diagram of the proposed DOAnet is presented in Figure 1. The DOAnet takes multichannel audio as the input and ﬁrst extracts the spectrograms of all the channels. The phases and the magnitudes of the spectrograms are mapped using a CRNN to two outputs sequentially . The ﬁrst output, spatial pseudo-spectrum (SPS) is generated as a regression task, followed by the DO A estimates as a classiﬁcation task. The DOA is deﬁned by the azimuth φ and elev ation λ with respect to the microphone and the SPS is the intensity of sound along the DO A given by S ( φ, λ ) . In this paper , we use discrete φ and λ by uniformly sampling the 2-D polar coordinate space, with a resolution of 10 degrees in both azimuth and elev ation, resulting in 614 sampled directions. The SPS is computed at each sampled direction, whereas, a subset of 432 directions is used for DO A, where the elev ations are limited between -60 and 60 degrees. A. F eatur e extraction The spectrogram is calculated for each of the audio channels whose sampling frequencies are 44100 Hz. A 2048-point discrete Fourier transform (DFT) is calculated on Hamming windows of 40 ms with 50 % overlap. W e keep 1024 v alues of the DFT corresponding to the positi ve frequencies, without the zeroth bin. L frames of features, each containing 1024 magnitude and phase values of the DFT extracted in all the C channels, are stacked in a L × 1024 × 2 C 3-D tensor and used as the input to the proposed neural network. The 2 C dimension results from ordering the magnitude component of 64 filters, 3x3, 2D CNN, ReLUs 1x8 max pool 64 filters, 3x3, 2D CNN, ReLUs 1x8 max pool 64 filters, 3x3, 2D CNN, ReLUs 1x4 max pool 64 filters, 3x3, 2D CNN, ReLUs 1x2 max pool 100x2x64 64 units, Bidirectional GRU, tanh 64 units, Bidirectional GRU, tanh 614 units, time distributed dense, Linear 100x128 Magnitude and phase spectrogram 100x614 Output 1 - Spatial pseudo-spectrum (SPS) 16 filters, 3x3, 2D CNN, ReLUs 1x2 max pool 16 filters, 3x3, 2D CNN, ReLUs 100x307x16 16 units, Bidirectional GRU, tanh 16 units, Bidirectional GRU, tanh 432 units, time distributed dense, Sigmoid 100x32 32 units, time distributed dense, Linear 100x32 Output 2 - Direction of arrival (DOA) 100x432 Output 1 - Spatial pseudo-spectrum (SPS) 100x614x1 100x614 100x1024x8 Fig. 1. DO Anet - neural network architecture for direction of arriv al estimation of multiple sound sources. all channels ﬁrst, followed by the phase. W e use a sequence length L of 100 (= 2 s) in this work. B. Direction of arrival estimation network (DO Anet) Local shift-inv ariant features are extracted from the input spectrogram tensor ( L × 1024 × 2 C dimension) using CNN layers. In every CNN layer , the intra-channel time-frequency features are processed using a receptiv e ﬁeld of 3 × 3 , rectiﬁed linear unit (ReLU) acti vation and pad zeros to the resulting acti vation map to keep the output dimension equal to input. Batch normalization and max-pooling operation along frequency axis are performed after e very CNN layer to reduce the ﬁnal dimension to L × 2 × N C , where N C is the number of CNN ﬁlters in the last CNN layer . The CNN activ ations are reshaped to L × 2 N C keeping the time axis length unchanged and fed to RNN layers in order to learn temporal structure. Speciﬁcally , the bi-directional gated recurrent units (GR U) with tanh activ ation are used. Further , the RNN output is mapped to the ﬁrst output, the SPS, in regression manner using FC layers with linear activ ation. The SPS is further mapped to DO A estimates–the ﬁnal output of the proposed method–using a similar CRNN network as abov e with two minor architectural changes. An FC layer is introduced between the CNN and RNN layers to reduce the dimension of the RNN output. Additionally , the output layer which predicts the DOA uses sigmoid activ ation in order to estimate more than one DO A for a given time frame. Each node in this output layer represents a direction in 2- D polar space. During testing, the probabilities at these nodes are thresholded with a value of 0.5, so that anything greater suggests the presence of a source in the direction or otherwise absence of source. W e refer to the combined architecture of SPS and DO A estimation in this work as DO Anet. The DO Anet is trained using the target SPS computed at each sampled direction, and for every time frame applying MUSIC (see Section III-B), and is represented using nonnegati ve real numbers. For the DOA output, the DO Anet aims to make a discrete decision about the presence of a source in a certain direction; and during training, the DO Anet uses the ground truth DOAs utilized to synthesize the audio (see Section III-A). The DO Anet was trained for 1000 epochs using Adam optimizer , mean squared error loss for SPS output and binary cross entropy loss for DO A output. The sum of the two losses was used for back propagation. Dropout was used after e very layer and early stopping was used if the DOA metric (Sec- tion III-C) did not improve for 100 epochs. The DOAnet was implemented using Keras framework with Theano backend. I I I . E V A L UAT I O N A. Dataset In order to ev aluate the proposed DOAnet, there are no publicly a vailable real or synthetic datasets which consist of general sound events each associated with a 2D spatial coordinate. Since DNN-based methods need sufﬁciently large datasets to train on, most DNN-based methods proposed [11], [12], [14], [15], [16] ha ve studied the performance on synthetic datasets. In similar fashion, we e valuate the proposed DOAnet on synthetic datasets about the same size as in the pre vious works. W e synthesize datasets consisting of static point sources associated with a spatial coordinate in the space in two contexts - anechoic and rev erberant. For each context, three datasets are generated with no temporally overlapping sources ( O 1 ), maximum two overlapping sources ( O 2 ), and maximum three o verlapping sound sources ( O 3 ). W e refer to the anechoic context dataset as O xA and reverberant as O xR , where x denotes the number of ov erlapping sources. Each of these datasets has three cross-validation (CV) splits with 240 record- ings for training and 60 for testing. Recordings are sampled at 44.1 kHz and 30 s long. In order to generate these datasets, we use the isolated real-life sound ev ent recordings from the DCASE 2016 task 2 [22]. This dataset consists of 11 sound ev ent classes, each with 20 examples. The classes in this dataset included speech, coughing, door slam, page-turning, phone ringing and key- board sounds. During CV , for each of the splits, we randomly chose disjoint sets of 16 and 4 examples for training and testing, amounting to 176 examples for training and 44 for testing. In order to synthesize a recording, a random subset of the 176 or 44 sound examples was chosen from the respectiv e split. The subset size varied for each recording based on the chosen sound examples. W e start synthesizing a recording by randomly choosing the beginning time of the ﬁrst randomly chosen sound example within the ﬁrst second of the recording. The next randomly chosen sound example is placed 250-500 ms after the end of the ﬁrst sound example. On reaching the maximum recording length of 30 s, the process is repeated as many times as the number of required overlapping sound ev ents. Each of the sound e xamples were assigned a DO A randomly using the follo wing conditions. All sound ev ents were placed in a spatial grid of ten degrees resolution along both azimuth and ele vation. T wo temporally ov erlapping sound e vents ha ve at least ten degrees of spatial separation to av oid spatial ov erlapping. The elev ation was constrained within the range of [-60, 60] degrees, as most natural sound events occur in this range. Finally , for the anechoic dataset, the sound sources were randomly placed at a distance d in the range 1-10 m. For the rev erberant dataset, the sound ev ents were randomly placed inside a room of dimensions 10 × 8 × 4 m with the microphone in the center of the room. Spatialization for the anechoic case was done as following. Each point source signal s i with DO A ( φ i , λ i ) , was con verted to Ambisonics format by multiplying the signal with the vector y ( φ i , λ i ) = [ Y 00 ( φ i , λ i ) , Y 1( − 1) ( φ i , λ i ) , Y 10 ( φ i , λ i ) , Y 11 ( φ i , λ i )] T of real orthonormalized spherical harmonics Y nm ( φ, λ ) . The complete anechoic sound scene multichannel recording x A was generated as x A = P i g i s i y ( φ i , λ i ) , with the gains g i < 1 modeling the distance attenuation. Each entry of x A corresponds to one channel and g i = p 1 / 10 d/d max , where d max = 10 m is the maximum distance. In the rev erberant case, a fast geometrical acoustics simu- lator was used to model natural reverberation based on the rectangular room image-source model [23]. For each point source s i with DO A in the dataset, K image sources were generated modeling reﬂections up to a predeﬁned time-limit. Based on the room and its propagation properties, each image source was associated with a propagation ﬁlter h ik and DOA ( φ k , λ k ) resulting in the spatial impulse response h i = P K k =1 h ik y ( φ k , λ k ) . The reverberant scene signal was ﬁnally generated by x R = P i s i ∗ h i , where ( ∗ ) denotes con volution of the source signal with the spatial impulse responses. The room absorption properties were adjusted to match rev erberation times of typical ofﬁce spaces. Three sets of testing data were generated with similar room size as training data (Room 1), 80% of room size ( 8 × 8 × 4 m) and rev erberation time (Room 2), and 60% of room size ( 8 × 6 × 4 m) and rev erberation time (Room 3). B. Baseline The proposed method to our knowledge is the ﬁrst DNN- based implementation for 2D DOA estimation of multiple ov erlapping sound ev ents. Thus in order to ev aluate the complete features of the proposed DO Anet, we compare the performance with the con ventional, high-resolution DOA estimator based on MUSIC. Similar to the SPS and DOA outputs estimated by the DOAnet, the MUSIC method also estimates SPS and DOA, thus allo wing a direct one-to-one comparison. The MUSIC SPS is based on a measure of orthogonality between the signal subspace (dominated by the source signals) of the spatial cov ariance matrix C s and the noise subspace (dominated by diffuse and ambient sounds, late rev erberation, and microphone noise). The spatial covariance matrix is cal- culated as C s = E f ,t  X ( f , t ) X ( f , t ) H  , where spectrogram X ( f , t ) is a frequency f and time t dependent C -dimensional vector , where C is the number of channels, H is the conjugate transpose and E f ,t denotes the expectation ov er f and t . For a sound scene with O number of sources, the MUSIC SPS S GT is obtained from C s by ﬁrst performing an eigen value decomposition on C s = EΛE H . The sorted eigenv ectors E (according to eigenv alues with decreasing magnitude) are further partitioned into the two aforementioned subspaces E = [ U s U n ] , where U s denotes the signal subspace and will be composed of O eigen vectors corresponding to the higher eigen values and the rest will form the noise subspace U n . The S GT along the direction ( φ i , λ i ) is no w giv en by S GT ( φ i , λ i ) = 1 / ( y T ( φ i , λ i ) U n U H n y ( φ i , λ i )) . Finally , the source DOAs are found by selecting the directions ( φ i , λ i ) corresponding to the O lar gest peaks from S GT . C. Metric The DO Anet estimated SPS ( S E ( φ, λ ) ) is e valuated with respect to the baseline MUSIC estimated ground truth ( S GT ( φ, λ ) ) using the SNR metric calculated as S N R = 10 log 10 ( P φ P λ S GT ( φ, λ ) 2 / P φ P λ ( S E ( φ, λ ) − S GT ( φ, λ )) 2 ) . As the DO A metric we use the angle between the estimate DO A (deﬁned by azimuth φ E and elev ation λ E ) and the ground truth DO A ( φ GT , λ GT ) used to synthesize the dataset in degrees. This is calculated as σ = arccos(sin φ E sin φ GT + cos φ E cos φ GT cos( λ GT − λ E )) · 180 . 0 /π . Further, to accom- modate the scenario of unequal number of estimated and ground truth DOAs we calculate and report the minimum distance between them using the Hungarian algorithm [24] along with the percentage of frames in which the number of DO As estimated were correct. The ﬁnal metric for the entire dataset, referred as DO A error , is calculated by normalizing the minimum distance with the total number of estimated DO A ’ s. D. Evaluation pr ocedur e The parameter tuning for DO Anet was performed on the O 1 A test data, and the best conﬁguration is as shown in Figure 1. This conﬁguration has 677 K weights, and the same conﬁguration is used in all of the following studies. At test time, the SNR metric for SPS output of the DO Anet ( S E ) is calculated with respect to SPS of baseline MUSIC ( S GT ). The DOA metric for the DO As predicted by DO Anet and baseline MUSIC are calculated with respect to the ground truth DO A used to synthesize the dataset. In the abov e experiment, the baseline MUSIC algorithm uses the kno wledge of the number of active sources. In order to have a fair ev aluation, we test the DOAnet in a similar scenario where the number of sources is known. W e use this knowledge to choose the top probabilities in prediction layer of the DO Anet instead of thresholding it with a value of 0.5. 0 50 100 150 200 250 300 0 50 100 150 (a) MUSIC estimated 0 50 100 150 200 250 300 0 50 100 150 (b) DO Anet estimated Fig. 2. SPS for two closely located sound sources. The black-cross markers represent the ground truth DO A. The horizontal axis is azimuth and vertical axis is elev ation angle (in degrees) T ABLE I E V A L UA T I O N M E T RI C S C OR E S F O R T H E S PA T I A L P OW E R M A P A ND D OA S E S TI M A T E D B Y T H E D OA N E T F O R D I FFE R E N T DAT A S ET S . Anechoic Rev erberant (Room 1) Max. no. of overlapping sources 1 2 3 1 2 3 SPS SNR (in dB) 9.90 3.35 -0.26 3.11 1.24 0.13 DO A error with unknown number of activ e sources (threshold of 0.5) DO Anet 0.57 8.03 18.34 6.31 11.46 38.41 Correctly predicted frames (in %) 95.4 42.7 1.8 59.3 15.8 1.2 DO A error with known number of activ e sources DO Anet 1.14 27.52 49.30 12.61 38.98 67.07 MUSIC 2.29 8.60 28.66 25.80 57.33 91.72 I V . R E S U LT S A N D D I S C U S S I O N The results of the ev aluations are presented in T able I. The high SNRs for SPS in both the contexts, with up to one and two ov erlapping sound ev ents show that the SPS generated by DOAnet ( S E ) is comparable with the baseline MUSIC SPS ( S GT ). Figure 2 shows the S E and the respectiv e S GT when two acti ve sources are closely located. In the case of up to three overlapping sound ev ents, the baseline MUSIC is already at its theoretical limit of estimating N − 1 sources from N -dimensional signal space [25]. In practice, for N − 1 sources only one noise subspace vector U n is used to generate SPS, which for real signals is too weak for stable estimation. In the present e valuation of DO Anet which is trained with four-channel audio features and MUSIC SPS, for the case of three ov erlapping sound sources the SPS used is an unstable estimate resulting in poor training and consequently the results. W ith more than four-channels input, which the proposed DO Anet can easily extend to, it can potentially localize more than two sound sources simultaneously . The DO A error for the proposed DOAnet when the number of acti ve sources are unknown is presented in T able I. The DO Anet error is considerably better in comparison to the baseline MUSIC that uses the active sources kno wledge for all datasets. Howe ver , the number of frames in which DOAnet produced the correct number of acti ve sources were fe w . For example, in the case of anechoic recordings with up to two ov erlapping sound ev ents, only 42.7% of the estimated frames had the correct number of DOA predictions. This prediction drops ev en drastically when the number of sources is three, due to the theoretical limit of MUSIC as explained previously , and consequently for the DO Anet as MUSIC SPS is used for training. Finally , the confusion matrix for the number of DOA estimates per frame for O 1 and O 2 datasets are visualized (a) O 1 A (b) O 2 A (c) O 1 R (d) O 2 R Fig. 3. Confusion matrix for the number of DOA estimated per frame by the DO Anet. The horizontal axis is the DO Anet estimate, and the vertical axis is the ground truth. T ABLE II E V A L UA T I O N S C O RE S F O R U N MAT CH E D R E V E RB E R A NT RO O M . Room 2 Room 3 Max. no. of overlapping sources 1 2 1 2 SPS SNR (in dB) 3.53 1.49 3.49 1.46 DO Anet error (Unknown number of sources) DO Anet 3.44 6.88 4.59 10.89 Correctly predicted frames (in %) 46.2 14.3 49.7 14.1 DO A error (Known number of sources) DO Anet 8.60 32.10 9.17 33.82 MUSIC 31.52 58.47 33.25 60.76 in Figure 3. W e skipped the confusion matrices for the O 3 datasets as they were not meaningful for similar reasons as explained abo ve. W ith the knowledge of the number of acti ve sources (T a- ble I), the DOAnet performs considerably better than baseline MUSIC for all datasets other than the O 2 A and O 3 A . The MUSIC DOA ’ s were chosen using a 2D peak ﬁnder on the MUSIC SPS, whereas the DO A ’ s in DO Anet were chosen by simply picking the top probabilities in the ﬁnal DOA prediction layer . A smarter peak picking method from the DO Anet, or using the number of sources as an additional input can potentially result in better scores across all datasets. Further , the DO Anet error on unmatched rev erberant data is presented in T able II. The performance of DO Anet is seen to be consistent in comparison to the matched rev erberant data in T able I, and signiﬁcantly better than the performance of MUSIC. In this paper , since the baseline was chosen to be MUSIC, for a fair comparison the DO Anet was also trained using MUSIC SPS. In an ideal scenario, considering the DO Anet is trained using datasets for which the ground truth DO As are known, we can generate accurate high-resolution SPS from the ground truth DO A ’ s as per the required application and use them for training. Alternativ ely , the DOAnet can be trained without the SPS to directly generate the DOAs, it was only used in this paper to present the complete potential of the method in the limited paper space. In general, the above results show that the proposed DO Anet has the potential to learn the 2D direction information of multiple ov erlapping sound sources directly from the spectrogram of the input audio without the knowledge of the number of activ e sound sources. An exhausti ve study with more detailed experiments including both synthetic and real datasets are planned for future work. V . C O N C L U S I O N A con volutional recurrent neural network (DOAnet) was proposed for multiple source localization. The DO Anet was shown to learn the number of activ e sources directly from the input spectrogram, and estimate precise DO A in 2-D polar space. The method was ev aluated on anechoic, matched and unmatched reverberant dataset. The proposed DOAnet performed considerably better than baseline MUSIC in most scenarios. Thereby showing the potential of DOAnet in learn- ing highly computational algorithm without prior knowledge of the number of sources. R E F E R E N C E S [1] M. W oelfel and J. McDonough, “Distant speech recognition, ” in W iley , 2009. [2] J. Nikunen and T . V irtanen, “Direction of arri val based spatial co variance model for blind sound source separation, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 22, no. 3, 2014. [3] A. Politis et al. , “Sector-based parametric sound ﬁeld reproduction in the spherical harmonic domain, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 9, no. 5, pp. 852–866, 2015. [4] Y . Huang et al. , “Real-time passiv e source localization: a practical linear - correction least-squares approach, ” in IEEE T ransactions on Speech and Audio Processing , vol. 9, no. 8, 2001. [5] M. S. Brandstein and H. F . Silverman, “ A high-accuracy , lo w-latency technique for talker localization in reverberant en vironments using microphone arrays, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 1997. [6] R. O. Schmidt, “Multiple emitter location and signal parameter esti- mation, ” in IEEE Tr ansactions on Antennas and Pr opagation , vol. 34, no. 3, 1986. [7] R. Roy and T . Kailath, “ESPRIT-estimation of signal parameters via ro- tational in variance techniques, ” in IEEE T ransactions on Audio, Speech, and Language Processing , vol. 37, no. 7, 1989. [8] J. H. DiBiase et al. , “Robust localization in reverberant rooms, ” in Micr ophone Arrays , 2001, pp. 157–180. [9] A. O’Donov an et al. , “Imaging concert hall acoustics using visual and audio cameras, ” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2008. [10] D. Khaykin and B. Rafaely , “ Acoustic analysis by spherical microphone array processing of room impulse responses, ” The Journal of the Acoustical Society of America , vol. 132, no. 1, 2012. [11] R. Roden et al. , “On sound source localization of speech signals using deep neural networks, ” in Deutsche J ahr estagung f ¨ ur Akustik (DA GA) , 2015. [12] X. Xiao et al. , “ A learning-based approach to direction of arriv al estimation in noisy and rev erberant environments, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2015. [13] R. T akeda and K. K omatani, “Discriminativ e multiple sound source localization based on deep neural networks using independent location model, ” in IEEE Spoken Language T echnolo gy W orkshop (SLT) , 2016. [14] A. Zermini et al. , “Deep neural network based audio source separation, ” in International Conference on Mathematics in Signal Processing , 2016. [15] F . V esperini et al. , “ A neural network based algorithm for speaker local- ization in a multi-room en vironment, ” in IEEE International W orkshop on Machine Learning for Signal Processing (MLSP) , 2016. [16] S. Chakrabarty and E. A. P . Habets, “Broadband DO A estimation using con volutional neural networks trained with noise signals, ” in IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics (W ASP AA) , 2017. [17] H. T eutsch, Modal array signal processing: principles and applications of acoustic waveﬁeld decomposition . Springer , 2007, vol. 348. [18] T . N. Sainath et al. , “Con volutional, long short-term memory , fully connected deep neural networks, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2015. [19] M. Malik et al. , “Stacked conv olutional and recurrent neural netw orks for music emotion recognition, ” in Sound and Music Computing Con- fer ence (SMC) , 2017. [20] T . Sainath et al. , “Multichannel signal processing with deep neural networks for automatic speech recognition, ” in IEEE T ransactions on Audio, Speech, and Language Pr ocessing , 2017. [21] S. Adav anne et al. , “Sound event detection using spatial features and con volutional recurrent neural network, ” in IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2017. [22] E. Benetos et al. , “Sound event detection in synthetic audio, ” http:// www .cs.tut.ﬁ/sgn/arg/dcase2016/, 2016. [23] J. B. Allen and D. A. Berkley , “Image method for efﬁciently simulating small-room acoustics, ” in The Journal of the Acoustical Society of America , vol. 65, no. 4, 1979. [24] H. W . K uhn, “The hungarian method for the assignment problem, ” in Naval Researc h Logistics Quarterly , no. 2, 1955, p. 8397. [25] B. Ottersten et al. , “Exact and large sample maximum likelihood techniques for parameter estimation and detection in array processing, ” in Radar Array Pr ocessing. Springer Series in Information Sciences , 1993.

Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment