Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

1 Sound Ev ent Localization and Detection of Ov erlapping Sources Using Con v olutional Recurrent Neural Netw orks Sharath Adav anne, Member , IEEE, Archontis Politis, Member , IEEE, Joonas Nikunen, Member , IEEE, and T uomas V irtanen, Member , IEEE Abstract —In this paper , we propose a con v olutional recurrent neural network f or joint sound event localization and detec- tion (SELD) of multiple overlapping sound e vents in three- dimensional (3D) space. The proposed network takes a sequence of consecutive spectr ogram time-frames as input and maps it to two outputs in parallel. As the ﬁrst output, the sound event detection (SED) is perf ormed as a multi-label classiﬁcation task on each time-frame pr oducing temporal activity f or all the sound event classes. As the second output, localization is perf ormed by estimating the 3D Cartesian coordinates of the direction-of- arrival (DO A) for each sound event class using multi-output regr ession. The pr oposed method is able to associate multiple DO As with r espective sound event labels and further track this association with respect to time. The proposed method uses sep- arately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-speciﬁc feature extraction. The method is evaluated on ﬁv e Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, re verberant and r eal-life scenarios. The pr oposed method is compared with two SED, three DOA estimation, and one SELD baselines. The r esults show that the proposed method is generic and applicable to any array structures, rob ust to unseen DO A values, rev erberation, and lo w SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DO As acr oss datasets in comparison to the best baseline. Additionally , this recall was observed to be signiﬁcantly better than the best baseline method for a higher number of overlapping sound events. Index T erms —Sound e vent detection, direction of arrival esti- mation, convolutional recurr ent neural network I . I N T RO D U C T I O N S OUND ev ent localization and detection (SELD) is the combined task of identifying the temporal activities of each sound event, estimating their respecti ve spatial location trajectories when acti ve, and further associating te xtual labels S. Adavanne, J. Nikunen and T .V irtanen are with the Signal Process- ing Laboratory , T ampere Univ ersity of T echnology , Finland, e-mail: ﬁrst- name.lastname@tut.ﬁ A. Politis is with the Department of Signal Processing and Acoustics, Aalto Univ ersity , Finland, e-mail: archontis.politis@aalto.ﬁ The research leading to these results has received funding from the European Research Council under the European Unions H2020 Framework Program through ERC Grant Agreement 637422 EVER YSOUND. The au- thors also wish to acknowledge CSC-IT Center for Science, Finland, for computational resources This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. with the sound events. Such a method can for example auto- matically describe social and human activities and assist the hearing impaired to visualize sounds. Robots can employ this for navigation and natural interaction with surroundings [1–4]. Smart cities, smart homes, and industries could use it for audio surveillance [5–8]. Smart meeting rooms can recognize speech among other events and use this information to beamform and enhance the speech for teleconferencing or for robust automatic speech recognition [9–13]. Naturalists could use it for bio-div ersity monitoring [14–16]. Further , in virtual reality (VR) applications with 360 ° audio SELD can be used to assist the user in visualizing sound ev ents. A. Sound event detection The SELD task can be broadly divided into two sub-tasks, sound event detection (SED) and sound source localization. SED aims at detecting temporally the onsets and offsets of sound events and further associating textual labels to the de- tected ev ents. The sound ev ents in real-life most often overlap with other sound events in time and the task of recognizing all the overlapping sound ev ents is referred as polyphonic SED. The SED task in literature has most often been approached using different supervised classiﬁcation methods that predict the frame wise activity of each sound ev ent class. Some of the classiﬁers include Gaussian mixture model (GMM) - hidden Markov model (HMM) [27], fully connected (FC) neural networks [28], recurrent neural networks (RNN) [29– 32], and con volutional neural networks (CNN) [33, 34]. More recently state-of-the-art results were obtained by stacking CNN, RNN and FC layers consecuti vely , referred jointly as the con volutional recurrent neural network (CRNN) [35–39]. Lately , in order to improv e recognition of overlapping sound ev ents, sev eral multichannel SED methods have been proposed [39–43] and these were among the top performing methods in the real-life SED task of DCASE 2016 1 and 2017 2 ev aluation challenges. More recently , we studied the SED performance on identical sound scenes captured using single, binaural and ﬁrst-order Ambisonics (FOA) microphones [35], where the order denotes the spatial resolution of the format and the ﬁrst order corresponds to four channels. The results sho wed 1 http://www .cs.tut.ﬁ/sgn/arg/dcase2016/task- results- sound- ev ent- detection- in- real- life- audio#system- characteristics 2 http://www .cs.tut.ﬁ/sgn/arg/dcase2017/challenge/task- sound- ev ent- detection- in- real- life- audio- results#system- characteristics 2 T ABLE I S U MM A RY O F D N N B AS E D D OA E S TI M A T I ON M E T HO D S I N T H E L I T ER A T U R E . T H E A Z IM U T H A N D E LE V A T I O N A NG L E S A R E D E NO TE D A S ‘ AZ I ’ A N D ‘ E LE ’ , D I S T A N CE A S ‘ D I ST ’ , ‘ X ’ A N D ‘ Y ’ R E P R ES E N T T H E D IS TA NC E A L ON G T H E R E S PE C T I VE C A RTE S I A N A X IS . ‘ F U L L ’ R E P RE S E N TS T H E E ST I M A T IO N I N T H E C O MP L E T E R AN G E O F T H E R E S PE C T I VE F O RM AT , A N D ‘ R E GR E S S IO N ’ R E P RE S E N TS T H E C L AS S I FIE R E S TI M A T I ON T Y PE . Approach Input feature Output format Sources DNN Array SELD Chakrabarty et al. [17, 18] Phase spectrum azi 1, multiple CNN Linear × Y alta et al. [3] Spectral po wer azi (Full) 1 CNN Resnet Robot × Xiao et al. [19] GCC azi (Full) 1 FC Circular × T akeda et al. [1, 2] Eigen vectors of spatial cov ariance matrix azi (Full) 1, 2 FC Robot × He et al. [4] GCC azi (Full) Multiple CNN Robot × Hirvonen [20] Spectral po wer azi (Full) for each class Multiple CNN Circular X Y iwere et al. [21] ILD, cross-correlation azi and dist 1 FC Binaural × Ferguson et al. [22] GCC, cepstrogram azi and dist (regression) 1 CNN Linear × V esperini et al. [23] GCC x and y (regression) 1 FC Distributed × Sun et al. [24] GCC azi and ele 1 PNN Cartesian × Adav anne et al. [25] Phase and magnitude spectrum azi and ele (Full) Multiple CRNN Generic × Roden et al. [26] ILD, ITD, phase and magnitude spectrum azi, ele and dist (separate NN) 1 FC Binaural × Proposed Phase and magnitude spectrum azi and ele (Full, regression) for each class Multiple CRNN Generic X that the recognition of o verlapping sound e vents improv ed with increase in spatial sampling, and the best performance was obtained with FO A. B. Sound sour ce localization Sound source localization is the task of determining the direction or position of a sound source with respect to the microphone. In this paper , we only deal with the estimation of the sound ev ent direction, generally referred as direction- of-arriv al (DO A) estimation. The DO A methods in literature can be broadly categorized into parametric- and deep neural network (DNN)-based approaches. Some popular parametric methods are based on time-difference-of-arri val (TDO A) [44], the steered-response-po wer (SRP) [45], multiple signal classi- ﬁcation (MUSIC) [46], and the estimation of signal parame- ters via rotational in v ariance technique (ESPRIT) [47]. These methods vary in terms of algorithmic complexity , constraints in array geometry , and model assumptions on the acoustic scenarios. Subspace methods like MUSIC can be applied with different array types and can produce high-resolution DOA estimates of multiple sources. On the other hand, subspace methods require a good estimate of the number of active sources that may be hard to obtain, and the y have been found sensitive to reverberant and lo w signal-to-noise (SNR) scenarios [48]. Recently , DNN-based methods were employed to overcome some of the drawbacks of parametric methods, while being robust towards rev erberation and low SNR scenarios. Ad- ditionally , implementing the localization task in the DNN framew ork allo ws seamless inte gration into broader DNN tasks such as SELD [20], robots can use it for sound source based navigation and natural interaction in multi-speaker scenar- ios [1–4]. A summary of the most recent DNN-based DOA estimation methods is presented in T able I. All these methods estimate DOAs for static point sources and were shown to perform equally or better than the parametric methods in re- verberant scenarios. Further , methods [4, 18, 20, 25] proposed to simultaneously detect DO As of o verlapping sound e vents by estimating the number of active sources from the data itself. Most methods used a classiﬁcation approach, thereby estimating the source presence likelihood at a ﬁxed set of angles, while [22, 23] used a regression approach and let the DNN produce continuous output. All of the past works were ev aluated on different array geometries, making a direct performance comparison difﬁcult. Most of the methods estimated full azimuth (’Full’ in T able I) using microphones mounted on a robot, circular and dis- tributed arrays, while the rest of the methods used linear arrays thereby estimating only the azimuth angles in a range of 180 °. Although fe w of the existing methods estimated the azimuth and elevation jointly [24, 25], most of them estimated only the azimuth angle [1–4, 17–20]. In particular, we studied the joint estimation of azimuth and elev ation angles in [25], this was enabled by the use of Ambisonic signals (FOA) obtained using a spherical array . Ambisonics are also known as spherical harmonic (SH) signals in the array processing literature, and they can be obtained from various array conﬁgurations such as circular or planar (for 2D capture) and spherical or volumetric (for 3D capture) using an appropriate linear transform of the recordings [49]. The same ambisonic channels have the same spatial characteristics independent of the recording setup, and hence, studies on such hardware-independent formats mak e the ev aluation and results more easily comparable in the future. Most of the pre viously proposed DNN-based DO A estima- tion methods that relied on a single array or distrib uted arrays of omnidirectional microphones, captured source location in- formation mostly in phase- or time-delay differences between the microphones. Howe ver , compact microphone arrays with full azimuth and elev ation coverage, such as spherical micro- phone arrays, rely strongly on the directionality of the sensors to capture spatial information, this reﬂects mainly in the mag- nitude dif ferences between channels. Motiv ated by this fact we proposed to use both the magnitude and phase component of the spectrogram as input features in [25]. Thus making the DO A estimation method [25] generic to array conﬁguration by av oiding method-speciﬁc feature extractions like inter-aural 3 lev el dif ference (ILD), the inter-aural time difference (ITD), generalized cross-correlation (GCC) or eigen vectors of spatial cov ariance matrix used in previous methods (T able I). C. Joint localization and detection In the presence of multiple ov erlapping sound e vents, the DO A estimation task becomes the classical tracking problem of associating correctly the multiple DO A estimates to respec- tiv e sources, without necessarily identifying the source [50, 51]. The problem is further extended for the polyphonic SELD task if the SED and DO A estimation are done sep- arately , resulting in the data association problem between the recognized sound events and the estimated DOAs [13]. One solution to the data association problem is to jointly predict the SED and DOA. In this regard, to the best of the authors’ knowledge, [20] is the only DNN-based method which performs SELD. Other works combining SED and parametric DO A estimation include [6, 13, 52, 53]. Lopatka et al. [53] used a 3D sound intensity acoustic vector sensor, with MPEG-7 spectral and temporal features along with a support vector machine classiﬁer to estimate DO A along azimuth for ﬁve classes of non-overlapping sound ev ents. Butko et al. [13] used distributed microphone arrays to recognize 14 different sound ev ents with an overlap of two at a time, using a GMM-HMM classiﬁer , and localized them inside a meeting room using the SRP method. Chakraborty et al. [52] replaced SRP-based localization in [13] with a sound-model- based localization, thereby ﬁxing the data association problem faced in [13]. In contrast, Hirvonen [20], extracted the frame- wise spectral po wer from each microphone of a circular array and used a CNN classiﬁer to map it to eight angles in full azimuth for each sound ev ent class in the dataset. In this output format, the resolution of azimuth is limited to the trained directions and the performance of unseen DOA values is unknown. For larger datasets with a higher number of sound ev ents and increased resolution along azimuth and elev ation directions, this approach results in a lar ge number of output nodes. T raining such a DNN with a large number of output nodes where the number of positiv e class labels per frame is one or two with respect to a high number of negati ve class labels poses challenges of an imbalanced dataset. Additionally , training such a large number of classes requires a huge dataset with enough examples for each class. On the other hand, this output format allows the network to simultaneously recognize more than one instance of the same sound event in a given time frame, at different locations. D. Contributions of this paper In general, the number of existing SELD methods is lim- ited [6, 13, 20, 52, 53], with only one published DNN-based approach [20]. On the other hand, there are several DNN- based methods in the literature for the SELD sub-tasks of SED and DOA estimation. Y et, there is no comprehensiv e work published that studies the various choices affecting the performance of these DNN-based SED, DOA and SELD methods, compare them with multiple competiti ve baselines, and ev aluate them o ver a wide range of acoustic conditions. Besides, with respect to the SELD task, the existing meth- ods [6, 13, 52, 53] localize up to one or maximum two ov erlapping sound ev ents and do not scale to a higher number of ov erlapping sources. Further , the only DNN-based SELD method [20] localizes sound ev ents exclusi vely at a predeﬁned grid of directions and requires a large number of output classes for a higher number of sound e vent labels and increased spatial resolution. Additionally , all the above SELD approaches use method-speciﬁc features and hence not independent of input array structure. In contrast to existing SELD methods, this paper presents nov elty in two broad areas: the proposed SELD method, and the exhaustiv e e valuation studies presented. The novelty of the proposed SELD method is as follows. It is the ﬁrst method that addresses the problem of localizing and recognizing more than two overlapping sound ev ents simultaneously and tracking their acti vity with respect to time. The proposed method is able to localize sources at any azimuth and ele v ation angles while being robust to unseen spatial locations, reverberation, and ambiance. Further , the method itself is generic enough to learn to perform SELD from any input array structure. Speciﬁcally , as our method, we propose to use the polyphonic SED output [39] as a conﬁdence measure for choosing the DO As estimated in a regression manner . By this approach, we not only extend the state-of-the-art polyphonic SED perfor- mance [39] for polyphonic SELD but also tackle the data- association problem faced due to the polyphony in SELD tasks [13]. As the second broad area of novelty , we present the performance of the proposed method with respect to various design choices made such as the DNN architecture, input feature and DO A output format. Additionally , we also present the comprehensi ve results of the proposed method with respect to six baselines (two SED, three DOA estimation, and one SELD baseline) ev aluated on sev en datasets with different acoustic conditions (anechoic and reverberant sce- narios with simulated and real-life impulse responses), array conﬁgurations (Ambisonic and circular array) and the number of ov erlapping sound e vents. In order to facilitate reproducibility of research, the pro- posed method and all the datasets used hav e been made pub- licly a vailable 3 . Additionally , the real-life impulse responses used to simulate datasets hav e also been published to enable users to e xperiment with custom sound events. The rest of the paper is or ganized as follows. In Section II, we describe the proposed SELD method and the training procedure. In Section III, we describe the datasets, the baseline methods, the metrics and the experiments carried out for ev aluating the proposed method. The experimental results on the ev aluation datasets are presented, compared with baselines and discussed in Section IV. Finally , we summarize the conclusions of the work in Section V. I I . M E T H O D The block diagram of the proposed method for SELD is presented in Figure 1a. The input to the method is the multichannel audio. The phase and magnitude spectrograms 3 https://github .com/sharathadav anne/seld-net 4 are extracted from each audio channel and are used as separate features. The proposed method takes a sequence of features in consecutiv e spectrogram frames as input and predicts all the sound e vent classes active for each of the input frames along with their respective spatial location, producing the temporal activity and DOA trajectory for each sound ev ent class. In particular , a CRNN is used to map the feature sequence to the two outputs in parallel. At the ﬁrst output, SED is performed as a multi-label classiﬁcation task, allowing the network to simultaneously estimate the presence of multiple sound events for each frame. At the second output, DO A estimates in the continuous 3D space are obtained as a multi-output regression task, where each sound event class is associated with three regressors that estimate the 3D Cartesian coordinates x , y and z of the DO A on a unit sphere around the microphone. The SED output of the network is in the continuous range of [0 1] for each sound ev ent in the dataset, and this v alue is thresholded to obtain a binary decision for the respectiv e sound ev ent activity as shown in Figure 1b. Finally , the respectiv e DOA estimates for these active sound event classes provide their spatial locations. The detailed description of the feature extraction and the proposed method is explained in the following sections. A. F eatur e extr action The spectrogram is extracted from each of the C channels of the multichannel audio using an M -point discrete Fourier transform (DFT) on Hamming window of length M and 50% ov erlap. The phase and magnitude of the spectrogram are then extracted and used as separate features. Only the M / 2 positiv e frequencies without the zeroth bin are used. The output of the feature extraction block in Figure 1a is a feature sequence of T frames, with an overall dimension of T × M / 2 × 2 C , where the 2 C dimension consists of C magnitude and C phase components. B. Neural network arc hitectur e The output of the feature extraction block is fed to the neural network as sho wn in Figure 1a. In the proposed ar- chitecture the local shift-in variant features in the spectrogram are learned using multiple layers of 2D CNN. Each CNN layer has P ﬁlters of 3 × 3 × 2 C (as in [25]) dimensional receptiv e ﬁelds acting along the time-frequency-channel axis with a rectiﬁed linear unit (ReLU) activ ation. The use of ﬁlter kernels spanning all the channels allows the CNN to learn relev ant inter-channel features required for localization, whereas the time and frequency dimensions of the kernel allows learning rele vant intra-channel features suitable for both the DOA and SED tasks. After each layer of CNN, the output activ ations are normalized using batch normalization [54], and the dimensionality is reduced using max-pooling ( M P i ) along the frequency axis, thereby keeping the sequence length T unchanged. The output after the ﬁnal CNN layer with P ﬁlters is of dimension T × 2 × P , where the reduced frequency dimension of 2 is a result of max-pooling across CNN layers (see Section IV -1). TxR Tx3N TxN Q, GRU, tanh, bi-directional Q, GRU, tanh, bi-directional Input audio Feature extractor P, 3x3 filters, 2D CNN, ReLUs 1xMP 1 max pool P, 3x3 filters, 2D CNN, ReLUs 1xMP 2 max pool P, 3x3 filters, 2D CNN, ReLUs 1xMP 3 max pool Tx2xP TxM/2x2C R, fully connected, linear N, fully connected, sigmoid Sound event detection (SED) Multi-label classification task Direction of arrival (DOA) estimation Multi-output regression task TxQ TxR R, fully connected, linear 3N, fully connected, tanh T frame t SPEECH CAR DOG TRAIN ... ... SPEECH SPEECH DOG T frame t x y z 1 -1 1 -1 1 -1 (a) SELDnet SED output x y z 0.4 -0.4 0.5 0.3 -0.1 0.0 0.1 0.2 0.1 -0.8 0.4 -0.2 ... ... ... 0.1 0.0 -0.1 DOA estimates Sound event inactive Sound event active SPEECH CAR ... DOG ... TRAIN Sound event class 0.8 0.1 0.2 0.7 ... 0.1 Sound event activity Threshold > 0.5 (b) SELDnet output Fig. 1. a) The proposed SELDnet and b) the frame-wise output for frame t in Figure a). A sound event is said to be localized and detected when the conﬁdence of the SED output exceeds the threshold. 5 The output activ ation from CNN is further reshaped to a T frame sequence of length 2 P feature vectors and fed to bidirectional RNN layers which are used to learn the temporal conte xt information from the CNN output acti vations. Speciﬁcally , Q nodes of gated recurrent units (GRU) are used in each layer with tanh acti vations. This is followed by two branches of FC layers in parallel, one each for SED and DO A estimation. The FC layers share weights across time steps. The ﬁrst FC layer in both the branches contains R nodes each with linear activ ation. The last FC layer in the SED branch consists of N nodes with sigmoid activ ation, each corresponding to one of the N sound event classes to be detected. The use of sigmoid acti vation enables multiple classes to be acti ve simultaneously . The last FC layer in the DO A branch consists of 3 N nodes with tanh acti vation, where each of the N sound event classes is represented by 3 nodes corresponding to the sound ev ent location in x , y , and z , respectiv ely . For a DOA estimate on a unit sphere centered at the origin, the range of location along each axes is [ − 1 , 1] , thus we use the tanh activ ation for these regressors to keep the output of the network in a similar range. W e refer to the abo ve architecture as SELDnet. The SED output of the SELDnet is in the continuous range of [0 , 1] for each class, while the DO A output is in the continuous range of [ − 1 , 1] for each ax es of the sound class location. A sound e vent is said to be active, and its respecti ve DO A estimate is chosen if the SED output e xceeds the threshold of 0.5 as shown in Figure 1b. The network hyperparameters are optimized based on cross-v alidation as explained in Section III-D1. C. T raining pr ocedur e In each frame, the target values for each of the activ e sound ev ents in the SED branch output are one while the inactiv e ev ents are zero. Similarly , for the DO A branch, the reference DO A x , y , and z values are used as targets for the active sound events and x = 0 , y = 0 , and z = 0 is used for inactive ev ents. A binary cross-entropy loss is used between the SED predictions of SELDnet and reference sound class activities, while a mean square error (MSE) loss is used for the DO A estimates of the SELDnet and the reference DOA. By using the MSE loss for DO A estimation in 3D Cartesian coordinates we truly represent the distance between two points in space. The distance between two points ( x 1 , y 1 , z 1 ) and ( x 2 , y 2 , z 2 ) in 3D space is giv en by √ S E , where S E = ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 + ( z 1 − z 2 ) 2 , while the MSE between the same points is giv en by S E / 3 . Thus the MSE loss is simply a scaled version of the distance in 3D space, and reducing the MSE loss implies the reduction in the distance between the two points. Theoretically , the adv antage of using Cartesian coordi- nates instead of azimuth and ele vation for regression can be observed when predicting DOA in full azimuth and/or full elev ation. The angles are discontinuous at the wrap- around boundary (for example the − 180 °, 180 ° boundary for azimuth), while the Cartesian coordinates are continuous. This continuity allows the network to learn better . Further experiments on this are discussed in Section III-D. W e train the SELDnet with a weighted combination of MSE and binary cross-entropy loss for 1000 epochs using Adam optimizer with default parameters as used in the original paper [55]. Early stopping is used to control the network from ov er-ﬁtting to training split. The training is stopped if the SELD score (Section III-C) on the test split does not improve for 100 epochs. The network was implemented using Keras library [56] with T ensorFlow [57] backend. I I I . E V A L U A T I O N A. Datasets The proposed SELDnet is ev aluated on sev en datasets that are summarized in T able II. Four of the datasets are synthesized with artiﬁcial impulse responses (IR), that con- sists of anechoic and reverberant scenarios virtually recorded both with a circular array and in the Ambisonics format. Three of the datasets are synthesized with real-life impulse responses, recorded with a spherical array and encoded into the Ambisonics format. All the datasets consist of stationary point sources each associated with a spatial coordinate. The synthesis procedure in all the datasets consists of mixing isolated sound event instances at different spatial locations, since this allows producing the reference ev ent locations and times of acti vity for e v aluation and training of the methods. 1) TUT Sound Events 2018 - Ambisonic, Anechoic and Synthetic Impulse Response (ANSYN) dataset: This dataset consists of spatially located sound ev ents in an anechoic en vironment synthesized using artiﬁcial IRs. It comprises three subsets: no temporally ov erlapping sources ( O 1 ), maximum two temporally overlapping sources ( O 2 ) and maximum three temporally ov erlapping sources ( O 3 ). Each of the subsets consists of three cross-validation splits with 240 training and 60 testing FOA format recordings of length 30 s sampled at 44100 Hz. The dataset is generated using the 11 isolated sound ev ent classes from the DCASE 2016 task 2 dataset [58] such as speech, coughing, door slam, page-turning, phone ringing and keyboard. Each of these sound classes has 20 examples, of which 16 are randomly chosen for the training set and the rest four for the testing set, amounting to 176 examples from 11 classes for training, and 44 for testing. During synthesis of a recording, a random collection of examples are chosen from the respective set and are randomly placed in a spatial grid of 10 ° resolution along azimuth and elev ation, such that two overlapping sound ev ents are separated by 10 °, and the elev ation is in the range of [ − 60 ° , 60 ° ) . In order to have a variability of amplitude, the sound events are randomly placed at a distance ranging from 1 to 10 m with 0.5 m resolution from the microphone. More details regarding the synthesis can be found in [25]. 2) TUT Sound Events 2018 - Ambisonic, Reverber ant and Synthetic Impulse Response (RESYN) dataset: This dataset is synthesized with the same details as the ambisonic ANSYN dataset, with the only difference being that the sound e vents are spatially placed within a room using the image source method [59]. Speciﬁcally , the microphone is placed at the center of the room, and the sound ev ents are randomly placed around the microphone, with their distance ranging from 1 m from the microphone to the respecti ve end of the room at 0.5 m resolution. The three cross-validation splits 6 T ABLE II S U MM A RY O F D A TA S ET S Audio format Sound scene Impulse response Dataset acronym T rain/T est, notes Ambisonic (four channel) Anechoic Synthetic ANSYN 240/60 Rev erberant RESYN Real life REAL REALBIG 600/150 REALBIGAMB 600/150, ambiance Circular array (eight channel) Anechoic Synthetic CANSYN 240/60 Rev erberant CRESYN of each of the three subsets O 1 , O 2 and O 3 are generated for a moderately reverberant room of size 10 × 8 × 4 m (Room 1), with reverberation times 1.0, 0.8, 0.7, 0.6, 0.5, and 0.4 s per each octave band, and 125 Hz–4 kHz band center frequencies. Additionally , to study the performance in mismatched reverberant scenarios, testing splits are generated for two different sized rooms: room 2 that is 80% the volume ( 8 × 8 × 4 m) and rev erberation-time per band of room 1, and room 3 that is 125% the volume ( 10 × 10 × 4 m) and rev erberation-time per band of room 1. In order to remove any ambiguity while comparing the performance difference of room 1 with room 2 and 3, we keep the sound ev ents and their respectiv e spatial locations in room 2 and 3 identical to the testing split of room 1. But the individual sound ev ents whose distance from the microphone exceeded the room size were reassigned a new distance within the room. Further details on the re verberant synthesis can be read in [25]. 3) TUT Sound Events 2018 - Ambisonic, Reverber ant and Real-life Impulse Response (REAL) dataset: In order to study the performance of SELDnet in a real-life scenario, we gen- erated a dataset by collecting impulse responses from a real en vironment using the Eigenmike 4 spherical microphone array . For the IR acquisition, we used a continuous measurement signal as in [60]. The measurement was done by slo wly moving a Genelec G T wo loudspeaker 5 continuously playing a maximum length sequence around the array in circular trajectory in one elev ation at a time, as shown in Figure 2. The playback volume was set to be 30 dB greater than the ambient sound le vel. The recording was done in a corridor inside the uni versity with classrooms around it. The moving-source IRs were obtained by a freely a vailable tool from CHiME challenge [61] which estimates the time- varying responses in STFT domain by forming a least-squares regression between the known measurement signal and the far -ﬁeld recording independently at each frequency . The IR for any azimuth within one trajectory can be analyzed by assuming block-wise stationarity of acoustic channel. The av erage angular speed of the loudspeaker in the measurements was 6 °/s and we used a block size of 860 ms (81 STFT frames with analysis frame size of 1024 with 50 % ov erlap and sample rate F s = 48 kHz) for estimation of IR of length 170 ms (16 STFT frames). The IRs were collected at elev ations − 40 ° to 40 ° with 10 ° increments at 1 m from the Eigenmike and at elev ations − 20 ° 4 https://mhacoustics.com/products 5 https://www .genelec.com/home-speakers/g-series-active-speak ers 1.15 m 1 m, 0 o elevation 1 m, -30 o elevation 2 m, 10 o elevation Speaker Eigenmike Fig. 2. Recording real-life impulse responses for sound scene generation. A person walks around the Eigenmike 4 holding a Genelec loudspeaker 5 playing a maximum length sequence at different elev ation angles and distances. to 20 ° with 10 ° increments at 2 m. For the dataset creation, we analyzed the DO A of each time frame using MUSIC and extracted IRs for azimuthal angles at 10 ° resolution (36 IRs for each elev ation). The IR estimation tool [61] was applied independently on all 32 channels of the Eigenmike. In order to synthesize the sound scene from the estimated IRs, we used isolated real-life sound ev ents from the urban- sound8k dataset [62]. This dataset consists of 10 sound ev ent classes such as: air conditioner , car horn, children playing, dog barking, drilling, engine idling, gunshot, jackhammer , siren and street music. Among these, we did not include children playing and air conditioner classes since these can also occur in our ambiance recording which we use as background recording in dataset REALBIGAMB (Section III-A5). From the sound examples in urbansound8k, we only used the ones marked as fore ground in order to ha ve clean isolated sound ev ents. Similarly to the other datasets used in this paper , we used the splits 1, 8 and 9 provided in the urbansound8k as the three cross-v alidation splits. These splits were chosen as they had a good number of examples for all the chosen sound ev ent classes after selecting only the fore ground examples. The ﬁnal selected e xamples v aried in length from 100 ms to 4 s and amount to 15671.5 seconds from 4542 examples. During the sound scene synthesis, we randomly chose a sound ev ent example and associated it with a random distance among the collected ones, azimuth and elev ation angle. The sound ev ent example was then con volv ed with the respectiv e IR for the given distance, azimuth and elev ation to spatially position it. Finally , after positioning all the sound e vents in a recording we conv erted this multichannel audio to FOA format. The transform of the microphone signals to FO A was performed using the tools published in [63]. In total, we generated 300 such 30 s recordings in a similar fashion as ANSYN and RESYN with 240 of them earmarked for training and 60 for testing. Similar to the ANSYN recordings we also generated three subsets O 1 , O 2 and O 3 with a different number of o verlapping sound events. 4) TUT Sound Events 2018 - Ambisonic, Reverber ant and Real-life Impulse Response big (REALBIG) dataset: In order to study the performance of SELDnet with respect to the size of the dataset, we generated for each of three ambisonic REAL subsets a 750 recordings REALBIG subset of 30 s length, with 600 for training and 150 for testing. 5) TUT Sound Events 2018 - Ambisonic, Reverber ant, Real- life Impulse Response and Ambiance big (REALBIGAMB) dataset: Additionally , to simulate a real sound-scene we 7 recorded 30 min of ambient sound to use as background noise in the same location as the IR recordings without changing the setup. W e mixed randomly chosen segments of the recorded ambiance at three different SNRs: 0, 10 and 20 dB for each of the three ambisonic REALBIG subsets and refer to it as REALBIGAMB subsets. The ambiance used for the testing set was kept separate from the training set. 6) TUT Sound Events 2018 - Circular array , Anechoic and Synthetic Impulse Response (CANSYN) dataset: T o study the performance of SELDnet on generic array conﬁgurations, similarly to the SELD baseline method [20] (Section III-B3), we synthesized the ANSYN recordings for a circular array of radius 5 cm with eight omnidirectional microphones at 0 , 45 , 90 , 135 , 180 , 225 , 270 , 315 o , and the array plane parallel to the ground, and refer to it as CANSYN. It is an exact replica of the ANSYN dataset in terms of the synthesized sound ev ents except for the microphone array setup, and hence the number of channels. Similar to ANSYN, the CANSYN dataset has three subsets with a dif ferent number of overlapping sound ev ents each with three cross-validation splits. 7) TUT Sound Events 2018 - Cir cular arr ay , Reverber ant and Synthetic Impulse Response (CRESYN) dataset: Similar to the CANSYN dataset, we synthesize the circular array version of ambisonic RESYN room 1 dataset, referred as CRESYN. During synthesis, the circular microphone array is placed in the center of the room, and the array plane parallel to the ﬂoor . B. Baseline methods The SELDnet is compared with six different baselines as summarized in T able III: two SED baselines (single- and multichannel), three DOA baselines (parametric and DNN- based), and a SELD baseline. 1) SED baseline: The SED capabilities of the proposed SELDnet is compared with the existing state-of-the-art multi- channel SED method [39], referred here as MSEDnet. MSED- net is easily scalable to any number of input audio channels and won [38] the recently concluded real-life SED task in DCASE 2017 [64]. In particular, it won the top two posi- tions among 34 submissions, ﬁrst using single-channel mode (referred as SEDnet) and a close second using multichannel mode. The SED performance of SELDnet is compared with both the single- and the multichannel modes of MSEDnet. In the original MSEDnet implementation [39] the input is a sequence of log mel-band energy (40-bands) frames, that are mapped to an equal-length sequence of sound ev ent activities. The SED metrics (Section III-C) for MSEDnet did not change much on using phase and magnitude components of the STFT spectrogram instead of log mel-band energies. Hence, in order to have a one-to-one comparison with SELDnet, we use the phase and magnitude components of the STFT spectrogram for MSEDnet in this paper . W e train the MSEDnet for 500 epochs and use early stopping when SED score (Section III-C) stops improving for 100 epochs. 2) DO A baseline: The DOA estimation performance of the SELDnet is ev aluated with respect to three baselines. As a parametric baseline, we use MUSIC [46] and as DNN-based baselines, we use the recently proposed DOAnet [25] that T ABLE III B A S E LI N E A N D P RO P O SE D M E TH O D S U M MA RY T ask Acronym Notes Datasets ev aluated SED SEDnet [39] Single channel All MSEDnet [39] Multichannel DO A MUSIC * Azi and ele All except CANSYN and CRESYN DO Anet [25] Azi and ele AZInet [18] Azi CANSYN and CRESYN SELD HIRnet [20] Azi SELDnet-azi Azi All SELDnet Azi and ele * Parametric, all other methods are DNN based estimates DO As in 3D and [18] that estimates only the DOA azimuth angle referred as AZInet. i) MUSIC: is a versatile high-resolution subspace method that can detect multiple narrowband source DOAs and can be applied to generic array setups. It is based on a subspace decomposition of the spatial cov ariance matrix of the mul- tichannel spectrogram. For a broadband estimation of DOAs, we combine narrowband spatial cov ariance matrices over three frames and frequency bins from 50 to 8000 Hz. The steering vector information required to produce the MUSIC pseudo- spectrum from which the DOAs are extracted is adapted to the recording system under use, meaning uniform circular array steering vectors for CANSYN and CRESYN datasets, and real SH vectors for all the other ambisonic datasets. MUSIC requires a good estimate of the number of active sound sources in order to estimate their DO As. In this paper, we use MUSIC with the number of active sources taken from the reference of the dataset. Hence, the DO A estimation results of MUSIC can be considered as the best possible for the giv en dataset and serve as a benchmark for DO A estimation with and without the knowledge of the number of active sources. For a detailed description on MUSIC and other subspace methods, the reader is referred to [65], while for application of MUSIC to SH signals similar to this work, please refer to [66]. ii) DO Anet: Among the recently proposed DNN-based DO A estimation methods listed in T able I, the only method that attempts DOA estimation of multiple overlapping sources in 3D space is the DOAnet [25]. Hence, DO Anet serves as a suitable baseline to compare against the DOA estimation performance of the proposed SELDnet. DOAnet is based on a similar CRNN architecture, the input to which is a sequence of multichannel phase and magnitude spectrum frames. It considers DOA estimation as a multi-label classiﬁcation task by directional sampling with a resolution of 10 ° along azimuth and elev ation and estimating the likelihood of a sound source being acti ve in each of these points. iii) AZInet: Among the DO A-only estimation methods listed in T able I, apart from the DO Anet [25], methods [18] and [4] are the only ones which attempt simultaneous DO A estimation of overlapping sources. Since [4] is e valuated on a dataset collected using microphones mounted on a humanoid robot, it is dif ﬁcult to replicate the setup. Hence in this paper , we use the AZInet ev aluated on a linear array in [18] as the baseline. The AZInet is a CNN-based method that uses the phase component of the spectrogram of each channel as input, 8 and maps it to azimuth angles in the range 0 ° to 180 ° at 5 ° resolution as a multi-label classiﬁcation task. AZInet uses only the phase spectrogram since the dataset ev aluated on employs omnidirectional microphones, which for compact arrays and sources in the far-ﬁeld, preserve directional information in inter-channel phase differences. Thus, although the ev aluation in [18] was carried out on a linear array , the method is generic to any omnidirectional array under these conditions. Further, in order to have a direct comparison, we extend the output of AZInet to full-azimuth with 10 ° resolution and reduce the output of SELDnet to generate only the azimuth, i.e., we only estimate x and y coordinates of the DOA (SELDnet- azi). T o enable this full-azimuth estimation we use the circular array with omnidirectional microphones datasets CANSYN and CRESYN. 3) SELD baseline (HIRnet): The joint SED and DOA estimation performance of SELDnet is compared with the method proposed by Hirvonen [20], hereafter referred to as HIRnet. The HIRnet was proposed for a circular array of om- nidirectional microphones, hence we compare its performance only on the CANSYN and CRESYN datasets. HIRnet is a CNN-based network that uses the log-spectral power of each channel as the input feature and maps it to eight angles in full azimuth for each of the two classes (speech and music) as a multi-label classiﬁcation task. More details about HIRnet can be found in [20]. In order to ha ve a direct comparison with SELDnet-azi, we e xtend HIRnet to estimate DOAs at a 10 ° resolution for each of the sound ev ent classes in our testing datasets. C. Evaluation metrics The proposed SELDnet is e valuated using individual metrics for SED and DO A estimation. F or SED, we use the standard polyphonic SED metrics, error rate (ER) and F-score calcu- lated in segments of one second with no ov erlap as proposed in [67, 68]. The segment-wise results are obtained from the frame-lev el predictions of the classiﬁer by considering the sound ev ents to be acti ve in the entire segment if it is activ e in any of the frames within the segment. Similarly , we obtain labels for one-second segments of reference from its frame- wise annotation, and calculate the segment-wise ER and F- scores. Mathematically , the F-score is calculated as follo ws: F = 2 · P K k =1 T P ( k ) 2 · P K k =1 T P ( k ) + P K k =1 F P ( k ) + P K k =1 F N ( k ) , (1) where the number of true positives T P ( k ) is the total number of sound ev ent classes that were acti ve in both reference and predictions for the k th one-second segment. The number of false positi ves in a segment F P ( k ) is the number of sound ev ent classes that were active in the prediction but were inactiv e in the reference. Similarly , F N ( k ) is the number of false negativ es, i.e. the number of sound e vent classes inacti ve in the predictions but acti ve in the reference. The ER metric is calculated as E R = P K k =1 S ( k ) + P K k =1 D ( k ) + P K k =1 I ( k ) P K k =1 N ( k ) , (2) where, for each one-second segment k , N ( k ) is the total num- ber of active sound e vent classes in the reference. Substitution S ( k ) is the number of times an e vent was detected but given the wrong level, this is obtained by merging the false negati ves and false positives without individually correlating which false positiv e substitutes which false negati ve. The remaining false positiv es and false negati ves, if any , are counted as insertions I ( k ) and deletions D ( k ) respectively . These statistics are mathematically deﬁned as follows: S ( k ) = min( F N ( k ) , F P ( k )) , (3) D ( k ) = max(0 , F N ( k ) − F P ( k )) , (4) I ( k ) = max(0 , F P ( k ) − F N ( k )) . (5) An SED method is jointly e valuated using the F-score and ER metric, and an ideal method will hav e an F-score of one (reported as percentages in T able) and ER of zero. More details regarding the F-score and ER metric can be read in [67, 68]. The predicted DOA estimates ( x E , y E , z E ) are ev aluated with respect to the reference ( x G , y G , z G ) used to synthesize the dataset, utilizing the central angle σ ∈ [0 , 180] . The σ is the angle formed by ( x E , y E , z E ) and ( x G , y G , z G ) at the origin in de grees, and is gi ven by σ = 2 · arcsin  p ∆ x 2 + ∆ y 2 + ∆ z 2 2  · 180 π , (6) where, ∆ x = x G − x E , ∆ y = y G − y E , and ∆ z = z G − z E . The DO A error for the entire dataset is then calculated as D OA er ror = 1 D · D X d =1 σ (( x d G , y d G , z d G ) , ( x d E , y d E , z d E )) (7) where D is the total number of DO A estimates across the entire dataset, and σ (( x d G , y d G , z d G ) , ( x d E , y d E , z d E )) is the angle between d -th estimated and reference DO As. Additionally , in order to account for time frames where the number of estimated and reference DO As are unequal, we report the frame recall, calculated as T P / ( T P + F N ) in percentage, where true positi ves T P is the total number of time frames in which the number of DOAs predicted is equal to reference, and false ne gativ es F N is the total number of frames where the predicted and reference DO A are unequal. The DO A estimation method is jointly ev aluated using the DO A error and the frame recall, and an ideal method has a frame recall of one (reported as percentages in T able) and DO A error of zero. During the training of SELDnet, we perform early stopping based on the combined SELD score calculated as S E LD score = ( S E D score + D OA scor e ) / 2 , (8) where S E D score = ( E R + (1 − F )) / 2 , (9) D OA score =  D OA err or / 180 + (1 − f rame r ecall )  / 2 , (10) and an ideal SELD method will have an SELD score of zero. In the proposed method, the localization performance is dependent on the detection performance. This relation is represented by the frame recall metric of DO A estimation. As a consequence, the SELD score which is comprised of frame recall metric in addition to the SED metrics can be seen to weigh the SED performance more than DO A. 9 D. Experiments The SELDnet is e valuated in dif ferent dimensions to under- stand its potential and drawbacks. The experiments carried out with dif ferent datasets in this regard are e xplained below . 1) SELDnet arc hitectur e and model parameter tuning: A wide variety of architectures with different combinations of CNN, RNN and FC layers are explored on the ANSYN O 2 subset with frame length M = 1024 ( 23 . 2 ms ). Additionally , for each architecture, we tune the model parameters such as the number of CNN, RNN, and FC layers (0 to 4) and nodes (in the set of [16 , 32 , 64 , 128 , 256 , 512] ). The input sequence length is tuned in the set of [32 , 64 , 128 , 256 , 512] , the DO A and SED branch output loss weights in the set of [1 , 5 , 50 , 500] , the regularization (dropout in the set of [0 , 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5] , L1 and L2 in the set of [0 , 10 − 1 , 10 − 2 , 10 − 3 , 10 − 4 , 10 − 5 , 10 − 6 , 10 − 7 ] ) and the CNN max-pooling in the set of [2 , 4 , 6 , 8 , 16] for each layer . The best set of parameters are the ones which giv e the lowest SELD score on the three cross-v alidation splits of the dataset. After ﬁnding the best network architecture and conﬁguration, we tune the input audio feature parameter M by v arying it in the set of [512 , 1024 , 2048] . Simultaneously the sequence length is also changed with respect to M such that the input audio length is kept constant ( 1 . 49 s obtained from the ﬁrst round of tuning). W e perform ﬁne-tuning of model parameters for different M and sequence length values, this time only the number of CNN, RNN and FC nodes are tuned in a small range (neighboring nodes in the set of [16, 32, 64, 128, 256, 512]) to identify the optimum parameters. Similar ﬁne-tuning is repeated for other datasets. 2) Selecting SELDnet output format: The output format for polyphonic SED in the literature has become standardized to estimating the temporal activity of each sound class using frame-wise binary numbers [31–34]. On the other hand, the output formats for DOA estimation are still being experi- mented with as seen in T able I. Among the DOA estimation methods using regression mode, there are two possible output formats, predicting azimuth and elev ation, and predicting x, y , z coordinates of the DOA on the unit sphere. In order to identify the best output format among these two, we ev aluate the SELDnet for both. During this ev aluation, only the output weight parameter of the model is ﬁne-tuned in the set of [1 , 5 , 50 , 500] . Additionally , for a regression-based model, the default output i.e. the DO A target when the ev ent is not activ e should be chosen carefully . In this study , we chose the default DO A output to be 180 ° in azimuth and 60 ° in elev ation (the datasets do not contain sound events for these DO A v alues), and x = 0 , y = 0 and z = 0 for default Cartesian outputs. The chosen default Cartesian coordinates are equidistant from all the possible DOA values. On the other hand, there are no such equidistant azimuth and ele vation values. Hence we chose the default values ( 180 °, 60 °) to be in a similar range as the true DO A v alues. 3) Continuous DO A estimation and performance on un- seen DO A values: Theoretically , the advantage of using a regression-based DOA estimator over a classiﬁcation-based one is that the network is not limited to a set of DOA angles, but it can operate as a high-resolution continuous DO A estimator . T o study this, we train the SELDnet on ANSYN subsets whose sound ev ents are placed on an angular grid of 10 ° resolution along azimuth and elev ation, and test the model on a dataset where the angular grid is shifted by 5 ° along azimuth and elev ation while keeping the temporal location unchanged. This shift makes the DO A values of the testing split unseen, and correctly predicting the DOAs will prov e that the regression model can estimate the DO As in a continuous space. Additionally , it also proves the rob ustness of the SELDnet to predict unseen DO A v alues. 4) P erformance on mismatched re verberant dataset: Para- metric DOA estimation methods are known to be sensitiv e to rev erberation [48]. In this regard, we ﬁrst ev aluate the performance of SELDnet on the simulated (RESYN), and real-life (REAL, REALBIG, and REALBIGAMB) re verberant datasets and further compare the results with the parametric baseline MUSIC. DNN based methods are known to fail when the training and testing splits come from different domains. For example, the performance of a DNN trained on anechoic dataset would be poor on a reverberant testing dataset. This performance can only be improved by training the DNN on a similar rev erberant dataset as the testing dataset. On the other hand, it is impractical to train such a DNN for every existing room-dimension, its surface material distribution, and the rev erberation times associated with it. In this re gard, it would be ideal if the proposed method is robust to a moderate mismatch in rev erberant conditions so that a single model can be used for a range of comparable room conﬁgurations. Motiv ated by this, we study the sensitivity of SELDnet on moderately mismatched rev erberant data. Speciﬁcally , we train the SELDnet with RESYN room 1 dataset and test it on RESYN room 2 and 3 datasets that are mismatched in terms of volume and re verberation times as described in Section III-A2. 5) P erformance on the size of the dataset: W e study the performance of SELDnet on two datasets, REAL, and REAL- BIG that are similar in content, but different in size. 6) P erformance with ambiance at differ ent SNR: The per- formance of SELDnet with respect to dif ferent SNRs (0, 10 and 20 dB) of the sound event is studied on the REAL- BIGAMB dataset. 7) Generic to arr ay structur e: SELDnet is a generic method that learns to localize and recognize sound ev ents from any array structure. This additionally implies that the SELDnet will continue to work in the desired manner if the conﬁguration of the array such as individual microphone response, microphone spacing and the number of microphones remains the same between the training and testing set. If the array conﬁguration changes between the training and testing set, then the SELDnet will have to be retrained for the new array conﬁguration. In order to prov e that the SELDnet is applicable to any array conﬁguration and not just dependent on the Ambisonics format, SELDnet is ev aluated on a circular array . In compar- ison to the Ambisonic format, the chosen circular array has a dif ferent number of microphones, each placed on a single plane, and with an omnidirectional response. Further, we compare the SELDnet performance with dataset compatible 10 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Number of model parameters 1e6 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 SELD score CNN RNN CRNN (2048, 64) (1024, 128) (512, 256) (FFT, Sequence) length 0.05 0.10 0.15 0.20 0.25 0.30 SELD score O1 O2 O3 Fig. 3. SELD score for ANSYN O 2 dataset for different CNN, RNN and CRNN architecture conﬁgurations. Fig. 4. SELD score for ANSYN datasets for different combinations of FFT length and input sequence length in frames. 1 5 50 500 Weight for DOA loss 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 SELD score O1 O2 O3 O1 O2 O3 Number of overlapping sources 0.05 0.10 0.15 0.20 0.25 0.30 0.35 SELD score xyz azi/ele Fig. 5. SELD score for ANSYN datasets with respect to different weights for DOA output. Fig. 6. SELD score for ANSYN datasets with respect to DO A output formats. baselines such as SEDnet, MSEDnet, HIRnet, and AZInet. Since the HIRnet and AZInet baselines methods are proposed for estimating azimuth only , we compare the results with the SELDnet-azi version. Additionally , we also report the performance of using SELDnet with DO A estimation in x, y , z axis on CANSYN and CRESYN datasets. In general, for all our experiments the only difference between the training and testing splits is the mutually exclusi ve set of sound examples. Apart from experiment III-D3 the training and testing splits contains the same set of spatial locations i.e., azimuth and ele vation angles at 10 ° resolution amounting to 468 spatial locations (= 36 azimuth angles * 13 elev ation angles). But the distance of the sound e vent at each of this 468 spatial locations is an added variable. For example, in the anechoic case, a sound event can be placed an ywhere between 1-10 m at 0.5 m resolution. This variable amounts to 8892 spatial locations (= 468 * 19 distance positions) that are being coarsely grouped to 468 locations. This complexity is stretched further in experiment III-D3 where the testing split sound ev ent examples and their spatial locations both are dif ferent from the training split. I V . R E S U LT S A N D D I S C U S S I O N 1) SELDnet arc hitectur e and model parameter tuning: The SELD scores obtained with hyper-parameter tuning of different CNN, RNN, and CRNN conﬁgurations as explained in Section III-D1 are visualized with respect to the number of model parameters in Figure 3. CNN in this ﬁgure refers to a SELDnet architecture which had no RNN layers but just CNN and FC layers. Similarly , RNN refers to SELDnet without CNN layers, while CRNN refers to SELDnet with CNN, RNN and FC layers. This experiment was carried out on ANSYN O 2 dataset. The CRNN architecture was seen to perform the best follo wed by the RNN architecture. The optimum model parameters across the ANSYN subsets after hyper -parameter tuning the CRNN architecture was found to have three layers of CNN with 64 nodes each ( P in Figure 1a), followed by two layers of GR U with 128 nodes each ( Q in Figure 1a), and one FC layer with 128 nodes ( R in Figure 1a). The max-pooling over frequency after each of the three CNN layers ( M P i in Figure 1a) was (8 , 8 , 2) . This conﬁguration had about 513,000 parameters. Further , the SELDnet was seen to perform best with no regularization (dropout, or L1 or L2 regularization of weights). A frame length of M = 512 and sequence length of 256 frames was seen to gi ve the best results across ANSYN subsets (Figure 4). Furthermore, on tuning the sequence length with frame length ﬁxed ( M = 512 ), the best scores were obtained using 512 frames ( 2 . 97 s ). Sequences longer than this could not be studied due to hardware restrictions. For the output weights, DO A output weighted 50 times more than SED output was seen to giv e the best results across subsets (Figure 5). On ﬁne-tuning the SELDnet parameters obtained with AN- SYN dataset for RESYN subsets, the only parameter that helped improve the performance was using a sequence length of 256 instead of 512, leaving the total number of network parameters unchanged at 513,000. Similar conﬁguration ga ve the best results for CANSYN and CRESYN datasets. Model parameters identical to ANSYN dataset were ob- served to perform the best on the REAL subsets. The same parameters were also used for the study of REALBIG and REALBIGAMB subsets. 2) Selecting SELDnet output format: In the output data formats study , it was observed that using the Cartesian x, y , z format in place of azimuth/elev ation angle was truly helping the network learn better across datasets as seen in Figure 6. This suggests that the discontinuity at the angle wrap-around boundary actually reduces the performance of DO A estimation and hence the SELD score. 11 T ABLE IV S E D A N D D OA E S TI M A T I ON M E T RI C S F O R A N SY N A N D R ES Y N D A TA S ET S . T H E R E S ULT S F O R T H E R ES Y N R OO M 2 A N D 3 T E S T IN G S P L I TS W E RE O B T A IN E D F RO M C L AS S I FIE R S T R A IN E D O N R E S Y N R OO M 1 T R A IN I N G S E T . B E S T S C O RE S F O R S U B SE T S I N B O L D . ANSYN RESYN Room 1 RESYN Room 2 RESYN Room 3 Overlap 1 2 3 1 2 3 1 2 3 1 2 3 SED metrics SELDnet ER 0.04 0.16 0.19 0.10 0.29 0.32 0.11 0.33 0.35 0.13 0.32 0.34 F 97.7 89.0 85.6 92.5 79.6 76.5 91.6 79.5 75.8 89.8 79.1 75.5 MSEDnet [39] ER 0.10 0.13 0.17 0.17 0.28 0.29 0.19 0.30 0.26 0.18 0.29 0.30 F 94.4 90.1 87.2 89.1 79.1 75.6 88.3 78.2 74.2 86.5 80.5 76.1 SEDnet [39] ER 0.14 0.16 0.18 0.18 0.28 0.30 0.19 0.32 0.28 0.21 0.32 0.33 F 91.9 89.1 86.7 88.2 76.9 74.1 87.6 76.4 73.2 85.1 78.2 75.6 DO A metrics SELDnet DO A error 3.4 13.8 17.3 9.2 20.2 26.0 11.5 26.0 33.1 12.1 25.4 31.9 Frame recall 99.4 85.6 70.2 95.8 74.9 56.4 96.2 78.9 61.2 95.9 78.2 60.7 DO Anet [25] DO A error 0.6 8.0 18.3 6.3 11.5 38.4 3.4 6.9 - 4.6 10.9 - Frame recall 95.4 42.7 1.8 59.3 15.8 1.2 46.2 14.3 - 49.7 14.1 - MUSIC DO A error 4.1 7.2 15.8 40.2 47.1 50.5 45.7 58.1 74.0 48.3 60.6 75.6 3) Continuous DO A estimation and performance on unseen DO A values: The input and outputs of SELDnet trained on ANSYN O 1 and O 2 subsets for a respectiv e 1000 frame test sequence are visualized in Figure 7. The Figure represents each sound event class and its associated DOA outputs with a unique color . In the case of ANSYN O 1 , we see that the network predictions of SED and DO A are almost perfect. In the case of unseen DO A values (× markers), the network predictions continue to be accurate. This shows that the regres- sion mode output format helps the network learn continuous DO A v alues, and further that the network is robust to unseen DO A v alues. In case of ANSYN O 2 , the SED predictions are accurate, while the DOA estimates, in general, are seen to v ary around the respective mean reference value. In this work, the DOA and SED labels for a single sound event instance are considered to be constant for the entire duration ev en though the instance has inherent magnitude v ariations and silences within. From Figure 7b it seems that these variations and silences are leading to ﬂuctuating DO A estimates, while the SED predictions are unaffected. In general, we see that the proposed method successfully recognizes, localizes in time and space, and tracks multiple ov erlapping sound ev ents simultaneously . T able IV presents the ev aluation metric scores for the SELDnet and the baseline methods with ANSYN and RESYN datasets. In the SED metrics for the ANSYN datasets, the SELDnet performed better than the best baseline MSEDnet for O 1 subset while MSEDnet performed slightly better for O 2 and O 3 subsets. W ith regard to DO A metrics, the SELDnet is signiﬁcantly better than the baseline DO Anet in terms of frame recall. This impro vement in frame recall is a direct result of using SED output as a conﬁdence measure for estimating DO A, thereby extending state-of-the-art SED performance to SELD. Although the frame recall of DOAnet is poor , its DO A error for O 1 and O 2 subsets is observed to be lower than SELDnet. The DO A error of the parametric baseline MUSIC with the knowledge of the number of sources is seen to be the best among the ev aluated methods for O 2 and O 3 subsets. 0 100 200 Normalized Spectrogram 0 5 10 SED reference 0 5 10 SED predicted 1.0 0.5 0.0 0.5 1.0 X-axis DOA reference 1.0 0.5 0.0 0.5 1.0 Y-axis DOA reference 1.0 0.5 0.0 0.5 1.0 Z-axis DOA reference 1.0 0.5 0.0 0.5 1.0 X-axis DOA predicted 1.0 0.5 0.0 0.5 1.0 Y-axis DOA predicted 1.0 0.5 0.0 0.5 1.0 Z-axis DOA predicted (a) ANSYN O 1 0 100 200 Normalized Spectrogram 0 5 10 SED reference 0 5 10 SED predicted 1.0 0.5 0.0 0.5 1.0 X-axis DOA reference 1.0 0.5 0.0 0.5 1.0 Y-axis DOA reference 1.0 0.5 0.0 0.5 1.0 Z-axis DOA reference 1.0 0.5 0.0 0.5 1.0 X-axis DOA predicted 1.0 0.5 0.0 0.5 1.0 Y-axis DOA predicted 1.0 0.5 0.0 0.5 1.0 Z-axis DOA predicted (b) ANSYN O 2 Fig. 7. SELDnet input and outputs visualized for ANSYN O 1 and O 2 datasets. The horizontal-axis of all sub-plots for a given dataset represents the same time frames, the vertical-axis for spectrogram sub-plot represents the frequency bins, vertical-axis for SED reference and prediction sub-plots represents the unique sound ev ent class identiﬁer, and for the DOA reference and prediction sub-plots, it represents the distance from the origin along the respective axes. The bold lines visualize both the reference labels and predictions of DO A and SED for ANSYN O 1 and O 2 datasets, while the × markers in Figure 7a visualize the results for testing split with unseen DOA values (shifted by 5 ° along azimuth and elevation). 12 T ABLE V S E D A N D D OA E S TI M A T I ON M E T RI C S F O R R E AL , R E AL B I G A N D R E AL B I G AM B DAT A SE T S . B E S T S CO R E S F O R S U BS E T S I N B O LD . REAL REALBIG REALBIGAMB 20dB REALBIGAMB 10dB REALBIGAMB 0dB Overlap 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 SED metrics SELDnet ER 0.40 0.49 0.53 0.37 0.42 0.50 0.34 0.46 0.52 0.37 0.49 0.52 0.46 0.58 0.59 F 60.3 53.1 51.1 65.4 61.5 56.5 65.6 58.5 55.0 66.3 55.4 53.3 57.9 48.6 49.0 MSEDnet [39] ER 0.35 0.38 0.41 0.34 0.39 0.38 0.35 0.40 0.41 0.38 0.43 0.42 0.48 0.56 0.54 F 66.2 61.6 59.5 67.3 61.8 61.9 66.0 61.6 60.1 63.2 58.7 59.3 54.5 49.3 51.3 SEDnet [39] ER 0.38 0.42 0.43 0.38 0.43 0.44 0.39 0.42 0.43 0.41 0.44 0.46 0.51 0.61 0.57 F 64.6 61.5 57.2 68.0 62.4 62.4 65.7 60.1 59.2 62.7 56.3 56.9 52.6 46.0 50.4 DO A metrics SELDnet DO A error 26.6 33.7 36.1 23.1 31.3 34.9 25.4 32.5 36.1 27.2 32.5 36.1 30.7 33.7 36.7 Frame recall 64.9 41.5 24.6 68.0 45.2 28.3 69.1 42.8 25.8 66.9 40.0 27.3 62.5 35.2 23.4 DO Anet [25] DO A error 6.3 20.1 25.8 7.5 17.8 22.9 6.3 18.9 25.78 8.0 20.1 24.1 14.3 24.1 27.5 Frame recall 46.5 11.5 2.9 44.1 12.5 3.1 34.7 11.6 3.2 42.1 13.5 3.3 30.1 10.5 2.8 MUSIC DO A error 36.3 49.5 54.3 35.8 49.6 53.8 54.5 56.1 61.3 51.6 54.5 62.6 41.9 47.5 62.3 4) P erformance on mismatched re verberant dataset: From T able IV results on RESYN room 1 subsets, we see that the performance of parametric method MUSIC is poor in compar- ison to SELDnet in reverberant conditions. The SELDnet is seen to perform signiﬁcantly better than the baseline DOAnet in terms of frame recall, although the DO Anet has lower DOA error for O 1 and O 2 subsets. The SED metrics of SELDnet are comparable if not better than the best baseline performance of MSEDnet. Further , on training the SELDnet on room 1 dataset and testing on moderately mismatched rev erberant room 2 and 3 datasets the SED and DOA metric trends remain similar to the results of room 1 testing split. That is, the SELDnet has 0 1 0 1 1 0 0.01 0.99 (a) ANSYN O 1 0 1 0 1 0.97 0.03 0.05 0.95 (b) RESYN O 1 0 1 2 3 0 1 2 0.99 0.01 0 0 0.02 0.88 0.1 0 0 0.22 0.76 0.02 (c) ANSYN O 2 0 1 2 3 0 1 2 0.91 0.09 0 0 0.06 0.79 0.15 0 0.01 0.36 0.58 0.05 (d) RESYN O 2 0 1 2 3 0 1 2 3 0.97 0.03 0 0 0.03 0.84 0.12 0.01 0 0.21 0.69 0.1 0 0.06 0.51 0.43 (e) ANSYN O 3 0 1 2 3 4 0 1 2 3 0.79 0.19 0.02 0 0 0.08 0.65 0.25 0.02 0 0.01 0.3 0.57 0.11 0 0 0.13 0.56 0.28 0.03 (f) RESYN O 3 Fig. 8. Confusion matrix for the number of sound e vent classes estimated to be active per frame by the SELDnet for ANSYN and RESYN datasets. The horizontal axis represents the SELDnet estimate, and the vertical axis represents the reference. higher frame recall, the DO Anet has better DO A error , the MUSIC performs poorly , and the SED metrics of SELDnet are comparable to MSEDnet. These results prove that the SELDnet is robust to reverberation in comparison to baseline methods and performs seamlessly on moderately mismatched room conﬁgurations. Figure 8 visualizes the confusion matrices for the estimated number of sound ev ent classes per frame by SELDnet. For example in Figure 8c the SELDnet correctly estimated the number of sources to be two in 76% (true positiv e percentage) of the frames which had two sources in the reference. In con- text, the frame recall value used as a metric to e valuate DO A estimation represents this confusion matrix in one number . From the confusion matrices, we observe that the percentage of true positiv es drops with higher number of sources, and this drop is ev en more signiﬁcant in the rev erberant scenario. But, in comparison to the frame recall metric of the baseline DO Anet in T able IV, the SELDnet frame recall is signiﬁcantly better for higher number of overlapping sound e vents, espe- cially in the reverberant conditions. 5) P erformance on the size of the dataset: The overall performance of SELDnet on REAL dataset (T able V) reduced in comparison to ANSYN and RESYN datasets. The baseline MSEDnet is seen to perform better than SELDnet in terms of SED metrics. Similar performance drop on real-life datasets has also been reported on SED datasets in other studies [37]. W ith regard to DO A metrics, the frame recall of SELDnet con- tinues to be signiﬁcantly better than DO Anet, while the DO A error of DOAnet is lower than SELDnet. The performance of MUSIC is seen to be poor in comparison to both DO Anet and SELDnet. With the larger REALBIG dataset the SELDnet performance was seen to improve. A similar study was done with larger ANSYN and RESYN datasets, where the results were comparable with that of smaller datasets. This shows that the datasets with real-life IR are more complicated than synthetic IR datasets, and ha ving larger real-life datasets helps the network learn better . 6) P erformance with ambiance at dif fer ent SNR: In pres- ence of ambiance, SELDnet was seen to be rob ust for 10 and 20 dB SNR REALBIGAMB datasets (T able V). In comparison to the SED metrics of REALBIG dataset with no ambiance, the SELDnet performance on O 1 subsets of 10 dB and 20 dB 13 T ABLE VI S E D A N D D OA E S TI M A T I ON M E T RI C S F O R C A NS Y N A N D C R E SY N DAT A S ET S . B E S T S CO R E S F O R S U BS E T S I N B O L D . CANSYN CRESYN Overlap 1 2 3 1 2 3 SED metrics SELDnet ER 0.11 0.18 0.19 0.13 0.22 0.30 F score 93.0 86.6 85.3 90.4 82.2 78.0 SELDnet-azi ER 0.08 0.19 0.24 0.06 0.18 0.20 F score 94.7 87.5 83.8 96.3 87.9 85.6 MSEDnet [39] ER 0.09 0.18 0.16 0.12 0.22 0.26 F score 94.6 89.0 86.7 92.7 83.7 80.7 SEDnet [39] ER 0.15 0.21 0.20 0.18 0.26 0.25 F score 91.4 87.3 84.7 90.5 84.3 82.8 HIRnet [20] ER 0.41 0.45 0.62 0.43 0.46 0.50 F score 60.0 54.9 58.8 59.3 60.2 58.6 DO A metrics SELDnet DO A error 29.5 31.3 34.3 28.4 33.7 41.0 Frame recall 97.9 78.8 67.0 96.4 75.7 60.7 SELDnet-azi DOA error 7.5 14.4 19.6 5.2 13.2 18.4 Frame recall 98.0 82.1 66.2 98.5 82.3 70.6 HIRnet [20] DO A error 5.2 16.3 33.0 7.4 18.6 43.3 Frame recall 60.2 35.9 18.4 56.9 20.5 10.7 AZInet [18] DO A error 1.2 4.0 7.4 2.3 6.9 9.7 Frame recall 99.4 80.5 60.5 97.3 65.2 44.8 ambiance is comparable, while a small drop in performance was observ ed with the respecti ve O 2 and O 3 subsets. Whereas, the performance was seen to drop considerably for the 0 dB SNR dataset. W ith respect to DOA error , the SELDnet performed better than MUSIC but poorer than DOAnet across datasets, on the other hand, SELDnet gav e signiﬁcantly higher frame recall than DO Anet. From the insight of SELDnet per- formance on REAL dataset (Section IV -5), the more complex the acoustic scene the larger the dataset size required to learn better . Considering that the SELDnet is jointly estimating the DO A along with SED in a challenging acoustic scene with ambiance the SELDnet performance can potentially improve with larger datasets. 7) Generic to array structur e: The results on circular array datasets are presented in T able VI. W ith respect to SED metrics, the SELDnet-azi performance is seen to be better than the best baseline MSEDnet for all subsets of CRESYN dataset, while MSEDnet is seen to perform better for O 2 and O 3 subsets of CANSYN dataset. Similarly , in the case of DO A metrics, the SELDnet-azi has better frame recall than the best baseline method AZInet across datasets (except for CANSYN O 1 ). Whereas, AZInet has lower DOA error than SELDnet- azi. Between SELDnet and SELDnet-azi, even though the frame recall is in the same order the DO A error of SELDnet- azi are lower than SELDnet. This shows that estimating DO A in 3D ( x, y , z ) is challenging using a circular array . Overall, the SELDnet is shown to perform consistently across different array structures (Ambisonic and circular array), with good results in comparison to baselines. The usage of SED output as a conﬁdence measure for estimating the number of DO As in the proposed SELDnet is shown to improv e the frame recall signiﬁcantly and consis- tently across the ev aluated datasets. On the other hand, the DO A error obtained with SELDnet is consistently higher than the classiﬁcation based baseline DOA estimation methods [18, 25]. W e believe that this might be a result of the regression- based DOA estimation approach in SELDnet not having completely learned the full mapping between input feature and the continuous DO A space. The inv estigation of which is planned for future work. In general, a classiﬁcation only or a classiﬁcation-regression based SELD approach can be chosen based on the required frame recall, DOA error , resolution of DO A labels, training split size, and robustness to unseen DO A values and rev erberation. V . C O N C L U S I O N In this paper , we proposed a con volutional recurrent neural network (SELDnet) to simultaneously recognize, localize and track sound e vents with respect to time. The localization is done by estimating the direction of arri val (DO A) on a unit sphere around the microphone using 3D Cartesian coordinates. W e tie each sound e vent output class in the SELDnet to three regressors to estimate the respectiv e Cartesian coordinates. W e show that using regression helps estimating DOA in a continuous space, and also estimating unseen DO A values accurately . On the other hand, estimating a single DO A for each sound e vent class does not allow recognizing multiple instances of the same class overlapping. W e plan to tackle this problem in our future work. The usage of SED output as a conﬁdence measure to estimate DO A was seen to extend the state-of-the-art SED performance to SELD resulting in a higher recall of DO As. W ith respect to the estimated DO A error , although the clas- siﬁcation based baseline methods had poor recall they had lower DO A error in comparison to the proposed regression based DO A estimation. The proposed SELDnet uses phase and magnitude spectrogram as the input feature. The usage of such non-method-speciﬁc feature makes the method generic and easily extendable to dif ferent array structures. W e prov e this by ev aluating on datasets of Ambisonic and circular array format. The proposed SELDnet is shown to be robust to rev erberation, low SNR scenarios and unseen rooms with comparable room- sizes. Finally , the overall performance on dataset synthesized using real-life impulse response (IR) was seen to drop in comparison to artiﬁcial IR dataset, suggesting the need for larger real-life training datasets and more po werful classiﬁers in future. R E F E R E N C E S [1] R. T akeda and K. Komatani, “Sound source localization based on deep neural networks with directional activ ate function exploiting phase information, ” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016. [2] ——, “Discriminativ e multiple sound source localization based on deep neural networks using independent location model, ” in IEEE Spoken Language T echnology W orkshop (SLT) , 2016. [3] N. Y alta, K. Nakadai, and T . Ogata, “Sound source localization using deep learning models, ” in Journal of Robotics and Mechatr onics , vol. 29, no. 1, 2017. [4] W . He, P . Motlicek, and J.-M. Odobez, “Deep neural networks for multiple speaker detection and localization, ” in International Conference on Robotics and Automation (ICRA) , 2018. [5] M. Crocco, M. Cristani, A. T rucco, and V . Murino, “ Audio surveillance: A systematic review , ” in ACM Computing Surveys (CSUR) , 2016. [6] C. Grobler , C. Kruger, B. Silva, and G. Hancke, “Sound based localiza- tion and identiﬁcation in industrial en vironments, ” in IEEE Industrial Electr onics Society (IECON) , 2017. [7] P . W . W essels, J. V . Sande, and F . V . der Eerden, “Detection and local- ization of impulsive sound events for en vironmental noise assessment, ” in The Journal of the Acoustical Society of America 141 , vol. 141, no. 5, 2017. 14 [8] P . Foggia, N. Petkov , A. Saggese, N. Strisciuglio, and M. V ento, “ Audio surveillance of roads: A system for detecting anomalous sounds, ” in IEEE T ransactions on Intelligent T ransportation Systems , vol. 17, no. 1, 2015. [9] C. Busso, S. Hernanz, C.-W . Chu, S.-i. Kwon, S. Lee, P . G. Georgiou, I. Cohen, and S. Narayanan, “Smart room: participant and speaker localization and identiﬁcation, ” in IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing (ICASSP) , 2005. [10] H. W ang and P . Chu, “V oice source localization for automatic camera pointing system in videoconferencing, ” in IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 1997. [11] M. W ¨ olfel and J. McDonough, Distant speech recognition . John W iley & Sons, 2009. [12] P . Swietojanski, A. Ghoshal, and S. Renals, “Conv olutional neural networks for distant speech recognition, ” in IEEE Signal Pr ocessing Letters , vol. 21, 2014. [13] T . Butko, F . G. Pla, C. Segura, C. Nadeu, and J. Hernando, “T wo-source acoustic ev ent detection and localization: Online implementation in a smart-room, ” in Eur opean Signal Pr ocessing Confer ence (EUSIPCO) , 2011. [14] S. Chu, S. Narayanan, and C. J. Kuo, “En vironmental sound recognition with time-frequency audio features, ” in IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 17, no. 6, 2009. [15] T . A. Marques et al. , “Estimating animal population density using passiv e acoustics, ” in Biological re views of the Cambridge Philosophical Society , vol. 88, no. 2, 2012. [16] B. J. Furnas and R. L. Callas, “Using automated recorders and occupancy models to monitor common forest birds across a large geographic region, ” in Journal of Wildlife Management , vol. 79, no. 2, 2014. [17] S. Chakrabarty and E. A. P . Habets, “Broadband DOA estimation using con volutional neural networks trained with noise signals, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics (W ASP AA) , 2017. [18] ——, “Multi-speaker localization using con volutional neural network trained with noise, ” in Neural Information Pr ocessing Systems (NIPS) , 2017. [19] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “ A learning-based approach to direction of arrival estimation in noisy and reverberant environments, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2015. [20] T . Hirvonen, “Classiﬁcation of spatial audio location and content using con volutional neural networks, ” in Audio Engineering Society Conven- tion 138 , 2015. [21] M. Y iwere and E. J. Rhee, “Distance estimation and localization of sound sources in rev erberant conditions using deep neural networks, ” in International Journal of Applied Engineering Resear ch , vol. 12, no. 22, 2017. [22] E. L. Ferguson, S. B. W illiams, and C. T . Jin, “Sound source localization in a multipath environment using conv olutional neural networks, ” in IEEE International Conference on Acoustics, Speech and Signal Pr o- cessing (ICASSP) , 2018. [23] F . V esperini, P . V ecchiotti, E. Principi, S. Squartini, and F . Piazza, “ A neural network based algorithm for speaker localization in a multi-room en vironment, ” in IEEE International W orkshop on Machine Learning for Signal Processing (MLSP) , 2016. [24] Y . Sun, J. Chen, C. Y uen, and S. Rahardja, “Indoor sound source localization with probabilisitic neural network, ” in IEEE T ransactions on Industrial Electronics , vol. 29, no. 1, 2017. [25] S. Adav anne, A. Politis, and T . V irtanen, “Direction of arrival estima- tion for multiple sound sources using con volutional recurrent neural network, ” in European Signal Pr ocessing Conference (EUSIPCO) , 2018. [26] R. Roden, N. Moritz, S. Gerlach, S. W einzierl, and S. Goetze, “On sound source localization of speech signals using deep neural networks, ” in Deutsche Jahr estagung f ¨ ur Akustik (DA GA) , 2015. [27] A. Mesaros, T . Heittola, A. Eronen, and T . V irtanen, “ Acoustic ev ent detection in real-life recordings, ” in European Signal Pr ocessing Con- fer ence (EUSIPCO) , 2010. [28] E. C ¸ akır, T . Heittola, H. Huttunen, and T . V irtanen, “Polyphonic sound ev ent detection using multi-label deep neural networks, ” in IEEE Inter- national J oint Conference on Neural Networks (IJCNN) , 2015. [29] G. Parascandolo, H. Huttunen, and T . V irtanen, “Recurrent neural networks for polyphonic sound ev ent detection in real life recordings, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2016. [30] S. Adav anne, G. Parascandolo, P . Pertila, T . Heittola, and T . V irtanen, “Sound ev ent detection in multichannel audio using spatial and harmonic features, ” in Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2016. [31] T . Hayashi, S. W atanabe, T . T oda, T . Hori, J. L. Roux, and K. T akeda, “Duration-controlled LSTM for polyphonic sound ev ent detection, ” in IEEE/ACM T ransactions on Audio, Speech, and Language Processing , vol. 25, no. 11, 2017. [32] M. Z ¨ ohrer and F . Pernkopf, “V irtual adversarial training and data augmentation for acoustic event detection with gated recurrent neural networks, ” in INTERSPEECH , 2017. [33] H. Zhang, I. McLoughlin, and Y . Song, “Robust sound event recognition using conv olutional neural networks, ” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2015. [34] H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio ev ent recognition with 1-max pooling con volutional neural networks, ” in INTERSPEECH , 2016. [35] S. Adav anne, A. Politis, and T . V irtanen, “Multichannel sound event detection using 3D con volutional neural networks for learning inter- channel features, ” in IEEE International Joint Confer ence on Neural Networks (IJCNN) , 2018. [36] H. Lim, J. Park, K. Lee, and Y . Han, “Rare sound e vent detection using 1D convolutional recurrent neural networks, ” in Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2017. [37] E. C ¸ akır, G. Parascandolo, T . Heittola, H. Huttunen, and T . V irtanen, “Con volutional recurrent neural networks for polyphonic sound event detection, ” in IEEE/A CM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 25, no. 6, 2017. [38] S. Adavanne and T . V irtanen, “ A report on sound ev ent detection with different binaural features, ” in Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2017. [39] S. Adavanne, P . Pertil ¨ a, and T . V irtanen, “Sound event detection using spatial features and conv olutional recurrent neural network, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) , 2017. [40] I.-Y . Jeong, S. Lee1, Y . Han, and K. Lee, “ Audio ev ent detection using multiple-input conv olutional neural network, ” in Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2017. [41] J. Zhou, “Sound event detection in multichannel audio LSTM network, ” in Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2017. [42] R. Lu and Z. Duan, “Bidirectional gru for sound event detection, ” in Detection and Classiﬁcation of Acoustic Scenes and Events (DCASE) , 2017. [43] A. T emko, C. Nadeu, and J.-I. Biel, “ Acoustic event detection: SVM- based system and evaluation setup in CLEAR’07, ” in Multimodal T echnologies for P er ception of Humans . Springer , 2008. [44] Y . Huang, J. Benesty , G. Elko, and R. Mersereati, “Real-time passi ve source localization: a practical linear-correction least-squares approach, ” in IEEE Tr ansactions on Speech and Audio Processing , vol. 9, no. 8, 2001. [45] M. S. Brandstein and H. F . Silverman, “ A high-accuracy , low-latenc y technique for talker localization in reverberant environments using microphone arrays, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 1997. [46] R. O. Schmidt, “Multiple emitter location and signal parameter esti- mation, ” in IEEE Tr ansactions on Antennas and Pr opagation , vol. 34, no. 3, 1986. [47] R. Roy and T . Kailath, “ESPRIT-estimation of signal parameters via rotational inv ariance techniques, ” in IEEE/ACM T ransactions on Audio, Speech, and Language Processing , vol. 37, no. 7, 1989. [48] J. H. DiBiase, H. F . Silverman, and M. S. Brandstein, “Robust localiza- tion in reverberant rooms, ” in Microphone Arrays . Springer , 2001. [49] H. T eutsch, Modal array signal processing: principles and applications of acoustic waveﬁeld decomposition . Springer , 2007, vol. 348. [50] J.-M. V alin, F . Michaud, and J. Rouat, “Robust localization and tracking of simultaneous moving sound sources using beamforming and particle ﬁltering, ” Robotics and A utonomous Systems , vol. 55, no. 3, pp. 216– 228, 2007. [51] J. T raa and P . Smaragdis, “Multiple speaker tracking with the Factorial Von Mises-Fisher ﬁlter, ” in IEEE International W orkshop on Machine Learning for Signal Processing (MLSP) , 2014. [52] R. Chakraborty and C. Nadeu, “Sound-model-based acoustic source localization using distributed microphone arrays, ” in IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) , 2014. [53] K. Lopatka, J. Kotus, and A. Czyzewsk, “Detection, classiﬁcation and localization of acoustic ev ents in the presence of background noise for acoustic surveillance of hazardous situations, ” Multimedia T ools and Applications J ournal , vol. 75, no. 17, 2016. 15 [54] S. Ioffe and C. Szegedy , “Batch normalization: Accelerating deep network training by reducing internal covariate shift, ” International Confer ence on Machine Learning , 2015. [55] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” in International Conference on Learning Repr esentations (ICLR) , 2015. [56] F . Chollet, “Keras v2.0.8, ” 2015, accessed on 7 May 2018. [Online]. A vailable: https://github.com/fchollet/keras [57] M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow , A. Harp, G. Irving, M. Isard, Y . Jia, R. Jozefo wicz, L. Kaiser , M. Kudlur , J. Levenberg, D. Man ´ e, R. Monga, S. Moore, D. Murray , C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutske ver , K. T alwar , P . Tucker , V . V anhoucke, V . V asudevan, F . V i ´ egas, O. V inyals, P . W arden, M. W attenberg, M. Wick e, Y . Y u, and X. Zheng, “T ensorFlo w: Large- scale machine learning on heterogeneous systems, ” 2015, accessed on 7 May 2018. [Online]. A vailable: https://www .tensorﬂow .org/ [58] E. Benetos, M. Lagrange, and G. Lafay , “Sound e vent detection in synthetic audio, ” 2016, accessed on 7 May 2018. [Online]. A vailable: https://archiv e.org/details/dcase2016 task2 train de v [59] J. B. Allen and D. A. Berkley , “Image method for efﬁciently simulating small-room acoustics, ” in The Journal of the Acoustical Society of America , vol. 65, no. 4, 1979. [60] G. Enzner, “3D-continuous-azimuth acquisition of head-related impulse responses using multi-channel adaptiv e ﬁltering, ” in IEEE W orkshop on Applications of Signal Pr ocessing to Audio and Acoustics (W ASP AA) , 2009. [61] J. Barker , R. Marxer , E. Vincent, and S. W atanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and base- lines, ” in IEEE W orkshop on Automatic Speech Recognition and Under- standing (ASR U) , 2015. [62] J. Salamon, C. Jacoby , and J. P . Bello, “ A dataset and taxonomy for urban sound research, ” in A CM International Conference on Multimedia (ACM-MM) , 2014. [63] A. Politis, “Microphone array processing for parametric spatial audio techniques, ” Ph.D. thesis, Aalto University , 2016. [64] A. Mesaros, T . Heittola, A. Diment, B. Elizalde, A. Shah, E. V incent, B. Raj, and T . V irtanen, “DCASE 2017 challenge setup: tasks, datasets and baseline system, ” in Detection and Classiﬁcation of Acoustic Scenes and Events 2017 W orkshop (DCASE) , 2017. [65] B. Ottersten, M. V iberg, P . Stoica, and A. Nehorai, “Exact and large sample maximum likelihood techniques for parameter estimation and detection in array processing, ” in Radar Array Pr ocessing. Springer Series in Information Sciences , 1993. [66] D. Khaykin and B. Rafaely , “ Acoustic analysis by spherical microphone array processing of room impulse responses, ” The Journal of the Acoustical Society of America , vol. 132, no. 1, 2012. [67] A. Mesaros, T . Heittola, and T . V irtanen, “Metrics for polyphonic sound ev ent detection, ” in Applied Sciences , vol. 6, no. 6, 2016. [68] A. Mesaros, T . Heittola, and D. Ellis, “Datasets and evaluation, ” in Computational Analysis of Sound Scenes and Events , T . V irtanen, M. Plumbley , and D. Ellis, Eds. Springer International Publishing, 2018, ch. 6. Sharath Adavanne receiv ed his M.Sc. degree in Information T echnology from T ampere Univ ersity of T echnology (TUT), Finland in 2011. From 2011 to 2016 he worked in the industry solving prob- lems related to music information retriev al, speech recognition, audio ﬁngerprinting and general audio content analysis. Since 2016, he is pursuing his Ph.D. degree at the laboratory of signal processing in TUT . His current research interest is in the applica- tion of machine learning based methods for real-life auditory scene analysis. Archontis Politis obtained a M.Eng degree in civil engineering Aristotle University , Thessaloniki, Greece, and his M.Sc degree in sound and vibra- tion studies from Institute of Sound and V ibration Studies (ISVR), Southampton, UK, in 2006 and 2008 respectiv ely . From 2008 to 2010 he worked as a graduate acoustic consultant in Arup Acoustics, UK, and as a researcher in a joint collaboration between Arup Acoustics and the Glasgow School of Arts, on architectural auralization using spatial sound techniques. In 2016 he obtained a Doctor of Science degree on the topic of parametric spatial sound recording and reproduction from Aalto University , Finland. He has also completed an internship at the Audio and Acoustics Research Group of Microsoft Research, during summer of 2015. He is currently a post-doctoral researcher at Aalto Univ ersity . His research interests include spatial audio technologies, virtual acoustics, array signal processing and acoustic scene analysis. Joonas Nikunen recei ved the M.Sc degree in sig- nal processing and communications engineering and Ph.D degree in Signal Processing from T ampere Univ ersity of T echnology (TUT), Finland in 2010 and 2015, respectively . He is currently a post- doctoral researcher at TUT focusing on sound source separation with applications on spatial audio anal- ysis, modiﬁcation and synthesis. His other research interests include microphone array signal processing, 3D/360 audio in general and machine and deep learning for source separation. T uomas Virtanen is Professor at Laboratory of Signal Processing, T ampere Uni versity of T echnol- ogy (TUT), Finland, where he is leading the Audio Research Group. He received the M.Sc. and Doc- tor of Science de grees in information technology from TUT in 2001 and 2006, respectiv ely . He has also been working as a research associate at Cam- bridge University Engineering Department, UK. He is known for his pioneering work on single-channel sound source separation using non-negati ve matrix factorization based techniques, and their application to noise-robust speech recognition and music content analysis. Recently he has done signiﬁcant contributions to sound ev ent detection in everyday en vironments. In addition to the above topics, his research interests include content analysis of audio signals in general and machine learning. He has authored more than 150 scientiﬁc publications on the abo ve topics, which ha ve been cited more than 6000 times. He has received the IEEE Signal Processing Society 2012 best paper award for his article ”Monaural Sound Source Separation by Nonnegativ e Matrix Factorization with T emporal Continuity and Sparseness Criteria” as well as three other best paper awards. He is an IEEE Senior Member , member of the Audio and Acoustic Signal Processing T echnical Committee of IEEE Signal Processing Society , Associate Editor of IEEE/ACM Transaction on Audio, Speech, and Language Processing, and recipient of the ERC 2014 Starting Grant.

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment