Deep Ad-hoc Beamforming

Far-field speech processing is an important and challenging problem. In this paper, we propose \textit{deep ad-hoc beamforming}, a deep-learning-based multichannel speech enhancement framework based on ad-hoc microphone arrays, to address the problem…

Authors: Xiao-Lei Zhang

Deep Ad-hoc Beamforming
Deep ad-hoc beamforming Xiao-Lei Zhang 1 , 2 1 Resear ch & Development Institute of Northwestern P olytechnical University in Shenzhen, Shenzhen, China 2 Center for Intelligent Acoustics and Immersive Communications, Sc hool of Marine Science and T echnology , Northwestern P olytec hnical University , China Abstract Far -field speech processing is an important and challenging problem. In this paper , we propose deep ad-hoc beamforming , a deep-learning-based multichannel speech enhancement framew ork based on ad-hoc microphone arrays, to address the problem. It contains three nov el components. First, it combines ad-hoc micr ophone arrays with deep-learning-based multichannel speech enhancement, which reduces the probability of the occurrence of far -field acoustic environments significantly . Second, it groups the microphones around the speech source to a local microphone array by a supervised channel selection framework based on deep neural networks. Third, it develops a simple time synchronization framew ork to synchronize the channels that hav e di ff erent time delay . Besides the above nov elties and advantages, the proposed model is also trained in a single-channel fashion, so that it can easily employ new de velopment of speech processing techniques. Its test stage is also flexible in incorporating any number of microphones without retraining or modifying the framew ork. W e hav e developed man y implementations of the proposed framework and conducted an extensi ve experiment in scenarios where the locations of the speech sources are far-field, random, and blind to the microphones. Results on speech enhancement tasks show that our method outperforms its counterpart that works with linear microphone arrays by a considerable margin in both di ff use noise reverberant environments and point source noise re verberant en vironments. W e ha ve also tested the framework with di ff erent handcrafted features. Results sho w that although designing good features lead to high performance, they do not a ff ect the conclusion on the e ff ecti v eness of the proposed framew ork. K e ywor ds: Adaptiv e beamforming, ad-hoc microphone array, channel selection, deep learning, distrib uted microphone array 1. Introduction Deep learning based speech enhancement has demonstrated its strong denoising ability in adverse acoustic en vironments ( W ang & Chen , 2018 ), which has attracted much attention since its first appearance ( W ang & W ang , 2013 ). Although many pos- itiv e results hav e been observed, existing deep-learning-based speech enhancement and its applications were studied mostly with a single microphone or a con ventional microphone array , such as a linear array in a portable equipment. Its performance drops when the distance between the speech source and the mi- crophone (array) is enlarged. Finally , how to maintain the en- hanced speech at the same high quality throughout an interested physical space becomes a ne w problem. Ad-hoc microphone arrays provide a potential solution to the abov e problem. As illustrated in Fig. 1 , an ad-hoc microphone array is a set of randomly distributed microphones. The mi- crophones collaborate with each other . Compared to the con- ventional microphone arrays, an ad-hoc microphone array has the following two advantages. First, it has a chance to enhance a speaker’ s voice with equally good quality in a range where the array covers. Second, its performance is not limited to the physical size of application de vices, e.g. cell-phones, goose- neck microphones, or smart speaker box es. Ad-hoc microphone Email addr ess: xiaolei.zhang@nwpu.edu.cn (Xiao-Lei Zhang 1 , 2 ) Figure 1: Illustration of an ad-hoc microphone array . arrays also hav e a chance to be widespread in real-world en- vironments, such as meeting rooms, smart homes, and smart cities. The research on ad-hoc microphone arrays is an emerg- ing direction ( Marko vich-Golan et al. , 2012 ; Heusdens et al. , 2012 ; Zeng & Hendriks , 2014 ; W ang et al. , 2015 ; O’Connor et al. , 2016 ; O’Connor & Kleijn , 2014 ; T av akoli et al. , 2016 ; Jayaprakasam et al. , 2017 ; T av akoli et al. , 2017 ; Zhang et al. , 2018 ; W ang & Cav allaro , 2018 ; Koutrouv elis et al. , 2018 ). It contains at least the following three fundamental problems: • Channel selection. Because the microphones may dis- tribute in a large area, taking all microphones into con- sideration may not be the best way , since that the micro- phones that are far away from the speech source may be Pr eprint submitted to Elsevier F ebruary 10, 2021 too noisy . Channel selection aims to group a handful mi- crophones around a speaker into a local microphone array from a large number of randomly distributed microphones. • Device synchronization. Because the microphones are distributed in di ff erent positions and maybe also di ff erent devices, their output signals may ha ve di ff erent time delay , clock rates, or adaptive gain controllers. De vice synchro- nization aims to synchronize the signals from the micro- phones, so as to fascinate its following specific applica- tions. • T ask-driv en multichannel signal pr ocessing . It aims to maximize the performance of a specific task by , e.g., adapting a multichannel signal processing algorithm de- signed for a con ventional array to an ad-hoc array . The tasks include speech enhancement, multi-talker speech separation, speech recognition, speaker recognition, etc. Howe v er , current research on ad-hoc microphone arrays is still at the beginning. For example, some work discussed a so called message passing problem between microphones where it as- sumes that the ground-truth noise spectrum is known ( Heus- dens et al. , 2012 ) or the steering vectors from speech sources to microphones are known ( Zeng & Hendriks , 2014 ). Some work focused on the channel selection problem in an ideal sce- nario where perfect noise estimation and voice activity detec- tion are av ailable ( Zhang et al. , 2018 ). Although some work tried to jointly conduct noise estimation and channel selection by advanced mathematical formulations, it has to make many assumptions such as the ground-truth distances between micro- phones, ground-truth geometry of the array , and free-space sig- nal transmission model without re verberation ( O’Connor et al. , 2016 ). A possible explanation for the abov e di ffi culty is that an ad- hoc microphone array lacks so much important prior knowledge while contains so many interferences that we finally hav e little information about the array except the received signals in the extreme case. T o ov ercome the di ffi culty , we may consider the way of learning the priors, parameters, or hidden variables of the array instead of making unrealistic assumptions. Supervised deep learning, which learns prior kno wledge and parameters by neural networks, provides us this opportunity as it did for the supervised speech separation with the con ventional microphone arrays ( W ang & Chen , 2018 ). In this paper, we propose a frame work named deep ad-hoc beamforming (DAB) which brings deep learning to ad-hoc mi- crophone arrays. It has the following four nov elties: • A supervised channel selection framework is proposed. It first predicts the quality of the received speech signal of each channel by a deep neural network. Then, it groups the microphones that have high speech quality and strong cross-channel signal correlation into a local microphone array . Several channel selection algorithms have been de- veloped, including a one-best channel selection method and sev eral N -best channel selection methods with the positiv e integer N ≥ 1 either predefined or automatically determined according to di ff erent channel selection crite- ria. • A simple supervised time synchronization framework is proposed. It first picks the output of the best channel as a reference signal, then estimates the relati ve time delay of the other channels by a traditional time delay estima- tor , and finally synchronize the channels according to the estimation result, where the best channel is selected in a supervised manner . • A speech enhancement algorithm is implemented as an example. The algorithm applies the channel selection and time synchronization frameworks to deep beamforming. It is designed to demonstrate the overall e ff ectiv eness and flexibility of DAB. Its implementation is straightforward without a large modification of existing deep beamforming algorithms. • The e ff ects of acoustic features on performance are studied. It is known that the performance of deep- learning-based speech enhancement relies hea vily on acoustic features. In this paper, we further emphasize its importance on D AB by carrying out the first study on the e ff ects of di ff erent handcrafted features on the perfor- mance. In this study , a variant of short term Fourier trans- form (STFT) and the common multi-resolution cochlea- gram (MRCG) features are used for comparison. W e have conducted an extensiv e experimental comparison be- tween D AB and its deep-learning-based multichannel speech enhancement counterpart with linear microphone arrays, in sce- narios where the speech sources and microphone arrays were placed randomly in typical physical spaces with random time delay , and the noise sources were either di ff use noise or point source noise. Experimental results with noise-independent training show that D AB outperforms its counterpart by a large margin. Its experimental conclusion is consistent across di ff er- ent hyperparameter settings and handcrafted features, though a good handcrafted feature do improv e the ov erall performance. The core idea of the proposed method has been released at ( Zhang , 2018 ). The main di ff erence between the method here and ( Zhang , 2018 ) is that we hav e added the time synchro- nization module, more channel selection algorithms and exper- iments on the point source noise en vironment here, which are incremental e xtensions that do not a ff ect the fundamental claim of the novelty of the paper, compared to some related methods published after ( Zhang , 2018 ). This paper is organized as follows. Section 2 presents math- ematical notations of this paper . Section 3 presents the sig- nal model of ad-hoc microphone arrays. Section 4 presents the framew ork of the proposed D AB. Section 5 presents the chan- nel selection module of DAB. Section 6 presents the application of D AB to speech enhancement. Section 7 e v aluates the e ff ec- tiv eness of the proposed method. Finally , Section 8 concludes our findings. 2 1.1. Related work Current deep-learning-based techniques employ either a sin- gle microphone or a con v entional microphone array to pick up speech signals. Here, the con ventional microphone array means that the microphone array is fixed in a single device. Deep- learning-based single-channel speech enhancement, e.g. ( W ang & W ang , 2013 ; Zhang & W u , 2013b ; Lu et al. , 2013 ; W ang et al. , 2014 ; Huang et al. , 2015 ; Xu et al. , 2015 ; W eninger et al. , 2015 ; W illiamson et al. , 2016 ; Zhang & W ang , 2016b ), em- ploys a deep neural network (DNN), which is a multilayer per- ceptron with more than one nonlinear hidden layer , to learn a nonlinear mapping function from noisy speech to clean speech or its ideal time-frequency masks. This field progresses rapidly . W e list some of the recent progress as follo ws. Phase spectrum, which was originally belie ved to be helpless to speech enhance- ment, has shown to be helpful in the deep learning methodol- ogy ( Zheng & Zhang , 2018 ; T an & W ang , 2019 ). Some end- to-end speech enhancement methods hav e been proposed, in- cluding the representative gated residual networks ( T an et al. , 2018 ), fully-con volutional time-domain audio separation net- work ( Luo & Mesgarani , 2019 ), and full con v olutional neural networks ( Pande y & W ang , 2019 ). The long-term di ffi culty of the nonlinear distortion of the enhanced speech for speech recognition has been ov ercome as well ( W ang et al. , 2019 ). Some solid theoretical analysis on the generalization ability of the deep learning based speech enhancement has been made ( Qi et al. , 2019 ). Deep-learning-based multichannel speech enhancement has two major forms. The first form ( Jiang et al. , 2014 ) uses a microphone array as a feature extractor to extract spatial fea- tures as the input of the DNN-based single-channel enhance- ment. The second form ( Heymann et al. , 2016 ; Erdogan et al. , 2016 ), which we denote bravely as deep beamforming , esti- mates a monaural time-frequency (T -F) mask ( W ang et al. , 2014 ; Heymann et al. , 2016 ; Higuchi et al. , 2016 ) using a single-channel DNN so that the spatial covariance matrices of speech and noise can be deriv ed for adaptive beamform- ing, e.g. minimum v ariance distortionless response (MVDR) or generalized eigen v alue beamforming. It is fundamentally a linear method, whose output does not su ff er from nonlinear distortions. Due to its success on speech recognition, it has been extensiv ely studied, including the aspects of the integra- tion with the spatial-clustering-based masking ( Nakatani et al. , 2017 ), acoustic features ( W ang & W ang , 2018 ), model training ( Xiao et al. , 2017 ; T u et al. , 2017 ; Higuchi et al. , 2018 ; Zhou & Qian , 2018 ), mask estimations ( Erdogan et al. , 2016 ), post- processing ( Zhang et al. , 2017 ), rank-1 estimation of steering vectors ( T aherian et al. , 2019 ), etc. The e ff ectiv eness of deep learning based speech enhance- ment lies strongly on acoustic features. The earliest studies take the concatenation of multiple acoustic features, such as STFT and Mel frequency captral coe ffi cient (MFCC), as the in- put ( W ang & W ang , 2013 ; Zhang & W u , 2013a ) for the sake of mining the complementary information between the fea- tures. Later on, Chen et al. ( 2014 ) found that cochleagram fea- ture based on gammatone filterbanks is a strong noise-robust acoustic feature, after a wide comparison between 17 acoustic features co vering gammatone-domain, autocorrelation-domain, and modulation-domain features, as well as linear prediction features, MFCC variants, pitch-based features, etc, in various adverse acoustic en vironments. Because STFT has a perfect in v erse transform, the log spectral magnitude becomes popular ( Xu et al. , 2015 ). Recently , Delfarah & W ang ( 2017 ) performed another feature study in room re verberant situations, where log spectral magnitude and log mel-spectrum features were further added to the comparison. The conclusions in ( Chen et al. , 2014 ) and ( Delfarah & W ang , 2017 ) are consistent. Although learn- able features are becoming a new trend ( T an et al. , 2018 ; Luo & Mesgarani , 2019 ; Pandey & W ang , 2019 ), very recent re- search results demonstrate that handcrafted acoustic features are still competitive to the learnable filters, e.g. ( Ditter & Gerk- mann , 2020 ; Pariente et al. , 2020 ). For multichannel speech enhancement, interaural time di ff erence, interaural le vel di ff er- ence ( Jiang et al. , 2014 ), interaural phase di ff erence, and their variants ( Y ang & Zhang , 2019 ) are widely used spatial features. See ( W ang & Chen , 2018 , Section 4) for an excellent summary on the acoustic features. 2. Notations W e first introduce some notations here. Regular lower -case letters, e.g. s , f , and γ , indicate scalars. Bold lower-case letters, e.g. y and α , indicate vectors. Bold capital letters, e.g. P and Φ , indicate matrices. Letters in calligraphic fonts, e.g. X , indicate sets. 0 ( 1 ) is a vector with all entries being 1 (0). The operator T denotes the transpose. The operator H denotes the conjugate transpose of complex numbers. 3. Signal model of ad-hoc microphone arrays Ad-hoc microphone arrays can significantly reduce the prob- ability of the occurrence of far-field en vironments. W e take the case described in Fig. 2 as an example. When a speaker and a microphone array are distrib uted randomly in a room, the distribution of the distance between the speaker and an ad-hoc microphone array has a smaller variance than that between the speaker and a conv entional microphone array (Figs. 2 a and 2 b). For example, the con ventional array has a probability of 24% to be placed over 10 meters away from the speech source, while the number regarding to the ad-hoc array is only 7%. Particu- larly , the distance between the best microphone in the ad-hoc array and the speech source is only 1.9 meters on average, and the probability of the distance that is lar ger than 5 meters is only 2% (Fig. 2 c). Here we build the signal model of an ad-hoc microphone ar- ray . All speech enhancement methods throughout the paper op- erate in the frequency domain on a frame-by-frame basis. Sup- pose that a physical space contains one target speak er , multiple noise sources, and an ad-hoc microphone array of M micro- phones. The physical model for the signals arrived at the ad-hoc array is assumed to be v ( t , f ) = c ( f ) s ( t , f ) + h ( t , f ) + n ( t , f ) (1) 3 Figure 2: Monte Carlo simulation of the distance distribution between a speech source and a microphone array in comparison. The ph ysical spaces for this simulation contain a square room, a rectangle room, and a circle room (see sFig. 1 in the supplementary materials for the details of the three rooms). The farthest distance between the speech source and the microphone array in any of the rooms is set to 20 meters. Each microphone array in comparison consists of 16 microphones. (a) Probability density function (PDF) of the distance distribution of a conventional microphone array . The mean and standard deviation of this distribution are 7.28 and 3.71 meters respectiv ely . (b) PDF of the distance distribution of an ad-hoc microphone array , where the distance is defined as the a verage distance between the speaker and each microphone in the ad-hoc array . The mean and standard deviation of this distribution are 7.28 and 1.68 meters respectively . (c) PDF of the distribution of the distance between the speech source and the best microphone in the ad-hoc microphone array , where the word “best microphone” denotes the closest microphone to the speech source. The mean and standard deviation of the distribution are 1.92 and 1.21 meters respectively . (d) Cumulative distribution functions (CDF) of the distance distributions in Figs. 2 a, 2 b, and 2 c. Figure 3: Diagram of deep ad-hoc beamforming. The channel-selection framew ork is described in the red dashed box. where s ( t , f ) is the short-time Fourier transform (STFT) value of the tar get clean speech at time t and frequency f , c ( f ) is the time-in v ariant acoustic transfer function from the speech source to the array which is an M -dimensional complex number: c ( f ) = [ c 1 ( f ) , c 2 ( f ) , . . . , c M ( f )] T (2) c ( f ) s ( t , f ) and h ( t , f ) are the direct sound and early and late re- verberation of the tar get signal, and n ( t , f ) is the additi ve noise: n ( t , f ) = [ n 1 ( t , f ) , n 2 ( t , f ) , . . . , n M ( t , f )] T (3) v ( t , f ) = [ v 1 ( t , f ) , v 2 ( t , f ) , . . . , v M ( t , f )] T . (4) which are the STFT values of the received signals by the m - th microphone at time t and frequency f . Usually , we denote x ( t , f ) = c ( f ) s ( t , f ). After processed by the devices { D m ( · ) } M m = 1 where the micro- phones are fixed, the signals that the D AB finally receives are: z m ( t , f ) = D m ( v m ( t , f )) , ∀ m = 1 , . . . , M (5) with z ( t , f ) = [ z 1 ( t , f ) , . . . , z M ( t , f )] T . Real-world de vices { D m ( · ) } M m = 1 may cause many problems including the unsynchro- nization of time delay , clock rates, adaptiv e gain controllers, etc. Here we consider the time unsynchronization problem: z m ( t , f ) = v m ( t + τ m , f ) = x m ( t + τ m , f ) + h m ( t + τ m , f ) + n m ( t + τ m , f ) (6) where τ m is the time delay caused by the m th device. 4. Deep ad-hoc beamforming: A system over view A system ov erview of D AB is shown in Fig. 3 . It con- tains three core components—a supervised channel selection framew ork, a supervised time synchronization framework, and a speech enhancement module. The core idea of the channel selection frame work is to filter the received signals z ( t , f ) by a channel-selection vector p = [ p 1 , . . . , p M ] T : z p ( t , f ) = p ◦ z ( t , f ) (7) such that the channels that output low quality speech signals can be suppressed or ev en discarded, where p is the output mask of the channel-selection method described in the red box of Fig. 3 , and ◦ denotes the element-wise product operator . W ithout loss of generality , we assume the selected channels are z 1 ( t , f ) , . . . , z N ( t , f ). The time synchronization module first selects the noisy sig- nal from the best channel, assumed to be z k ( t , f ), as a reference signal by a supervised 1-best channel selection algorithm that will be described in Section 5.2 . Then, it estimates the relative time delay of the noisy signals from the selected microphones ov er the reference signal by a time delay estimator: ˆ τ n = h ( z n ( t , f ) | z k ( t , f )) , ∀ n = 1 , . . . , N (8) where h ( z n ( t , f ) | z k ( t , f )) is the time delay estimator with z k ( t , f ) as the reference signal, and ˆ τ n is the estimated relativ e time 4 delay of z n ( t , f ) over z k ( t , f ). Finally , it synchronizes the mi- crophones according to the estimated time delay: y n ( t , f ) = z n ( t − ˆ τ n , f ) , ∀ n = 1 , . . . , N (9) Note that ˆ τ n consists of the relativ e time delay caused by both the de vice and the signal transmission through air . Because dev eloping a new accurate time delay estimator is not the fo- cus of this paper , we simply use the classic generalized cross- correlation phase transform ( Knapp & Carter , 1976 ; Carter , 1987 ) as the estimator , though many other time delay estima- tors can be adopted as well ( Chen et al. , 2006 ). For example, the deep neural network based time delay estimators ( W ang et al. , 2018 ), which were original proposed for the estimation of the signal direction of arriv al, can be adopted here too. The speech enhancement module takes y ( t , f ) = [ y 1 ( t , f ) , . . . , y N ( t , f )] T as its input. Many deep-learning- based speech enhancement methods can be used directly or with slight modification as the speech enhancement module. Here we take the MVDR based deep beamforming as an example. Because the deep beamforming is trained in a single- channel fashion as the other parts of D AB, it makes the o verall D AB flexible in incorporating any number of microphones in the test stage without retraining or modifying the D AB model. This is an important requirement of real world applications that should also be considered in other D AB implementations. Note that, if N = 1, then D AB outputs the noisy speech of the selected single channel directly without resorting to deep beamforming anymore. In the follo wing two sections, we will present the supervised channel selection framew ork and speech enhancement module respectiv ely . 5. Supervised channel selection The channel-selection algorithm is applied to each channel independently . It contains two steps described in the following two subsections respecti vely . 5.1. Channel-r eweighting model Suppose there is a test utterance of U frames, and suppose the receiv ed speech signal at the i -th channel is { e z i ( t ) } U t = 1 : e z i ( t ) = [ | z | i ( t , 1) , . . . , | z | i ( t , F )] T (10) where | z | i ( t , f ) is the amplitude spectrogram of z ( t , f ) at the i -th channel. The channel-reweighting model estimates the channel weight q i of the i th channel by q i = g  ¯ e z i  (11) where g ( · ) is a DNN-based channel-reweighting model, and ¯ e z i is the av erage pooling result of { e z i ( t ) } U t = 1 : ¯ e z i = 1 U U X t = 1 e z i ( t ) (12) T o train g ( · ), we need to first define a training target, and then extract noise-robust handcrafted features, which are described as follows. 5.1.1. T raining tar gets This paper uses a variant of SNR as the tar get: P t | x | time ( t ) P t | x | time ( t ) + P t | n | time ( t ) (13) where { x time ( t ) } t and { n time ( t ) } t are the direct sound and additi ve noise of the receiv ed noisy speech signal in time-domain. Many measurements may be used as the training targets as well, such as the equiv alent form of ( 13 ) in the time-frequency domain P U t = 1 P F f = 1 | x ( t , f ) | P U t = 1 P F f = 1 | x ( t , f ) | + P U t = 1 P F f = 1 | h ( t , f ) + n ( t , f ) | , (14) performance ev aluation metrics including signal-to-distortion ratio (SDR), short-time objectiv e intelligibility (STOI) ( T aal et al. , 2011 ), etc., application driven metrics including equal error rate (EER) ( Bai et al. , 2020a ), partial area under the ROC curve ( Bai et al. , 2020c , b ), word error rate, etc, as well as other device-specific metrics including the battery life of a cell phone, etc. For example, if a cell phone is to be out of power , then DAB should prev ent the cell phone being an acti v ated channel. 5.1.2. Handcrafted featur es Because the STFT feature { e z i ( t ) } U t = 1 may not be noise-robust enough, an important issue to the e ff ecti veness of ( 11 ) is the acoustic feature. Here we introduce two handcrafted features— enhanced STFT (eSTFT) and multi-resolution cochleagram (MRCG) feature. 1) eSTFT : As shown in the red dashed box of Fig. 3 , we first use a DNN-based single-channel speech enhancement method, denoted as DNN1, to generate an estimated ideal ratio mask (IRM) of the direct sound of e z i ( t ), denoted as { ˆ x i ( t ) } U t = 1 : ˆ x i ( t ) = [ d IRM i ( t , 1) , . . . , d IRM i ( t , F )] T (15) where d IRM i ( t , f ) is the estimate of the IRM at the i -th channel. The IRM is the training target of DNN1: IRM ( t , f ) = | x ( t , f ) | | x ( t , f ) | + | h ( t , f ) + n ( t , f ) | (16) where | x ( t , f ) | , | h ( t , f ) | , and | n ( t , f ) | are the amplitude spectro- grams of the direct and early rev erberant speech, late re ver - berant speech, and noise components of single-channel noisy speech respectiv ely . Then, we denote the concatenation of the estimated IRM ˆ x i ( t ) and the noisy feature e z i ( t ) as the eSTFT feature, which is used to replace e z i ( t ) in ( 12 ). T o discriminate the channel selection model g ( · ) from DNN1, we denote g ( · ) as DNN2. As presented above, both DNN1 and DNN2 are trained on single-channel data only instead of mul- tichannel data from ad-hoc microphone arrays, which is an im- portant merit for the practical use of DAB. In practice, the train- ing data of DNN1 and DNN2 need to be independent so as to prev ent o verfitting. 2) MRCG: Another alternative of STFT can be MRCG, which has shown to be a noise robust acoustic feature for speech 5 separation ( Chen et al. , 2014 ). 1 The key idea of MRCG is to incorporate both global information and local information of speech through multi-resolution extraction. The global in- formation is produced by extracting cochleagram features with a large frame length or a large smoothing window (i.e., low resolutions). The local information is produced by extracting cochleagram features with a small frame length and a small smoothing window (i.e., high resolutions). It has been shown that cochleagram features with a low resolution, such as frame length = 200 ms, can detect patterns of noisy speech better than that with only a high resolution, and features with high resolu- tions complement those with low resolutions. Therefore, con- catenating them together is better than using them separately . In this paper, we adopt the implementation in ( Zhang & W ang , 2016a ). Because g ( · ) does not need to recover the time-domain sig- nal, many handcrafted acoustic features can be used beyond the above two examples to further improve the estimation ac- curacy . Some candidate acoustic features are listed in ( Chen et al. , 2014 ; Guido , 2018b ). Besides, a large family of wa velet transforms ( Mallat , 1989 ) have not been deeply studied yet. Here we list some possible candidates for D AB: wav elet-packet transform ( Sepúlveda et al. , 2013 ), discrete shapelet transform ( Guido , 2018a ), fractal-wavelet transform ( Guariglia & Silve- strov , 2016 ; Guariglia , 2019 ), and adaptiv e multiscale wa velet transform ( Zheng et al. , 2019 ). 5.2. Channel-selection algorithms Giv en the estimated weights q = [ q 1 , . . . , q M ] T of the test utterance, many advanced sparse learning methods are able to project q to p , i.e. p = δ ( q ) where δ ( · ) is a channel-selection function that enforces sparse constraints on q . This section de- signs sev eral δ ( · ) functions as follo ws. 5.2.1. One-best channel selection (1-best) The simplest channel-selection method is to pick the channel with the highest SNR: p i =        1 , if q i = max 1 ≤ k ≤ M q k 0 , otherwise ∀ i = 1 , . . . , M . (17) After the channel selection, D AB outputs the noisy speech from the selected channel directly . 5.2.2. All channels (all-c hannels) Another simple channel-selection method is to select all channels with equiv alent importance: p i = 1 , ∀ i = 1 , . . . , M . (18) This method is an extreme case of channel selection that usually performs well when the microphones are distributed in a small space. 1 Code is downloadable from http: // web .cse.ohio- state.edu / pnl / software.html 5.2.3. N-best c hannel selection with pr edefined N (fixed-N- best) When the microphone number M is large enough, there might exist sev eral microphones close to the speech source whose recei ved signals are more informativ e than the others. It is better to group the informative microphones together into a local array instead of selecting one best channel: p i =        1 , if q i ∈ { q 0 1 , q 0 2 , . . . , q 0 N } 0 , otherwise ∀ i = 1 , . . . , M . (19) where q 0 1 ≥ q 0 2 ≥ . . . ≥ q 0 M is the descent order of { q i } M i = 1 , and N is a user-defined hyperparameter , N ≤ M . 5.2.4. N-best c hannel selection where N is determined on-the- fly (auto-N-best) Here we de v elop a simple method that determines the hyper - parameter N in ( 19 ) on-the-fly . It first finds q ∗ = max i ∈{ 1 ,..., M } q i , and then determines p by p i =        1 , if q i q ∗ 1 − q ∗ 1 − q i > γ 0 , otherwise , ∀ i = 1 , . . . , M . (20) where γ ∈ [0 , 1] is a tunable threshold. See Appendix for the proof of ( 20 ). 5.2.5. Soft N-best channel selection (soft-N-best) One way to encode the signal quality of the selected channels in ( 20 ) is to use soft weights as follows: p i =        q i , if q i q ∗ 1 − q ∗ 1 − q i > γ 0 , otherwise , ∀ i = 1 , . . . , M . (21) 5.2.6. Machine-learning-based N-best channel selection (learning-N-best) The above channel selection methods determine the selected channels by SNR only , without considering the correlation be- tween the channels. As we know , the correlation between the channels, which encodes environmental information and the time delay between the microphones, is important to adaptive beamforming. Here, we develop a spectral clustering based channel selection method that takes the correlation into the de- sign of the a ffi nity matrix of the spectral clustering. Unlike the other channel selection algorithms, “learning- N - best” should first conduct the time synchronization, which takes y ( t , f ) = [ y 1 ( t , f ) , . . . , y M ( t , f )] T as its input. Then, it calculates the cov ariance matrix of the noisy speech across the channels by Φ yy ( f ) = X t y ( t , f ) y ( t , f ) H . (22) and normalize ( 22 ) to an amplitude cov ariance matrix Φ norm yy ( f ): Φ norm yy ( f )( i , j ) = | Φ yy ( f )( i , j ) | 2 Φ yy ( f )( i , i ) Φ yy ( f )( j , j ) , ∀ i , j = 1 , . . . , M . (23) 6 Figure 4: T w o examples of the learning- N -best channel selection method in point source noise environments. (a1) Dendrogram of Example 1. (b1) Channel selection result of Example 1. (a2) Dendrogram of Example 2. (b2) Channel selection result of Example 2. After that, it calculates a ne w matrix K by a veraging the ampli- tude cov ariance matrix along the frequenc y axis by K ( i , j ) = 1 F F X f = 1 Φ norm yy ( f )( i , j ) , ∀ i , j = 1 , . . . , M (24) where F is the number of the DFT bins. The a ffi nity matrix A of the spectral clustering is defined as A = exp − | K − I | 2 2 σ 2 ! , ∀ i , j = 1 , . . . , M (25) where I is the identity matrix, and σ is a hyperparameter with a default value 1. Following the Laplacian eigen value decom- position ( Ng et al. , 2001 ) of A , it obtains a J × M -dimensional representation of the channels, U = [ u 1 , . . . , u M ], where u i is the representation of the i th microphone and J denotes the di- mension of the representation. “Learning- N -best” conducts agglomerati ve hierarchical clus- tering on U , and takes the maximal lifetime of the dendrogram as the threshold to partition the microphones into B clusters (1 ≤ B ≤ M ), denoted as U 1 , . . . , U B . The maximum pre- dicted SNRs of the microphones in the clusters are denoted as q 0 1 , . . . , q 0 B respectiv ely . Finally , it groups the microphones that satisfy the following condition into a local microphone array: p i =          1 , if u i ∈ U b and q 0 b q 0 ∗ 1 − q 0 ∗ 1 − q 0 b > γ 0 , otherwise , ∀ i = 1 , . . . , M , ∀ b = 1 , . . . , B . (26) where q 0 ∗ = max 1 ≤ b ≤ B q 0 b . Figure 4 lists two e xamples of the “learning- N -best” method. From the figure, we see that the microphones around the speech sources are grouped into clusters, while the microphones that are far aw ay from the speech sources hav e weak correlations, hence the y form a number of special clusters that each contains only one microphone. Note that, as sho wn in Example 2 of Fig. 4 , because the selection criterion is determined by ( 26 ), there is no guarantee that the clusters that contain more than one microphone will be selected, or the clusters that contain only a single microphone will be discarded. 6. Speech enhancement: An application case After getting the synchronized signals y ( t , f ) = [ y 1 ( t , f ) , . . . , y N ( t , f )] T , we may use e xisting multichannel signal processing techniques directly or with slight mod- ification for a specific application. Here we use a deep beamforming algorithm ( Heymann et al. , 2016 ; W ang & W ang , 2018 ) directly for speech enhancement as an example. The deep beamforming algorithm finds a linear estimator w opt ( f ) to filter y ( t , f ) by the following equation: ˆ x ref . ( t , f ) = w H opt ( f ) y ( t , f ) . (27) where ˆ x ref . ( t , f ) is an estimate of the direct sound at the refer- ence microphone of the array . For example, MVDR finds w opt by minimizing the av erage output power of the beamformer while maintaining the energy along the tar get direction: min w ( f ) w H ( f ) Φ nn ( f ) w ( f ) (28) subject to w H ( f ) c ( f ) = 1 7 where Φ nn ( f ) is an M × M -dimensional cross-channel cov ari- ance matrix of the received noise signal n ( f ). ( 28 ) has a closed- form solution: w opt ( f ) = b Φ − 1 nn ( f ) ˆ c ( f ) ˆ c H ( f ) b Φ − 1 nn ( f ) ˆ c ( f ) (29) where the v ariables b Φ nn ( f ) and ˆ c ( f ) are the estimates of Φ nn ( f ) and c ( f ) respecti vely which are derived by the following equa- tions according to ( Zhang et al. , 2017 ; W ang & W ang , 2018 ): b Φ xx ( f ) = 1 P t η ( t , f ) X t η ( t , f ) y ( t , f ) y ( t , f ) H (30) b Φ nn ( f ) = 1 P t ξ ( t , f ) X t ξ ( t , f ) y ( t , f ) y ( t , f ) H (31) ˆ c ( f ) = principal  b Φ xx ( f )  (32) where b Φ xx ( f ) is an estimate of the cov ariance matrix of the direct sound x ( t , f ), principal( · ) is a function returning the first principal component of the input square matrix, and η ( t , f ) and ξ ( t , f ) are defined as the product of individual estimated T -F masks: η ( t , f ) = M Y i = 1 d IRM i ( t , f ) (33) ξ ( t , f ) = M Y i = 1  1 − d IRM i ( t , f )  (34) Note that, in our experiments, when we calculate η ( t , f ) and ξ ( t , f ), we take all channels of the ad-hoc array into considera- tion, which empirically results in slight performance improve- ment over the method that we take only the selected channels into the calculation. 7. Experiments In this section, we study the e ff ectiveness of D AB in di ff use noise and point source noise environments under the situation where the output signals of the channels have random time de- lay caused by devices. Specifically , we first present the exper- imental settings in Section 7.1 , then present the experimental results in the di ff use noise and point source noise en vironments in Section 7.2 , and finally discuss the e ff ects of h yperparameter settings on performance in Sections 7.3 and 7.4 . 7.1. Experimental settings Datasets: The clean speech was generated from the TIMIT corpus. W e randomly selected half of the training speakers to construct the database for training DNN1, and the remaining half for training DNN2. W e used all test speakers for test. The noise source for the training database was a large-scale sound e ff ect library which contains over 20,000 sound e ff ects. The additiv e noise for the test database was the babble, factory1, and volv o noise respecti vely from the NOISEX-92 database. T raining data: W e simulated a rectangle room for each train- ing utterance. The length and width of the rectangle room were generated randomly from a range of [5 , 30] meters. The height was generated randomly from [2 . 5 , 4] meters. The rev erberant en vironment was simulated by an image-source model. 2 Its T60 was selected randomly from a range of [0 , 1] second. A speech source, a noise source, and a single microphone were placed randomly in the room. The SNR, which is the energy ratio be- tween the speech and noise at the locations of their sources, was randomly selected from a range of [ − 10 , 20] dB. W e syn- thesized 50,000 noisy utterances to train DNN1, and 100,000 noisy utterances to train DNN2. T est data: W e constructed a rectangle room for each test ut- terance. The length, width, and height of the room were ran- domly generated from [10 , 20], [10 , 20], and [2 . 7 , 3 . 5] meters respectiv ely . The additive noise is assumed to be either di ff use noise or point source noise. For the di ff use noise en vironment, a speech source and a mi- crophone array were placed randomly in the room. The T60 for generating reverberant speech was selected randomly from a range of [0 . 4 , 0 . 8] second. T o simulate the uncorrelated di ff use noise, the noise segments at di ff erent microphones do not ha ve ov erlap, and they were added directly to the re v erberant speech at the microphone receivers without referring to rev erberation. The noise power at the locations of all microphones were main- tained at the same lev el, which was calculated by the SNR of the direct sound over the additive noise at a place of 1 meter away from the speech source, denoted as the SNR at the origin (SNRatO). Note that the SNRs at di ff erent microphones were di ff erent due to the energy degradation of the speech signal dur - ing its propagation. The SNRatO was selected from 10, 15, and 20 dB respecti vely . W e generated 1,000 test utterances for each SNRatO, each noise type, and each kind of microphone array , which amounts to 9 test scenarios and 18,000 test utterances. For the point source noise en vironment, a speech source, a point noise source, and a microphone array were placed ran- domly in the room. The T60 of the room was selected ran- domly from a range of [0 . 4 , 0 . 8] second for generating re verber - ant speech and re verberant noise at the microphone receivers. The SNRatO was defined as the log ratio of the speech power ov er the noise power at their source locations respectiv ely . It was chosen from {− 5 , 5 , 15 } dB. Like the di ff use noise en viron- ment, we also generated 1,000 test utterances for each SNRatO, each noise type, and each kind of microphone array . For both of the test en vironments, we generated a random time delay τ from a range of [0 , 0 . 5] second at each microphone of an ad-hoc microphone array for simulating the time delay caused by devices. Comparison methods: The baseline is the MVDR-based DB ( Heymann et al. , 2016 ) with a linear array of 16 microphones, which is described in Section 6 . All DB models employed DNN1 for the single-channel noise estimation. The aperture size of the linear microphone array (i.e. the distance between two neighboring microphones) was set to 10 centimeters. D AB also employed an ad-hoc array of 16 microphones. W e denote the D AB with di ff erent channel selection algorithms as: 2 https: // github .com / ehabets / RIR-Generator 8 • D AB + 1-best. • D AB + all-channels. • D AB + fixed- N -best. W e set N = √ M . • D AB + auto- N -best. W e set γ = 0 . 5. • D AB + soft- N -best. W e set γ = 0 . 5. • D AB + learning- N -best. W e set J = M / 2, σ = 1, and γ = 0 . 5. T o study the e ff ecti veness of time synchronization (TS) module, we further compared the following systems: • D AB + channel selection method. It does not use the TS module. • D AB + channel selection method + GT . It uses the ground truth (GT) time delay caused by the devices to synchronize the microphones. • D AB + channel selection method + TS. It uses the TS module to estimate the time delay caused by both di ff erent locations of the microphones and di ff erent devices where the microphones are installed. W e implemented the above comparison methods with di ff er- ent channel selection algorithms. For example, if the channel selection algorithm is “auto- N -best”, then the comparison sys- tems are “D AB + auto- N -best”, “DAB + auto- N -best + GT”, and “D AB + auto- N -best + TS” respectively . DNN models: For each comparison method, we set the frame length and frame shift to 32 and 16 milliseconds respectively , and extracted 257-dimensional STFT features. W e used the same DNN1 for DB and DAB. DNN1 is a standard feedfor- ward DNN. It contains two hidden layers. Each hidden layer has 1024 hidden units. The activ ation functions of the hidden units and output units are rectified linear unit and sigmoid func- tion, respectively . The number of epochs was set to 50. The batch size was set to 512. The scaling factor for the adaptiv e stochastic gradient descent was set to 0.0015, and the learning rate decreased linearly from 0.08 to 0.001. The momentum of the first 5 epochs was set to 0.5, and the momentum of other epochs was set to 0.9. A contextual window was used to ex- pand each input frame to its context along the time axis. The window size w as set to 7. DNN2 has the same parameter setting as DNN1 except that DNN2 does not need a contextual window , was trained with a batch size of 32, and took eSTFT as the acoustic feature. All DNNs were well-tuned. Note that although bi-directional long short-term memory may lead to better performance, we simply used the feedforward DNN since the type of the DNN models is not the focus of this paper . Evaluation metrics: The performance e valuation metrics in- clude STOI ( T aal et al. , 2011 ), perceptual ev aluation of speech quality (PESQ) ( Rix et al. , 2001 ), and signal to distortion ra- tio (SDR) ( V incent et al. , 2006 ). STOI e v aluates the objectiv e speech intelligibility of time-domain signals. It has been sho wn T able 1: Results with 16 microphones per array in di ff use noise environments. SNRatO Comparison methods Babble Factory V olvo STOI PESQ SDR STOI PESQ SDR STOI PESQ SDR 10 dB Noisy 0.5989 1.86 1.12 0.5969 1.80 1.20 0.6785 2.10 1.62 DB 0.6911 1.87 2.75 0.6900 1.86 3.42 0.7766 2.16 3.95 DAB (1-best) 0.7154 2.06 5.14 0.7143 2.00 5.13 0.7892 2.31 5.23 DAB (all-channels) 0.5824 1.83 -0.93 0.5760 1.78 -1.55 0.6061 1.88 -1.49 DAB (all-channels + GT) 0.7206 2.06 4.40 0.7137 2.00 4.47 0.7831 2.39 4.42 DAB (all-channels + TS) 0.7405 2.00 3.49 0.7388 1.95 3.54 0.8039 2.32 3.25 DAB (fix ed-N-best) 0.6026 1.87 -0.12 0.6022 1.82 -0.33 0.6351 1.92 -0.32 DAB (fix ed-N-best + GT) 0.7451 2.12 5.10 0.7437 2.07 5.16 0.8117 2.42 5.54 DAB (fix ed-N-best + TS) 0.7675 2.11 5.18 0.7634 2.05 5.01 0.8460 2.43 5.97 DAB (auto-N-best) 0.5982 1.87 -0.13 0.5927 1.83 -0.61 0.6573 2.00 0.82 DAB (auto-N-best + GT) 0.7531 2.14 5.74 0.7518 2.09 5.73 0.8164 2.46 5.97 DAB (auto-N-best + TS) 0.7696 2.12 5.45 0.7641 2.06 5.36 0.8405 2.44 5.85 DAB (soft-N-best) 0.5999 1.85 -0.28 0.5952 1.83 -0.76 0.6645 2.01 0.84 DAB (soft-N-best + GT) 0.7463 2.13 5.22 0.7455 2.07 5.26 0.8055 2.42 5.50 DAB (soft-N-best + TS) 0.7659 2.12 5.11 0.7610 2.05 5.09 0.8363 2.43 5.65 DAB (learning-N-best) 0.5973 1.86 -0.29 0.5892 1.81 -0.99 0.6488 1.98 0.21 DAB (learning-N-best + GT) 0.7405 2.12 5.22 0.7387 2.06 5.34 0.8026 2.43 5.35 DAB (learning-N-best + TS) 0.7631 2.07 4.55 0.7606 2.02 4.59 0.8330 2.41 4.97 15 dB Noisy 0.6410 1.97 3.05 0.6400 1.93 2.79 0.6847 2.10 2.87 DB 0.7350 2.02 4.37 0.7396 1.99 4.61 0.7804 2.19 4.86 DAB (1-best) 0.7496 2.17 6.59 0.7527 2.14 6.45 0.7906 2.31 6.58 DAB (all-channels) 0.5977 1.85 -0.52 0.5990 1.84 -0.87 0.6102 1.88 -0.75 DAB (all-channels + GT) 0.7575 2.22 5.45 0.7588 2.18 5.39 0.7887 2.42 5.23 DAB (all-channels + TS) 0.7809 2.15 4.53 0.7869 2.12 4.54 0.8091 2.35 4.32 DAB (fix ed-N-best) 0.6218 1.89 0.27 0.6189 1.88 0.03 0.6463 1.93 0.33 DAB (fix ed-N-best + GT) 0.7788 2.26 6.15 0.7832 2.22 6.11 0.8172 2.43 6.37 DAB (fix ed-N-best + TS) 0.8074 2.26 6.65 0.8095 2.21 6.36 0.8518 2.44 7.04 DAB (auto-N-best) 0.6188 1.90 0.44 0.6142 1.87 -0.13 0.6641 2.00 1.44 DAB (auto-N-best + GT) 0.7877 2.30 6.83 0.7946 2.26 6.92 0.8219 2.48 6.85 DAB (auto-N-best + TS) 0.8082 2.27 6.82 0.8140 2.23 6.81 0.8476 2.47 6.93 DAB (soft-N-best) 0.6179 1.90 0.20 0.6140 1.88 -0.29 0.6625 2.00 1.13 DAB (soft-N-best + GT) 0.7792 2.27 6.10 0.7858 2.24 6.18 0.8094 2.43 6.06 DAB (soft-N-best + TS) 0.8045 2.26 6.35 0.8090 2.22 6.28 0.8428 2.46 6.54 DAB (learning-N-best) 0.6187 1.90 0.31 0.6154 1.87 -0.24 0.6562 1.98 0.93 DAB (learning-N-best + GT) 0.7768 2.27 6.40 0.7799 2.24 6.30 0.8096 2.46 6.35 DAB (learning-N-best + TS) 0.8049 2.24 5.98 0.8100 2.20 5.90 0.8394 2.45 5.98 20 dB Noisy 0.6622 2.03 3.81 0.6653 2.01 3.73 0.6860 2.12 3.71 DB 0.7539 2.09 4.90 0.7619 2.09 5.23 0.7792 2.21 5.29 DAB (1-best) 0.7768 2.25 7.25 0.7790 2.25 7.25 0.7967 2.31 7.13 DAB (all-channels) 0.6196 1.88 -0.31 0.6212 1.89 -0.19 0.6213 1.89 -0.50 DAB (all-channels + GT) 0.7784 2.33 5.74 0.7834 2.32 5.68 0.7937 2.43 5.44 DAB (all-channels + TS) 0.8057 2.27 5.21 0.8113 2.25 5.02 0.8161 2.38 4.72 DAB (fix ed-N-best) 0.6583 1.96 1.03 0.6487 1.95 0.72 0.6553 1.94 0.57 DAB (fix ed-N-best + GT) 0.8011 2.35 6.60 0.8046 2.34 6.33 0.8183 2.42 6.65 DAB (fix ed-N-best + TS) 0.8352 2.37 7.36 0.8346 2.34 7.08 0.8551 2.45 7.47 DAB (auto-N-best) 0.6632 1.99 1.78 0.6504 1.97 1.20 0.6816 2.03 2.19 DAB (auto-N-best + GT) 0.8098 2.40 7.28 0.8134 2.39 7.10 0.8257 2.48 7.21 DAB (auto-N-best + TS) 0.8361 2.39 7.67 0.8383 2.36 7.21 0.8515 2.48 7.35 DAB (soft-N-best) 0.6610 1.99 1.52 0.6479 1.97 0.95 0.6810 2.03 2.00 DAB (soft-N-best + GT) 0.8018 2.37 6.70 0.8034 2.36 6.30 0.8164 2.45 6.59 DAB (soft-N-best + TS) 0.8317 2.38 7.16 0.8338 2.35 6.74 0.8473 2.46 7.01 DAB (learning-N-best) 0.6564 1.97 1.30 0.6491 1.96 0.93 0.6733 2.00 1.66 DAB (learning-N-best + GT) 0.7968 2.38 6.74 0.8006 2.37 6.53 0.8139 2.47 6.57 DAB (learning-N-best + TS) 0.8309 2.36 6.74 0.8338 2.33 6.37 0.8447 2.47 6.43 empirically that STOI scores are well correlated with human speech intelligibility scores ( W ang et al. , 2014 ; Du et al. , 2014 ; Huang et al. , 2015 ; Zhang & W ang , 2016b ). PESQ is a test methodology for automated assessment of the speech quality as experienced by a listener of a telephony system. SDR is a metric similar to SNR for ev aluating the quality of enhance- ment. The higher the value of an e valuation metric is, the better the performance is. 7.2. Main r esults W e list the performance of the comparison methods in the di ff use noise and point source noise environments in T ables 1 and 2 respectively . From the tables, we see that the DAB vari- 9 T able 2: Results with 16 microphones per array in point source noise environ- ments. SNRatO Comparison methods Babble Factory V olvo STOI PESQ SDR STOI PESQ SDR STOI PESQ SDR -5 dB Noisy 0.4465 1.29 -6.75 0.4336 1.19 -6.08 0.6286 1.90 -0.20 DB 0.5429 1.63 -3.50 0.5250 1.51 -2.22 0.7406 2.04 3.82 DAB (1-best) 0.5741 1.73 -1.96 0.5512 1.59 -1.52 0.7647 2.25 5.16 DAB (all-channels) 0.4246 1.95 -8.97 0.4194 1.73 -8.10 0.5106 1.71 -3.70 DAB (all-channels + GT) 0.5756 1.70 -2.41 0.5487 1.50 -2.07 0.7424 2.22 3.95 DAB (all-channels + TS) 0.5954 1.70 -2.43 0.5488 1.50 -2.20 0.7775 2.22 3.30 DAB (fix ed-N-best) 0.4665 1.82 -6.69 0.4619 1.66 -5.83 0.5745 1.79 -1.85 DAB (fix ed-N-best + GT) 0.5891 1.74 -1.99 0.5619 1.58 -1.68 0.7736 2.30 5.16 DAB (fix ed-N-best + TS) 0.6065 1.74 -1.65 0.5692 1.58 -1.25 0.8124 2.33 5.57 DAB (auto-N-best) 0.4753 1.93 -6.60 0.4547 1.75 -6.48 0.5773 1.86 -1.18 DAB (auto-N-best + GT) 0.6029 1.76 -1.26 0.5707 1.55 -1.21 0.7745 2.32 5.55 DAB (auto-N-best + TS) 0.6160 1.75 -1.24 0.5696 1.55 -1.23 0.8047 2.32 5.21 DAB (soft-N-best) 0.4806 1.95 -6.26 0.4601 1.76 -6.08 0.5822 1.87 -1.05 DAB (soft-N-best + GT) 0.6035 1.77 -1.15 0.5725 1.57 -1.05 0.7681 2.29 5.08 DAB (soft-N-best + TS) 0.6164 1.76 -1.15 0.5719 1.58 -1.10 0.8013 2.32 4.92 DAB (learning-N-best) 0.4606 1.92 -7.34 0.4465 1.75 -6.94 0.5654 1.83 -1.80 DAB (learning-N-best + GT) 0.5915 1.73 -1.73 0.5621 1.53 -1.54 0.7617 2.29 4.91 DAB (learning-N-best + TS) 0.6086 1.72 -1.74 0.5604 1.53 -1.72 0.7967 2.29 4.39 5 dB Noisy 0.5678 1.68 0.11 0.5607 1.61 0.50 0.6550 1.99 2.69 DB 0.6975 1.87 2.80 0.6856 1.85 3.19 0.7695 2.14 4.78 DAB (1-best) 0.7232 2.05 5.22 0.7187 2.00 5.53 0.7939 2.31 7.42 DAB (all-channels) 0.4942 1.79 -3.59 0.4806 1.74 -3.62 0.5207 1.74 -2.69 DAB (all-channels + GT) 0.7263 2.12 4.71 0.7168 2.06 4.98 0.7770 2.37 5.71 DAB (all-channels + TS) 0.7602 2.11 4.22 0.7481 2.05 4.31 0.8075 2.33 4.79 DAB (fix ed-N-best) 0.5522 1.80 -1.80 0.5421 1.74 -1.82 0.5889 1.83 -0.92 DAB (fix ed-N-best + GT) 0.7473 2.14 5.28 0.7425 2.09 5.55 0.8117 2.42 7.22 DAB (fix ed-N-best + TS) 0.7768 2.15 5.61 0.7734 2.11 5.81 0.8492 2.44 7.60 DAB (auto-N-best) 0.5820 1.89 -0.36 0.5901 1.86 0.39 0.6118 1.93 0.55 DAB (auto-N-best + GT) 0.7601 2.17 5.94 0.7568 2.12 6.29 0.8187 2.46 7.68 DAB (auto-N-best + TS) 0.7835 2.17 6.02 0.7751 2.12 6.32 0.8455 2.45 7.54 DAB (soft-N-best) 0.5834 1.90 -0.47 0.5915 1.85 0.25 0.6122 1.93 0.42 DAB (soft-N-best + GT) 0.7534 2.16 5.48 0.7510 2.10 5.83 0.8081 2.42 6.85 DAB (soft-N-best + TS) 0.7797 2.16 5.61 0.7714 2.11 5.91 0.8420 2.44 7.09 DAB (learning-N-best) 0.5605 1.86 -1.27 0.5617 1.82 -0.86 0.5975 1.90 -0.09 DAB (learning-N-best + GT) 0.7462 2.15 5.37 0.7397 2.09 5.71 0.8034 2.43 6.95 DAB (learning-N-best + TS) 0.7806 2.16 5.40 0.7672 2.10 5.38 0.8378 2.43 6.57 15 dB Noisy 0.6394 1.92 2.71 0.6405 1.90 2.76 0.6700 2.02 3.16 DB 0.7534 2.11 4.85 0.7596 2.10 5.01 0.7767 2.21 5.21 DAB (1-best) 0.7868 2.26 7.48 0.7886 2.23 7.39 0.8024 2.32 7.63 DAB (all-channels) 0.5215 1.76 -2.52 0.5152 1.73 -2.68 0.5183 1.76 -2.69 DAB (all-channels + GT) 0.7770 2.36 6.16 0.7763 2.33 6.12 0.7871 2.43 5.98 DAB (all-channels + TS) 0.8173 2.35 5.77 0.8189 2.31 5.66 0.8173 2.41 5.14 DAB (fix ed-N-best) 0.5924 1.82 -0.69 0.5886 1.80 -0.78 0.5911 1.83 -0.81 DAB (fix ed-N-best + GT) 0.8015 2.35 6.81 0.7999 2.32 6.77 0.8177 2.44 7.23 DAB (fix ed-N-best + TS) 0.8434 2.40 7.68 0.8419 2.35 7.54 0.8591 2.48 7.87 DAB (auto-N-best) 0.6503 2.01 2.22 0.6042 1.89 0.39 0.6602 2.03 2.39 DAB (auto-N-best + GT) 0.8156 2.40 7.93 0.8088 2.38 7.48 0.8292 2.46 7.94 DAB (auto-N-best + TS) 0.8405 2.41 8.08 0.8422 2.37 7.50 0.8502 2.47 8.12 DAB (soft-N-best) 0.6499 2.01 2.13 0.6042 1.90 0.30 0.6595 2.03 2.27 DAB (soft-N-best + GT) 0.8088 2.39 7.44 0.8009 2.35 6.84 0.8226 2.44 7.45 DAB (soft-N-best + TS) 0.8379 2.40 7.79 0.8385 2.37 7.09 0.8477 2.46 7.85 DAB (learning-N-best) 0.6272 1.95 1.25 0.5862 1.85 -0.33 0.6340 1.98 1.18 DAB (learning-N-best + GT) 0.8028 2.39 7.30 0.7966 2.36 6.97 0.8148 2.45 7.28 DAB (learning-N-best + TS) 0.8415 2.42 7.28 0.8392 2.37 6.89 0.8500 2.49 7.22 ants given the TS module or the ground-truth time delay out- perform the DB baseline significantly in terms of all ev aluation metrics. Even the simplest “D AB + 1-best” is better than the DB baseline, which demonstrates the adv antage of the ad-hoc mi- crophone array . W e compare the DB variants with the TS module or the ground-truth time delay for studying the e ff ectiv eness of the channel selection algorithms. W e find that “auto- N -best” per- forms the best among the channel selection algorithms in most cases, followed by “soft- N -best”. The “learning- N -best” and “fixed- N -best” algorithms perform equi valently well in general, both of which perform better than the “all-channels” algorithm. Although “1-best” performs the poorest in terms of STOI and PESQ, it usually produces good SDR scores that are compa- rable to those produced by “auto- N -best”. Note that although “learning- N -best” seems an advanced technique, this advantage does not transfer to superior performance. This may be caused by ( 26 ) which is an e xpansion of the channel selection result of “auto- N -best”. This problem needs further in vestigation in the future. Comparing “auto- N -best” and “soft- N -best”, we further find that di ff erent amplitude ranges of the channels a ff ect the performance, though this phenomenon is not so obvious due to that the nonzero weights do not vary in a large range. As a byproduct, the idea of “soft- N -best” is a way of synchroniz- ing the adaptiv e gain controllers of the de vices. The synchro- nization of the adaptive gain controllers is not the focus of this paper , hence we lea ve it for the future study . W e compare the “DAB + channel selection method”, “D AB + channel selection method + GT”, and “D AB + channel selection method + TS” given di ff erent channel selection meth- ods for studying the e ff ectiv eness of the time synchronization module. W e find that the D AB without the TS module does not work at all when there exists a serious time unsynchro- nization problem caused by devices. “DAB + channel selection method + TS” performs better than “D AB + channel selection method + GT” in terms of STOI, and is equiv alently good with the latter in terms of PESQ and SDR in all SNRatO le vels, even though the latter was given the ground-truth time delay caused by devices. This phenomenon demonstrates the e ff ectiv eness of the proposed TS module. It also implies that the time unsynchronization problem caused by di ff erent locations of the microphones a ff ects the performance, though not so serious. Figure 5 shows three examples of the channel selection re- sults, of which we find some interesting phenomena after look- ing into the details. Figure 5 a is a typical scenario where the speech source is far away from the noise point source. W e see clearly from the figure that, although “1-best” is relati ve much poorer than the other algorithms, all comparison algorithms perform not so bad according to the absolute STOI scores, since that the SNRs at many selectted microphones are relati ve high. Figure 5 b is a special scenario where the speech source is very close to the noise point source. Therefore, the SNRs at all mi- crophones are low . As shown in the figure, it is better to select most microphones as “auto- N -best” and “learning- N -best” do in this case, otherwise the performance is rather poor as “1-best” yields. Figure 5 c is a special scenario where there is a micro- phone very close to the speech source. It can be seen from the channel selection result that the best way is to select the closest microphone, while “all-channels” perform much poorer than the other channel selection algorithms. T o summarize the abov e phenomena, we see that the adaptive channel selection algorithms, i.e. “auto- N -best” and “learning- N -best”, always produce top performance among the comparison algorithms. 7.3. E ff ect of the number of the micr ophones in an arr ay T o study ho w the number of the microphones in an array af- fects the performance, we repeated the experimental setting in Section 7.1 except that the number of the microphones in an array was reduced to 4. Because the experimental phenomena were consistent across di ff erent SNRatO lev els and noise types, 10 Figure 5: Examples of channel-selection results in the babble point source noise environments at the SNRatO of − 5 dB, where the number in the brackets of the title of each sub-figure is the STOI v alue. (a) Room size: 20 × 10 × 3 . 5 m 3 . (b) Room size: 13 × 14 × 3 . 1 m 3 . (c) Room size: 14 × 11 × 3 . 1 m 3 T able 3: Results with 4 microphones per array in babble di ff use noise environ- ment at an SNRatO of 10 dB. SNRatO Comparison methods Babble STOI PESQ SDR 10 dB Noisy 0.5919 1.80 0.99 DB 0.6830 1.91 3.14 DAB (1-best) 0.6400 1.86 2.38 DAB (all-channels + TS) 0.7154 1.92 2.70 DAB (fix ed-N-best + TS) 0.6821 1.90 2.58 DAB (auto-N-best + TS) 0.7112 1.93 3.01 DAB (soft-N-best + TS) 0.7013 1.91 2.27 DAB (learning-N-best + TS) 0.7135 1.92 2.82 we list the comparison results of only one test scenario in T ables 3 and 4 for saving the space of the paper . From the tables, we see that, e v en if the number of the microphones in an array was limited, the D AB variants still perform equiv alently well with DB except “D AB + 1-best”. Comparing T ables 3 and 4 with T a- bles 1 and 2 , we also find that D AB benefits much more than DB from the increase of the number of the microphones. W e take the results in the di ff use noise environment as an exam- ple. The STOI score of “D AB + auto- N -best + TS” is improv ed by relati vely 20.22% when the number of the microphones is increased from 4 to 16, while the relative improv ement about DB is only 2.56%. 7.4. E ff ect of hyperparameter γ T o study how the hyperparameter γ a ff ects the per- formance of “D AB + auto- N -best + TS”, “D AB + soft- N - best + TS”, and “D AB + learning- N -best + TS”, we tune γ from { 0 . 1 , 0 . 3 , 0 . 5 , 0 . 7 , 0 . 9 } . T o save the space of the paper , we only show the results in the babble noise environments at the lowest T able 4: Results with 4 microphones per array in babble point source noise en vironment at an SNRatO of − 5 dB. SNRatO Comparison methods Babble STOI PESQ SDR -5 dB Noisy 0.4576 1.56 -6.14 DB 0.5079 1.47 -5.70 DAB (1-best) 0.4996 1.57 -5.10 DAB (all-channels + TS) 0.5056 1.49 -6.80 DAB (fix ed-N-best + TS) 0.5068 1.55 -5.72 DAB (auto-N-best + TS) 0.5111 1.51 -6.28 DAB (soft-N-best + TS) 0.5140 1.53 -6.19 DAB (learning-N-best + TS) 0.5064 1.50 -6.67 Figure 6: E ff ect of hyperparameter γ in the babble di ff use noise en vironment at the SNRatO of 10 dB. SNRatO lev els in Figs. 6 and 7 . From the figures, we observe that “D AB + auto- N -best + TS” and “DAB + soft- N -best + TS” perform similarly if γ is well-tuned, both of which are better than “DAB + soft- N -best + TS”. The working range of γ is [0 . 5 , 0 . 7] for “D AB + auto- N -best + TS” and “D AB + soft- N - best + TS”, and [0 . 7 , 0 . 9] for “D AB + learning- N -best + TS”. 11 Figure 7: E ff ect of hyperparameter γ in the babble point source noise environ- ment at the SNRatO of − 5 dB. T able 5: Comparison results of handcrafted features for the DAB v ariants with 16 microphones per array in babble di ff use noise en vironment at an SNRatO of 10 dB. SNRatO Comparison methods Babble STOI PESQ SDR 10 dB 1-best (eSTFT) 0.7154 2.06 5.14 1-best (MRCG) 0.7101 2.05 5.01 fixed-N-best + TS (eSTFT) 0.7675 2.11 5.18 fixed-N-best + TS (MRCG) 0.7617 2.10 4.96 auto-N-best + TS (eSTFT) 0.7696 2.12 5.45 auto-N-best + TS (MRCG) 0.7654 2.11 5.28 soft-N-best + TS (eSTFT) 0.7659 2.12 5.11 soft-N-best + TS (MRCG) 0.7621 2.11 4.88 learning-N-best + TS (eSTFT) 0.7631 2.07 4.55 learning-N-best + TS (MRCG) 0.7606 2.07 4.48 7.5. E ff ect of handcrafted featur es on performance All abov e experiments were conducted with the eSTFT fea- ture. T o study how the handcrafted features a ff ect the perfor- mance, we compared the D AB models that took eSTFT and MRCG respecti v ely as the input of the channel-selection model g ( · ) in the babble noise en vironments at the lowest SNRatO lev els. From the comparison results in T ables 5 and 6 , we see that eSTFT is slightly better than MRCG in the di ff use noise en vironment, and significantly outperforms MRCG in the point source noise en vironment. The e ff ect of di ff erent acoustic features for “D AB + 1-best + TS” in the point source noise en- vironment is remarkable, which manifests the importance of designing a good handcrafted feature. W e also observe that the adv antage of the adaptive channel selection algorithms ov er “D AB + 1-best + TS” and “D AB + fixed- N -best + TS” is consistent across the two acoustic features, which demonstrates the rob ust- ness of the proposed channel selection algorithms to the choice of the acoustic features. 8. Conclusions and future w ork In this paper , we hav e proposed deep ad-hoc beamforming, which is to our knowledge the first deep learning method de- signed for ad-hoc microphone arrays. 3 D AB has the follow- ing novel aspects. First, DAB employs an ad-hoc microphone 3 This claim was made according to the fact that the core idea of the paper has been put on arXiv ( Zhang , 2018 ) in January 2019. T able 6: Comparison results of handcrafted features for the D AB variants with 16 microphones per array in babble point source noise environment at an SNRatO of − 5 dB. SNRatO Comparison methods Babble STOI PESQ SDR -5 dB 1-best (eSTFT) 0.5741 1.73 -1.96 1-best (MRCG) 0.5267 1.67 -3.77 fixed-N-best + TS (eSTFT) 0.6065 1.74 -1.65 fixed-N-best + TS (MRCG) 0.5838 1.67 -2.55 auto-N-best + TS (eSTFT) 0.6160 1.75 -1.24 auto-N-best + TS (MRCG) 0.5918 1.70 -2.14 soft-N-best + TS (eSTFT) 0.6164 1.76 -1.15 soft-N-best + TS (MRCG) 0.5934 1.70 -2.10 learning-N-best + TS (eSTFT) 0.6086 1.72 -1.74 learning-N-best + TS (MRCG) 0.5956 1.70 -2.27 array to pick up speech signals, which has a potential to en- hance the speech signals with equally high quality in a range where the array covers. It may also significantly improve the SNR at the microphone receiv ers by physically placing some microphones close to the speech source in probability . Sec- ond, DAB employs a channel-selection algorithm to reweight the estimated speech signals with a sparsity constraint, which groups a handful microphones around the speech source into a local microphone array . W e hav e dev eloped several channel- selection algorithms as well. Third, we have de veloped a time synchronization frame work based on time delay estimators and the supervised 1-best channel selection. At last, we emphasized the importance of acoustic features to D AB by carrying out the first study on how di ff erent acoustic features a ff ect the perfor- mance. Besides the above novelties and advantages, the proposed D AB is flexible in incorporating new development of DNN- based single channel speech processing techniques, since that its model is trained in the single-channel fashion. Its test pro- cess is also flexible in incorporating any number of micro- phones without retraining or revising the model, which meets the requirement of real-world applications. Moreover , although we applied D AB to speech enhancement as an example, we may apply it to other tasks as well by replacing the deep beamform- ing to other task-specific algorithms. W e have conducted extensi ve experiments in the scenario where the location of the speech source is far -field, random, and blind to the microphones. Experimental results in both the di ff use noise and point source noise en vironments demon- strate that D AB outperforms its MVDR based deep beamform- ing counterpart by a large margin gi ven enough number of mi- crophones. The conclusion is consistent across di ff erent acous- tic features. The research on D AB is only at the beginning. There are too many open problems. Here we list some urgent topics as fol- lows. (i) Ho w to synchronize microphones when the clock rates and po wer amplifiers at di ff erent de vices are di ff erent. (ii) Ho w to design new spatial acoustic features for ad-hoc microphone arrays beyond interaural time di ff erence or interaural le vel dif- ference. (iii) Ho w to design a model that can be trained with multichannel data collected from ad-hoc microphone arrays, and generalize well to fundamentally di ff erent ad-hoc micro- 12 phone arrays in the test stage. (iv) Ho w to handle a large num- ber of microphones (e.g. over 100 microphones) for a large room that contains many complicated acoustic en vironments. Acknowledgments The author would like to thank Prof. DeLiang W ang for help- ful discussions. This work was supported in part by the National Key Research and Dev elopment Program of China under Grant No. 2018AAA0102200, in part by National Science Foun- dation of China under Grant No. 61831019, 61671381, in part by the Project of the Science, T echnology , and Innov a- tion Commission of Shenzhen Municipality under grant No. JCYJ20170815161820095, and in part by the Open Research Project of the State K e y Laboratory of Media Con vergence and Communication, Communication Univ ersity of China, China under Grant No. SKLMCC2020KF009. Appendix A. Pr oof. W e denote the energy of the direct sound and additiv e noise components of the test utterance at the i -th channel as X i and N i respectiv ely , i.e. X = P t | x | time ( t ) and N = P t | n | time ( t ). Our core idea is to filter out the signals of the channels whose clean speech satisfies: X i < γ X ∗ (A.1) Under the assumptions that the estimated weights are perfect and that the statistics of the noise components are consistent across the channels, we hav e q i = S i S i + N ∗ , q ∗ = S ∗ S ∗ + N ∗ (A.2) Substituting ( A.2 ) into ( A.1 ) deriv es ( 20 ). References References Bai, Z., Zhang, X.-L., & Chen, J. (2020a). Cosine metric learning based speaker verification. Speech Communication , 118 , 10–20. Bai, Z., Zhang, X.-L., & Chen, J. (2020b). Partial auc optimization based deep speaker embeddings with class-center learning for text-independent speaker verification. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6819–6823). IEEE. Bai, Z., Zhang, X.-L., & Chen, J. (2020c). Speaker verification by partial auc optimization with mahalanobis distance metric learning. IEEE / ACM T r ans- actions on Audio, Speec h, and Language Pr ocessing , . Carter , G. C. (1987). Coherence and time delay estimation. Pr oceedings of the IEEE , 75 , 236–255. Chen, J., Benesty , J., & Huang, Y . A. (2006). T ime delay estimation in room acoustic en vironments: an overview . EURASIP Journal on Advances in Signal Pr ocessing , 2006 , 026503. Chen, J., W ang, Y ., & W ang, D. L. (2014). A feature study for classification- based speech separation at very low signal-to-noise ratio. IEEE / A CM T rans. Audio, Speec h, Lang. Pr ocess. , 22 , 1993–2002. Delfarah, M., & W ang, D. (2017). Features for masking-based monaural speech separation in reverberant conditions. IEEE / A CM T ransactions on Audio, Speech, and Languag e Pr ocessing , 25 , 1085–1094. Ditter , D., & Gerkmann, T . (2020). A multi-phase gammatone filterbank for speech separation via tasnet. In ICASSP 2020-2020 IEEE International Con- fer ence on Acoustics, Speech and Signal Processing (ICASSP) (pp. 36–40). IEEE. Du, J., Tu, Y ., Xu, Y ., Dai, L., & Lee, C.-H. (2014). Speech separation of a target speaker based on deep neural networks. In Proc. IEEE Int. Conf. Signal Pr ocess. (pp. 473–477). Erdogan, H., Hershey , J. R., W atanabe, S., Mandel, M. I., & Le Roux, J. (2016). Improved MVDR beamforming using single-channel mask prediction net- works. In Interspeec h (pp. 1981–1985). Guariglia, E. (2019). Primality , fractality , and image analysis. Entr opy , 21 , 304. Guariglia, E., & Silvestrov , S. (2016). Fractional-wavelet analysis of positive definite distrib utions and wav elets on D’(C). In Engineering Mathematics II (pp. 337–353). Springer . Guido, R. C. (2018a). Fusing time, frequency and shape-related information: Introduction to the discrete shapelet transform’ s second generation (dst-ii). Information Fusion , 41 , 9–15. Guido, R. C. (2018b). A tutorial review on entropy-based handcrafted feature extraction for information fusion. Information Fusion , 41 , 161–175. Heusdens, R., Zhang, G., Hendriks, R. C., Zeng, Y ., & Kleijn, W . B. (2012). Distributed MVDR beamforming for (wireless) microphone networks using message passing. In Acoustic Signal Enhancement; Proceedings of IW AENC 2012; International W orkshop on (pp. 1–4). VDE. Heymann, J., Drude, L., & Haeb-Umbach, R. (2016). Neural network based spectral mask estimation for acoustic beamforming. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2016 IEEE International Confer ence on (pp. 196–200). IEEE. Higuchi, T ., Ito, N., Y oshioka, T ., & Nakatani, T . (2016). Robust MVDR beamforming using time-frequency masks for online / o ffl ine ASR in noise. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Interna- tional Confer ence on (pp. 5210–5214). IEEE. Higuchi, T ., Kinoshita, K., Ito, N., Karita, S., & Nakatani, T . (2018). Frame-by- frame closed-form update for mask-based adaptiv e MVDR beamforming. In 2018 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) (pp. 531–535). IEEE. Huang, P .-S., Kim, M., Hasegawa-Johnson, M., & Smaragdis, P . (2015). Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE / ACM T rans. Audio, Speech, Lang. Pr ocess. , 23 , 2136–2147. Jayaprakasam, S., Rahim, S. K. A., & Leo w , C. Y . (2017). Distributed and col- laborativ e beamforming in wireless sensor networks: Classifications, trends, and research directions. IEEE Communications Surveys & T utorials , 19 , 2092–2116. Jiang, Y ., W ang, D., Liu, R., & Feng, Z. (2014). Binaural classification for re- verberant speech se gregation using deep neural networks. IEEE / A CM T rans- actions on Audio, Speech and Language Processing (T ASLP) , 22 , 2112– 2121. Knapp, C., & Carter, G. (1976). The generalized correlation method for esti- mation of time delay . IEEE transactions on acoustics, speech, and signal pr ocessing , 24 , 320–327. K outrouvelis, A. I., Sherson, T . W ., Heusdens, R., & Hendriks, R. C. (2018). A low-cost robust distributed linearly constrained beamformer for wireless acoustic sensor networks with arbitrary topology . IEEE / A CM T ransactions on Audio, Speec h and Language Pr ocessing (T ASLP) , 26 , 1434–1448. Lu, X., Tsao, Y ., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising autoencoder . In Interspeec h (pp. 436–440). Luo, Y ., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time– frequency magnitude masking for speech separation. IEEE / A CM transac- tions on audio, speech, and languag e pr ocessing , 27 , 1256–1266. Mallat, S. G. (1989). A theory for multiresolution signal decomposition: the wav elet representation. IEEE transactions on pattern analysis and machine intelligence , 11 , 674–693. Markovich-Golan, S., Gannot, S., & Cohen, I. (2012). Distributed multiple con- straints generalized sidelobe canceler for fully connected wireless acoustic sensor networks. IEEE Tr ansactions on Audio, Speech, and Language Pr o- cessing , 21 , 343–356. Nakatani, T ., Ito, N., Higuchi, T ., Araki, S., & Kinoshita, K. (2017). Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE International Confer ence on (pp. 286–290). IEEE. 13 Ng, A. Y ., Jordan, M. I., & W eiss, Y . (2001). On spectral clustering: Analysis and an algorithm. In NIPS . O’Connor , M., & Kleijn, W . B. (2014). Di ff usion-based distributed MVDR beamformer . In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2014 IEEE International Confer ence on (pp. 810–814). IEEE. O’Connor , M., Kleijn, W . B., & Abhayapala, T . (2016). Distributed sparse MVDR beamforming using the bi-alternating direction method of multi- pliers. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2016 IEEE International Confer ence on (pp. 106–110). IEEE. Pandey , A., & W ang, D. (2019). A new framework for cnn-based speech en- hancement in the time domain. IEEE / ACM T ransactions on Audio, Speech, and Language Pr ocessing , 27 , 1179–1188. Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). Filterbank de- sign for end-to-end speech separation. In ICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) (pp. 6364–6368). IEEE. Qi, J., Du, J., Siniscalchi, S. M., & Lee, C.-H. (2019). A theory on deep neu- ral network based vector-to-v ector regression with an illustration of its ex- pressiv e power in speech enhancement. IEEE / ACM T ransactions on Audio, Speech, and Languag e Pr ocessing , 27 , 1932–1943. Rix, A. W ., Beerends, J. G., Hollier , M. P ., & Hekstra, A. P . (2001). Percep- tual ev aluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Pr ocess. (pp. 749–752). Sepúlveda, A., Guido, R. C., & Castellanos-Dominguez, G. (2013). Estimation of relev ant time–frequency features using kendall coe ffi cient for articulator position inference. Speech communication , 55 , 99–110. T aal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility prediction of time–frequenc y weighted noisy speech. IEEE T r ans. Audio, Speech, Lang . Pr ocess. , 19 , 2125–2136. T aherian, H., W ang, Z.-Q., & W ang, D. (2019). Deep learning based multi- channel speaker recognition in noisy and re verberant en vironments. Pr oc. Interspeech 2019 , (pp. 4070–4074). T an, K., Chen, J., & W ang, D. (2018). Gated residual networks with dilated con volutions for monaural speech enhancement. IEEE / ACM transactions on audio, speech, and languag e pr ocessing , 27 , 189–198. T an, K., & W ang, D. (2019). Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE / ACM T ransactions on Audio, Speech, and Language Pr ocessing , 28 , 380–390. T avak oli, V . M., Jensen, J. R., Christensen, M. G., & Benesty , J. (2016). A framework for speech enhancement with ad hoc microphone arrays. IEEE / ACM T r ansactions on Audio, Speech and Language Processing (T ASLP) , 24 , 1038–1051. T avak oli, V . M., Jensen, J. R., Heusdens, R., Benesty , J., & Christensen, M. G. (2017). Distrib uted max-SINR speech enhancement with ad hoc microphone arrays. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE International Confer ence on (pp. 151–155). IEEE. T u, Y .-H., Du, J., Sun, L., & Lee, C.-H. (2017). LSTM-based iterative mask estimation and post-processing for multi-channel speech enhancement. In Asia-P acific Signal and Information Pr ocessing Association Annual Summit and Confer ence (APSIP A ASC), 2017 (pp. 488–491). IEEE. V incent, E., Gribon val, R., & Févotte, C. (2006). Performance measurement in blind audio source separation. IEEE Tr ans. Audio, Speech, Lang. Process. , 14 , 1462–1469. W ang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: An overvie w . IEEE / A CM Tr ansactions on Audio, Speech, and Language Pr ocessing , . W ang, L., & Cav allaro, A. (2018). Pseudo-determined blind source separa- tion for ad-hoc microphone networks. IEEE / ACM T r ansactions on Audio, Speech, and Languag e Pr ocessing , 26 , 981–994. W ang, L., Hon, T .-K., Reiss, J. D., & Cav allaro, A. (2015). Self-localization of ad-hoc arrays using time di ff erence of arriv als. IEEE T ransactions on Signal Pr ocessing , 64 , 1018–1033. W ang, P ., T an, K. et al. (2019). Bridging the gap between monaural speech en- hancement and recognition with distortion-independent acoustic modeling. IEEE / ACM T ransactions on Audio, Speech, and Language Pr ocessing , 28 , 39–48. W ang, Y ., Narayanan, A., & W ang, D. L. (2014). On training targets for super- vised speech separation. IEEE / A CM T rans. Audio, Speech, Lang. Process. , 22 , 1849–1858. W ang, Y ., & W ang, D. L. (2013). T owards scaling up classification-based speech separation. IEEE Tr ans. Audio, Speech, Lang. Pr ocess. , 21 , 1381– 1390. W ang, Z.-Q., & W ang, D. (2018). All-neural multichannel speech enhance- ment. to appear in Interspeech , . W ang, Z.-Q., Zhang, X., & W ang, D. (2018). Robust speaker localization guided by deep learning-based time-frequency masking. IEEE / ACM T r ans- actions on Audio, Speec h, and Language Pr ocessing , 27 , 178–188. W eninger, F ., Erdogan, H., W atanabe, S., V incent, E., Le Roux, J., Hershey , J. R., & Schuller , B. (2015). Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr . In International Conference on Latent V ariable Analysis and Signal Separation (pp. 91–99). Springer . W illiamson, D. S., W ang, Y ., & W ang, D. L. (2016). Complex ratio masking for monaural speech separation. IEEE / ACM T rans. Audio, Speech, Lang. Pr ocess. , 24 , 483–492. Xiao, X., Zhao, S., Jones, D. L., Chng, E. S., & Li, H. (2017). On time- frequency mask estimation for MVDR beamforming with application in robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 3246–3250). IEEE. Xu, Y ., Du, J., Dai, L.-R., & Lee, C.-H. (2015). A regression approach to speech enhancement based on deep neural netw orks. IEEE / A CM T r ans. Au- dio, Speech, Lang. Pr ocess. , 23 , 7–19. Y ang, Z., & Zhang, X.-L. (2019). Boosting spatial information for deep learn- ing based multichannel speaker-independent speech separation in rev erber- ant environments. In 2019 Asia-P acific Signal and Information Pr ocess- ing Association Annual Summit and Confer ence (APSIP A ASC) (pp. 1506– 1510). IEEE. Zeng, Y ., & Hendriks, R. C. (2014). Distributed delay and sum beamformer for speech enhancement via randomized gossip. IEEE / A CM Tr ansactions on Audio, Speec h, and Language Pr ocessing , 22 , 260–273. Zhang, J., Chepuri, S. P ., Hendriks, R. C., & Heusdens, R. (2018). Microphone subset selection for MVDR beamformer based noise reduction. IEEE / ACM T r ansactions on Audio, Speech, and Langua ge Pr ocessing , 26 , 550–563. Zhang, X., W ang, Z.-Q., & W ang, D. (2017). A speech enhancement algorithm by iterating single-and multi-microphone processing and its application to robust ASR. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE International Confer ence on (pp. 276–280). IEEE. Zhang, X.-L. (2018). Deep ad-hoc beamforming. arXiv preprint arXiv:1811.01233 , . URL: . Zhang, X.-L., & W ang, D. (2016a). Boosting contextual information for deep neural network based voice activity detection. IEEE / ACM Tr ansactions on Audio, Speec h, and Language Pr ocessing , 24 , 252–264. Zhang, X.-L., & W ang, D. (2016b). A deep ensemble learning method for monaural speech separation. IEEE / ACM transactions on audio, speech, and language pr ocessing , 24 , 967–977. Zhang, X.-L., & W u, J. (2013a). Deep belief networks based voice activity detection. IEEE T rans. A udio, Speech, Lang . Pr ocess. , 21 , 697–710. Zhang, X.-L., & W u, J. (2013b). Denoising deep neural networks based voice activity detection. In the 38th IEEE International Conference on Acoustic, Speech, and Signal Pr ocessing (pp. 853–857). Zheng, N., & Zhang, X.-L. (2018). Phase-aware speech enhancement based on deep neural networks. IEEE / ACM T ransactions on Audio, Speech, and Language Pr ocessing , 27 , 63–76. Zheng, X., T ang, Y . Y ., & Zhou, J. (2019). A frame work of adapti ve multiscale wav elet decomposition for signals on undirected graphs. IEEE T ransactions on Signal Pr ocessing , 67 , 1696–1711. Zhou, Y ., & Qian, Y . (2018). Robust mask estimation by integrating neural network-based and clustering-based approaches for adapti ve acoustic beam- forming. In Int Conf on Acoustics, Speech, and Signal Processing, in pr ess. Google Scholar . 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment