Improving Reverberant Speech Training Using Diffuse Acoustic Simulation

IMPR O VING REVERBERANT SPEECH TRAINING USING DIFFUSE A COUSTIC SIMULA TION Zhenyu T ang ? Lianwu Chen † Bo W u † Dong Y u † Dinesh Manocha ? ? Uni versity of Maryland † T encent AI Lab { zhy , dm } @cs.umd.edu, { lianwuchen, lambowu, dyu } @tencent.com ABSTRA CT W e present an efﬁcient and realistic geometric acoustic simulation ap- proach for generating and augmenting training data in speech-related machine learning tasks. Our physically-based acoustic simulation method is capable of modeling occlusion, specular and diffuse reﬂec- tions of sound in complicated acoustic en vironments, whereas the classical image method can only model specular reﬂections in simple room settings. W e show that by using our synthetic training data, the same neural networks gain signiﬁcant performance impro vement on real test sets in far-ﬁeld speech recognition by 1.58% and keyw ord spotting by 21%, without ﬁne-tuning using real impulse responses. Index T erms — rev erberation, diffuse reﬂection, speech recogni- tion, data augmentation, acoustic simulation 1. INTR ODUCTION Over the past fe w years, deep learning approaches have g ained signif- icant ground in the speech community , surpassing the performance of many classical machine learning models in a v ariety of related sub- ﬁelds. State-of-the-art deep neural networks (DNNs) are powerful tools for exploiting variable-length contextual information embed- ded in noisy speech sequences. Some very f amous applications of DNN techniques in speech include Microsoft Cortana ® , Apple Siri ® , Google Now ® , and Amazon Alexa ® . These applications usually in- tegrate se veral fundamental speech tasks such as speech enhancement and separation [ 1 , 2 ], automated speech recognition (ASR) [ 3 , 4 , 5 , 6 ], and keyw ord spotting (KWS) [ 7 , 8 ]. Another important enabling fac- tor behind the success of DNNs in these tasks is the huge amount of annotated speech corpus made av ailable by research groups and large companies. Deep learning theory indicates that having more training examples is crucial to reduce the generalization error of trained mod- els in real test cases [ 9 ]. Ho wever , the majority of popular speech corpuses were recorded under relatively ideal conditions, i.e. ane- choic speech with negligible noise and en vironmental reverberation. When training models for real-world applications, it is common to distort the clean speech by adding noise and rev erberation as a pre- processing step to augment the training data [ 10 , 11 ]. Rev erberation is a characteristic effect of a particular acoustic environment and can be described by impulse responses (IRs) or frequency responses. In practice, both recorded IRs and synthetic IRs hav e been used to con volv e with the clean speech. Signiﬁcant improv ements in model accuracy ha ve been observ ed due to this type of data augmentation. Howe ver , there is still a performance gap when the application is This work is supported in part by AR O grant W911NF-18-1-0313, NSF grant #1910940, T encent, Adobe, Facebook and Intel. The authors thank Jie Chen and Dan Su from T encent for their help with the ASR and KWS systems. Project website https://gamma.umd.edu/pro/speech/asr deployed in conditions not matched to training conditions. IRs pre- recorded in a limited number of environments may not generalize well to inﬁnite real-world conditions. Howe ver, it is in-efﬁcient to shrink the gap by collecting more real-world IRs; recording IRs is not a trivial task because it requires professional equipment and trained people. An alternativ e and cost-effecti ve way is simulating room im- pulse responses (RIRs) by using acoustic simulators. A simple RIR simulator should take in the room geometry , source and listener posi- tions, and surface absorption/reﬂection properties, and generate an RIR for each source-listener pair . One classical approach is the image method (IM), which models specular sound reﬂections in rectangular rooms and has been prov en to w ork well in some tasks. Howe ver , one notable drawback of this method is its over -simpliﬁcation of room acoustics by ignoring dif fuse reﬂections that are very common in real- world environments. Furthermore, it does not deal with occlusion. These limitations make the image method less realistic in terms of augmenting data, especially in applications where late re verberation plays a signiﬁcant role. Main contribution: T o ov ercome limitations of existing simulation methods and better augment the training data, we propose an ef ﬁcient and realistic geometric acoustic simulation approach that models occlusion, specular , and diffuse reﬂections, where sound ener gy can be reﬂected randomly and thus not follo wing an ideal specular path. W e sample 5000 different acoustic room conﬁgurations and use our method to simulate far-ﬁeld sound propagation in each room. The speech training data is generated by randomly con volving o ver 1500 hours of clean speech utterances with simulated RIRs and adding en vironmental noise. W e train two different models independently based on 1D/2D con volution + long short-term memory (LSTM) structures for an ASR task and a KWS task, and then e valuate them on different real-world data. W e observ e accurac y impro vement using our method by 1.58% in terms of ASR and by 21% in terms of KWS. The rest of the paper is organized as follows. In Section 2 we explain our ray-tracing based geometric acoustic simulation algorithm. W e describe several speech training benchmarks in Section 3 and present our results in Section 4. 2. A COUSTIC SIMULA TION 2.1. Impulse Response Modeling Acoustic simulation engines have been used in computer aided de- signs (CAD), theoretical research, the game industry , and many other ﬁelds. The simulation goal is usually to observe ho w the sound pres- sure changes according to time at some position when there is a sound source at some other position in space. IR is the most common way to describe sound propagation between two points in a ﬁxed envi- ronment, so we use IR ( x s , x r , t ) to denote the IR at time t from the point source at location x s to the listener at location x r . In practice, an IR can be measured by exciting an impulse using a shotgun as a sound source; the sound pressure is then recorded at the tar get re- ceiv er location. From a ﬁrst principle view , the propagation of sound wa ves follow the acoustic wave equation [ 12 ], which describes the sound pressure variation in both spatial and temporal domain and is the foundation of wav e-based solvers. There are sev eral ways to implement wa ve-based solvers, including Finite Element Methods (FEM), Boundary Element Methods (BEM), ﬁnite-difference time domain (FDTD) approaches [ 13 ], and Adaptiv e Rectangular Decom- position (ARD) methods [ 14 ]. W av e-based techniques yield the most accurate results, but are only feasible for lo w frequencies and small scenes because the y do not scale well with space and time granularity . When the wav elength of the sound is smaller than the size of the obstacles in the en vironment, the sound wave can be treated in the form of a ray , which is the key idea of geometric methods. T ypical geometric methods include the image method [ 15 ], path tracing meth- ods [ 16 , 17 , 18 , 19 ], and beam or frustum tracing methods [ 20 , 21 ]. Our method is based on efﬁcient Monte Carlo path tracing [22]. (a) Specular reﬂections (b) Dif fuse reﬂections Fig. 1 . T wo types of reﬂections of sound at a surface. Both phenom- ena are frequency dependent. 2.2. Sound Propagation From the perspectiv e of geometric methods, there are two types of reﬂections that can occur at a rigid surface: specular reﬂections and dif fuse reﬂections. Specular reﬂections occur at mostly ﬂat and uniform surfaces and the outgoing direction of the sound ray is the same as the incident angle in Fig. 1(a), kno wn as Snell’ s La w in geometric optics. Howe ver , real-world object surfaces usually do not completely satisfy the specular condition and scatter sound energy in all directions according to Lambert’ s cosine-law , which is called diffuse reﬂections as illustrated in Fig 1(b). IRs are constructed by accumulating sound energy from both specular and dif fuse reﬂection paths with the correct time delay and energy decay , which can be calculated from the total length of the path. Con ventionally , an IR is decomposed into 3 parts: direct response, early reﬂections, and late rev erberation. The direct response is determined by the visibility between the source and listener . Early reﬂections are mostly due to specular reﬂections, whereas the late reverberation is caused by diffuse reﬂections. A typical IR ener gy distrib ution is shown in Fig. 2. Fig. 2 . Energy distrib ution of an impulse response in time. Our goal is to accurately model the late rev erberation effects in simulated IRs. 2.3. Image Method The image method is the current most widely used method in the speech community for generating RIRs in various learning-based tasks [ 23 ]. It is based on the principle of specular reﬂections where all reﬂection paths can be constructed by mirroring sound sources with respect to the reﬂecting plane, shown in Fig 3. A source will be mirrored multiple times depending on the desired order of reﬂections. Therefore, the image method fails to model the late re verberation part of an IR. Computationally , for a scene with one source, N reﬂectiv e surfaces, and reﬂection order d , the time comple xity is O ( N d ) , which is prohibitiv e for simulations at high orders or scene complexities. Fig. 3 . Construction and v alidation of image paths. The source S is mirrored into 5 image sources marked as S 1 ∼ S 5 by 5 planes. A sound path is connected to the listener L from each image source. Then a path v alidation is performed by checking whether the image path intersects with the plane that generates this image source. The path from S 1 to L does not intersect with plane 1; therefore it is infeasible and rejected. The other 4 image paths are valid and can be used to compute the IR analytically . 2.4. Diffuse Acoustic Simulation Diffuse reﬂections occur when sound energy is scattered into non- specular directions. Diffuse reﬂections are widely observ ed in real- world and have been sho wn to be important for modeling sound ﬁelds in room en vironments [ 24 , 25 , 26 ]. Dif fuse acoustic simulations correctly model not only the specular, b ut also the diffuse soundﬁeld. W e propose our geometric acoustic simulation (GAS) method for this purpose. In contrast to the image method, our method is based on stochastic path tracing illustrated in Fig. 4: sound paths are randomly traced in all directions and each path follows either specular or diffuse reﬂections. W e explicitly deﬁne the scattering coefﬁcient s between 0 and 1, which denotes the proportion of sound energy that is diffusely reﬂected at a surface (0 means perfectly specular and 1 means perfectly diffuse). Speciﬁcally , the sound energy L r reﬂected at a surface point x to direction ~ ω r is computed by integrating the incoming energy o ver a hemisphere Ω centered at x on the surface: L r ( x , ~ ω r ) = Z Ω f r ( x , ~ ω i → ~ ω r ) L i ( x , ~ ω i ) cos θ i dω i , (1) where θ is the incident angle, ~ ω i is the incoming direction, and f r ( x , ~ ω i → ~ ω r ) is the probability distribution function that describes the probability of generating the sound path from ~ ω i to ~ ω r , which is generic to include both specular and dif fuse reﬂections. In practice, Eq. 1 is recursive and can only be solved numerically using Monte Carlo integration. The diffuse reﬂection paths are generated by tracing random rays from the source, the listener , or both [ 27 ]. A large number of ray samples is required for solution con vergence. The complexity of Monte Carlo path tracing is O ( M log N ) , where M is the total number of rays traced to solve Eq. 1 and N is the number of surfaces in the scene. One of its computational adv antages over the image method is that most in valid paths that are generated, veriﬁed, and rejected in the image method are not considered in path tracing, so the number of surfaces does not greatly impact the efﬁciency of path tracing. This allows us to compute both early reﬂections and late rev erberation efﬁciently . In a far-ﬁeld speech simulation setting, we deﬁne an acoustic room by its length, width, and height. Acoustic absorption and scat- tering coefﬁcients can be deﬁned for each surface element (triangular mesh), which determines the relativ e strength of diffraction. After specifying the sound source and recei ver locations within the room, our simulation generates an RIR. Detailed conﬁgurations are in Sec- tion 3.1. One speech-related problem that has beneﬁted from more accurate simulations is the direction-of-arri val estimation task [ 28 ]. W e ar gue that using a more accurate geometric acoustic simulation that f aithfully models the late rev erberation for general speech-related training will lead to better performance in learning-based models. Fig. 4 . Monte Carlo path tracing for solving the sound transport problem. Ray samples are generated in random directions from the source S . Reﬂections upon hitting a plane are simulated by generating subsequent random rays while conserving the total energy . Once a ray intersects with the listener L , the energy is accumulated to the IR. 3. TRAINING WITH A COUSTIC SIMULA TION T o evaluate our proposed approach, we conduct far -ﬁeld automated speech recognition and keyw ord spotting experiments and then com- pare our approach with the popular image method. Both e xperiments are rev erberant speech training tasks in which the test set is always real-world noisy re verberant speech recordings, but the training set can consist of clean speech or synthetic rev erberant speech generated by either the image method or our geometric acoustic simulation. 3.1. Impulse Response Generation W e consider a 6-microphone circular array with 7cm diameter with speakers and the microphone array randomly located in the room at least 0.3m away from the wall. Both the image method and the geometric sound simulation method were employed to simu- late the impulse response randomly generated from 5000 different room conﬁgurations with the size (length-width-height) ranging from 3m-3m-2.5m to 8m-10m-6m. The distance between the speaker and microphones ranges from 0.5m to 6m. The re verberation time T60 is sampled in a range of 0.05s to 0.5s. In general, there are two IR sets, each with 5000 IRs generated with the image method and the geometric sound simulation method, respectively . The IRs were used for data augmentation in ASR and KWS tasks. 3.2. A utomated Speech Recognition 3.2.1. Data The training corpus consists of two sets: (i) a clean corpus of 1.5 million clean speech utterances that translates to about 1500 hours in total and (ii) a noisy far -ﬁeld training set simulated based on the clean corpus by adding reverberations and mixing with v arious en vironmental noises with SNRs ranging from 0 to 24 dB. F or each IR generation method, the corresponding noisy far -ﬁeld training set was generated using the IRs described in Section 3.1, and the ﬁrst channel of simulated data w as used as the input to the ASR system. The clean speech was ﬁrst used to train the acoustic model and then both the clean speech and the simulated noisy speech were used to ﬁne-tune the model. Depending on which of the two IR simulation methods were used to generate the noisy training sets, we got two acoustic models, one for the image method and one for the GAS method. The dataset sizes for clean, the image method, and the GAS method are the same. The testing corpus contains 2000 utterances of real far-ﬁeld recording from 48 speakers; each utterance is 5 seconds on av erage and the whole set is about 3 hours. The data is recorded in 5 different rooms with sizes of about 4m-4m-3m. The distances between the microphones and the speaker are randomly set as 0.5 m, 1 m, 3 m and 5 m, and the SNR ranges from 5 to 20dB with the background noise of an air-conditioning or f an. Fig. 5 . The framework of our ASR system used for e valuations. 3.2.2. Model Conﬁguration The framework of the ASR system is shown in Fig. 5 and consists of feature extraction, an acoustic model [ 29 ], and a decoder . 40- dimensional Mel ﬁlter bank features were computed with a 25-ms window length and a 10-ms hop size to form a 120-dimensional vector along with their ﬁrst and second order dif ferences. After nor- malization, the feature vector of the current frame is concatenated with that of the 5 preceding and 5 subsequent frames, resulting in an input vector of dimension 1320 = 120 × (5 + 1 + 5). The acoustic model contains two 2-dimensional con volutional layers, each with a kernel size of (3, 3) and a stride of (1, 1), follo wed by a maxpooling layer with a kernel size of (2, 2) and a stride of (2, 2), and then ﬁ ve LSTM layers, each with 1024 hidden units and peepholes, and then one full-connection layer plus a softmax layer . Batch normalization is applied after each CNN and LSTM layer to accelerate conv er- gence and improv e model generalization. W e use context-dependent (CD) phonemes as the output units, which form 12000 classes in our Chinese ASR system. The Adam optimizer was adopted with an initial learning rate of 0.0001. A 5-gram language model with size of 190 GB was used. The vocab ulary’ s size was 280K and the training corpus was collected from news, blogs, messages, encyclopedias, etc. 3.3. Keyw ord Spotting 3.3.1. Data The original training corpus contains 2500 hours of clean speech data, including 1250 hours of target keyword “Hi, Liu Bei” and 1250 hours of negati ve speech samples. The corresponding multi-channel rev erberant data was simulated using each IR generation method. Noises with SNRs ranging from 0 to 24dB were also added into the augmented speech. The 2500 hours of simulated rev erberant data are used for model training. The test corpus contains 8000 utterances with target keyword randomly selected from real user data from smart-speakers in a typical li ving room scenario, as well as 33 hours of negati ve samples from dif ferent categories, including music, TV noise, chatter, and other indoor noises. The 6-channel microphone signals were processed by an MVDR beamformer [ 30 ], and the output enhanced mono-channel signal was used for ke yword spotting. Fig. 6 . The framework of our KWS system used for e valuations. 3.3.2. Model Conﬁguration The framew ork of the ke yword spotting system, which is similar to [ 7 ] is shown in Fig. 6, comprising feature e xtraction, a classiﬁcation model, and a posterior handling module. The 40-dimensional Mel ﬁlter bank features were computed with a 25-ms windo w length and a 10-ms hop size, and then combined with the ﬁrst and second order differences to form a 120-dimensional frame feature. The current frame feature was concatenated with the 10 preceding frames and 5 subsequent frames, resulting in an input vector of dimension 1920=40 × 3 × (10 + 1 + 5). The classiﬁcation model contains one layer of 1D CNN [ 31 ] with a kernel size of 4 and is follo wed with a maxpooling layer with a kernel size of 3. The output of the CNN is passed to two layers of LSTM (hidden units 256) and then to a softmax layer with 4 (3 words + 1 garbage) output classes. Cross entropy is used for loss calculation. The outputs were then passed through a posterior handling module to obtain decisions. The ﬁnal ke yword score is deﬁned as the largest product of the smoothed posteriors in an input sliding window , subject to the constraint that the individual words “ﬁre” in the same order as speciﬁed in the keyw ord. 4. RESUL TS AND ANAL YSIS T able 1 shows the character accurac y of ASR systems achieved with the clean acoustic model (Clean), the noisy acoustic model based on the image method (Noisy IM), and the geometric sound simulation method (Noisy GAS). W e collected 2K real-world test utterances that are corrupted by rev erberations and noises to ev aluate IR methods. Compared with the “Clean” setup, the “Noisy IM” setup improved the system performance signiﬁcantly by adding simulated noisy train- ing data. Our proposed approach outperformed the image method by increasing the accurac y from 59.96% to 61.54%, illustrating the supe- riority of the proposed realistic geometric sound simulation approach. T able 1 . Character accuracy of ASR systems. Our GASmethod has the highest accuracy and outperforms IM by 1.58%. Model % Clean 31.178 Noisy IM 59.961 Noisy GAS 61.540 T able 2 . Equal error rates of KWS systems. Our GASmethod has the lowest equal error rate and results in a 21% error reduction relati ve to that of IM. Model % Noisy IM 1.48 Noisy GAS 1.17 The equal error rates (EERs) of keyword spotting systems are shown in T able 2. These results indicate that we can achiev e an EER of 1.17% and 1.48% when the augmented training data was generated using the geometric sound simulation method and the image method, respectiv ely . This translates to a 21% EER reduction. In these experiments, the input to the ke yword spotting system is the enhanced speech from an MVDR beamformer . This indicates that the proposed IRs are rob ust to multichannel signal processing algorithms. In both experiments, we carefully controlled the training and ev aluation conditions, where the only dif ference is the RIR simulation method. Due to our faithful simulation of dif fuse sound reﬂections, the domain gap between synthetic training data and real data is further reduced and therefore we observe signiﬁcant accurac y gains. 5. DISCUSSION AND FUTURE WORK In this paper , we described a geometric acoustic simulation method that simulates both the specular and the diffuse soundﬁelds for re- verberant speech training. On the speech recognition and keyword spotting tasks, we showed that the proposed approach outperformed the popular image method, where the gain is mostly attributable to the more realistic simulation of rev erberation and diffuse reﬂections. One limitation of this work is that neither method can model low- frequency or dif fraction phenomena. A partial solution would be to compensate RIRs at low-frequenc y bands [32]. Although we demonstrated the ef ﬁcacy of the proposed approach mainly on speech recognition and keyword spotting tasks, we believ e a similar improv ement on performance can be achiev ed on tasks such as source localization [ 33 ], speech separation, and the cocktail problem [ 1 , 2 ], all of which can beneﬁt from data-dri ven techniques and are future research directions. The proposed approach is thus of wide interest, especially because it can signiﬁcantly reduce the effort of collecting training data under real-usage scenarios. 6. REFERENCES [1] J. R. Hershey , Z. Chen, J. Le Roux, and S. W atanabe, “Deep clustering: Discriminativ e embeddings for segmentation and separation, ” in 2016 IEEE International Confer ence on Acous- tics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2016, pp. 31–35. [2] D. Y u, M. K olbæk, Z.-H. T an, and J. Jensen, “Permutation in variant training of deep models for speaker-independent multi- talker speech separation, ” in 2017 IEEE International Confer- ence on Acoustics, Speec h and Signal Processing (ICASSP) . IEEE, 2017, pp. 241–245. [3] F . Seide, G. Li, and D. Y u, “Con versational speech transcrip- tion using context-dependent deep neural networks, ” in T welfth annual confer ence of the international speec h communication association , 2011. [4] G. E. Dahl, D. Y u, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-v ocabulary speech recognition, ” IEEE T ransactions on audio, speech, and lan- guage pr ocessing , vol. 20, no. 1, pp. 30–42, 2011. [5] W . Xiong, J. Droppo, X. Huang, F . Seide, M. Seltzer, A. Stol- cke, D. Y u, and G. Zweig, “The microsoft 2016 con versational speech recognition system, ” in IEEE International Confer ence on Acoustics , 2017. [6] D. Y u and J. Li, “Recent progresses in deep learning based acoustic models, ” IEEE/CAA Journal of automatica sinica , vol. 4, no. 3, pp. 396–409, 2017. [7] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks, ” 05 2014, pp. 4087–4091. [8] R. Prabhav alkar, R. Alv arez, C. Parada, P . Nakkiran, and T . N. Sainath, “ Automatic gain control and multi-style training for robust small-footprint k eyword spotting with deep neural net- works, ” 2015. [9] M. L. Seltzer , Y . Dong, and Y . W ang, “ An in vestigation of deep neural networks for noise rob ust speech recognition, ” in IEEE International Confer ence on Acoustics , 2013. [10] C. Kim, A. Misra, K. Chin, T . Hughes, A. Narayanan, T . N. Sainath, and M. Bacchiani, “Generation of lar ge-scale simulated utterances in virtual rooms to train deep-neural networks for far -ﬁeld speech recognition in google home, ” in Interspeech , 2017. [11] M. Doulaty , R. Rose, and O. Siohan, “ Automatic optimization of data perturbation distributions for multi-style training in speech recognition, ” in Spoken Language T echnolo gy W orkshop , 2017. [12] R. P . Feynman, R. B. Leighton, and M. Sands, The F eynman lectur es on physics, V ol. I: The new millennium edition: mainly mechanics, r adiation, and heat . Basic books, 2011, vol. 1. [13] S. Sakamoto, A. Ushiyama, and H. Nagatomo, “Numerical anal- ysis of sound propagation in rooms using the ﬁnite dif ference time domain method, ” The Journal of the Acoustical Society of America , vol. 120, no. 5, pp. 3008–3008, 2006. [14] N. Raghuvanshi, R. Narain, and M. C. Lin, “Ef ﬁcient and accu- rate sound propagation using adapti ve rectangular decomposi- tion, ” V isualization and Computer Gr aphics, IEEE T ransactions on , vol. 15, no. 5, pp. 789–801, 2009. [15] J. B. Allen and D. A. Berkley , “Image method for efﬁciently simulating small-room acoustics, ” The Journal of the Acoustical Society of America , vol. 65, no. 4, pp. 943–950, 1979. [16] M. T . T aylor , A. Chandak, L. Antani, and D. Manocha, “Re- sound: interactiv e sound rendering for dynamic virtual environ- ments, ” in Pr oceedings of the 17th ACM international confer- ence on Multimedia . A CM, 2009, pp. 271–280. [17] M. T aylor , A. Chandak, Q. Mo, C. Lauterbach, C. Schissler, and D. Manocha, “Guided multi view ray tracing for fast au- ralization, ” IEEE T ransactions on V isualization and Computer Graphics , v ol. 18, no. 11, pp. 1797–1810, 2012. [18] C. Schissler and D. Manocha, “Interacti ve sound propagation and rendering for large multi-source scenes, ” A CM T ransactions on Graphics (T OG) , vol. 36, no. 1, p. 2, 2016. [19] C. Schissler and D. Manocha, “Interacti ve sound rendering on mobile devices using ray-parameterized re verberation ﬁlters, ” arXiv pr eprint arXiv:1803.00430 , 2018. [20] T . Funkhouser , I. Carlbom, G. Elko, G. Pingali, M. Sondhi, and J. W est, “ A beam tracing approach to acoustic modeling for interacti ve virtual en vironments, ” in Pr oceedings of the 25th annual confer ence on Computer graphics and inter active techniques . A CM, 1998, pp. 21–32. [21] A. Chandak, C. Lauterbach, M. T aylor, Z. Ren, and D. Manocha, “ Ad-frustum: Adaptive frustum tracing for interacti ve sound propagation, ” IEEE T ransactions on V isualization and Com- puter Graphics , v ol. 14, no. 6, pp. 1707–1722, 2008. [22] J. T . Kajiya, “The rendering equation, ” in A CM SIGGRAPH computer graphics , v ol. 20, no. 4. A CM, 1986, pp. 143–150. [23] T . K o, V . Peddinti, D. Po ve y , M. L. Seltzer, and S. Khudanpur , “ A study on data augmentation of reverberant speech for rob ust speech recognition, ” in 2017 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2017, pp. 5220–5224. [24] M. Hodgson, “Evidence of dif fuse surface reﬂections in rooms, ” The Journal of the Acoustical Society of America , vol. 89, no. 2, pp. 765–771, 1991. [25] B.-I. Dalenb ¨ ack, M. Kleiner , and P . Svensson, “ A macroscopic view of diffuse reﬂection, ” Journal of the Audio Engineering Society , vol. 42, no. 10, pp. 793–807, 1994. [26] Z. T ang, N. J. Bryan, D. Li, T . R. Langlois, and D. Manocha, “Scene-aware audio rendering via deep acoustic analysis, ” arXiv pr eprint arXiv:1911.06245 , 2019. [27] C. Cao, Z. Ren, C. Schissler, D. Manocha, and K. Zhou, “Inter- activ e sound propagation with bidirectional path tracing, ” ACM T ransactions on Graphics (T OG) , vol. 35, no. 6, p. 180, 2016. [28] Z. T ang, J. Kanu, K. Hogan, and D. Manocha, “Regression and classiﬁcation for direction-of-arri val estimation with con volu- tional recurrent neural networks, ” in Interspeech , 2019. [29] T . N. Sainath, O. V inyals, A. Senior, and H. Sak, “Con volutional, long short-term memory , fully connected deep neural networks, ” in IEEE International Confer ence on Acoustics , 2015. [30] O. Hoshuyama and A. Sugiyama, “Robust adapti ve beamform- ing, ” IEEE T ransactions on Acoustics Speech & Signal Pr ocess- ing , vol. 35, no. 10, pp. 1365–1376, 2008. [31] O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Y u, “Con volutional neural netw orks for speech recogni- tion, ” IEEE/A CM T ransactions on Audio Speec h & Language Pr ocessing , vol. 22, no. 10, pp. 1533–1545, 2014. [32] Z. T ang, H.-Y . Meng, and D. Manocha, “Lo w-frequency com- pensated synthetic impulse responses for improv ed far -ﬁeld speech recognition, ” arXiv pr eprint arXiv:1910.10815 , 2019. [33] R. T akeda and K. Komatani, “Sound source localization based on deep neural networks with directional acti vate function ex- ploiting phase information, ” in 2016 IEEE international con- fer ence on acoustics, speech and signal pr ocessing (ICASSP) . IEEE, 2016, pp. 405–409.

Improving Reverberant Speech Training Using Diffuse Acoustic Simulation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment