SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

SMR U: SPLIT -AND-MERGE RECURRENT -B ASED UNET FOR A COUSTIC ECHO CANCELLA TION AND NOISE SUPPRESSION Zhihang Sun 1,2 , Andong Li 1 , Rilin Chen 1 , Hao Zhang 1 , Meng Y u 1 , Y i Zhou 2 , Dong Y u 1 1 T encent AI Lab 2 School of Communications and Information Engineering, Chongqing Uni versity of Posts and T elecommunications, Chongqing, China ABSTRA CT The proliferation of deep neural networks has spawned the rapid dev elopment of acoustic echo cancellation and noise suppression, and plenty of prior arts hav e been proposed, which yield promising performance. Nev ertheless, they rarely consider the deployment generality in dif ferent pro- cessing scenarios, such as edge devices, and cloud process- ing. T o this end, this paper proposes a general model, term ed SMR U, to cover different application scenarios. The novelty lies in two-fold. First, a multi-scale band split layer and band merge layer are proposed to effecti vely fuse local frequenc y bands for lower complexity modeling. Besides, by simu- lating the multi-resolution feature modeling characteristic of the classical UNet structure, a novel recurrent-dominated UNet is devised. It consists of multiple variable frame rate blocks, each of which in volv es the causal time down-/up- sampling layer with v arying compression ratios and the dual- path structure for inter- and intra-band modeling. The model is conﬁgured from 50 M/s to 6.8 G/s in terms of MA Cs, and the experimental results sho w that the proposed approach yields competitiv e or ev en better performance over existing baselines, and has the full potential to adapt to more general scenarios with varying comple xity requirements. Index T erms — Acoustic echo cancellation, noise sup- pression 1. INTR ODUCTION Annoying acoustic echo and en vironmental noise are ubiqui- tous in real-time communication (R TC) systems, leading to hurdles in intelligibility and ov erall lo w audio quality . Digital Signal Processing (DSP)-based methods for linear acoustic echo cancellation (AEC) were widely adopted in R TC sce- narios [1, 2]. Nonetheless, these classical approaches can- not effecti vely cancel acoustic echo and struggle to main- tain high speech quality in relati vely low signal-to-noise ra- tios (SNRs) and double-talk scenarios. Recent years hav e witnessed the proliferation of deep neural networks (DNNs) and dozens of DNN-based AEC algorithms hav e been pro- posed, which can be roughly categorized into two classes, Fig. 1 . Ov erview diagram of the proposed hybrid AEC sys- tem. namely hybrid [3, 4, 5, 6] and fully neural network-based systems [7, 8, 9], depending on whether the linear-AEC is in volv ed as a prior for later DNN processing. Regarding the neural network topology in the AEC task, an intuitive tactic is to transfer the network structures from other front-end tasks like speech enhancement [10, 11, 12, 13]. For example, in [5], a classical UNet-style structure was utilized, in which conv olution-based encoder and decoder are adopted for feature extraction and target spectrum recov- ery , and stacked LSTM layers serve as the bottleneck for temporal and frequenc y modeling. In [6], a dual-path trans- former structure was devised to ef fectively grasp global rela- tions. Despite the promising performance these works hav e achiev ed, the computational comple xity is usually prohibiti ve and they might be quite difﬁcult to deploy in edge devices. Besides, some operators like self-attention may require large time buf fers, which can bring laborious optimization costs. Considering that, an important question arises: how to devise an AEC network that encompasses differ ent complexity and is also felicitous to adapt to r eal-time scenarios. Note that most of the previous w orks require the number of frames to be unaltered in the forward process to follow the causality principle, and the time do wn-sampling/up-sampling (DS/US) operations are often not allo wed, which prunes to result in high computational complexity . More recently , a causal-guaranteed DS/US strategy was proposed by future frame prediction [14, 15], leading to notably decreased com- plexity and mild performance degradation. Besides, in [16], a band-split strate gy was proposed to manually merge neigh- boring frequency sub-bands and can ef fectiv ely decrease the Fig. 2 . Architecture of the proposed SMR U. Different modules are indicated with dif ferent colors for better illustrations. (a) Overall diagram of the proposed SMR U. (b) Detail structure of the multi-scale band split layer . (c) Detail structure of the band merge layer . (d) Detail structure of the variable frame rate block. (e) Detail structure of the inter -band MLP . cost aroused by modeling in a large frequenc y dimension. Therefore, we belie ve that it should be signiﬁcant to ﬂexibly modulate the time and frequency dimensions and meanwhile sustain the causality characteristic for a real-time AEC frame- work. In this regard, we propose S plit-and- M erge R ecurrent- based U Net dubbed SMR U , the ﬁrst UNet-style recurrent- dominated frame work for AEC and noise suppression to our best kno wledge. The whole framew ork adopts the UNet topology structure, in which the encoder part follows the from-ﬁne-to-coarse principle and vice versa for the decoder , and the skip connections are utilized for feature recalibra- tion. Dif ferent from preliminary con volution-based UNet works, here the “coarse/ﬁne” lies in different temporal reso- lution instead of frequency size, i.e. , multi-lev el causal time DS/US operations are utilized in the encoder and decoder , re- spectiv ely , enabling multi-scale temporal modeling and also notably reducing the computation complexity . Within each module, a dual-path structure was devised, where the RNN exca vates the temporal relations and an MLP-based band shufﬂer is adopted for global band modeling [17]. Beneﬁting from the multi-scale time compression method, the proposed model enjoys better ﬂexibility in computational complexity control. In this paper , the proposed model can co ver from 50 M/s to 6.8 G/s in terms of MA Cs, and both quantitati ve and qualitativ e results manifest the performance superiority of the proposed method. The rest of the paper is organized as follo ws. Sec. 2 presents the proposed framework. Sec. 3 and Sec. 4 namely giv e the experimental setups and results. Some conclusions are drawn in Sec. 5. 2. PR OPOSED METHOD The ov erview diagram of the proposed hybrid AEC system is shown in Figure 1. The input consists of the received mi- crophone mixture signal d ( n ) , the reference far-end signal x ( n ) , the error signal e ( n ) , and the linear echo y ( n ) gener - ated by LAEC algorithm [18], respectiv ely , and n denotes the time sample index. All signals are then con verted to the time- frequency (T -F) domain and fed into the SMR U. The frame- work of the proposed SMR U is shown in Figure 2. The real and imaginary parts of the four input spectra are concatenated along the channel axis to yield the input feature I ∈ R 8 × T × F , where T and F denote the number of the frames and fre- quency bins, respecti vely . The input ﬁrst passes a 2D con vo- lution layer to generate an initial feature map R ∈ R E × T × F , where E denotes the embedding dimension. Similar to [16], the feature map is split into sub-bands by the band split layer (see Figure 2(b)) to compress the frequency dimension. The compressed feature is then fed into the proposed recurrent UNet for multi-scale modeling. After that, the band merge layer (see Figure 2(c)) is adopted for ﬁlter estimation. A lightweight postnet is optional and can be utilized for further post-processing. W e will illustrate each module in detail be- low . 2.1. Split and Merge T o alle viate the computational b urden caused by a large fre- quency dimension, we manually di vide all frequency bins into three frequency regions, representing low , mid, and high re- gions. For each region, different multi-scale con volution sets are used to split and compress the frequency bins to a uni- ﬁed embedding dimension E . After the UNet modeling, each sub-band is conv erted to its original size and merged. The split and merge pattern allows the feature stream to maintain a relativ ely low dimension, while multi-scale con volution sets can introduce richer inter-band information. 2.1.1. Multi-scale band split layer Figure 2(b) sho ws the detail of the band split layer . The input feature is split along the frequency dimension into P regions. Each region feature R p is processed by a set of 2D con volu- tions with different k ernel sizes and the same output channels. They are subsequently concatenated to obtain a compressed 3D representation ˜ R p , where subscript p ∈ { 1 , · · · , P } . All ˜ R p are merged into ˜ R ∈ R ( M × E ) × T × Q , where M denotes the con volution scales, and Q denotes the number of sub- bands after compression. A 2D con volution is then applied to reduce the embedding dimension of ˜ R from M × E to E . Collectiv ely , the process can be formulated as: { R 1 , · · · , R P } = Reg ion - spl it ( R ) , (1) ˜ R p = C at ( C onv K = k p 1 ,S = s p ( R p ) , · · · , C onv K = k pM ,S = s p ( R p )) , (2) H = N orm ( C onv ( M erg e ( ˜ R 1 , · · · , ˜ R P ))) , (3) where C at ( · ) and M er g e ( · ) refer to the concatenation op- eration along the channel and frequency axes, respecti vely . H ∈ R E × T × Q denotes the output from the split layer . Please note that different re gions adopt different strides in their con- volution sets, as the frequency compression ratio varies for each R p . Recall that in [16], the band split/merge operations are implemented with for-loop , and we notice a higher im- plementation efﬁcienc y with con volution operations thanks to the internal optimization of Pytorch platform. When we adopt the con volutions in the split layer but not the merge part, only neglectable performance de gradations are observed in our in- ternal trials. 2.1.2. Band mer ge layer Figure 2(c) sho ws the internal structure of the band merge layer . T o be speciﬁc, the output from the recurrent UNet is termed as U . It is split into Q sub-band features, and each sub-band feature is fed into a normalization layer and a sepa- rate multilayer perceptron (MLP) layer to estimate comple x- valued T -F masks G q , where subscript q ∈ { 1 , · · · , Q } . Fi- nally , all G q are merged along the frequency axis to obtain the estimated T -F mask G ∈ R 8 × T × F , which is combined with I for target spectrum ﬁltering. The process can be gi ven by: { U 1 , · · · , U Q } = S ub - band - S plit ( U ) , (4) G q = M LP q ( N or m ( U q )) , (5) G = M er g e ( G 1 , · · · , G Q ) , (6) ˆ S (1) = 8 X i I i ⊗ G i , (7) where ⊗ denotes the element multiplication operator , and ˆ S (1) denotes the target estimation after ﬁltering. 2.2. Recurrent UNet 2.2.1. V ariable frame r ate block The proposed recurrent UNet comprises multiple basic blocks with v ariable frame rates (VR) due to different time DS/US operations. T aking the encoder process as an e xample, the in- ternal structure of each VR block is shown in Figure 2(d). W e may as well denote the input feature as Z i . It is ﬁrst passed to a causal time DS operation to obtain a squeezed versi on with a lower frame rate feature stream termed as Z ↓ . Then a dual- path module is utilized to model the intra-band and inter-band relations, respecti vely . Finally , the causal time US layer is adopted to recov er the feature back to its original frame rate, termed as Z o . The above-mentioned process can be summa- rized as: Z ↓ = T imeD ow nS ample ( Z i ) , (8) Z (1) ↓ = R eshape ( F C ( GRU ( N or m ( Z ↓ ))) + Z ↓ ) , (9) Z (2) ↓ = I nter - band - M LP ( Z (1) ↓ ) + Z (1) ↓ , (10) Z o = T imeU pS ampl e ( Z (2) ↓ ) , (11) where Z (1) ↓ and Z (2) ↓ denote the outputs from inter-band and intra-band modeling, respectiv ely . Reshape ( · ) denotes the transpose operation. 2.2.2. Causal time Down-Sample and Up-Sample layer The Do wn-Sample layer is implemented using non-o verlapped 1D causal con volution and R eshape ( · ) operation. In con- crete, we merge the embedding and sub-band dimensions and pass a causal 1D con volution layer to obtain a down-sampled version, i.e. , R ( E × Q ) × T 7→ R ( E × Q ) × ( T /λ ) , where λ denotes the time compression ratio, and the kernel size and stride are set to the same value to the compression ratio to keep causal- ity . In the Up-Sample layer , the input is ﬁrst interpolated in the time axis, and a 1D point-wise con volution layer is then used. The causality is guaranteed by future frame prediction. Due to the space limit, we refer the readers to [15] for more details. 2.2.3. Inter-band MLP shuf ﬂer The Inter-band MLP shufﬂed is inspired by the gMLP pro- posed in [19]. It consists of channel and band projections and uses split head and multiplicativ e gating for global inter-band modeling. The channel dimension of the feature map is dou- bled using 1D conv olution in the ﬁrst channel projections. Band attention is achiev ed through the Gating Unit [17], which e venly divides the feature map into tw o parts along the channel dimension. One part is modeled inter-band along the time axis in the band projections using 1D con volution, and then multiplicative gate with the other part. Finally , in the last channel projections, we reshape the feature map and apply 1D con volution ag ain. 2.2.4. Cr oss-scale skip connections In addition to the standard skip connections in UNet, we also introduce cross-scale skip connections, as shown by the blue curve in Figure 2. W e adopt a connection mode similar to dense connections [20], but instead of channel-wise concate- nation, we sum them after normalization. By doing so, nearly no e xtra computational o verhead is introduced. W e observe that the strategy can ef fectiv ely improve performance, which will be rev ealed in Sec. 4.2. 2.3. Post-pr ocessing module Despite the effecti veness of the proposed model, the resid- ual noise components may still exist. T o further suppress the remaining noise, a lightweight postnet is often cascaded for post-processing. Similar to [21], it consists of sev eral GR U layers and a group linear layer to estimate deep ﬁlter coef ﬁ- cients for deep ﬁltering. The complexity of the adopted post- net is only 30 M/s in terms of MA Cs, which can be ov erall negligible. 2.4. Loss function W e adopt the Mean Absolute Error (MAE) loss, which is for- mulated as L M AE = M AE ( ˆ S R , S R )+ M AE ( ˆ S I , S I )+ M AE ( | ˆ S | , | S | ) , (12) where n ˆ S R , ˆ S I , ˆ | S | o refer to the real, imaginary , and mag- nitude parts of the estimated spectrum, respecti vely , and { S R , S I , | S |} refer to that of the target version. T o effec- tiv ely suppress the echo, the echo-a ware loss L echo [5] is also adopted. Besides, when the near-end speech is absent, we expect the model to suppress the output as much as possible. T o this end, the V AD-oriented loss L v ad is proposed, given by L v ad = 10 log 10 ( || ˆ S × (1 − I v ad ) || 2 2 + ϵ ) , (13) where ˆ S represents the predicted spectrum, I v ad is the V AD label of the near-end speech, and ϵ is used to prevent o ver - suppression of the target speech, which we empirically set to 0.1. The ﬁnal loss is thus giv en by L = L M AE + 0 . 1 L echo + β L v ad , (14) where β is set to balance the capability between echo suppres- sion and near-end speech preserv ation, whose impact will be shown in Sec. 4.2. 3. EXPERIMENT AL SETUP 3.1. Data preparation In the training dataset, clean clips are randomly sampled from the train-clean-100 and train-clean-360 subsets of Lib- rispeech [22]. Environmental noises are sampled from the DNS-Challenge [23]. The echo data are simulated by con- volving clean speech with room impulse responses (RIRs) from the SLR28 dataset [24]. The simulated SNR and signal- to-echo ratio (SER) are sampled from -5 to 15 dB. The scenario proportions of far -end single talk (ST -FE), near-end single talk (ST -NE), and double talk (DT) are set to 10%, 25%, and 65%, respecti vely . Besides, noise can be absent in 10% of the data. The duration of the training set is around 530 hours, and 5% of the data in the training set are picked out for model validation. The test set is simulated by the same method with dif- ferent data sources. Clean audios are sourced from the test- clean subset of Librispeech. Noise audios are selected from the same dataset as the training set, with no data ov erlap. The echo data are obtained from real recorded echoes from the AEC Challenge [25]. SERs and SNRs of − 5 dB, 5 dB, 15 dB, + ∞ and −∞ are included in the test set. A SER of + ∞ corresponds to the ST -NE scenario, while a SER of −∞ cor- responds to the ST -FE scenario. The total duration of the test set was around 10 hours. In addition, we use the blind test set of the AEC Challenge [25] to inv estigate the generalization capability of models. 3.2. Implementation details A state-space-based linear ﬁlter is used to estimate the error signal e ( n ) and linear echo y ( n ) in the LAEC [18]. The win- dow length for STFT and iSTFT is set to 20 ms, with an ov er- lap of 10 ms. The number of VR blocks is 12, with 6 blocks set in the encoder and decoder, respectiv ely . The time com- pression ratios λ are set as { 1 , 2 , 4 , 8 , 16 , 32 , 32 , 16 , 8 , 4 , 2 , 1 } . In our e xperiments, we can adjust the model complexity by changing the embedding dimension E . For E = 10 and E = 200 , the computational complexity of the model (with- out post-processing module) can be 50 M/s and 6.8 G/s, respectiv ely , which are adequate to cover both resource- limited and of ﬂine scenarios. For the multi-scale band split layer , the number of regions P is set to 3, with each region corresponding to 20, 60, and 81 frequency bins, respecti vely . The stride of the 2D con volution sets for the 3 re gions is set to { (2 , 4) , (2 , 10) , (2 , 20) } , and the corresponding kernels K 1 , K 2 , K 3 are respectively set to { (1 , 4) , (1 , 8) , (1 , 12) } , { (1 , 10) , (1 , 20) , (1 , 30) } , { (1 , 20) , (1 , 30) , (1 , 40) } . The Adam optimizer is adopted with an initialized learning rate of T able 1 . The objecti ve results of proposed SMR U and differ- ent baselines in the test set. BOLD indicates the best score. Models MA Cs R TF DT ST -NE ST -FE (G/s) SI-SNR PESQ SI-SNR PESQ ERLE NSNet 0.13 0.0114 10.01 1.82 12.94 2.08 52.16 DeepFilterNet 0.24 0.0347 11.48 2.11 14.01 2.31 55.94 DTLN 0.46 0.0351 11.65 2.08 13.02 2.14 67.39 BSRNN 1.38 0.2087 12.92 2.34 14.67 2.48 55.03 FastFullSubNet 1.75 0.1008 12.64 2.25 14.18 2.35 50.61 SMR U-T 0.05 0.0291 11.17 1.97 12.90 2.08 52.93 SMR U-S 0.11 0.0354 11.76 2.09 13.58 2.21 52.87 +PostNet 0.14 0.0496 12.29 2.17 13.97 2.29 52.44 SMR U-L 1.03 0.0972 13.28 2.35 14.77 2.48 57.18 SMR U-H 6.83 0.3452 14.11 2.50 15.65 2.65 58.91 Fig. 3 . AECMOS metrics of the blind test set under the DT scenario. 0.001 and a decay coefﬁcient of 0.99. Each model is trained for 200 epochs with a batch size of 16 in the utterance lev el. 3.3. Evaluation metrics For the synthetic test set, the ST -NE and DT scenarios are ev aluated using scale-in variant SNR (SI-SNR) [26] and wide- band perceptual e valuation of speech quality (WB-PESQ) [27], while the ST -FE scenario is ev aluated using echo return loss enhancement (ERLE) [28]. For the blind test set, we use the AECMOS metric [29]. 4. EXPERIMENT AL RESUL TS 4.1. Result comparisons with baselines W e compare the proposed SMRU with ﬁ ve advanced base- lines on the test set, and quantitative results are sho wn in T a- ble 1. Four modes of SMR U are inv estigated, namely tiny (T), small (S), large (L), and huge (H), with the complexity varying from 50 M/s to 6.83 G/s, in terms of MACs. Baseline methods include NSNet [25], DTLN [8], DeepFilterNet, F ast- FullSubNet, and BSRNN. The latter three models were origi- nally proposed for the speech enhancement task, and we adapt them to AEC task by using the same input as the proposed Fig. 4 . Spectrum visualizations of an example. (a) Mix au- dio. (b) T arget near-end speech. (c) Estimated spectrum pro- cessed by DeepFilterNet. (d) Estimated spectrum processed by SMR U-S. method. The real-time factor (R TF) is measured on an Intel Core (TM) i7-9750H CPU clocked at 2.60 GHz. From the ta- ble, se veral observations can be made. First, with the increase in computational complexity , the objecti ve metric scores of the proposed method are gradually improv ed, where the tiny version achieves overall better performance ov er NSNet b ut only with around one-third percentage in complexity . More- ov er , although DTLN performs well in the ST -FE scenario, it lacks capability in other scenarios. In DT and ST -NE sce- narios, DTLN performs worse compared to SMRU-S, which has only a quarter of its complexity . For the large version, SMR U further outperforms BSRNN and FastFullSubNet with less complexity . It fully validates the superiority of the pro- posed method. Besides, when a PostNet is adopted, notable improv ements can be observed in both DT and ST -NE cases and just slight degradation in ERLE for the ST -FE case, which validates the ef fectiv eness of the post-processing. Finally , compared with BSRNN and FastFullSubNet, SMR U-L en- joys a notably lo wer R TF , which can be attributed to the pro- posed UNet structure with a multi-lev el time sampling strat- egy . Figure 3 shows the AECMOS results on the AEC Chal- lenge ICASSP 2022 blind test set for different approaches under the DT scenario. The MOS change trend of SMR U with different complexities is shown by the blue curv e. One T able 2 . Ablation study on the proposed SMRU-S. Cond. 1 represents the condition of using a multi-scale band split layer , while Cond. 2 represents the condition of using cross- scale skip connections. Model Cond. 1 Cond. 2 DT ST -NE ST -FE SI-SNR PESQ SI-SNR PESQ ERLE SMR U-S ✓ ✗ 11.69 2.06 13.39 2.17 52.51 ✗ ✓ 11.54 2.02 13.21 2.12 45.59 ✓ ✓ 11.76 2.09 13.58 2.21 52.87 can see that SMR U pro vides the best DT performance. The MOS of DT Echo for NSNet and DeepFilterNet is slightly better than that of the same complexity SMR U, but their near - end speech preservation is poor , resulting in lo w MOS of DT Other . Therefore, SMRU can strike well trade-off between far -end echo suppression and near-end speech preserv ation. Figure 4 shows the spectrum visualization of an example case. (a)-(d) denote mix, near -end, estimation of DeepFilter- Net, and SMR U-S, respecti vely . It can be observ ed that the result of the DeepFilterNet may over -suppress the voiced re- gions after the silent se gment, while the proposed SMR U can better preserve the harmonic structure of this tar get speech. 4.2. Ablation study Ablation studies are conducted on SMR U-S to in vestigate the effects of using multi-scale band split layer and cross-scale skip connections in the test set. For the case without the multi- scale band split layer , the single-scale con volution is used for band splitting, i.e. , the number of con volutions in the con- volution set for each region is set to 1. The results in T able 2 show that removing the multi-scale band split layer can lead to signiﬁcant performance degradation, as the single-scale con- volution cannot provide frequency representations at differ - ent resolutions. Besides, the cross-scale skip connections can also provide a notable performance improv ement by only in- troducing minimal computational ov erhead. T able 3 shows the performance of different β values in the loss function. When β = 0 . 0002 , both PESQ in DT and ST -NE scenarios and ERLE in ST -FE scenario sho w an im- prov ement over β = 0 . Ho wev er , when β further increases, the model exhibits better capability in far -end echo suppres- sion, i.e . , a higher ERLE score, b ut at the cost of more speech distortion, i.e . , lower PESQ and SI-SNR scores. Due to space constraints, we do not tra verse more β options, and β = 0 . 0002 seems adequate to well balance between echo cancel- lation and target speech preserv ation. 5. CONCLUSION In this paper , we propose SMR U, a UNet-based fundamental model for echo cancellation and noise suppression task. T o T able 3 . Ablation study on the weight β of the proposed V AD-oriented loss. Model β DT ST -NE ST -FE SI-SNR PESQ SI-SNR PESQ ERLE SMR U-S 0 11.85 2.08 13.51 2.19 50.73 0.0002 11.76 2.09 13.58 2.21 52.87 0.0005 11.76 2.06 13.45 2.18 55.56 0.001 11.60 2.04 13.21 2.16 62.81 enable more ﬂe xible computational complexity control, we explore modulating both frequency and time dimensions. F or the former , the multi-scale band split layer and band merge layer are introduced to effecti vely decrease the modeling complexity in the frequenc y domain. For the latter , we intro- duce the variable frame rate block as the basic unit to model both intra-/inter-band, while also ef fecti vely decreasing the computational complexity via dif ferent causal time down- /up-sampling rates. W ith these tactics together , we control the overall computational complexity from 50 M/s to 6.8 G/s in MA Cs, which is adequate to cover both resource-limited and cloud-processing scenarios. Both quantitati ve and quali- tativ e results reveal the superiority of the proposed approach ov er existing advanced baselines. In future work, we plan to extend the proposed SMR U to more related tasks, e.g. , derev erberation and multi-channel speech enhancement. 6. REFERENCES [1] J-S Soo and Khee K P ang, “Multidelay block frequenc y domain adaptive ﬁlter, ” IEEE T rans. Acoust. Speech Sig- nal Pr ocess. , vol. 38, no. 2, pp. 373–376, 1990. [2] Gerald Enzner and Peter V ary , “Frequenc y-domain adaptiv e kalman ﬁlter for acoustic echo control in hands-free telephones, ” Signal Pr ocess. , vol. 86, no. 6, pp. 1140–1156, 2006. [3] Haoran Zhao, Nan Li, Runqiang Han, Lianwu Chen, Xiguang Zheng, Chen Zhang, Liang Guo, and Bing Y u, “ A deep hierarchical fusion netw ork for fullband acous- tic echo cancellation, ” in Pr oc. IEEE Int. Conf . Acoust. Speech Signal Pr ocess. , 2022, pp. 9112–9116. [4] Jan Franzen and T im Fingscheidt, “Deep residual echo suppression and noise reduction: A multi-input fcrn ap- proach in a hybrid speech enhancement system, ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2022, pp. 666–670. [5] Shimin Zhang, Ziteng W ang, Jiayao Sun, Y ihui Fu, Biao T ian, Qiang Fu, and Lei Xie, “Multi-task deep resid- ual echo suppression with echo-aware loss, ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Process. , 2022, pp. 9127–9131. [6] Xingwei Sun, Chenbin Cao, Qinglong Li, Linzhang W ang, and Fei Xiang, “Explore relati ve and context in- formation with transformer for joint acoustic echo can- cellation and speech enhancement, ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2022, pp. 9117– 9121. [7] Hao Zhang, K e T an, and DeLiang W ang, “Deep learn- ing for joint acoustic echo and noise cancellation with nonlinear distortions., ” in Pr oc. Interspeech , 2019, pp. 4255–4259. [8] Nils L W esthausen and Bernd T Meyer , “ Acoustic echo cancellation with the dual-signal transformation lstm network, ” in Pr oc. IEEE Int. Conf. Acoust. Speec h Sig- nal Pr ocess. , 2021, pp. 7138–7142. [9] Chengyu Zheng, Y uan Zhou, Xiulian Peng, Y uan Zhang, and Y an Lu, “Real-time speech enhancement with dynamic attention span, ” in Proc. IEEE Int. Conf . Acoust. Speech Signal Pr ocess. , 2023, pp. 1–5. [10] K e T an and DeLiang W ang, “Learning complex spectral mapping with gated con volutional recurrent networks for monaural speech enhancement, ” IEEE T rans. Audio Speech Lang . Pr ocess. , vol. 28, pp. 380–390, 2019. [11] Eesung Kim and Hyeji Seo, “Se-conformer: T ime- domain speech enhancement using conformer ., ” in Pr oc. Interspeech , 2021, pp. 2736–2740. [12] Guochen Y u, Andong Li, Chengshi Zheng, Y inuo Guo, Y utian W ang, and Hui W ang, “Dual-branch attention- in-attention transformer for single-channel speech en- hancement, ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2022, pp. 7847–7851. [13] Feng Dang, Hangting Chen, and Pengyuan Zhang, “Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement, ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2022, pp. 6857–6861. [14] Xiang Hao and Xiaofei Li, “Fast fullsubnet: Accelerate full-band and sub-band fusion model for single-channel speech enhancement, ” 2022, . [15] Hangting Chen, Jianwei Y u, Y i Luo, Rongzhi Gu, W ei- hua Li, Zhuocheng Lu, and Chao W eng, “Ultra dual- path compression for joint echo cancellation and noise suppression, ” 2023, . [16] Jianwei Y u, Y i Luo, Hangting Chen, Rongzhi Gu, and Chao W eng, “High ﬁdelity speech enhancement with band-split rnn, ” 2022, . [17] Zhengzhong T u, Hossein T alebi, Han Zhang, Feng Y ang, Peyman Milanfar , Alan Bovik, and Y inxiao Li, “Maxim: Multi-axis mlp for image processing, ” in Pr oc. IEEE Conf. Comput. V is. P attern Recognit. , 2022, pp. 5769–5780. [18] Fabian Kuech, Edwin Mabande, and Gerald Enzner , “State-space architecture of the partitioned-block-based acoustic echo controller , ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2014, pp. 1295–1299. [19] Hanxiao Liu, Zihang Dai, David So, and Quoc V Le, “Pay attention to mlps, ” Pr oc. Adv . Neural Inf. Pr ocess. Syst. , vol. 34, pp. 9204–9215, 2021. [20] Gao Huang, Zhuang Liu, Laurens V an Der Maaten, and Kilian Q W einberger , “Densely connected con volutional networks, ” in Pr oc. IEEE Conf. Comput. V is. P attern Recognit. , 2017, pp. 4700–4708. [21] Hendrik Schroter, Alberto N Escalante-B, T obias Rosenkranz, and Andreas Maier , “Deepﬁlternet: A low complexity speech enhancement framework for full- band audio based on deep ﬁltering, ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2022, pp. 7407– 7411. [22] V assil Panayotov , Guoguo Chen, Daniel Pov ey , and San- jeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books, ” in Proc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2015, pp. 5206–5210. [23] Chandan KA Reddy , Harishchandra Dubey , Kazuhito K oishida, Arun Nair, V ishak Gopal, Ross Cutler , Sebas- tian Braun, Hannes Gamper , Robert Aichner , and Sri- ram Srini vasan, “Interspeech 2021 deep noise suppres- sion challenge, ” 2021, . [24] T om Ko, V ijayaditya Peddinti, Daniel Po vey , Michael L Seltzer , and Sanjee v Khudanpur , “ A study on data augmentation of reverberant speech for rob ust speech recognition, ” in Proc. IEEE Int. Conf. Acoust. Speec h Signal Pr ocess. , 2017, pp. 5220–5224. [25] Ross Cutler, Ando Saabas, T anel Parnamaa, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sørensen, and Robert Aichner, “Icassp 2022 acoustic echo cancellation challenge, ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2022, pp. 9107–9111. [26] Y i Luo and Nima Mesgarani, “Con v-tasnet: Surpass- ing ideal time–frequenc y magnitude masking for speech separation, ” IEEE T rans. Audio Speech Lang. Pr ocess. , vol. 27, no. 8, pp. 1256–1266, 2019. [27] ITUT Rec, “P . 862.2: Wideband extension to recom- mendation p. 862 for the assessment of wideband tele- phone networks and speech codecs, ” Int. T elecommu. Uni. , 2005. [28] Sergios Theodoridis and Rama Chellappa, Academic pr ess libr ary in signal pr ocessing: Image, video pr o- cessing and analysis, hardwar e, audio, acoustic and speech pr ocessing , Academic Press, 2013. [29] Marju Purin, Sten Sootla, Mateja Sponza, Ando Sa abas, and Ross Cutler , “ Aecmos: A speech quality assessment metric for echo impairment, ” in Proc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. , 2022, pp. 901–905.

SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment