IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention

Reading time: 5 minute
...

📝 Original Info

  • Title: IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention
  • ArXiv ID: 2511.14515
  • Date: 2025-11-18
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (예: Xin Xin Tang 등 추정 가능하지만 정확히 확인 필요) **

📝 Abstract

Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8\% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.

💡 Deep Analysis

📄 Full Content

Speech enhancement (SE) algorithms, a core technology in the speech signal processing field, aim to recover clean speech from noise-corrupted signals, thereby improving speech quality and intelligibility. This technology has wide-ranging applications in mobile communications, hearing aid design, and automatic speech recognition (ASR) front-ends. In recent years, with the rapid development of deep learning, deep neural network (DNN)-based SE methods [1][2][3][4] have achieved performance significantly superior to traditional signal processing methods in suppressing non-stationary noise and reverberation.

Currently, mainstream SE models typically adopt Encoder-Decoder or Two-Stage architectures, utilizing time-frequency (T-F) domain masking or complex spectral mapping for enhancement. To capture long-range contextual dependencies, the Transformer [5,6] and its variants have been widely introduced into SE tasks. However, the standard self-attention mechanism has a computational complexity of O(N 2 ) (quadratic to the sequence length N ), which is prohibitive for processing long speech sequences at high sampling rates. Furthermore, to achieve SOTA performance, existing models often stack a large number of layers and channels, leading to a surge in model parameters and computation, making real-time deployment on edge devices difficult.

To address these issues, Zizhen Lin et al. recently proposed MUSE [7], a lightweight U-Net-based model (0.51M parameters). MUSE’s success is attributed to two key designs: first, the use of Deformable Embedding (DE) to adapt to the irregular shapes of spectrogram patterns; second, the proposal of the Multi-path Enhanced Taylor (MET) transformer, which uses Taylor expansion to reduce attention complexity to linear and compensates for information loss with an additional convolutional branch.

Although MUSE achieves an impressive balance, we argue that its core components still have room for optimization: 1. Complexity of the MET module: MET employs Taylor Self-Attention (T-MSA) [8] as its core. However, Qihang Fan et al. [9] pointed out that standard linear attention formulas lose the amplitude information of the Query vector during normalization, leading to overly smooth and non-selective attention distributions. To compensate for this, MUSE must introduce a complex CSA branch, which adds structural redundancy. To handle long sequences, TSTNN [11] proposed a two-stage Transformer architecture. To reduce computation, MP-SENet [12] adopted a strategy of processing magnitude and phase in parallel. MUSE [7] further combined the U-Net architecture with a linear Transformer, exploring an ultra-lightweight design under 1M parameters. Our work continues this line of research, focusing on discovering more efficient fundamental algorithms.

The quadratic complexity of the standard Transformer is its main bottleneck in long-sequence tasks. Linear attention reduces the O(N 2 ) complexity to O(N ) using the kernel trick. Common approaches include using ϕ(x) = elu(x)+1 or ReLU as the kernel function. However, these methods often overlook the amplitude information loss caused by the non-linear activation. MALA, recently proposed by Qihang Fan et al. [9], deeply analyzes this “amplitude-ignoring” phenomenon and proposes a magnitude-preserving computation paradigm, significantly boosting linear attention performance. IMSE leverages this latest advancement to streamline the network structure.

To expand the receptive field of CNNs, researchers have proposed large-kernel convolutions (e.g., 7×7 or larger). However, these have high parameter and computational costs. Inception-NeXt [10] proposed an Inception-based decomposition strategy, splitting a large-kernel depthwise convolution into multiple branches, such as K × K, 1 × K, and K × 1. This design not only reduces computational cost but its strip convolution kernels are highly suitable for capturing features along the time axis (duration) and frequency axis (harmonic structures) in spectrograms, aligning well with the physical properties of speech signals.

The overall architecture of IMSE is shown in Fig. 1. It retains the four-level U-Net backbone of MUSE, processing complexvalued features from the STFT. This section focuses on the two core modules we replaced: Inception Depthwise Convolution Embedding (IDConv) and Amplitude-Aware Linear Attention (MALA).

The MET module approximates Softmax via Taylor expansion but sacrifices the Query’s magnitude information, resulting in over-smoothed attention. Consequently, MUSE requires a redundant branch to compensate. We instead adopt MALA, which mathematically rectifies this flaw by reintroducing magnitude via a division-based normalization. This restores the sharpness of Softmax attention at O(N ) complexity without structural redundancy.

To address this, we replace MET with Amplitude-Aware Linear Attention (MALA) [9]. The core idea of MALA is to reintroduce the amplitude information of the query Q i

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut