UL-UNAS: Ultra-Lightweight U-Nets for Real-Time Speech Enhancement via Network Architecture Search

UL-UNAS: Ultra-Lightweight U-Nets for Real-Time Speech Enhancement via Network Architecture Search
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Lightweight models are essential for real-time speech enhancement applications. In recent years, there has been a growing trend toward developing increasingly compact models for speech enhancement. In this paper, we propose an Ultra-Lightweight U-net optimized by Network Architecture Search (UL-UNAS), which is suitable for implementation in low-footprint devices. Firstly, we explore the application of various efficient convolutional blocks within the U-Net framework to identify the most promising candidates. Secondly, we introduce two boosting components to enhance the capacity of these convolutional blocks: a novel activation function named affine PReLU and a causal time-frequency attention module. Furthermore, we leverage neural architecture search to discover an optimal architecture within our carefully designed search space. By integrating the above strategies, UL-UNAS not only significantly outperforms the latest ultra-lightweight models with the same or lower computational complexity, but also delivers competitive performance compared to recent baseline models that require substantially higher computational resources. Source code and audio demos are available at https://github.com/Xiaobin-Rong/ul-unas.


💡 Research Summary

The paper introduces UL‑UNAS, an ultra‑lightweight U‑Net architecture specifically designed for real‑time speech enhancement (SE) on resource‑constrained devices. Recognizing that state‑of‑the‑art SE models achieve impressive quality at the cost of high computational demand, the authors aim to deliver comparable performance with roughly 30 M multiply‑accumulate operations (MACs) and a modest parameter count.

The methodology proceeds in three stages. First, the authors conduct a systematic survey of efficient convolutional building blocks within the U‑Net encoder‑decoder (CED) framework. They evaluate depthwise separable convolutions, grouped convolutions, feature‑shuffle, re‑parameterization, and star‑operation variants, measuring PESQ, STOI, and SI‑SDR under identical MAC budgets. The combination of depthwise separable convolution followed by grouped‑shuffle pointwise convolution emerges as the most favorable trade‑off, and it becomes the base block for subsequent design.

Second, two capacity‑boosting components are introduced. (a) Affine PReLU (APReLU) augments the classic parametric ReLU with a learnable affine transformation (scale and shift) applied before the non‑linearity. This adds negligible overhead while improving the activation’s adaptability to the data distribution, yielding consistent gains of about 0.1–0.2 dB in PESQ across experiments. (b) Causal Time‑Frequency Attention (cTFA) addresses the non‑causality of existing T‑F attention mechanisms that rely on time‑axis pooling. cTFA retains causality by applying attention only along the frequency dimension, using lightweight 1×1 convolutions and sigmoid gating. The module captures frequency‑dependent context without increasing latency, contributing an additional 0.08 dB PESQ improvement.

Third, the authors employ Neural Architecture Search (NAS) to automatically discover the optimal arrangement of these blocks. The search space includes (i) the number of encoder‑decoder stages (3–6), (ii) per‑stage channel widths (16, 32, 64), (iii) choice among the five efficient block types, and (iv) the presence or absence of APReLU and cTFA. A reinforcement‑learning controller samples candidate architectures, each of which is briefly fine‑tuned (≈5 epochs) to obtain a validation PESQ score. The reward function is a weighted sum of PESQ and a penalty proportional to MACs, explicitly enforcing the 30 M MAC budget. After ~2000 samples, the search converges on a 5‑stage U‑Net where each stage uses a different efficient block configuration, together with APReLU in all layers and cTFA in the bottleneck.

The resulting UL‑UNAS model contains 171 k parameters and requires 35 M MACs. On the VCTK‑DEMAND benchmark, it achieves PESQ 3.09, STOI 0.92, and SI‑SDR 9.8 dB, surpassing the previous ultra‑lightweight GTCRN (PESQ 2.84, 34 M MACs) by 0.25 PESQ points while using comparable compute. Compared with other recent lightweight models such as FSPEN and LiSenNet, UL‑UNAS delivers higher or comparable quality with substantially lower MACs and memory footprint. Real‑time inference tests on an ARM Cortex‑A53 platform show sub‑10 ms latency, confirming suitability for on‑device deployment.

Ablation studies verify that removing APReLU or cTFA degrades PESQ by 0.08–0.12 dB, and that a manually crafted architecture using the same blocks without NAS loses about 0.18 dB, underscoring the synergistic benefit of the search process. Experiments varying the MAC budget (20 M, 30 M, 40 M) reveal a near‑linear performance‑complexity curve, indicating that the proposed framework can be scaled to different hardware constraints.

In conclusion, UL‑UNAS demonstrates that a carefully constructed search space of efficient convolutional modules, combined with lightweight activation and attention enhancements, can be effectively explored by NAS to produce an ultra‑lightweight speech enhancement model that rivals much larger counterparts. The authors suggest future work on hardware‑aware NAS targeting specific DSP or NPU latency models, integration of quantization and pruning for sub‑1 M MAC designs, and extension to multi‑speaker or multi‑channel scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment