Study of Lightweight Transformer Architectures for Single-Channel Speech Enhancement
In speech enhancement, achieving state-of-the-art (SotA) performance while adhering to the computational constraints on edge devices remains a formidable challenge. Networks integrating stacked temporal and spectral modelling effectively leverage improved architectures such as transformers; however, they inevitably incur substantial computational complexity and model expansion. Through systematic ablation analysis on transformer-based temporal and spectral modelling, we demonstrate that the architecture employing streamlined Frequency-Time-Frequency (FTF) stacked transformers efficiently learns global dependencies within causal context, while avoiding considerable computational demands. Utilising discriminators in training further improves learning efficacy and enhancement without introducing additional complexity during inference. The proposed lightweight, causal, transformer-based architecture with adversarial training (LCT-GAN) yields SoTA performance on instrumental metrics among contemporary lightweight models, but with far less overhead. Compared to DeepFilterNet2, the LCT-GAN only requires 6% of the parameters, at similar complexity and performance. Against CCFNet+(Lite), LCT-GAN saves 9% in parameters and 10% in multiply-accumulate operations yet yielding improved performance. Further, the LCT-GAN even outperforms more complex, common baseline models on widely used test datasets.
💡 Research Summary
This paper tackles the persistent dilemma of achieving state‑of‑the‑art (SotA) speech enhancement performance while meeting the strict computational budgets of edge devices. The authors first observe that transformer‑based temporal and spectral modeling delivers powerful global dependency learning, yet the naïve stacking of such modules leads to prohibitive parameter counts and multiply‑accumulate (MAC) operations. To address this, they propose a streamlined “Frequency‑Time‑Frequency” (FTF) stacking strategy. In the FTF bottleneck, three lightweight transformers are arranged sequentially: a frequency‑domain transformer that exchanges information across frequency bins within a single frame, a causal time‑domain transformer that propagates context from past frames to the current one (using a trapezoidal mask to enforce causality and limit context to ≤1 s), and a second frequency transformer that re‑integrates the updated temporal information back across the frequency axis. Each transformer combines a grouped GRU and a multi‑head attention (MHA) block; the time transformer shares GRU parameters across frequency, while the frequency transformers share parameters across time. This design yields a “ladder‑shaped” information flow that captures global dependencies with dramatically reduced overhead—27 % fewer MACs and 98 % fewer parameters compared with conventional large‑scale transformers.
The second major contribution is the integration of adversarial training, forming the LCT‑GAN architecture. The generator (LCT) follows a causal U‑Net encoder‑decoder with three convolutional layers, predicting a compressed‑domain ideal ratio mask (IRM) with a compression factor of 0.3. The authors deliberately avoid complex‑valued outputs, showing that magnitude‑only estimation suffices for high‑quality enhancement. During training, two discriminators are employed: a multi‑scale discriminator that evaluates spectrograms at several STFT resolutions, and a multi‑period discriminator that captures long‑term periodic structures. These discriminators provide additional loss terms (L_adv_gen and L_adv_dis) but are discarded at inference, incurring zero extra latency.
Extensive experiments were conducted on VoiceBank+Demand and DNS‑3 datasets. An ablation study compared various transformer stacking orders (TT, FF, TF, FT, TFT, FTF, etc.). The FTF configuration consistently achieved the best objective scores (e.g., PESQ 2.68, STOI 0.932, SI‑SDR 16.29 dB) while keeping the model causal and without look‑ahead frames. Complex‑valued masking (RI, MCS) increased MACs but did not improve the intrusive metrics, confirming the sufficiency of magnitude‑only masks.
When benchmarked against recent lightweight SotA models—DeepFilterNet2/3, CCFNet+Lite, FRCRN, S‑8.1GF, and others—LCT‑GAN required only about 6 % of DeepFilterNet2’s parameters (0.14 M vs. 2.31 M) and comparable MACs (0.35 G vs. 0.36 G). Despite this drastic reduction, LCT‑GAN matched or exceeded these baselines across composite metrics (CSIG, CBAK, COVL), STOI, and SI‑SDR. Adding Perceptual Contrast Stretching (PCS) further raised PESQ to 3.20, on par with DeepFilterNet3, while still maintaining the lightweight footprint. Latency was kept under 48 ms (one‑frame look‑ahead) and even 32 ms in a fully causal variant.
In summary, the paper introduces a highly efficient transformer‑based bottleneck (FTF) that preserves the expressive power of global attention while dramatically shrinking model size, and couples it with adversarial training that boosts perceptual quality without inference cost. The resulting LCT‑GAN delivers SotA instrumental performance with a fraction of the parameters and MACs of existing lightweight models, making it especially suitable for real‑time speech enhancement on resource‑constrained platforms such as smartphones, hearing aids, and IoT devices. Future work may explore extending the FTF concept to multi‑channel inputs, non‑causal configurations, or integrating phase‑aware reconstruction to further close the gap to full‑complex models.
Comments & Academic Discussion
Loading comments...
Leave a Comment