Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/RENet.


💡 Research Summary

This paper tackles a long‑standing problem in deep learning‑based speech enhancement (SE): the mismatch between the circular topology of the phase spectrum and the Euclidean assumptions of conventional neural networks. While magnitude processing has been extensively studied, phase modeling remains difficult because most networks treat the complex spectrum as a flat vector, ignoring that phase lives on the unit circle (S¹). The authors propose a manifold‑aware, dual‑stream architecture that enforces Global Rotation Equivariance (GRE) on the phase branch, ensuring that the network’s operations commute with a global phase rotation Tθ(x)=x·e^{jθ}. In other words, the model becomes intrinsically insensitive to the absolute phase offset while preserving all relative phase structures (group delay, instantaneous phase, etc.).

To realize GRE, the authors identify two atomic operations that satisfy the equivariance condition: (1) bias‑free complex linear transformations (e.g., complex convolutions without additive bias) and (2) element‑wise modulation of the complex feature map by a rotation‑invariant real tensor. Building on these, they design two novel modules:

  1. Magnitude‑Phase Interactive Convolutional Module (MPICM) – This replaces standard convolutions in both encoder and decoder. MPICM receives a real‑valued magnitude tensor M and a complex‑valued phase tensor P. It extracts the magnitude of P, scales it with a function of M (e.g., a normalized gating σ(M)), and recombines it with the original phase angle. Mathematically, the output is P′ = (|P|·σ(M))·e^{j∠P}. Because the scaling factor σ(M) is real and rotation‑invariant, the transformation preserves GRE.

  2. Hybrid‑Attention Dual‑FFN (HADF) – This serves as the bottleneck. Each HADF block contains a complex‑real hybrid attention mechanism followed by a dual feed‑forward network. The attention weights are computed from rotation‑invariant real features, guaranteeing that the attention operation does not break GRE. Two HADF blocks are stacked in a time‑first and frequency‑first order, allowing the model to capture long‑range temporal and spectral dependencies efficiently.

The overall architecture follows an encoder‑bottleneck‑decoder paradigm. The encoder and decoder consist of a Dilated DenseNet where every dense connection respects the alignment of magnitude and phase channels. Down‑sampling and up‑sampling are performed by MPICM layers that keep the GRE property intact.

Training objectives combine a magnitude loss (log‑spectrogram L1/L2, SI‑SNR) with a phase‑specific loss (phase distance and complex L2). The phase loss directly penalizes angular errors while the GRE constraint ensures that the network does not waste capacity learning an arbitrary global phase offset.

Experimental evaluation is extensive. The authors test on four representative SE tasks: (i) phase retrieval (pure phase reconstruction given clean magnitude), (ii) additive‑noise denoising, (iii) dereverberation, and (iv) bandwidth extension. Datasets include WSJ0, VCTK, DNS‑2023, and simulated reverberant corpora. Key results:

  • Phase retrieval – The proposed model reduces Phase Distance by ~20 % compared with state‑of‑the‑art complex‑mask networks (e.g., CTS‑Net, DB‑AIA‑T).
  • Zero‑shot cross‑corpus denoising – PESQ improves by >0.1 dB over baselines, demonstrating strong generalization to unseen noise types.
  • Dereverberation – Gains of ~0.2 dB SDR and 0.1 dB PESQ over the best competing methods.
  • Bandwidth extension – High‑frequency reconstruction benefits from consistent phase continuity, yielding MOS and PESQ improvements of 0.12 and 0.08 points respectively.

The model uses roughly 3.2 M parameters and 1.1 G FLOPs, comparable to or slightly lower than existing SOTA systems, yet consistently outperforms them across all metrics.

A qualitative analysis visualizes the learned phase embeddings, revealing clear periodic patterns that align with the underlying S¹ symmetry, confirming that the network has internalized the circular nature of phase.

In summary, the paper makes three major contributions: (1) introducing a principled GRE constraint as a structural inductive bias for phase modeling, (2) designing MPICM and HADF modules that enable deep magnitude‑phase interaction without violating GRE, and (3) demonstrating that this approach yields superior phase accuracy and overall SE performance across a wide range of tasks while maintaining computational efficiency. The work opens a new direction for geometry‑aware neural design in audio signal processing, suggesting future extensions to other symmetries (e.g., time‑shift equivariance) and to broader audio domains such as music source separation.


Comments & Academic Discussion

Loading comments...

Leave a Comment