A Dual-Branch Parallel Network for Speech Enhancement and Restoration
We present a novel general speech restoration model, DBP-Net (dual-branch parallel network), designed to effectively handle complex real-world distortions including noise, reverberation, and bandwidth degradation. Unlike prior approaches that rely on a single processing path or separate models for enhancement and restoration, DBP-Net introduces a unified architecture with dual parallel branches-a masking-based branch for distortion suppression and a mapping-based branch for spectrum reconstruction. A key innovation behind DBP-Net lies in the parameter sharing between the two branches and a cross-branch skip fusion, where the output of the masking branch is explicitly fused into the mapping branch. This design enables DBP-Net to simultaneously leverage complementary learning strategies-suppression and generation-within a lightweight framework. Experimental results show that DBP-Net significantly outperforms existing baselines in comprehensive speech restoration tasks while maintaining a compact model size. These findings suggest that DBP-Net offers an effective and scalable solution for unified speech enhancement and restoration in diverse distortion scenarios.
💡 Research Summary
The paper introduces DBP‑Net, a Dual‑Branch Parallel Network designed for unified speech restoration under the simultaneous presence of three common degradations: additive noise, reverberation, and bandwidth limitation. Traditional approaches either focus on a single distortion type or employ separate modules and sequential pipelines, leading to sub‑optimal performance and high computational cost when multiple distortions co‑occur. DBP‑Net tackles this challenge by integrating two parallel processing branches within a single lightweight architecture.
The first branch adopts a masking‑based strategy (Sigmoid activation) to generate a suppression mask that attenuates noise and reverberation primarily in the low‑frequency region. The second branch follows a mapping‑based strategy (ReLU activation) that directly predicts the clean magnitude spectrum, with a particular emphasis on reconstructing the missing high‑frequency components caused by bandwidth reduction. Both branches share the same Conformer‑based encoder and backbone parameters, which drastically reduces model size while encouraging complementary feature learning.
A novel cross‑branch skip‑fusion mechanism is introduced to bridge the two branches. The low‑frequency output of the masking branch (M_mask) is multiplied by a learnable scalar α (≈0.38 after training) and added to the output of the mapping branch (M_map), yielding the fused magnitude estimate M̂ = M_map + α·M_mask. This selective fusion supplies the generative branch with clean low‑frequency information without re‑introducing residual noise, thereby balancing suppression and generation.
The encoder backbone consists of a two‑stage Conformer architecture. In the first stage, the time dimension is flattened into the batch axis and processed by Conformer blocks that combine multi‑head self‑attention with convolutional modules, capturing long‑range temporal dependencies. In the second stage, the frequency dimension is similarly flattened and processed to model spectral correlations. This disentangled temporal‑spectral modeling yields robust representations suitable for both suppression and reconstruction tasks.
Phase information is handled by a dedicated phase decoder that receives the same encoded representation and predicts a clean phase spectrum using sine‑cosine embedding for the wrapped phase. The final complex spectrum is reconstructed as X̂ = M̂·e^{jP̂} and transformed back to the time domain via inverse STFT.
Training employs a multi‑level loss: an L1 time‑domain loss, magnitude L2 loss, complex‑spectrum L2 loss, a PESQ‑based metric loss, and an anti‑wrapping phase loss. These components are weighted by hyper‑parameters γ₁…γ₅ (0.2, 0.9, 0.1, 0.05, 0.3) following the MP‑SENet setup. This composite loss encourages both objective distortion reduction and perceptual quality improvement.
Experiments use a synthetic dataset built from VCTK speech, DEMAND noise (0–20 dB SNR), simulated room impulse responses (RT₆₀ 0.3–0.9 s), and various low‑pass filters (cut‑off 2–4 kHz) to emulate bandwidth loss. The model is trained for 1 M steps with AdamW (lr 0.0005). Evaluation metrics include CSIG, CBAK, COVL (subjective MOS predictors), PESQ, STOI, SRMR (reverberation suppression), and LSD (bandwidth extension).
DBP‑Net, with only 2.05 M parameters, outperforms large baselines such as VoiceFixer (122 M), HD‑DEMUCS (24 M), and SGMSE+ (65 M) across all metrics (e.g., CSIG 3.90 vs 3.35, PESQ 2.61 vs 2.37, LSD 2.24 dB vs 2.72 dB). Ablation studies show that removing the skip‑fusion or the parameter sharing leads to notable performance drops, confirming the importance of these design choices.
In summary, DBP‑Net provides an elegant solution that unifies suppression‑based and generation‑based learning within a compact network, achieving state‑of‑the‑art speech restoration under complex, multi‑distortion conditions. Its lightweight nature makes it suitable for real‑time or mobile applications, and the architecture offers a promising foundation for future extensions to multi‑modal or multi‑channel speech enhancement scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment