From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks

From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.


💡 Research Summary

This paper investigates whether the internal binary pruning masks generated by a Dynamic Channel Pruning (DynCP) speech‑enhancement (SE) network can serve as a source of information for a variety of auxiliary speech‑processing tasks, thereby eliminating the need for separate on‑device models. The authors build upon prior work that uses DynCP to adaptively disable convolutional channels in a Conv‑FSENet SE backbone based on the current acoustic input. Each processing block contains a lightweight gating sub‑network that outputs a binary mask indicating which channels are active for a given time‑frequency frame.

The study proceeds in three stages. First, a DynCP‑enhanced SE model is trained on the VoiceBank+DEMAND corpus (≈30 min of speech and noise for training, the same amount for testing). The model comprises nine blocks, each with 128 residual channels; the gating subnet has 16 hidden units. During inference, the binary mask tensor G (dimensions: frames × blocks × channels) is recorded. Because many channels are either always on or always off, the authors filter out low‑variance masks (standard deviation < τ = 0.005), retaining 202 informative binary features (≈18 % of the total).

Second, these filtered masks (denoted ˜G) are used as input features for a suite of downstream tasks:

  • Classification: Voice Activity Detection (VAD), gender, accent, and noise‑category classification.
  • Regression: frame‑wise input and enhanced SNR, SI‑SDR, PESQ, and fundamental frequency (F0).

For each task, a simple linear (regression) or logistic (classification) model is trained, with ℓ2 regularization (α = 0.01) to mitigate collinearity. Because the inputs are binary, prediction reduces to weighted sums, incurring virtually no extra computation on the device. Baselines include the noisy log‑magnitude spectrogram (257 dimensions) and the SE model’s predicted suppression mask (also 257 dimensions). Additional ablations explore using only the first two blocks (67 features), the top‑64 most informative features (selected by regression coefficients), and the raw gating scores before binarization (˜R).

Third, the authors evaluate performance across all tasks. VAD achieves 93 % accuracy and 0.97 ROC‑AUC, substantially outperforming both baselines. Noise‑category classification reaches 84 % accuracy. Gender and accent classification, restricted to frames with speech activity, obtain 88 % and 81 % accuracy respectively. Regression results show strong correlations: input SNR R² = 0.78, enhanced SNR R² = 0.84, SI‑SDR R² = 0.71, PESQ R² = 0.68, and F0 R² = 0.86. Notably, the top‑64 binary features alone achieve performance comparable to using all 202 masks, confirming that most predictive power is concentrated in a small subset.

The paper also explores speaker verification using mask‑derived embeddings. For each utterance, frames with speech activity are averaged, ℓ2‑normalized, and fed to a standard linear backend (Within‑Class Covariance Normalization, Linear Discriminant Analysis, length normalization, cosine scoring). The resulting Equal Error Rate (EER) is 12.3 %, slightly higher than the 10.1 % obtained with STFT log‑magnitude embeddings but achieved with far fewer operations. t‑SNE visualizations of the mask space reveal clustering according to noise type, speech presence, and speaker gender, indicating that the gating subnet implicitly learns high‑level acoustic attributes. Coefficient analysis of the top‑64 features shows that mid‑to‑late blocks contribute most to discriminating voice activity and noise level, while early blocks are less informative.

Overall, the authors make two principal contributions: (1) they empirically demonstrate that DynCP masks encode rich acoustic metadata beyond their original purpose of computational budgeting, and (2) they show that this information can be extracted with ultra‑lightweight linear models, enabling real‑time, on‑device multi‑task inference with negligible overhead. The work suggests a new paradigm where dynamic neural networks serve simultaneously as efficient processors and as feature extractors for auxiliary tasks. Future directions include integrating mask‑based features directly into multi‑task training, refining mask‑derived speaker embeddings, and extending the approach to other audio domains such as music or environmental sound analysis.


Comments & Academic Discussion

Loading comments...

Leave a Comment