Perceptual audio loss function for deep learning
PESQ and POLQA , are standards are standards for automated assessment of voice quality of speech as experienced by human beings. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests. Wavenet is a deep neural network originally developed as a deep generative model of raw audio wave-forms. Wavenet architecture is based on dilated causal convolutions, which exhibit very large receptive fields. In this short paper we suggest using the Wavenet architecture, in particular its large receptive filed in order to learn PESQ algorithm. By doing so we can use it as a differentiable loss function for speech enhancement.
💡 Research Summary
The paper addresses a long‑standing gap between objective distortion metrics such as Mean Squared Error (MSE) or Peak Signal‑to‑Noise Ratio (PSNR) and the subjective quality scores (Mean Opinion Score, MOS) that listeners actually perceive. While standards like PESQ (Perceptual Evaluation of Speech Quality) and POLQA (Perceptual Objective Listening Quality Assessment) are known to correlate well with MOS, their algorithms are complex, non‑differentiable, and incorporate long‑term temporal dependencies, making them unsuitable as direct loss functions for neural network training.
To bridge this gap, the authors propose to train a WaveNet‑style model to emulate the PESQ scoring process. WaveNet’s dilated causal convolutions provide an extremely large receptive field, which is essential for capturing the long‑range dependencies inherent in PESQ calculations. In the experimental setup, 0.25‑second audio segments (4 095 samples at 16 kHz) of both clean speech and its degraded counterpart are fed simultaneously into the network. The degraded signals are generated by randomizing the phase of the Fourier transform of the clean speech, yielding “speech‑shaped noise” that preserves the original spectral magnitude while introducing perceptual distortion.
The model is conditioned on speaker identity, following the original WaveNet conditioning scheme, allowing a single network to learn speaker‑specific PESQ mappings. Training is supervised: the network’s output is a scalar prediction of the full‑reference PESQ score, and the loss is the L2 distance between this prediction and the true PESQ value computed by the official ITU‑P.563 reference implementation. After training, the network becomes a differentiable function P(x_clean, x_degraded) that can be inserted into any speech enhancement pipeline as a perceptual loss term.
The authors suggest a composite loss for speech denoising:
L_total = P(x_clean, x_degraded) + λ·MSE(x_clean, x_degraded)
where λ ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment