Complex spectrogram enhancement by convolutional neural network with multi-metrics learning
📝 Abstract
This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously. To solve the first problem, we propose a novel convolutional neural network (CNN) model for complex spectrogram enhancement, namely estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstructed RI spectrograms are directly used to synthesize enhanced speech waveforms. In addition, since log-power spectrogram (LPS) can be represented as a function of RI spectrograms, its reconstruction is also considered as another target. Thus a unified objective function, which combines these two targets (reconstruction of RI spectrograms and LPS), is equivalent to simultaneously optimizing two commonly used objective metrics: segmental signal-to-noise ratio (SSNR) and logspectral distortion (LSD). Therefore, the learning process is called multi-metrics learning (MML). Experimental results confirm the effectiveness of the proposed CNN with RI spectrograms and MML in terms of improved standardized evaluation metrics on a speech enhancement task.
💡 Analysis
This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously. To solve the first problem, we propose a novel convolutional neural network (CNN) model for complex spectrogram enhancement, namely estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstructed RI spectrograms are directly used to synthesize enhanced speech waveforms. In addition, since log-power spectrogram (LPS) can be represented as a function of RI spectrograms, its reconstruction is also considered as another target. Thus a unified objective function, which combines these two targets (reconstruction of RI spectrograms and LPS), is equivalent to simultaneously optimizing two commonly used objective metrics: segmental signal-to-noise ratio (SSNR) and logspectral distortion (LSD). Therefore, the learning process is called multi-metrics learning (MML). Experimental results confirm the effectiveness of the proposed CNN with RI spectrograms and MML in terms of improved standardized evaluation metrics on a speech enhancement task.
📄 Content
COMPLEX SPECTROGRAM ENHANCEMENT BY CONVOLUTIONAL NEURAL NETWORK WITH MULTI-METRICS LEARNING
Szu-Wei Fu 12, Ting-yao Hu3, Yu Tsao1, Xugang Lu4
1 Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan 2 Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan 3 Department of Computer Science, Carnegie Mellon University, Pittsburg, PA, USA. 4National Institute of Information and Communications Technology, Kyoto, Japan
ABSTRACT
This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously. To solve the first problem, we propose a novel convolutional neural network (CNN) model for complex spectrogram enhancement, namely estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstructed RI spectrograms are directly used to synthesize enhanced speech waveforms. In addition, since log-power spectrogram (LPS) can be represented as a function of RI spectrograms, its reconstruction is also considered as another target. Thus a unified objective function, which combines these two targets (reconstruction of RI spectrograms and LPS), is equivalent to simultaneously optimizing two commonly used objective metrics: segmental signal-to-noise ratio (SSNR) and log- spectral distortion (LSD). Therefore, the learning process is called multi-metrics learning (MML). Experimental results confirm the effectiveness of the proposed CNN with RI spectrograms and MML in terms of improved standardized evaluation metrics on a speech enhancement task.
Index Terms—Convolutional neural network, complex spectrogram, speech enhancement, phase processing, multi- objective learning
- INTRODUCTION
Recently, various types of deep-learning-based denoising
models have been proposed and extensively investigated [1-
12]. They have demonstrated superior ability to model the
non-linear relationship between noisy and clean speech
compared to traditional speech enhancement models.
However, most existing denoising models focus only on
processing the magnitude spectrogram (e.g., log-power
spectrogram, LPS) leaving phase in its original noisy
condition. This may be because there is no clear structure in
the phase spectrogram, which makes estimating clean phase
from noisy phase extremely difficult [13]. On the other hand,
some researches have shown the importance of phase when
spectrograms are resynthesized back into time-domain
waveforms. Roux [14] demonstrated that when the
inconsistency between magnitude and phase spectrograms is
maximized, the same magnitude spectrogram can lead to
extremely diverse resynthesized sounds, depending on the
phase with which it is combined. Paliwal et al. [15]
confirmed the importance of phase for perceptual quality in
speech enhancement, especially when window overlap and
length of the Fourier transform are increased.
To further improve the performance of speech
enhancement, phase information is considered in some up-
to-date research [13, 16-19]. For time-domain signal
reconstruction, Wang et al. [18] proposed a deep neural
networks (DNN) model which tries to learn an optimal
masking function given the noisy phase. Williamson et al.
[13, 19] found that the structures in real and imaginary (RI)
spectrograms are similar to that of magnitude spectrograms.
Therefore, they employed a DNN for estimating the
complex ratio mask (cRM) from a set of complementary
features, and thus magnitude and phase can be jointly
enhanced. The quality of the cRM enhanced speech is
improved compared to the ideal ratio mask (IRM) based
model.
In this paper, we estimate clean RI spectrograms
directly from noisy ones instead of complementary features
(e.g., amplitude modulation spectrogram, relative spectral
transform and perceptual linear prediction, etc.) used in [13].
To efficiently exploit the relation between RI spectrograms,
they are treated as different input channels in the proposed
convolutional neural network (CNN) model.
Since the goal of speech enhancement is to improve the
intelligibility and quality of a noisy speech [20], several
objective metrics have to be applied to evaluate the
performance in different aspects. For example, segmental
signal-to-noise ratio (SSNR in dB) measure the signal
difference in time domain, and log-spectral distortion (LSD
in dB) [21] measure the spectrogram difference. Because the
outputs of the proposed CNN are RI spectrograms, which do
not loss any information from raw waveform, other signal
representation forms (e.g., waveform, log power spectrum)
can be derived from them. Using this characteristic, several
metrics can also be optimized simultaneously by including
them into the objective function of our CNN. Each target
corresponds to a metric; hence, th
This content is AI-processed based on ArXiv data.