Complex spectrogram enhancement by convolutional neural network with multi-metrics learning

Reading time: 5 minute
...

📝 Abstract

This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously. To solve the first problem, we propose a novel convolutional neural network (CNN) model for complex spectrogram enhancement, namely estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstructed RI spectrograms are directly used to synthesize enhanced speech waveforms. In addition, since log-power spectrogram (LPS) can be represented as a function of RI spectrograms, its reconstruction is also considered as another target. Thus a unified objective function, which combines these two targets (reconstruction of RI spectrograms and LPS), is equivalent to simultaneously optimizing two commonly used objective metrics: segmental signal-to-noise ratio (SSNR) and logspectral distortion (LSD). Therefore, the learning process is called multi-metrics learning (MML). Experimental results confirm the effectiveness of the proposed CNN with RI spectrograms and MML in terms of improved standardized evaluation metrics on a speech enhancement task.

💡 Analysis

This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously. To solve the first problem, we propose a novel convolutional neural network (CNN) model for complex spectrogram enhancement, namely estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstructed RI spectrograms are directly used to synthesize enhanced speech waveforms. In addition, since log-power spectrogram (LPS) can be represented as a function of RI spectrograms, its reconstruction is also considered as another target. Thus a unified objective function, which combines these two targets (reconstruction of RI spectrograms and LPS), is equivalent to simultaneously optimizing two commonly used objective metrics: segmental signal-to-noise ratio (SSNR) and logspectral distortion (LSD). Therefore, the learning process is called multi-metrics learning (MML). Experimental results confirm the effectiveness of the proposed CNN with RI spectrograms and MML in terms of improved standardized evaluation metrics on a speech enhancement task.

📄 Content

COMPLEX SPECTROGRAM ENHANCEMENT BY CONVOLUTIONAL NEURAL NETWORK WITH MULTI-METRICS LEARNING

Szu-Wei Fu 12, Ting-yao Hu3, Yu Tsao1, Xugang Lu4

1 Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan 2 Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan 3 Department of Computer Science, Carnegie Mellon University, Pittsburg, PA, USA. 4National Institute of Information and Communications Technology, Kyoto, Japan

ABSTRACT

This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously. To solve the first problem, we propose a novel convolutional neural network (CNN) model for complex spectrogram enhancement, namely estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstructed RI spectrograms are directly used to synthesize enhanced speech waveforms. In addition, since log-power spectrogram (LPS) can be represented as a function of RI spectrograms, its reconstruction is also considered as another target. Thus a unified objective function, which combines these two targets (reconstruction of RI spectrograms and LPS), is equivalent to simultaneously optimizing two commonly used objective metrics: segmental signal-to-noise ratio (SSNR) and log- spectral distortion (LSD). Therefore, the learning process is called multi-metrics learning (MML). Experimental results confirm the effectiveness of the proposed CNN with RI spectrograms and MML in terms of improved standardized evaluation metrics on a speech enhancement task.

Index Terms—Convolutional neural network, complex spectrogram, speech enhancement, phase processing, multi- objective learning

  1. INTRODUCTION

Recently, various types of deep-learning-based denoising models have been proposed and extensively investigated [1- 12]. They have demonstrated superior ability to model the non-linear relationship between noisy and clean speech compared to traditional speech enhancement models. However, most existing denoising models focus only on processing the magnitude spectrogram (e.g., log-power spectrogram, LPS) leaving phase in its original noisy condition. This may be because there is no clear structure in the phase spectrogram, which makes estimating clean phase from noisy phase extremely difficult [13]. On the other hand, some researches have shown the importance of phase when spectrograms are resynthesized back into time-domain waveforms. Roux [14] demonstrated that when the inconsistency between magnitude and phase spectrograms is maximized, the same magnitude spectrogram can lead to extremely diverse resynthesized sounds, depending on the phase with which it is combined. Paliwal et al. [15] confirmed the importance of phase for perceptual quality in speech enhancement, especially when window overlap and length of the Fourier transform are increased. To further improve the performance of speech enhancement, phase information is considered in some up- to-date research [13, 16-19]. For time-domain signal reconstruction, Wang et al. [18] proposed a deep neural networks (DNN) model which tries to learn an optimal masking function given the noisy phase. Williamson et al. [13, 19] found that the structures in real and imaginary (RI) spectrograms are similar to that of magnitude spectrograms. Therefore, they employed a DNN for estimating the complex ratio mask (cRM) from a set of complementary features, and thus magnitude and phase can be jointly enhanced. The quality of the cRM enhanced speech is improved compared to the ideal ratio mask (IRM) based model.
In this paper, we estimate clean RI spectrograms directly from noisy ones instead of complementary features (e.g., amplitude modulation spectrogram, relative spectral transform and perceptual linear prediction, etc.) used in [13]. To efficiently exploit the relation between RI spectrograms, they are treated as different input channels in the proposed convolutional neural network (CNN) model. Since the goal of speech enhancement is to improve the intelligibility and quality of a noisy speech [20], several
objective metrics have to be applied to evaluate the performance in different aspects. For example, segmental signal-to-noise ratio (SSNR in dB) measure the signal difference in time domain, and log-spectral distortion (LSD in dB) [21] measure the spectrogram difference. Because the outputs of the proposed CNN are RI spectrograms, which do not loss any information from raw waveform, other signal representation forms (e.g., waveform, log power spectrum) can be derived from them. Using this characteristic, several metrics can also be optimized simultaneously by including them into the objective function of our CNN. Each target corresponds to a metric; hence, th

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut