A Divide and Conquer Strategy for Musical Noise-free Speech Enhancement in Adverse Environments

Reading time: 6 minute
...

📝 Original Info

  • Title: A Divide and Conquer Strategy for Musical Noise-free Speech Enhancement in Adverse Environments
  • ArXiv ID: 1802.02665
  • Date: 2018-02-09
  • Authors: Md Tauhidul Islam, Celia Shahnaz, Wei-Ping Zhu, M. Omair Ahmad

📝 Abstract

A divide and conquer strategy for enhancement of noisy speeches in adverse environments involving lower levels of SNR is presented in this paper, where the total system of speech enhancement is divided into two separate steps. The first step is based on noise compensation on short time magnitude and the second step is based on phase compensation. The magnitude spectrum is compensated based on a modified spectral subtraction method where the cross-terms containing spectra of noise and clean speech are taken into consideration, which are neglected in the traditional spectral subtraction methods. By employing the modified magnitude and unchanged phase, a procedure is formulated to compensate the overestimation or underestimation of noise by phase compensation method based on the probability of speech presence. A modified complex spectrum based on these two steps are obtained to synthesize a musical noise free enhanced speech. Extensive simulations are carried out using the speech files available in the NOIZEUS database in order to evaluate the performance of the proposed method. It is shown in terms of the objective measures, spectrogram analysis and formal subjective listening tests that the proposed method consistently outperforms some of the state-of-the-art methods of speech enhancement for noisy speech corrupted by street or babble noise at very low as well as medium levels of SNR.

💡 Deep Analysis

Figure 1

📄 Full Content

analysis-modification-synthesis (AMS) framework [16], [17], [18], [19] is employed for reconstructing the original speech after performing the enhancement operation.

In speech analysis, it is commonly believed that human auditory system is phase-deaf, i.e., it ignores the phase spectrum and considers only the magnitude spectrum. That is why in the conventional spectral subtraction based speech enhancement methods mentioned above, for synthesizing a clean speech, operations are performed only on the shorttime magnitude spectrum and an unaltered short-time phase spectrum is maintained. Recently, it has been shown that the phase spectrum is also useful in speech analysis [20], [21], [22].

Among all the methods mentioned above, spectral subtraction has been widely used due to its noise suppression capability with simple computation. In Boll’s method [1] of spectral subtraction, the noise spectrum is estimated from the non-speech frames and subtracted from the noisy speech spectrum in the current frame. This simple formulation for enhancing noisy speech comes with prices. If too much noise is subtracted from the noisy speech spectrum, it creates speech distortion. On the other hand, if less noise is subtracted, the enhanced speech remains noisy. For subtracting the proper amount of noise, lots of methods have been proposed such as [23], [24]. Another problem with spectral subtraction is the musical noise, which arises because of raising negative values in the resulting spectrum to zero [25]. Sometimes musical noise is more disturbing than the original noise. To solve the problem of musical noise in [25], the authors proposed to floor the negative spectrum values to some other values than zero.

Spectral subtraction is based on the assumption that the noise and clean speech spectra are totally independent and the cross correlation between them is zero, which is incorrect for most of the practical cases. In [26], the authors show that the cross terms keep crucial impact on the performance of the speech enhancement, when the signal to noise ratio of the noisy speech is less than or near dB. Several attempts have been taken to consider the cross terms for speech enhancement such as [27], [26].

Most of the speech enhancement methods discussed above performs well in high or reasonable SNR levels. But a very few methods have been proposed to cope up with low SNR environments such as [28], [13].

In this paper, we will address the above mentioned problems using a two step formulation. The first step is based on ob-taining a crude estimate of the clean speech spectrum through a modified spectral subtraction method, where we consider the cross-terms between the speech and noise spectrum as non-zero. The second step is based on phase compensation which uses a probabilistic approach to calculate how much compensation should be imposed on the phase spectrum of the noisy speech. An enhanced complex spectrum is obtained by pairing the modified magnitude spectrum from the first step and modified phase spectrum from the second step. Both of the steps produce non-negative results which allow the proposed method to enhance the noisy speech without introducing the musical noise. The proposed method is shown to be effective in producing good results even for noisy speeches with very low SNR levels.

The paper is organized as follows. Section II presents the problem formulation and proposed method. Section III describes the results. Concluding remarks are presented in Section IV.

In any AMS framework, at first, noisy speech frames are transformed by a transformation method. Then modifications are carried out in the transformed domain and finally, the inverse transform of the transformation method followed by the overlap-add method is performed to reconstruct the enhanced speech. The proposed method is based on the AMS framework where speech is analyzed, modified and synthesized frame wise.

In the presence of additive noise d[n], a clean speech signal x[n] gets contaminated and produces noisy speech y[n]. The noisy speech can be segmented into overlapping frames by using a sliding window.

windowed noisy speech frame can be expressed in the time domain as (1) where is the total number of speech frames. If , and are the short-time Fourier transform (STFT) representations of , and , respectively, we can write (2) where , is the length of a frame in samples. The -point FFT, of can be computed as (3) The Fourier transform of the noisy speech frame, is modified in the proposed method to obtain an estimate of the clean speech spectrum.

An overview of the proposed speech enhancement method is shown by a block diagram in Fig. 1. It is seen from Fig. 1 that short-time Fourier transform (STFT) is first applied to each input speech frame. The magnitude of the Fourier spectrum is compensated in a modified spectral subtraction method, which we call M-step. The modified magnitude from M-step is then combined with unchanged phase to obtain the

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut