Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 1 W a v eform Modeling and Generation Using Hierarchical Recurrent Neural Netw orks for Speech Bandwidth Extension Zhen-Hua Ling, Member , IEEE , Y ang Ai, Y u Gu, and Li-Rong Dai Abstract —This paper presents a wav eform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from con ventional BWE methods which pr edict spectral parame- ters for reconstructing wideband speech wavef orms, this BWE method models and pr edicts wavef orm samples directly without using vocoders. Inspir ed by SampleRNN which is an uncon- ditional neural audio generator , the HRNN model repr esents the distribution of each wideband or high-frequency wavef orm sample conditioned on the input narrowband wa veform samples using a neural network composed of long short-term memory (LSTM) layers and feed-forward (FF) layers. The LSTM layers form a hierar chical structur e and each layer operates at a speciﬁc temporal resolution to efﬁciently capture long-span dependencies between temporal sequences. Furthermore, additional conditions, such as the bottleneck (BN) featur es derived from narr owband speech using a deep neural network (DNN)-based state classiﬁer , are employed as auxiliary input to further impr ove the quality of generated wideband speech. The experimental results of comparing se veral wa veform modeling methods show that the HRNN-based method can achieve better speech quality and run- time efﬁciency than the dilated con volutional neural network (DCNN)-based method and the plain sample-level recurr ent neural network (SRNN)-based method. Our proposed method also outperforms the con ventional vocoder -based BWE method using LSTM-RNNs in terms of the subjective quality of the reconstructed wideband speech. Index T erms —speech bandwidth extension, recurrent neural networks, dilated con volutional neural networks, bottleneck features I . I N T R O D U C T I O N S PEECH communication is important in people’ s daily life. Howe ver , due to the limitation of transmission channels and the restriction of speech acquisition equipments, the bandwidth of speech signal is usually limited to a narrowband of frequencies. For example, the bandwidth of speech signal in the public switching telephone network (PSTN) is less than 4kHz. The missing of high-frequency components of speech signal usually leads to low naturalness and intelligibility , such This work was partially funded by National K ey Research and De velopment Project of China (Grant No. 2017YFB1002202) and the National Natural Science Foundation of China (Grants No. U1636201). Z.-H. Ling, Y . Ai, and L.-R. Dai are with the National Engineering Laboratory of Speech and Language Information Processing, University of Science and T echnology of China, Hefei, 230027, China (e-mail: zhling@ustc.edu.cn, ay8067@mail.ustc.edu.cn, lrdai@ustc.edu.cn). Y . Gu is with Baidu Speech Department, Baidu T echnology Park, Beijing, 100193, China (e-mail: guyu04@baidu.com ). This work was done when he was a graduate student at the National Engineering Laboratory of Speech and Language Information Processing, University of Science and T echnology of China. as the difﬁculty of distinguishing fricati ves and similar voices. Therefore, speech bandwidth extension (BWE), which aims to restore the missing high-frequency components of narrowband speech using the correlations that exist between the low and high-frequency components of the wideband speech signal, has attracted the attentions of man y researchers. BWE methods can not only be applied to real-time voice communication, but also beneﬁt other speech signal processing areas such as text-to- speech (TTS) synthesis [1], speech recognition [2], [3], and speech enhancement [4], [5]. Many researchers hav e made a lot of efforts in the ﬁeld of BWE. Some early studies adopted the source-ﬁlter model of speech production and attempted to restore high-frequency residual signals and spectral en velopes respectively from input narro wband signals. The high-frequency residual signals were usually estimated from the narro wband residual signals by spectral folding [6]. T o estimate high-frequency spectral en velopes from narro wband signals is always a dif ﬁcult task. T o achieve this goal, simple methods, such as codebook mapping [7] and linear mapping [4], and statistical methods using Gaussian mixture models (GMMs) [8]–[11] and hidden Markov models (HMMs) [12]–[15], hav e been proposed. In statistical methods, acoustic models were build to represent the mapping relationship between narrowband spectral param- eters and high-frequency spectral parameters. Although these statistical methods achieved better performance than simple mapping methods, the inadequate modeling ability of GMMs and HMMs may lead to over -smoothed spectral parameters which constraints the quality of reconstructed speech signals [16]. In recent years, deep learning has become an emerging ﬁeld in machine learning research. Deep learning techniques hav e been successfully applied to many signal processing tasks. In speech signal processing, neural networks with deep structures have been introduced to the speech generation tasks including speech synthesis [17], [18], voice con version [19], [20], speech enhancement [21], [22], and so on. In the ﬁeld of BWE, neural networks hav e also been adopted to predict either the spectral parameters representing vocal-tract ﬁlter properties [23]–[25] or the original log-magnitude spectra deriv ed by short-time Fourier transform (STFT) [26], [27]. The studied model architectures included deep neural networks (DNN) [28]–[30], recurrent temporal restricted Boltzmann machines (RBM) [31], recurrent neural networks (RNN) with long short-term memory (LSTM) cells [32], and so on. These methods achie ved better BWE performance than using JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 2 con ventional statistical models, like GMMs and HMMs, since deep-structured neural networks are more capable of modeling the complicated and nonlinear mapping relationship between input and output acoustic parameters. Howe ver , all these existing methods are v ocoder-based ones, which means vocoders are used to extract spectral param- eters from narrowband wav eforms and then to reconstruct wa veforms from the predicted wideband or high-frequency spectral parameters. This may lead to two deﬁciencies. First, the parameterization process of vocoders usually degrades speech quality . For e xample, the spectral details are alw ays lost in the reconstructed wav eforms when low-dimensional spectral parameters, such as mel-cepstra or line spectral pairs (LSP), are adopted to represent spectral env elopes in vocoders. The spectral shapes of the noise components at voiced frames are always ignored when only F0 values and binary voiced/un v oiced ﬂags are used to describe the excitation. Second, it is difﬁcult to parameterize and to predict phase spectra due to the phase-warpping issue. Thus, simple estimation methods, such as mirror in version, are popularly used to predict the high-frequency phase spectra in existing methods [26], [32]. This also constraints the quality of the reconstructed wideband speech. Recently , neural network-based speech wav eform synthe- sizers, such as W av eNet [33] and SampleRNN [34], hav e been presented. In W aveNet [33], the distribution of each wa veform sample conditioned on pre vious samples and addi- tional conditions was represented using a neural network with dilated con volutional neural layers and residual architectures. SampleRNN [34] adopted recurrent neural layers with a hier- archical structure for unconditional audio generation. Inspired by W av eNet, a wa veform modeling and generation method using stacked dilated CNNs for BWE has been proposed in our previous w ork [35], which achieved better subjecti ve BWE performance than the v ocoder-based approach utilizing LSTM- RNNs. On the other hand, the methods of applying RNNs to directly model and generate speech wa veforms for BWE hav e not yet been in vestigated. Therefore, this paper proposes a wa veform modeling and generation method using RNNs for BWE. As discussed above, direct wa veform modeling and generation can help a void the spectral representation and phase modeling issues in v ocoder- based BWE methods. Considering the sequence memory and modeling ability of RNNs and LSTM units, this paper adopts LSTM-RNNs to model and generate the wideband or high- frequency wav eform samples directly giv en input narrowband wa veforms. Inspired by SampleRNN [34], a hierarchical RNN (HRNN) structure is presented for the BWE task. There are multiple recurrent layers in an HRNN and each layer operates at a speciﬁc temporal resolution. Compared with plain sample- lev el deep RNNs, HRNNs are more capable and efﬁcient at capturing long-span dependencies in temporal sequences. Furthermore, additional conditions, such as the bottleneck (BN) features [32], [36], [37] extracted from narrowband speech using a DNN-based state classiﬁer, are introduced into HRNN modeling to further improve the performance of BWE. The contributions of this paper are twofold. First, this paper makes the ﬁrst successful attempt to model and gen- erate speech w aveforms directly at sample-le vel using RNNs for the BWE task. Second, v arious RNN architectures for wa veform-based BWE, including plain sample-le vel LSTM- RNNs, HRNNs, and HRNNs with additional conditions, are implemented and ev aluated in this paper . The experimental results of comparing several wa veform modeling methods show that the HRNN-based method achiev es better speech quality and run-time efﬁcienc y than the stacked dilated CNN- based method [35] and the plain sample-level RNN-based method. Our proposed method also outperforms the conv en- tional vocoder-based BWE method using LSTM-RNNs in terms of the subjecti ve quality of the reconstructed wideband speech. This paper is organized as follows. In Section II, we brieﬂy revie w previous BWE methods including v ocoder-based ones and the dilated CNN-based one. In Section III, the details of our proposed method are presented. Section IV reports our experimental results, and conclusions are giv en in Section V. I I . P R E V I O U S W O R K A. V ocoder -Based BWE Using Neural Networks The vocoder -based BWE methods using DNNs or RNNs hav e been proposed in recent years [26], [32]. In these meth- ods, spectral parameters such as logarithmic magnitude spectra (LMS) were ﬁrst extracted by short time Fourier transform (STFT) [38]. Then, DNNs or LSTM-RNNs were trained under minimum mean square error (MMSE) criterion to establish a mapping relationship from the LMS of narrowband speech to the LMS of the high-frequency components of wideband speech. Some additional features extracted from narrowband speech, such as bottleneck features, can be used as auxiliary inputs to improve the performance of netw orks [32]. At the stage of reconstruction, the LMS of wideband speech were reconstructed by concatenating the LMS of input narrowband speech and the LMS of high-frequency components predicted by the trained DNN or LSTM-RNN. The phase spectra of wideband speech were usually generated by some simple mapping algorithms, such as mirror in version [26]. Finally , in verse FFT (IFFT) and ov erlap-add algorithm were carried out to reconstruct the wideband wa veforms from the predicted LMS and phase spectra. The experimental results of previous work showed that LSTM-RNNs can achiev e better performance than DNNs in the v ocoder-based BWE [32]. Ne vertheless, there are still some issues with the vocoder-based BWE approach as discussed in Section I, such as the quality degradation caused by the parameterization of v ocoders and the inadequacy of restoring phase spectra. B. W aveform-Based BWE Using Stack ed Dilated CNNs Recently , a no vel w aveform generation model named W av eNet was proposed [33] and has been successfully applied to the speech synthesis task [39]–[41]. This model utilizes stacked dilated CNNs to describe the autore gressiv e generation process of audio wa veforms without using frequency analysis and vocoders. A stacked dilated CNN consists of many con volutional layers with dif ferent dilation JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 3 t t + 1 t - 1 . . . . . . y t y t + 1 y t - 1 . . . . . . Fig. 1. The structure of stacked dilated non-causal CNNs [35]. factors. The length of its receptiv e ﬁled grows exponentially in terms of the network depth [33]. Motiv ated by this idea, a w aveform modeling and generation method for BWE was proposed [35], which described the con- ditional distribution of the output wideband or high-frequency wa veform sequence y = [ y 1 , y 2 , . . . , y T ] conditioned on the input narrowband wa veform sequence x = [ x 1 , x 2 , . . . , x T ] using stacked dilated CNNs . Similar to W av eNet, the samples x t and y t were all discretized by 8-bit µ -law quantization [42] and a softmax output layer was adopted. Residual and parameterized skip connections together with gated activ ation functions were also employed to capacitate training deep networks and to accelerate the conv ergence of model esti- mation. Different from W aveNet, this method modeled the mapping relationship between two wav eform sequences, not the autoregressiv e generation process of output waveform sequence. Both causal and non-causal model structures were implemented and e xperimental results sho wed that the non- causal structure achiev ed better performance than the causal one [35]. The stacked dilated non-causal CNN, as illustrated in Fig. 1, described the conditional distrib ution as p ( y | x ) = T Y t =1 p ( y t | x t − N/ 2 , x t − N/ 2+1 , . . . , x t + N/ 2 ) , (1) where N + 1 is the length of receptive ﬁeld. At the extension stage, given input narro wband speech, each output sample was obtained by selecting the quantization le vel with maximum posterior probability . Finally , the generated wa veforms were processed by a high-pass ﬁlter and then added with the input narro wband wav eforms to reconstruct the ﬁnal wideband wa veforms. Experimental results showed that this method achiev ed better subjecti ve BWE performance than the vocoder -based method using LSTM-RNNs [35]. I I I . P RO P O S E D M E T H O D S Inspired by SampleRNN [34] which is an unconditional audio generator containing recurrent neural layers with a hierarchical structure, this paper proposes waveform modeling and generation methods using RNNs for BWE. In this section, we ﬁrst introduce the plain sample-le vel RNNs (SRNN) for wa veform modeling. Then the structures of hierarchical RNNs (HRNN) and conditional HRNNs are explained in detail. Finally , the ﬂo wchart of BWE using RNNs is introduced. E m b ed d in g lay er e t - 1 e t e t + 1 t - 1 t t + 1 y t - 1 y t y t + 1 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · L S T M l a ye rs FF l a ye rs Fig. 2. The structure of SRNNs for BWE, where concentric circles represent LSTM layers and in verted trapezoids represent FF layers. A. Sample-Level Recurrent Neural Networks The LSTM-RNNs for speech generation are usually built at frame-level in order to model the acoustic parameters extracted by v ocoders with a ﬁxed frame shift [32], [43]. It is straightforward to model and generate speech wa veforms at sample-le vel using similar LSTM-RNN framework. The structure of sample-lev el recurrent neural networks (SRNNs) for BWE is sho wn in Fig. 2, which is composed of a cascade of LSTM layers and feed-forward (FF) layers. Both the input wa veform samples x = [ x 1 , x 2 , . . . , x T ] and output wa veform samples y = [ y 1 , y 2 , . . . , y T ] are quantized to discrete values by µ -law . The embedding layer maps each discrete sample value x t to a real-valued vector e t . The LSTM layers model the sequence of embedding vectors in a recurrent manner . When there is only one LSTM layer , the calculation process can be formulated as h t = H ( h t − 1 , e t ) , (2) where h t is the output of LSTM layers at time step t , H represents the activ ation function of LSTM units. If there are multiple LSTM layers, their output can be calculated layer- by-layer . Then, h t passes through FF layers. The acti vation function of the last layer is a softmax function which generates the probability distribution of the output sample y t conditioned on the pre vious and current input samples { x 1 , x 2 , . . . , x t } as p ( y t | x 1 , x 2 , . . . , x t ) = F F ( h t ) , (3) where function F F denotes the calculation of FF layers. Giv en a training set with parallel input and output waveform sequences, the model parameters of the LSTM and the FF layers are estimated using cross-entropy cost function. At gen- eration time, each output sample y t is obtained by maximizing the conditional probability distribution (3). Our preliminary and informal listening test sho wed that this generation criterion can achieve better subjectiv e performance than generating random samples from the distribution. The random sampling is necessary for the con ventional W aveNet and SampleRNN models because of their autoregressi ve architecture. Howe ver , the model structure shown in Fig. 2 is not an autoregressiv e one. The input wa veforms provide the necessary randomness to synthesize the output speech, especially the unv oiced segments. JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 4  1 ,  ,  L ( 3 )  1 ,  ,  L ( 2 )  L ( 2 ) + 1 ,  ,  2 L ( 2 )  L ( 3 ) - L ( 2 ) + 1 ,  ,  L ( 3 ) e 1 e L ( 2 ) e L ( 2 ) + 1 e 2 L ( 2 ) e L ( 3 ) e L ( 3 ) - L ( 2 ) + 1 y 1 y L ( 2 ) y L ( 2 ) + 1 y 2 L ( 2 ) y L ( 3 ) - L ( 2 ) + 1 y L ( 3 ) · · · ( 3 ) 1 d ( 3 ) 2 d ( 3 ) L ( 3 ) / L ( 2 ) d ( 2 ) 1 d ( 2 ) L ( 2 ) d ( 2 ) L ( 2 ) + 1 d ( 2 ) 2 L ( 2 ) d ( 2 ) L ( 3 ) - L ( 2 ) + 1 d ( 2 ) L ( 3 ) d · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · T i e r 3 T i e r 2 T i e r 1 Fig. 3. The structure of HRNNs for BWE, where concentric circles represent LSTM layers and in verted trapezoids represent FF layers. In an SRNN, the generation of each output sample depends on all previous and current input samples. Howe ver , this plain LSTM-RNN architecture still has some deﬁciencies for waveform modeling and generation. First, sample-lev el modeling mak es it dif ﬁcult to model long-span dependencies between input and output speech signals due to the signiﬁ- cantly increased sequence length compared with frame-level modeling. Second, SRNNs suffer from the inefﬁcienc y of wa veform generation due to the point-by-point calculation at all layers and the dimension expansion at the embedding layer . Therefore, inspired by SampleRNN [34], a hierarchical RNN (HRNN) structure is proposed in next subsection to alleviate these problems. B. Hierar chical Recurr ent Neur al Networks The structure of HRNNs for BWE is illustrated in Fig. 3. Similar to SRNNs mentioned in Section III-A, HRNNs are also composed of LSTM layers and FF layers. Dif ferent from the plain LSTM-RNN structure of SRNNs, these LSTM and FF layers in HRNNs form a hierarchical structure of multiple tiers and each tier operates at a speciﬁc temporal resolution. The bottom tier (i.e., Tier 1 in Fig. 3) deals with individual samples and outputs sample-level predictions. Each higher tier operates on a lower temporal resolution (i.e., dealing with more samples per time step). Each tier conditions on the tier abov e it except the top tier . This model structure is similar to SampleRNN [34]. The main difference is that the original SampleRNN model is an unconditional audio generator which employs the history of output wav eforms as network input and generates output wa veforms in an autoregressi ve way . While, the HRNN model shown in Fig. 3 describes the mapping relationship between two w aveform sequences directly without considering the autoregressi ve property of output wa veforms. This HRNN structure is speciﬁcally designed for BWE be- cause narro wband wav eforms are used as inputs in this task. Removing autoregressiv e connections can help reduce the computation complexity and facilitate parallel computing at generation time. Although conditional SampleRNNs hav e been dev eloped and used as neural vocoders to reconstruct speech wa veforms from acoustic parameters [44], they still follow the autoregressi ve framew ork and are different from HRNNs. Assume an HRNN has K tiers in total (e.g., K = 3 in Fig. 3). Tier 1 works at sample-le vel and the other K − 1 tiers are frame-lev el tiers since they operate at a temporal resolution lower than samples. 1) F rame-level tiers: The k -th tier (1 < k ≤ K ) operates on frames composed of L ( k ) samples. The range of time step at the k -th tier, t ( k ) , is determined by L ( k ) . Denoting the quantized input waveforms as x = [ x 1 , x 2 , . . . , x T ] and assuming that L represents the sequence length of x after zero-padding so that L can be di visible by L ( K ) , we can get t ( k ) ∈ T ( k ) = { 1 , 2 , . . . , L L ( k ) } , 1 < k ≤ K. (4) Furthermore, the relationship of temporal resolution between the m -th tier and the n -th tier ( 1 < m < n ≤ K ) can be described as T ( n ) = { t ( n ) | t ( n ) = d t ( m ) L ( n ) /L ( m ) e , t ( m ) ∈ T ( m ) } , (5) where d·e represents the operation of rounding up. It can be observed from (5) that one time step of the n -th tier corresponds to L ( n ) /L ( m ) time steps of the m -th tier . The frame inputs f ( k ) t at the k -th tier (1 < k ≤ K ) and the t -th time step can be written by framing and concatenation operations as ˜ f ( k ) t = [ x ( t − 1) L ( k ) +1 , . . . , x tL ( k ) ] > , (6) f ( k ) t = [ ˜ f ( k ) > t , ..., ˜ f ( k ) > t + c ( k ) − 1 ] > , (7) where t ∈ T ( k ) , ˜ f ( k ) t denotes the t -th wav eform frame at the k -th tier , and c ( k ) is the number of concatenated frames at the k -th tier . W e hav e c (3) = c (2) = 1 in the model structure shown in Fig. 3. As shown in Fig. 3, the frame-lev el ties are composed of LSTM layers. For the top tier (i.e., k = K ), the LSTM units update their hidden states h ( K ) t based on the hidden states of previous time step h ( K ) t − 1 and the input at current time step f ( K ) t . If there is only one LSTM layer in the K -th tier, the calculation process can be formulated as h ( K ) t = H ( h ( K ) t − 1 , f ( K ) t ) , t ∈ T ( K ) . (8) If the top tier is composed of multiple LSTM-RNN layers, the hidden states can be calculated layer-by-layer iterativ ely . Due to the dif ferent temporal resolution at different tiers, the top tier generates r ( K ) = L ( K ) /L ( K − 1) conditioning vectors for the ( K − 1) -th tier at each time step t ∈ T ( K ) . This is implemented by producing a set of r ( K ) separate linear projections of h ( K ) t at each time step. For the intermediate tiers (i.e., 1 < k < K ), the processing of generating conditioning vectors is the same as that of the top tier . Thus, we can describe the conditioning vectors uniformly as d ( k ) ( t − 1) r ( k ) + j = W ( k ) j h ( k ) t , j = 1 , 2 , . . . , r ( k ) , t ∈ T ( k ) , (9) where 1 < k ≤ K and r ( k ) = L ( k ) /L ( k − 1) . The input v ectors of the LSTM layers at intermediate tiers are different from that of the top tier . For the k -th tier ( 1 < k < K ), the input v ector i ( k ) t at the t -th time step is JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 5 composed by a linear combination of the frame inputs f ( k ) t and the conditioning vectors d ( k +1) t giv en by the ( k + 1) -th tier as i ( k ) t = W ( k ) f ( k ) t + d ( k +1) t , t ∈ T ( k ) , (10) Thus, the output of the LSTM layer at the k -th tier ( 1 < k < K ) can be calculated as h ( k ) t = H ( h ( k ) t − 1 , i ( k ) t ) , t ∈ T ( K ) . (11) 2) Sample-level tier: The sample-lev el tier (i.e., Tier 1 in Fig. 3) gives the probability distribution of the output sample y t conditioned on the current input sample x t (i.e., L (1) = 1 ) together with the conditioning vector d (2) t passed from the above tier which encodes history information of the input sequence, where t ∈ T (1) = { 1 , 2 , . . . , L L (1) } . Since x t and y t are individual samples, it is conv enient to model the correlation among them using a memoryless structure such as FF layers. First, x t is mapped into a real-valued vector e t by an embedding layer . These embedding vectors form the input at each time step of the sample-le vel tier , i.e., f (1) t = [ e > t , ..., e > t + c (1) − 1 ] > , (12) where t ∈ T (1) , c (1) is the number of concatenated sample embeddings at the sample-level tier . In the model structure shown in Fig. 3, c (1) = 1 . Then, the input of the FF layers is a linear combination of f (1) t and d (2) t as i (1) t = W (1) f (1) t + d (2) t , t ∈ T (1) . (13) Finally , we can obtain the conditional probability distribu- tion of the output sample y t by passing i (1) t through the FF layers. The activ ation function of the last FF layer is a softmax function. The output of FF layers describes the conditional distribution p ( y t | x 1 , x 2 , . . . , x ( d t L ( K ) e + c ( K ) − 1) L ( K ) ) = F F ( i (1) t ) , (14) where t ∈ T (1) . It is worth mentioning that the structure shown in Fig. 3 is non-casual which utilizes future input samples together with current and previous input samples to predict current output sample (e.g., using x 1 , . . . , x L (3) to predict y 1 in Fig. 3). Generally speaking, at most c ( K ) L ( K ) − 1 input samples after the current time step are necessary in order to predict current output sample accroding to (14). This is also a difference between our HRNN model and SampleRNN, which has a causal and autoregressiv e structure. Similar to SRNNs, the parameters of HRNNs are estimated using cross-entropy cost function giv en a training set with parallel input and output sample sequences. At generation time, each y t is predicted using the conditional probability distribution in (14). C. Conditional Hier arc hical Recurrent Neural Networks Some frame-level auxiliary features extracted from input narrowband wa veforms, such as bottleneck (BN) features [36], hav e sho wn their effecti veness in improving the performance  1 ,  ,  L ( 3 )  L ( 4 ) - L ( 3 ) + 1 ,  ,  L ( 4 ) y 1 y L ( 2 ) · · · ( 4 ) 1 d ( 4 ) 2 d ( 4 ) L ( 4 ) / L ( 3 ) d · · · · · · · · · · · · · · · T i e r 4 T i e r 3 T i e r 1 1 1 c 1 d c ,  , · · ·  1 ,  ,  L ( 2 )  L ( 3 ) - L ( 2 ) + 1 ,  ,  L ( 3 ) · · · e 1 e L ( 2 ) · · · T i e r 2 · · · Fig. 4. The structure of conditional HRNNs for BWE, where concentric circles represent LSTM layers and inverted trapezoids represent FF layers. of vocoder -based BWE [32]. In order to combine such auxil- iary inputs with the HRNN model introduced in Section III-B, a conditional HRNN structure is designed as shown in Fig. 4. Compared with HRNNs, conditional HRNNs add an addi- tional tier named conditional tier on the top. The input features of the conditional tier are frame-lev el auxiliary feature vectors extracted from input wa veforms rather than wa veform samples. Assume the total number of tiers in a conditional HRNN is K (e.g., K = 4 in Fig. 4) and let L ( K ) donate the frame shift of auxiliary input features. The equations (4) and (5) in Section III-B still works here. Similar to the introductions in Section III-B, the frame inputs at the conditional tier can be written as c t = [ c t 1 , c t 2 , . . . , c t d ] , t ∈ T ( K ) , (15) where c t d represents the d -th dimension of the auxiliary feature vector at time t . Then the calculations of (8)-(13) for HRNNs are followed. Finally , the conditional probability distribution for generating y t can be written as p ( y t | x 1 , . . . , x ( d t L ( K ) e + c ( K ) − 1) L ( K ) , c 1 , c 2 , . . . , c d t L ( K ) e ) = F F ( i (1) t ) , (16) where t ∈ T (1) , { c 1 , c 2 , . . . , c d t L ( K ) e } are additional condi- tions introduced by the auxiliary input features. D. BWE Using SRNNs and HRNNs The ﬂo wchart of BWE using SRNNs or HRNNs are illustrated in Fig. 5. There are two mapping strategies. One is to map the narro wband wa veforms tow ards their corresponding wideband counterparts (named WB strategy in the rest of this paper) and the other is to map the narrowband wav eforms tow ards the wav eforms of the high-frequenc y component of wideband speech (named HF strate gy). A database with wideband speech recordings is used for model training. At the training stage, the input narrowband wa veforms are obtained by downsampling the wideband wa ve- forms. T o guarantee the length consistency between the input JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 6 D ow ns a m pl i ng U ps a m pl i ng Q ua nt i z a t i on e nc odi ng S RN N s or H RN N s H i ghpa s s fi l t e r A m pl i fi c a t i on W i de ba nd w a ve form Q ua nt i z a t i on e nc odi ng BN fe a t ure e xt ra c t or BN fe a t ure ve c t ors W a ve form pr e di c t i on Q ua nt i z a t i on e nc odi ng U ps a m pl i ng N a rrow ba nd w a ve form Q ua nt i z a t i on de c odi ng H i ghpa s s fi l t e r W a ve form a ddi ng Re c ons t ruc t e d w i de ba nd w a ve form D e a m pl i fi c a t i on N a rrow ba nd w a ve form H i gh - fre que nc y w a ve form H i gh - fre que nc y w a ve form W i de ba nd w a ve form Tr ai n i n g Exte n s i on y y ^ Fig. 5. The ﬂowchart of our proposed BWE methods. and output sequences, the narrowband wav eforms are then upsampled to the sampling rate of the wideband speech with zero high-frequency components. The upsampled narro wband wa veforms are used as the model input. The output wa ve- forms are either the unﬁltered wideband wav eforms ( WB strategy) or the high-frequency wav eforms ( HF strate gy). The high-frequency waveforms are obtained by sending wideband speech into a high-pass ﬁlter and an ampliﬁer for reducing quantization noise as the dotted lines in Fig. 5. Before the wa veforms are used for model training, all the input and output wa veform samples are discretized by 8-bit µ -law quantization. The model parameters of SRNNs or HRNNs are trained under cross-entropy (CE) criterion which optimizes the classiﬁcation accuracy of discrete output samples on training set. At the extension stage, the upsampled and quantized nar- rowband wav eforms are fed into the trained SRNNs or HRNNs to generate the probability distributions of output samples. Then each output sample is obtained by selecting the quantization lev el with maximum posterior probability . Later , the quantized output samples are decoded into continuous values using the inv erse mapping of µ -law quantization. A deampliﬁcation process is conducted for the HF strategy in order to compensate the ef fect of ampliﬁcation at training time. Finally , the generated wav eforms are high-pass ﬁltered and added with the input narrowband wav eforms to generate the ﬁnal wideband waveforms. Particularly for conditional HRNNs, BN features are used as auxiliary input in our implementation as shown by the gray lines in Fig. 5. BN features can be regarded as a compact representation of both linguistic and acoustic information [36]. Here, BN features are extracted by a DNN-based state classiﬁer , which has a bottleneck layer with smaller number of hidden units than that of other hidden layers. The inputs of the DNN are mel-frequency cepstral coefﬁcients (MFCC) extracted from narrowband speech and the outputs are the posterior probability of HMM states. The DNN is trained under cross-entropy (CE) criterion and is used as the BN feature extractor at extension time. I V . E X P E R I M E N T S A. Experimental Setup The TIMIT corpus [45] which contained English speech from multi-speakers with 16kHz sampling rate and 16bits resolution was adopted in our experiments. W e chose 3696 and 1153 utterances to construct the training set and validation set respectively . Another 192 utterances from the speakers not included in the training set and validation set were used as the test set to e valuate the performance of dif ferent BWE methods. In our experiments, the narrowband speech wa veforms sampled at 8kHz were obtained by downsampling the wideband speech at 16kHz. Fiv e BWE systems 1 were constructed for comparison in our experiments. The descriptions of these systems are as follows. • VRNN : V ocoder-based BWE method using LSTM-RNNs as introduced in Section II-A. The DRNN-BN system in [32] w as used here for comparison, which predicted the LMS of high-frequency components using a deep LSTM-RNN with auxiliary BN features. Backpropaga- tion through time (BPTT) algorithm was used to train the LSTM-RNN model based on the minimum mean square error (MMSE) criterion. In this system, a DNN- based state classiﬁer was built to extract BN features. 11-frames of 39-dimensional narrowband MFCCs were used as the input of the DNN classiﬁer and the posterior probabilities of 183 HMM states for 61 monophones were regarded as the output of the DNN classiﬁer . The DNN classiﬁer adopt 6 hidden layers where there were 100 hidden units at the BN layer and 1024 hidden units at other hidden layers. The BN layer was set as the ﬁfth hidden layer so that the extractor could capture more linguistic information. This BN feature extractor was also used in the CHRNN system. • DCNN : W av eform-based BWE method using stacked dilated CNNs as introduced in Section II-B. The CNN2- HF system in [35] was used here for comparison, which predicted high-frequency waveforms using non-causal CNNs and performed better than other conﬁgurations. • SRNN : W av eform-based BWE method using sample- lev el RNNs as introduced in Section III-A. The built model had two LSTM layers and two FF layers. Both the LSTM layers and the FF layers had 1024 hidden units and the embedding size was 256. The model was trained by stochastic gradient decent with a mini- batch size of 64 to minimize the cross entropy between the predicted and real probability distribution. Zero- padding was applied to make all the sequences in a mini- batch hav e the same length and the cost v alues of the added zero samples were ignored when computing the gradients. An Adam optimizer [46] was used to update the parameters with an initial learning rate 0.001. T runcated backpropagation through time (TBPTT) algorithm was employed to improve the efﬁciency of model training and the truncated length was set to 480. 1 Examples of reconstructed speech wa veforms in our experiments can be found at http://home.ustc.edu.cn/ ∼ ay8067/IEEEtran/demo.html. JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 7 • HRNN : W aveform-based BWE method using HRNNs as introduced in Section III-B. The HRNN was composed of 3 tiers with two FF layers in Tier 1 and one LSTM layer each in Tier 2 and 3. Therefore, there were two LSTM layers and two FF layers in total which was the same as the SRNN system. The number of c ( k ) , k = { 1 , 2 , 3 } in (14) and (19) were set as c (3) = c (2) = 2 , c (1) = L (2) in our experiments after tuning on the validation set. Some other setups, such as the dimension of the hidden units and the training method, were the same as that of the SRNN system mentioned abov e. The frame size conﬁgurations of the HRNN model will be discussed in Section IV -B. • CHRNN : W av eform-based BWE method using condi- tional HRNNs as introduced in Section III-C. The BN features extracted by the DNN state classiﬁer used by the VRNN system were adopted as auxiliary conditions. The model was composed of 4 tiers. The top conditional tier had one LSTM layer with 1024 hidden units and the other three tiers were the same as the HRNN system. Some basic setups and the training method were the same as the HRNN system. The setup of the conditional tier will be introduced in detail in Section IV -E. In our experiments, we ﬁrst inv estigated the inﬂuence of frame sizes and mapping strategies (i.e., the WB and HF strategies introduced in Section III-D) on the performance of the HRNN system. Then, the comparison between different wa veform-based BWE methods including the DCNN , SRNN and HRNN systems was carried out. Later , the effect of introducing BN features to HRNNs was studied by comparing the HRNN system and the CHRNN system. Finally , our proposed waveform-bas ed BWE method was compared with the con ventional vocoder-based one. B. Effects of F rame Sizes on HRNN-Based BWE As introduced in Section III-B, the frame sizes L ( k ) are key parameters that makes a HRNN model different from the con ventional sample-lev el RNN. In this experiment, we studied the effect of L ( k ) on the performance of HRNN- based BWE. The HRNN models with sev eral conﬁgurations of ( L (3) , L (2) ) were trained and their accuracy and efﬁciency were compared as shown in Fig. 6. Here, the classiﬁcation accuracy of predicting discrete wav eform samples in the validation set was used to measure the accuracy of different models. The total time of generating 1153 utterances in the validation set with mini-batch size of 64 on a single T esla K40 GPU was used to measure the run-time efﬁcienc y . Both the WB and HF mapping strategies were considered in this experiment. From the results shown in Fig. 6, we can see that there existed conﬂict between the accuracy and the efﬁcienc y of the trained HRNN models. Using smaller frame sizes of ( L (3) , L (2) ) improved the accurac y of sample prediction while increased the computational complexity at the extension stage for both the WB and HF strategies. Finally , we chose ( L (3) , L (2) ) = (16 , 4) as a trade-off and used this conﬁguration for building the HRNN system in the following experiments. Fig. 6. Accuracy and efﬁciency comparison for HRNN-based BWE with different ( L (3) , L (2) ) conﬁgurations and using (a) WB and (b) HF mapping strategies. T ABLE I A V E R AG E P E S Q S C O R E S W I T H 9 5 % CO N FI D E N C E I N T E RV A L S O N TH E T E S T SE T W H E N U S I N G WB A N D HF MA P P I N G S T R A T E G I E S FO R H R N N - BA S E D B W E . Narrowband HRNN-WB HRNN-HF PESQ score 3.63 ± 0.0636 3.53 ± 0.0438 3.75 ± 0.0456 C. Effects of Mapping Strate gy on HRNN-Based BWE It can be observed from Fig. 6 that the HF strategy achiev ed much lower classiﬁcation accuracy than the WB strategy . It is reasonable since it is more difﬁcult to predict the aperiodic and noise-like high-frequency wav eforms than to predict wideband wav eforms. Objectiv e and subjective ev aluations were conducted to in vestigate which strategy can achiev e better performance for the HRNN-based BWE. Since it is improper to compare the classiﬁcation accuracy of these two strate gies directly , the score of Perceptual Eval- uation of Speech Quality (PESQ) for wideband speech (ITU- T P .862.2) [47] was adopted as the objective measurement here. W e utilized the clean wideband speech as reference and calculated the PESQ scores of the 192 utterances in the test set generated using WB and HF strategies (i.e., the HRNN- WB system and the HRNN-HF system) respectiv ely . For comparison, the PESQ scores of the upsampled narrowband utterances (i.e., with empty high-frequency components) were also calculated. The av erage PESQ scores and their 95% conﬁ- dence intervals are sho wn in T able I. The differences between any two of the three systems were signiﬁcant according to the results of paired t -tests ( p < 0 . 001 ). From T able I, we can see that the HF strategy achieved higher PESQ score than the WB strategy . The av erage PESQ of the HRNN-WB system was ev en lower than that of the upsampled narrowband speech. This may be attributed to that the model in the HRNN-WB system aimed to reconstruct the whole wideband wav eforms and was incapable of generating high-frequency components as accurately as the HRNN-HF system. A 3-point comparison category rating (CCR) [48] test JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 8 Fig. 7. A verage CCR scores of comparing ﬁ ve system pairs, including (1) HRNN-HF vs. HRNN-WB , (2) HRNN vs. DCNN , (3) HRNN vs. SRNN , (4) CHRNN vs. HRNN , and (5) CHRNN vs. VRNN . The error bars represent 95% conﬁdence intervals and the numerical v alues in parentheses represent the p -value of one-sample t -test for different system pairs. was conducted on the Amazon Mechanical Turk (AMT) crowdsourcing platform (https://www .mturk.com) to compare the subjectiv e performance of the HRNN-WB and HRNN-HF systems. The wideband wav eforms of 20 utterances randomly selected from the test set were reconstructed by the HRNN- WB and HRNN-HF systems. Each pair of generated wideband speech were ev aluated in random order by 15 nati ve English listeners after rejecting improper listeners based on anti- cheating considerations [49]. The listeners were asked to judge which utterance in each pair had better speech quality or there was no preference. Here, the HRNN-WB system was used as the reference system. The CCR scores of +1, -1, and 0 denoted that the wideband utterance reconstructed by the ev aluated system, i.e., the HRNN-HF system, sounded better than, worse than, or equal to the sample generated by the reference system in each pair . W e calculated the a verage CCR score and its 95% conﬁdence interval through all pairs of utterances listened by all listeners. Besides, one-sample t -test was also conducted to judge whether there was a signiﬁcant difference between the av erage CCR score and 0 (i.e., to judge whether there was a signiﬁcant dif ference between two systems) by examining the p -v alue. The results are sho wn as the ﬁrst system pair in Fig. 7, which suggests that the HRNN- HF system outperformed the HRNN-WB system signiﬁcantly . This is consistent with the results of comparing these two strategies when dilated CNNs were used to model wa veforms for the BWE task [35]. Therefore, the HF strategy was adopted in the following experiments for building wav eform-based BWE systems. D. Model Comparison for W aveform-Based BWE The performance of three w aveform-based BWE systems, i.e., the DCNN , SRNN and HRNN systems, were compared by objectiv e and subjectiv e ev aluations. The accuracy and efﬁcienc y metrics used in Section IV -B and the PESQ score T ABLE II O B J E C T I V E PE R F O R M A N C E O F T H E DCNN , SRNN AN D HRNN S Y S T E M S O N TH E T E S T S E T . DCNN SRNN HRNN Accuracy (%) 7.18 ± 0.336 7.40 ± 0.387 7.52 ± 0.388 PESQ score 3.62 ± 0.0532 3.70 ± 0.0477 3.75 ± 0.0456 SNR (dB) 19.06 ± 0.5983 18.95 ± 0.6053 19.00 ± 0.6099 SNR-V (dB) 26.14 ± 0.7557 26.06 ± 0.7648 26.21 ± 0.7716 SNR-U (dB) 10.49 ± 0.4094 10.32 ± 0.4126 10.26 ± 0.4124 LSD (dB) 8.46 ± 0.122 8.61 ± 0.136 8.30 ± 0.127 LSD-V (dB) 7.71 ± 0.172 8.09 ± 0.203 8.02 ± 0.194 LSD-U (dB) 9.34 ± 0.124 9.19 ± 0.124 8.57 ± 0.107 Generation time (s) 3.97 19.39 3.61 used in Section IV -C were adopted as objectiv e measurements. Besides, tw o extra metrics were adopted here, including signal- to-noise ratio (SNR) [40] which measured the distortion of wav eforms and log spectral distance (LSD) [40] which reﬂected the distortion in frequency domain. The SNR and LSD for voiced frames (denoted by SNR-V and LSD-V) and unv oiced frames (denoted by SNR-U and LSD-U) were also calculated separately for each system. For the fairness of efﬁcienc y comparison, we set the mini-batch size as 1 for all the three systems when generating utterances in the test set. The time of generating 1 second speech (i.e., 16000 samples for 16kHz speech) using a T esla K40 GPU was recorded as the measurement of ef ﬁciency in this e xperiment. T able II shows the objectiv e performance of the three systems on the test set. The 95% conﬁdence interv als were also calculated for all metrics except the generation time. The results of paired t -tests indicated that the differences between any two of the three systems on all metrics were signiﬁcant ( p < 0 . 01 ). For accuracy and PESQ score, the DCNN system was not as good as the other two systems. The HRNN system achieved the best performance on both accuracy and PESQ score. For SNR, the HRNN system and the DCNN system achiev ed the best performance on voiced segments and un v oiced segments respectively . For LSD, the HRNN system achieved the lo west overall LSD and the lowest LSD of unv oiced segments. On the other hand, the DCNN system achiev ed the lowest LSD of voiced frames among the three systems. Considering that LSDs were calculated using only amplitude spectra while SNRs were inﬂuenced by both amplitude and phase spectra of the reconstructed wa veforms, it can be inferred that the HRNN system was better at restoring the phase spectra of voiced frames than the DCNN system according to the SNR-V and LSD-V results of these two systems shown in T able II. In terms of the efﬁcienc y , the generation time of the SRNN system was more than 5 times longer than that of the HRNN system due to the sample- by-sample calculation at all layers in the SRNN structure as discussed in Section III-A. Also, the efﬁciency of the DCNN system was slightly worse than that of the HRNN system. The results rev eal that HRNNs can help improv e both the accuracy and efﬁciency of SRNNs by modeling long-span dependencies among sequences using a hierarchical structure. The spectrograms extracted from clean wideband speech and the output of BWE using the DCNN , SRNN and HRNN systems for an example sentence in the test set are shown JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 9 Fig. 8. The spectrograms of clean wideband speech and the output of BWE using ﬁve systems for an example sentence in the test set. in Fig. 8. It can be observed that the high-frequency energy of some un voiced segments generated by the DCNN system was much weaker than that of the natural speech and the outputs of the SRNN and HRNN systems. Compared with the SRNN and HRNN systems, the DCNN system was better at reconstructing the high-frequency harmonic structures of some voiced segments. These observations are in line with the LSD results discussed earlier . Furthermore, two 3-point CCR tests were carried out to ev aluate the subjectiv e performance of the HRNN system by using the DCNN system and the SRNN system as the reference system respectively . The conﬁgurations of the tests were the same as the ones introduced in Section IV -C. The results are shown as the second and third system pairs in Fig. 7. W e can see that our proposed HRNN-based method generated speech with signiﬁcantly better quality than the dilated CNN-based method. Compared with the SRNN system, the HRNN system was slightly better while the superiority w as insigniﬁcant at 0.05 signiﬁcance le vel. Howe ver , the HRNN system was much more efﬁcient than the SRNN system at generation time as sho wn in T able II. E. Effects of Additional Conditions on HRNN-Based BWE W e compared the HRNN system with the CHRNN system by objecti ve and subjective ev aluations to e xplore the ef fects of additional conditions on HRNN-based BWE. As introduced in Section IV -A, the BN features were used as additional conditions in the CHRNN system since they can provide linguistic-related information besides the acoustic wa veforms. The CHRNN system adopted the conditional HRNN structure introduced in Section III-C with 4 tiers. The dimension of BN T ABLE III O B J E C T I V E PE R F O R M A N C E O F T H E HRNN AN D CHRNN S Y S T E M S O N T H E T E S T S E T TO G E T H E R W I T H T H E p V A L U E S O F PA I R E D t - T E S T S . HRNN CHRNN p -value Accuracy (%) 7.52 ± 0.388 7.46 ± 0.385 < 0.001 PESQ score 3.75 ± 0.0456 3.79 ± 0.0394 < 0.001 SNR (dB) 19.00 ± 0.6099 18.99 ± 0.5946 0.322 SNR-V (dB) 26.21 ± 0.7716 26.13 ± 0.7539 < 0.001 SNR-U (dB) 10.26 ± 0.4124 10.34 ± 0.4097 < 0.001 LSD (dB) 8.30 ± 0.127 8.27 ± 0.123 0.301 LSD-V (dB) 8.02 ± 0.194 7.89 ± 0.185 < 0.001 LSD-U (dB) 8.57 ± 0.107 8.66 ± 0.103 < 0.01 Generation time (s) 3.61 4.17 – features was 100 and the frame size at the top conditional tier was L (4) = 160 because the frame shift of BN features was 10ms, corresponding to 160 samples for 16kHz speech. The objecti ve measurements used in Section IV -D were adopted here to compare the HRNN and CHRNN systems. The results are shown in T able III. The CHRNN system outperformed the HRNN system on PESQ score while its prediction accurac y was not as good as the HRNN system. For SNR, these two systems achieved similar performance. The results of LSD show that the CHRNN system was better at reconstructing voiced frames and the HRNN system was on the contrary . In terms of efﬁcienc y , the generation time of the CHRNN system was higher than that of the HRNN system due to the extra conditional tier . A 3-point CCR test was also conducted to ev aluate the subjectiv e performance of the CHRNN system by using the HRNN system as the reference system and following the ev aluation conﬁgurations introduced in Section IV -C. The results are shown as the fourth system pairs in Fig. 7, which rev eal that utilizing BN features as additional conditions in HRNN-based BWE can improv e the subjectiv e quality of reconstructed wideband speech signiﬁcantly . Fig. 8 also shows the spectrogram of the wideband speech generated by the CHRNN system for an example sentence. Comparing the spectrograms produced by the HRNN system and the CHRNN system, we can observe that the high-frequency components generated by the CHRNN system were stronger than the HRNN system. This may lead to better speech quality as shown in Fig. 7. F . Comparison between W aveform-Based and V ocoder -Based BWE Methods Finally , we compared the performance of vocoder -based and waveform-based BWE methods by conducting objective and subjective ev aluations between the VRNN system and the CHRNN system since both systems adopted BN features as auxiliary input. The objective results including PESQ, SNR and LSD are shown in T able IV. The CHRNN system achiev ed signiﬁcantly better SNR than that of the VRNN system, which suggested that our proposed waveform-based method can restore the phase spectra more accurately than the con ventional vocoder -based method. For PESQ and LSD, the CHRNN system was not as good as the VRNN system. This is reasonable considering that the VRNN system modeled and JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 10 T ABLE IV O B J E C T I V E P E R F O R M A N C E O F T H E VRNN A N D CHRNN S Y S T E M S O N T H E T E S T SE T T O G E T H E R W I T H T H E p V A L U E S O F PA I R E D t - T E S T S . VRNN CHRNN p value PESQ score 3.87 ± 0.0368 3.79 ± 0.0394 < 0.001 SNR (dB) 17.76 ± 0.6123 18.99 ± 0.5946 < 0.001 SNR-V (dB) 25.00 ± 0.7333 26.13 ± 0.7539 < 0.001 SNR-U (dB) 9.01 ± 0.424 10.34 ± 0.4097 < 0.001 LSD (dB) 6.69 ± 0.110 8.27 ± 0.123 < 0.001 LSD-V (dB) 6.86 ± 0.148 7.89 ± 0.185 < 0.001 LSD-U (dB) 6.45 ± 0.0972 8.66 ± 0.103 < 0.001 T ABLE V M A X I M A L L A T E N C I E S ( ms ) O F T H E FIV E B W E SY S T E M S . T H E SA M P L I N G R ATE O F W I D E B A N D WA V E F O R M S I S f s = 16 k H z . Maximal Latenc y Remarks VRNN W S = 25 W S : window size in ms of STFT for e xtracting spectral parameters. DCNN N/ 2 f s = 32 N + 1 : length of recepti ve ﬁeld. SRNN 0 None HRNN c (3) L (3) − 1 f s = 1 . 9375 c (3) , L (3) : number of concatenated frames, frame size at Tier 3. CHRNN W S = 25 W S : window size in ms of STFT for e xtracting spectral parameters. predicted LMS directly which were used in the calculation of PESQ and LSD. A 3-point CCR test was also conducted to ev aluate the subjecti ve performance of the CHRNN system by using the VRNN system as the reference system and following the ev aluation conﬁguratioins introduced in Section IV -C. The results are sho wn as the ﬁfth system pairs in Fig. 7. W e can see that the CCR score w as high than 0 signiﬁcantly which indicates that the CHRNN system can achiev e signiﬁcantly higher quality of reconstructed wideband speech than the VRNN system. Comparing the spectrograms produced by the VRNN system and the CHRNN system in Fig. 8, it can be observed that the CHRNN system performed better than the VRNN system in generating the high-frequency harmonics for voiced sounds. Besides, the high-frequency components generated by the CHRNN system were less over -smoothed and more natural than that of the VRNN system at un voiced segments. Furthermore, there was a discontinuity between the low- frequency and high-frequency spectra of the speech generated the VRNN system, which was also found in other vocoder - based BWE method [26]. As shown in Fig. 8, the wav eform- based systems alleviated this discontinuity effecti vely . These experimental results indicate the superiority of modeling and generating speech wa veforms directly o ver utilizing v ocoders for feature extraction and wav eform reconstruction on the BWE task. G. Analysis and Discussion 1) Maximal latency of differ ent BWE systems Some application scenarios have strict requirement on the latency of BWE algorithm. W e compared the maximal latency of the ﬁv e BWE systems listed in Section IV -A and the results are shown in T able V. Here, the latency refers to the duration of future input samples that are necessary for predicting current output sample. The maximal latencies of the VRNN system and the CHRNN system were both determined by the window size of STFT for extracting LMS and MFCC parameters, which was 25 ms in our implementation. The maximal latencies of the other three systems depended on their structures. The SRNN system processed input wav eforms and generate output wa veforms sample-by-sample without latency according to (3). Because the non-causal CNN structure sho wn in Fig. 1 was adopted by the DCNN system and its receptiv e ﬁeld length was about 64 ms [35], it made the highest latenc y among the ﬁv e systems. The latency of the HRNN system was relativ ely short because the number of concatenated frames and the frame size of the top tier were small ( c (3) = 2 and L (3) = 16 ). 2) Run-time efﬁciency of waveform-based BWE One deﬁciency of the w aveform-based BWE methods is that they are v ery time-consuming at generation time. As shown in T able II and T able III, the HRNN system achie ved the best run-time ef ﬁciency among the four wav eform-based systems, which still took 3.61 seconds to generate 1 second speech in our current implementation. Therefore, to accelerate the computation of HRNNs is an important task of our future work. As shown in Fig. 6, using longer frame sizes may help reduce the computational complexity of HRNNs. Another possible way is to reduce the number of hidden units and other model parameters similar to the attempt of accelerating W av eNet for speech synthesis [39]. V . C O N C L U S I O N In this paper , we hav e proposed a novel wav eform modeling and generation method using hierarchical recurrent neural networks (HRNNs) to fulﬁll the speech bandwidth extension (BWE) task. HRNNs adopt a hierarchy of recurrent modules to capture long-span dependencies between input and output wa veform sequences. Compared with the plain sample-lev el RNN and the stacked dilated CNN, the proposed HRNN model achiev es better accuracy and efﬁcienc y of predicting high-frequency wa veform samples. Besides, additional con- ditions, such as the bottleneck features (BN) e xtracted from narrowband speech, can further improve subjective quality of reconstructed wideband speech. The experimental results sho w that our proposed HRNN-based method achiev es higher sub- jectiv e preference scores than the con ventional vocoder-based method using LSTM-RNNs. T o ev aluate the performance of our proposed methods using practical band-limited speech data, to impro ve the ef ﬁciency of wa veform generation using HRNNs, and to utilize other types of additional conditions will be the tasks of our future w ork. R E F E R E N C E S [1] K. Nakamura, K. Hashimoto, K. Oura, Y . Nankaku, and K. T okuda, “ A mel-cepstral analysis technique restoring high frequency components from low-sampling-rate speech, ” in Proc. Interspeec h , 2014. [2] A. Albahri, C. S. Rodriguez, and M. Lech, “ Artiﬁcial bandwidth extension to improve automatic emotion recognition from narrow-band coded speech, ” in Proc. ICSPCS , 2016, pp. 1–7. [3] M. M. Goodarzi, F . Almasganj, J. Kabudian, Y . Shekofteh, and I. S. Rezaei, “Feature bandwidth extension for Persian conversational telephone speech recognition, ” in Pr oc. ICEE , 2012, pp. 1220–1223. JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, J ANU AR Y 2007 11 [4] S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter , “Speech enhancement via frequency bandwidth extension using line spectral frequencies, ” in Pr oc. ICASSP , vol. 1, 2001, pp. 665–668. [5] F . Musti ` ere, M. Bouchard, and M. Boli ´ c, “Bandwidth extension for speech enhancement, ” in Proc. CCECE , 2010, pp. 1–4. [6] J. Makhoul and M. Berouti, “High-frequency regeneration in speech coding systems, ” in Proc. ICASSP , vol. 4, 1979, pp. 428–431. [7] S. V aseghi, E. Zavarehei, and Q. Y an, “Speech bandwidth e xtension: extrapolations of spectral env elop and harmonicity quality of excitation, ” in Pr oc. ICASSP , vol. 3, 2006, pp. III–III. [8] H. Pulakka, U. Remes, K. Palom ¨ aki, M. Kurimo, and P . Alku, “Speech bandwidth extension using Gaussian mixture model-based estimation of the highband mel spectrum, ” in Pr oc. ICASSP , 2011, pp. 5100–5103. [9] Y . W ang, S. Zhao, Y . Y u, and J. Kuang, “Speech bandwidth extension based on GMM and clustering method, ” in Pr oc. CSNT , 2015, pp. 437– 441. [10] Y . Ohtani, M. T amura, M. Morita, and M. Akamine, “GMM-based bandwidth extension using sub-band basis spectrum model. ” in Proc. Interspeech , 2014, pp. 2489–2493. [11] Y . Zhang and R. Hu, “Speech wideband extension based on Gaussian mixture model, ” Chinese Journal of Acoustics , no. 4, pp. 363–377, 2009. [12] G.-B. Song and P . Martyno vich, “ A study of HMM-based bandwidth extension of speech signals, ” Signal Pr ocessing , v ol. 89, no. 10, pp. 2036–2044, 2009. [13] Z. Y ong and L. Yi, “Bandwidth extension of narrowband speech based on hidden Markov model, ” in Proc. ICALIP , 2014, pp. 372–376. [14] P . Bauer and T . Fingscheidt, “ An HMM-based artiﬁcial bandwidth extension evaluated by cross-language training and test, ” in Pr oc. ICASSP , 2008, pp. 4589–4592. [15] G. Chen and V . Parsa, “HMM-based frequency bandwidth extension for speech enhancement using line spectral frequencies, ” in Pr oc. ICASSP , vol. 1, 2004, pp. I–709. [16] Z.-H. Ling, S.-Y . Kang, H. Zen, A. Senior , M. Schuster, X.-J. Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends, ” IEEE Signal Processing Magazine , vol. 32, no. 3, pp. 35–52, 2015. [17] Z.-H. Ling, L. Deng, and D. Y u, “Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 21, no. 10, pp. 2129–2139, 2013. [18] H. Zen, A. Senior , and M. Schuster, “Statistical parametric speech synthesis using deep neural networks, ” in Pr oc. ICASSP , 2013, pp. 7962– 7966. [19] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “V oice con version using deep neural networks with layer-wise generative training, ” IEEE/ACM T ransactions on Audio, Speech and Language Pr ocessing , vol. 22, no. 12, pp. 1859–1872, 2014. [20] T . Nakashika, R. T akashima, T . T akiguchi, and Y . Ariki, “V oice con version in high-order eigen space using deep belief nets. ” in Pr oc. Interspeech , 2013, pp. 369–372. [21] X. Lu, Y . Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder. ” in Proc. Interspeech , 2013, pp. 436–440. [22] Y . Xu, J. Du, L.-R. Dai, and C.-H. Lee, “ A regression approach to speech enhancement based on deep neural networks, ” IEEE/A CM T ransactions on Audio, Speech and Language Pr ocessing , vol. 23, no. 1, pp. 7–19, 2015. [23] C. V . Botinhao, B. S. Carlos, L. P . Caloba, and M. R. Petraglia, “Frequency extension of telephone narrowband speech signal using neural networks, ” in Proc. CESA , vol. 2, 2006, pp. 1576–1579. [24] J. Kontio, L. Laaksonen, and P . Alku, “Neural network-based artiﬁcial bandwidth expansion of speech, ” IEEE transactions on audio, speech, and language pr ocessing , vol. 15, no. 3, pp. 873–881, 2007. [25] H. Pulakka and P . Alku, “Bandwidth extension of telephone speech using a neural network and a ﬁlter bank implementation for highband mel spectrum, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 19, no. 7, pp. 2170–2183, 2011. [26] K. Li and C.-H. Lee, “ A deep neural network approach to speech bandwidth expansion, ” in Pr oc. ICASSP , 2015, pp. 4395–4399. [27] B. Liu, J. T ao, Z. W en, Y . Li, and D. Bukhari, “ A novel method of artiﬁcial bandwidth extension using deep architecture. ” in Pr oc. Interspeech , 2015, pp. 2598–2602. [28] Y . W ang, S. Zhao, W . Liu, M. Li, and J. Kuang, “Speech bandwidth expansion based on deep neural networks. ” in Proc. Interspeec h , 2015, pp. 2593–2597. [29] J. Abel, M. Strake, and T . Fingscheidt, “ Artiﬁcial bandwidth extension using deep neural networks for spectral envelope estimation, ” in Proc. IW AENC , 2016, pp. 1–5. [30] Y . Gu and Z.-H. Ling, “Restoring high frequency spectral env elopes using neural networks for speech bandwidth extension, ” in Proc. IJCNN , 2015, pp. 1–8. [31] Y . W ang, S. Zhao, J. Li, and J. Kuang, “Speech bandwidth extension using recurrent temporal restricted Boltzmann machines, ” IEEE Signal Pr ocessing Letters , vol. 23, no. 12, pp. 1877–1881, 2016. [32] Y . Gu, Z.-H. Ling, and L.-R. Dai, “Speech bandwidth extension using bottleneck features and deep recurrent neural networks. ” in Pr oc. Interspeech , 2016, pp. 297–301. [33] A. v . d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. V inyals, A. Graves, N. Kalchbrenner , A. Senior, and K. Kavukcuoglu, “W aveNet: A generativ e model for raw audio, ” arXiv preprint , 2016. [34] S. Mehri, K. Kumar , I. Gulrajani, R. Kumar , S. Jain, J. Sotelo, A. Courville, and Y . Bengio, “SampleRNN: An unconditional end-to- end neural audio generation model, ” arXiv preprint , 2016. [35] Y . Gu and Z.-H. Ling, “W aveform modeling using stacked dilated con volutional neural networks for speech bandwidth extension, ” in Pr oc. Interspeech , 2017, pp. 1123–1127. [36] D. Y u and M. L. Seltzer, “Improved bottleneck features using pretrained deep neural networks. ” in Proc. Interspeech , 2011, pp. 237–240. [37] Z. W u, C. V alentini-Botinhao, O. W atts, and S. King, “Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis, ” in Pr oc. ICASSP , 2015, pp. 4460–4464. [38] J. B. Allen and L. R. Rabiner , “ A uniﬁed approach to short-time Fourier analysis and synthesis, ” Pr oceedings of the IEEE , vol. 65, no. 11, pp. 1558–1564, 1977. [39] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky , Y . Kang, X. Li, J. Miller, J. Raiman, S. Sengupta et al. , “Deep voice: Real-time neural text-to-speech, ” arXiv preprint , 2017. [40] A. T amamori, T . Hayashi, K. K obayashi, K. T akeda, and T . T oda, “Speaker-dependent W av eNet vocoder . ” in Pr oc. Interspeech , 2017, pp. 1118–1122. [41] Y .-J. Hu, C. Ding, L.-J. Liu, Z.-H. Ling, and L.-R. Dai, “The USTC system for blizzard challenge 2017. ” in Proc. Blizzard Challenge W orkshop , 2017. [42] I. Recommendation, “G. 711: Pulse code modulation (PCM) of voice frequencies, ” International T elecommunication Union , 1988. [43] Y . Fan, Y . Qian, F .-L. Xie, and F . K. Soong, “TTS synthesis with bidirectional LSTM based recurrent neural networks. ” in Proc. Interspeech , 2014, pp. 1964–1968. [44] J. Sotelo, S. Mehri, K. Kumar, J. F . Santos, K. Kastner , A. Courville, and Y . Bengio, “Char2wav: End-to-end speech synthesis, ” in Proc. ICLR W orkshop Tr ack , 2017. [45] J. S. Garofolo, L. F . Lamel, W . M. Fisher, J. G. Fiscus, and D. S. Pallett, “D ARP A TIMIT acoustic-phonetic continous speech corpus CD-R OM. NIST speech disc 1-1.1, ” NASA STI/Recon technical r eport n , vol. 93, 1993. [46] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” arXiv preprint arXiv:1412.6980 , 2014. [47] I. Recommendation, “P . 862.2: Wideband e xtension to recommendation P. 862 for the assessment of wideband telephone networks and speech codecs, ” International T elecommunication Union , 2007. [48] A. O. W atson, “ Assessing the quality of audio and video components in desktop multimedia conferencing, ” Ph.D. dissertation, Univ ersity of London, 2001. [49] S. Buchholz and J. Latorre, “Crowdsourcing preference tests, and how to detect cheating, ” in Pr oc. Interspeech , 2011, pp. 1118–1122.

Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment