Wavenet based low rate speech coding

Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the …

Authors: W. Bastiaan Kleijn, Felicia S. C. Lim, Alej

Wavenet based low rate speech coding
W A VENET B ASED LO W RA TE SPEECH CODING W . Bastiaan Kleijn, 1 , 3 F elicia S. C. Lim, 1 Alejandr o Luebs, 1 J an Sko glund, 1 Florian Stimber g, 2 Quan W ang, 1 Thomas C. W alters 2 1 Google Inc., San Francisco, CA; 2 DeepMind, London, UK; 3 V ictoria Uni versity of W ellington, NZ ABSTRA CT T raditional parametric coding of speech facilitates lo w rate b ut pro- vides poor reconstruction quality because of the inadequacy of the model used. W e describe ho w a W av eNet generativ e speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. W e compare this parametric coder with a waveform coder based on the same genera- tiv e model and show that approximating the signal waveform incurs a large rate penalty . Our experiments confirm the high performance of the W av eNet based coder and sho w that the speech produced by the system is able to additionally perform implicit bandwidth ex- tension and does not significantly impair recognition of the original speaker for the human listener, e ven when that speaker has not been used during the training of the generativ e model. Index T erms — Speech coding, parametric coding, W a veNet, generativ e model 1. INTR ODUCTION Speech coding found its first major application in secure commu- nications, e.g., [1], and later enabled low-cost mobile and internet communications, e.g., [2, 3, 4]. W ith the continuously decreasing cost of bandwidth in most applications, the trade-off between rate and quality has gra vitated to higher rates, typically ov er 16 kb/s, to ensure good quality . This state-of-the-art may change with the ad- vent of a ne w generation of coders that pro vides a significant leap in performance. In this paper we discuss an approach that can provide good quality at rates around 2-3 kb/s with significant potential for further improv ement in the rate-quality trade-off. The redundancy in rate of existing speech coders can be deter- mined from estimates of the information rate in speech. A recent rate estimate [5] based on comparing signals with the same message is consistent with le xical information rates computed from phoneme statistics [6]. They suggest that the true information rate is less than 100 b/s. Attributes of speech that identify the speaker and speak- ing style do not vary rapidly ov er time and hence do not change this rough estimate significantly . The common coding algorithms used in current communication systems require a rate that is roughly two or - ders of magnitude higher than the rate of the information con ve yed. Essentially , all speech coding methods are based on an explicit model of the signal, which usually is time v arying. In par ametric coding the signal generation is generated at the decoder based on the model parameters only . The quality of the signal reproduced by para- metric coders is limited by the efficac y of the model. Howe ver , even poor signal models can be usefully exploited for high-quality coding. W aveform coding exploits that conditioning information (the model and its parameters) reduces the minimum rate required to achieve a particular mean error for the signal wav eform. The penalty paid is that the reproduced signal is an approximation of the original wav e- form. This requires the transmission of information that, at least in principle, is not needed for high fidelity reconstruction, explaining in part the high rate of current speech coding schemes. Most models used in speech coding have a statistical basis. A speech signal can be described as a discr ete-time stochastic pro- cess { s i } with a non-zero (differential) entropy rate. A discrete-time stochastic process can be characterized by a sequence of conditional probability density functions (PDFs) f ( s i | s i − 1 , s i − 2 , · · · ) . If the memory is p samples, then the process is an order- p Markov pro- cesss. Application of the chain rule relates the PDF of a sample sequence to that of the conditional PDFs. V arious generative models have been used to describe the speech process. The models generally assume speech to be a Markov pro- cess and provide a conditional distribution for the next signal sam- ple giv en a set of past samples. Ubiquitous are linear autoregressi ve (AR) models [7], and hidden Markov models (HMM) [8]. Refined generativ e models such as ARMA, e.g., [9], and kernel density esti- mation (KDE) HMM, e.g., [10], can also be used. Howev er , linear AR modeling remains the most commonly used generativ e model in speech coding. This may change with the recent introduction of the deep neural network (DNN) based W aveNet [11] generativ e models. Recursiv e sampling of the conditional PDF of a speech model can be used to produce a speech signal. The sampling of the PDF corresponds to the generation of new information. T o retain the per- ceptible attributes of an original speech signal in the coding appli- cation, the PDF must include conditioning variables (side informa- tion). These typically specify the short-term power spectral density , pitch and periodicity level. Good signal quality can be guaranteed ev en for imperfect generative models by approximating the original signal waveform. Howev er , no new information is then generated during reconstruction and this information must be transmitted in- stead. Efficienc y can be increased by encoding of blocks of samples simultaneously , for example using analysis-by-synthesis [12] in the context of generati ve models. The analysis-by-synthesis paradigm applied to AR models, introduced in [13], is uni versally used in mo- bile phone standards. The contribution of this paper is the usage of W aveNet as a generic generativ e model for speech coding and an analysis thereof. W e describe two W aveNet based coding architectures: i ) a paramet- ric coder that encodes only the conditioning v ariables, and ii ) a wav e- form coder that encodes the conditioning variables and also the ob- served wa veform exploiting the conditional distribution. The para- metric coder model differs from [14] in that it is not speaker depen- dent, and that it can also be decoded with a conv entional low com- plexity decoder . Our coding architectures replace the text sequence used as conditioning information in the original text-to-speech (TTS) application of W av eNet by a sequence of quantized parameters of a parametric speech coder . This change is nontri vial as in a coding scenario the generati ve model cannot be trained on the speakers that the coder encounters during operation. It will be shown in section 3 that W av eNet can be generalized to the multi-speaker case. In the remainder of this paper , we first describe our approach in more detail in section 2, then discuss our experimental results in section 3 and finally provide conclusions in section 4. 2. ALGORITHM In this section we discuss the parametric coding architecture, analyze the rate and describe the wa veform coding architecture. 2.1. Parametric W av eNet Coder A parametric coder transmits only the conditioning variables of the generativ e model that generates the signal at the decoder . It is possi- ble to train a neural architecture at the encoder to optimize the con- ditioning variables. In this paper , we opted for a con ventional para- metric encoder instead. The latter approach has a significantly lower complexity at the encoder and facilitates the use of a lo w complexity decoder if computational resources fall short. W e first motiv ate our con ventional encoder choice and then describe the W av eNet decoder . T raditional parametric coders almost al ways encode a similar set of parameters: spectral en velope, pitch, and voicing lev el. The pa- rameter set differs little for approaches based on a temporal perspec- tiv e with glottal pulse trains [15, 16], and those based on a frequency- domain perspectiv e with sinusoids [17, 18]. Any of these parameter sets can be used as a set of conditioning variables for W aveNet. T o illustrate that our architecture does not require special fea- tures for the encoder, we selected Codec 2 [19], an open source speech coder . It belongs to the sinusoidal coder family and can run at various update rates. For example, at 2.4 kb/s, each 20 ms block encodes the short-time spectral env elope using line spectral frequen- cies [20] with 36 bits, the pitch with 7 bits, the signal po wer with 5 bits, and the voicing level with 2 bits. The voicing level is deter- mined as in the multiband excitation v ocoder [21]. W e note that parametric coders (including Codec 2) almost univ ersally operate on narrow-band speech with a sampling rate of 8 kHz but for high quality output speech, a wide-band signal ( ≥ 16 kHz) is preferred. Historically , wide-band e xtension is op- tionally applied after the decoder [22]. In our approach, we train our decoder with 8 kHz conditioning variables and 16 kHz speech signals such that it implicitly performs bandwidth extension. For the parametric decoder , we use the W aveNet generative model [11]. Given the past output signal and the conditioning v ari- ables, W aveNet provides a discrete probability distribution of the next signal sample using the 8-bit ITU-T G.711 µ -law format. It then samples this distribution to select the output sample v alue. The W aveNet architecture is a multi-layer structure using dilated con volution with gated cells. The conditional variables are supplied to all layers of the network. For the coder, we retained the standard W av eNet configuration of [11] but replaced the conditioning vari- ables with the decoded Codec 2 bit stream. The Codec 2 decoder provides its parameters to its sinusoidal renderer at 100 Hz, which we used unchanged as the conditioning variables for the W a veNet decoder . As W aveNet requires conditioning for each output sample, we hold the conditional variables constant for 10 ms interv als. During training, W aveNet learns the parameters of a softmax function that represents the conditional discrete probability distrib u- tion. The training is subject to the same conditioning variables that are used during run-time. In contrast to W aveNet training for TTS applications, we used a training database containing a large number of different talkers providing a wide v ariety of voice characteristics, all without conditioning on a label that identifies the talker . 2.2. Rate Analysis The rate benefit of gener ating the wa veform o ver approximating the original wav eform can be estimated from the information rate gen- erated at the decoder . Basic information theory allo ws us to esti- mate the relevant rates. Let us consider a generated signal sequence { S i } i ∈A , where A is an index sequence, and a conditioning se- quence { Θ i } i ∈A , with the two sequences being presented at the same sampling rate. Let both sequences be discrete valued and of finite length |A| . Their entropy rates then satisfy 1 |A| H ( { S i } , { Θ i } ) = 1 |A| H ( { S i }|{ Θ i } ) + 1 |A| H ( { Θ i } ) , (1) where we omitted sequence subscripts to simplify notation. Hence, the overall information rate contained in the decoded signal is the sum of the information rate associated with the generativ e process and the rate of the encoded conditioning parameters. The rate re- quired for the conditioning v ariables is upper bounded by the rate of the parametric coder . The information rate 1 |A| H ( { S i }|{ Θ i } ) associated with the generativ e process is simple to ev aluate for W aveNet. W e consider |A| → ∞ and make the additional assumption that speech is a short-time stationary and ergodic process and that, therefore, { S i } and { Θ i } are stationary . At its output, the W aveNet decoder produces a probability dis- tribution with a set N of 256 discrete v alues for a µ -law encoding of the next sample. W e denote this distribution as q ( i ) n , i ∈ Z , n ∈ N . The distribution q ( i ) is sampled to produce the scalar output signal value. The mean information in bits generated by this sampling op- eration is the conditional entropy of the distrib ution: H ( S i | s i − 1 , s i − 2 , · · · ; θ i ) = − X n ∈N q ( i ) n log 2 q ( i ) n , (2) where θ i is the current vector of conditioning variables, and where capital letters indicate random v ariables and lower -case letters real- izations. Note that H ( S i | s i − 1 , s i − 2 , · · · ; θ i ) is simple to compute and that the ev aluation is exact for the reconstruction process. W e can now find the generated signal rate using an approxima- tion to the chain rule: lim |A|→∞ 1 |A| H ( { S i }|{ Θ i } ) = H ( S i | S i − 1 , S i − 2 , · · · ; Θ i ) (3) ≈ 1 |A 0 | X i ∈A 0 H ( S i | s i − 1 , s i − 2 , · · · ; θ i ) , (4) where A 0 is an observ ed finite sequence. W e will e valuate the result in section 3. 2.3. W aveNet W a veform Coder As will be seen in section 3, relation (4) implies that a coder that reproduces the signal wa veform is inef ficient compared to a coder relying on a generativ e model. Howe ver , a generative coder cannot guarantee its output quality; situations may occur where the mod- elled generativ e distribution is not a good description of the signal. Importantly , scenarios where the parametric W aveNet coder does not perform well can be detected within the W aveNet frame- work. Running a w av eform W av eNet at the encoder allo ws the e val- uation of the log likelihood of the input signal based on the W av eNet generativ e model and compare that to the expectation. Thus, we can select between parametric and wav eform coding (low and high rate). When generative performance is good, the decoder recei ves no wa veform information, and it rev erts to a generativ e mode. Resyn- chronization can be achie ved with con ventional techniques [23]. W e will report on this mode selection paradigm elsewhere and only discuss the basic W av eNet wav eform coder here. W av eform coding is commonly based on prediction, particularly at rates below 30 kb/s. While fixed-rate predictive coding is most common, variable-rate versions also exist. In predictiv e coders, the conditional probability distribution is generally considered fixed rel- ativ e to its mean predicted v alue. The same approach can be used for W av eNet coding, but it is not natural. In W av eNet w av eform coding, the prediction step can be omitted as a conditional discrete distri- bution of the next sample is av ailable without actually computing a predicted sample value. W e now describe our v ariable-rate W aveNet waveform coding structure. W e first discuss the quantization step and the correspond- ing decoding step. Let Q : R → N be the mapping from the signal to its quantization index and Z : N → R be the corresponding decoding operation. Then, a signal x i is encoded at the encoder as n i = Q ( x i ) and the quantized signal is obtained at the decoder as ˆ x i = Z ( n i ) . The resolution of Q determines both the rate and the quality for the wav eform coder . In con ventional W av eNet, Q is a µ -law quantizer . Our proposed approach described below will there- fore result in the exact same wa veform as basic µ -la w coding. In the W av eNet wav eform coder , the sequence of quantization indices { n i } i ∈ Z is subject to entropy coding and subsequently trans- mitted ov er the channel along with the conditioning variables. Iden- tically trained W av eNet models are required at both the encoder and decoder to provide the conditional probability distribution q ( i ) to the entropy encoder and decoder . Importantly , it is ˆ x i that is used at the decoder as input for generating subsequent samples in the W aveNet model, thus creating a closed loop coder . (In contrast, the paramet- ric W a veNet coder takes the previous generated sample as input.) Although q ( i ) is an approximate distribution for the original signal { x i } i ∈A , it can be used to reduce the rate significantly with entropy coding. Using techniques such as arithmetic coding [24, 25] can provide near -optimal coding for an entire sequence of indices. If the predictiv e distributions q ( i ) are correct, then the rate of the wav e- form coder will be arbitrarily close to that of (4). Any mismatch in the conditional distribution results in an expected rate increase that is specified by the Kulback-Leibler di ver gence. The described W a veNet wav eform coding scheme is close to op- timal for the squared error measure on samples with a µ -la w warped amplitude. The cubic cell shape of scalar quantization imposes a penalty that is (asymptotically with increasing rate) maximally 1.5 dB in mean squared error distortion or, equi v alently , 0.25 bits per sample [26]. As conditional distributions for higher dimensionalities and the usage of analysis-by-synthesis suffer from high complexity , the remov al of this penalty is non-trivial. The W aveNet waveform coder can be characterized by tw o rates. On the one hand we can compute the av erage entropy rate of the estimated conditional distribution: ¯ H = − 1 |A 0 | X i ∈A 0 X n ∈N q ( i ) n log 2 q ( i ) n , (5) which provides an estimate of the true av erage entropy rate. On the other hand, we can compute a lower bound on the real-world rate produced by the entropy coder: R = − 1 |A 0 | X i ∈A 0 log 2 q ( i ) n i . (6) which forms an upper bound on the av erage entropy rate. If R and ¯ H are close, we can be confident that the true average entropy rate is similar . The measured rates will be discussed in section 3. Finally , we discuss the rate required for the conditioning v ari- ables of a W av eNet w aveform coder . It w as sho wn in [27] that under reasonable conditions the optimal rate for the conditioning variables does not depend on the mean signal distortion . This also applies to the W aveNet based wa veform coder and implies that only the reso- lution of the quantizer Q has to be adjusted to v ary the rate. W e did not include perceptual weighting in the current imple- mentation of the w aveform W aveNet coder . Howe ver , pre- and post- filtering structures can be introduced to enhance coder performance by exploiting perception. 3. EXPERIMENT AL RESUL TS In this section, we ev aluate the information rates, signal quality and speaker identifiability produced by the W av eNet coding schemes. Listening examples are a vailable online 1 . 3.1. Experimental Setup The W aveNet system as described in [11] was used to dev elop our proposed W av eNet coders. At the encoder , we employed Codec 2 at 8 kHz and 2.4 kb/s. As previously discussed, its decoded bit stream was used as W aveNet conditioning variables, held constant over 10 ms intervals. The decoder then generated output samples at 16 kHz. The training and test sets were derived from the W all Street Jour- nal speech corpus [28] with no ov erlapping speakers. The training set contained 32580 utterances by 123 speakers and the testing set contained 2907 utterances by 8 speakers. Quality was ev aluated using POLQA and listening tests against a number of unmodified reference coders: Codec 2 (2.4 kb/s), MELP (2.4 kb/s) [29], Spee x wideband (2.4 kb/s) [30], ITU-T G.711 µ -law at 16 kHz (128 kb/s), and ITU-T G.722.2 AMR-WB (23.05 kb/s). Speaker identification performance of the parametric W aveNet coder was ev aluated with listening tests and a neural network based model [31] trained on our dataset. As reference, we trained a second parametric W av eNet coder with o verlapping speakers in the training and test sets. 3.2. Speech Information Rates W e used (4) to estimate the rate of wav eform generation on the test dataset (removing silence segments). The result was a mean rate of 2.65 bits per sample , or 42 kb/s at 16 kHz. Informal testing indicates that this rate is largely independent of the rate of the conditioning pa- rameters. W e also estimated the rates associated with (5) and (6) and obtained 2.61 and 2.62 bits per sample respectively , or about 42 kb/s at 16 kHz. Since they are very similar, we can be confident that the true average entropy rate is similar . Fig. 1 shows a speech segment of 0.2 seconds and the corresponding instantaneous information rate for the waveform coder . The early part of the signal is a fricative, which is relatively unstructured, thus a higher rate is required. The latter part is a voiced segment, where the rate required is low . The rate additionally varies with the pitch cycle. The highest rate is not 1 https://goo.gl/C14FFx Fig. 1 : T op: signal wa veform. Bottom: instantaneous information computed as the terms in the sample average sums in (5) (blue) and (6) (orange). associated with the pitch pulse, implying that W aveNet predicts the pitch accurately , b ut the wa veform is noisier across certain segments of the pitch cycle. 3.3. Quality Experiments The objective quality of the reference and two W aveNet coders was ev aluated using POLQA [32] and the results are shown in T able 1. W e noted that POLQA did not reflect informal listening impressions, where the bandwidth extension and absence of the distortions typical of a vocoder-based parametric coder was clearly heard. This discrep- ancy was not unexpected as the parametric W aveNet coder changes the signal wa veform and the timing of the phones. A subjective MUSHRA-type listening test [33] was performed where 21 participants ev aluated 8 utterances. The µ -law coder was omitted from the test as it is identical to the W aveNet wa veform coder . The results are giv en in Fig. 2 where it can be seen that two distinctiv e groups emerged: a lo w-quality group consisting of Spee x, Codec 2 and MELP , and a high-quality group consisting of AMR- WB, the W aveNet wav eform coder, and the parametric W aveNet coder . Thus, the parametric W aveNet coder has a subjecti ve qual- ity similar to that of existing waveform coders with the benefit of significantly lower rates. 3.4. Speaker Identification Experiments An objective speaker identification test was performed using a neural network based speaker identification model [31]. T wo single-layer models were trained with µ -law coded and parametric W aveNet coded speech respectively . There were no ov erlapping speakers between the training and test data. The training set contained 123 speakers and 3690 utterances and the same test set as before was T able 1 : POLQA mean opinion scores (MOS-LQO) for dif ferent coders operating at different rates (kb/s). WW : W aveNet wav eform coder , WP: parametric W av eNet coder Codec 2 MELP Speex AMR-WB WW WP Rate 2.4 2.4 2.4 23 42 2.4 MOS 2.7 2.9 2.2 4.6 4.7 2.9 Fig. 2 : Subjective quality (MUSHRA scores). WW: W aveNet wa ve- form coder , WP: parametric W av eNet coder . used, but now split into enrollment and verification sets with over- lapping speakers but non-o verlapping utterances. The v erification equal error rate (EER) results were 8.4% for the µ -law coded speech and 15.8% for W av eNet coded speech. It is known that EER goes up after coding [34] and in this case, it is expected that the spectral resolution of the low rate coder has restricted speak er identifiability . A listening test was also carried out. For this, a second paramet- ric W av eNet coder was trained with overlapping speakers between the training and test data. W e term this model as W w and the first model trained without overlapping speakers as W wo . A triangle test was performed where 15 listeners listened to 16 trials, each with 3 different utterances by the same speaker . T wo utterances were taken from one of { W w , W wo } and one utterance from the other model. The total utterances from each model were equal over all trials. The lis- teners had to indicate the utterance spoken by the different speaker . On av erage, they correctly identified the dif ferent speaker in 41% of the trials. If the speakers are indistinguishable, the expected v alue is 33%. This discrepancy is likely to diminish with more speakers in the training set. These experiments indicate that the current coder would be more suitable for human-facing applications, e.g. conference calls. 4. CONCLUSIONS Our results show that the high fidelity of the conditional probabil- ity distrib ution of the speech w aveform of W a veNet can be le ver - aged to create state-of-the-art speech and audio coding systems. W e demonstrated that the quality of our 2.4 kb/s parametric speech coder is similar to that of wa veform coders with much higher rates. W e also sho wed how to build a wav eform W aveNet coder and briefly discussed a method for switching from the parametric coder to the wa veform coder based on a lik elihood based quality measure for the parametric coder . The computational cost of both training and running W aveNet is high compared to conv entional coders. The exception is the para- metric W av eNet encoder , which is a conv entional, low complexity parametric encoder . It is expected that performance can be improv ed further . F or example, the conditioning parameter set and their interpolation over time can be further refined and pre-/postfiltering can be introduced to the wav eform coder to improve perceiv ed performance. T o increase computational efficienc y , it may be beneficial to study if the long- lag memory components of the conditional probability distribution unnecessarily duplicates the information that is also present in the bit stream. 5. REFERENCES [1] T . Tremain, “Linear predictiv e coding systems, ” in ICASSP ’76. IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing , vol. 1, Apr 1976, pp. 474–478. [2] P . Kroon, E. Deprettere, and R. Sluyter , “Regular -pulse excitation–a novel approach to effecti ve and efficient multi- pulse coding of speech, ” IEEE T rans. Acoust. Speech, Signal Pr ocess. , vol. 34, no. 5, pp. 1054–1063, Oct 1986. [3] R. Salami, C. Laflamme, J. P . Adoul, and D. Massaloux, “ A toll quality 8 kb/s speech codec for the personal communications system (PCS), ” IEEE T ransactions on V ehicular T echnology , vol. 43, no. 3, pp. 808–816, Aug 1994. [4] W . B. Kleijn, R. P . Ramachandran, and P . Kroon, “Interpola- tion of the pitch-predictor parameters in analysis-by-synthesis speech coders, ” IEEE T ransactions on Speech and Audio Pr o- cessing , vol. 2, no. 1, pp. 42–54, Jan 1994. [5] S. V an Kuyk, W . B. Kleijn, and R. C. Hendriks, “On the infor- mation rate of speech communication, ” in 2017 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , March 2017, pp. 5625–5629. [6] P . B. Denes, “On the statistics of spoken English, ” J. Acoust. Soc. Am. , v ol. 35, no. 6, pp. 892–904, 1963. [7] B. S. Atal and M. R. Schroeder, “ Adaptive predictive coding of speech signals, ” The Bell System T echnical Journal , v ol. 49, no. 8, pp. 1973–1986, Oct 1970. [8] K. T okuda, T . Y oshimura, T . Masuko, T . K obayashi, and T . Ki- tamura, “Speech parameter generation algorithms for HMM- based speech synthesis, ” in 2000 IEEE International Confer- ence on Acoustics, Speech, and Signal Pr ocessing. Pr oceed- ings , vol. 3, 2000, pp. 1315–1318. [9] Y . Grenier, “Time-dependent ARMA modeling of nonstation- ary signals, ” IEEE T ransactions on Acoustics, Speech, and Sig- nal Pr ocessing , vol. 31, no. 4, pp. 899–911, Aug 1983. [10] M. Piccardi and ´ O. P ´ erez, “Hidden Markov models with kernel density estimation of emission probabilities and their use in activity recognition, ” in Comp. V ision and P attern Recognition, 2007. CVPR’07. IEEE Confer ence on . IEEE, 2007, pp. 1–8. [11] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. V inyals, A. Graves, N. Kalchbrenner , A. Senior , and K. Kavukcuoglu, “W ave Net: A generativ e model for raw au- dio, ” ArXiv e-prints , Sep. 2016. [12] C. Bell, H. Fujisaki, J. Heinz, K. Stevens, and A. House, “Reduction of speech spectra by analysis-by-synthesis tech- niques, ” J. Acoust. Soc. of America , vol. 33, no. 12, pp. 1725– 1736, 1961. [13] S. Singhal and B. Atal, “Improving performance of multi-pulse LPC coders at low bit rates, ” in ICASSP ’84. IEEE Interna- tional Confer ence on Acoustics, Speec h, and Signal Process- ing , vol. 9, Mar 1984, pp. 9–12. [14] A. T amamori, T . Hayashi, K. K obayashi, K. T akeda, and T . T oda, “Speaker-dependent W aveNet vocoder , ” in Pr oceed- ings Inter speech , 2017. [15] B. S. Atal and S. L. Hanauer , “Speech analysis and synthesis by linear prediction of the speech wa ve, ” J. Acoust. Soc. America , vol. 50, no. 2B, pp. 637–655, 1971. [16] J. D. Markel and A. J. Gray , Linear prediction of speech . Springer V erlag, 1976, vol. 12. [17] P . Hedelin, “ A tone oriented v oice excited v ocoder , ” in ICASSP ’81. IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing , vol. 6, Apr 1981, pp. 205–208. [18] R. J. McAulay and T . F . Quatieri, “Speech analysis/synthesis based on a sinusoidal representation, ” IEEE T rans. Acoust. Speech Signal Pr ocess. , v ol. 34, pp. 744–754, 1986. [19] D. Ro we. (2011) Codec 2- open source speech coding at 2400 bits/s and below . [Online]. A v ailable: http: //www .tapr .org/pdf/DCC2011- Codec2- VK5DGR.pdf [20] F . Itakura, “Line spectrum representation of linear predicti ve coefficients of speech signals, ” J. Acoust. Soc. Am. , vol. 57, p. 535(A), 1975. [21] D. W . Griffin and J. S. Lim, “Multiband excitation vocoder , ” IEEE T ransactions on Acoustics, Speech, and Signal Pr ocess- ing , vol. 36, no. 8, pp. 1223–1235, Aug 1988. [22] B. Iser , G. Schmidt, and W . Minker , Bandwidth Extension of Speech Signals . Springer , 2008. [23] W . V erhelst and M. Roelands, “ An overlap-add technique based on wa veform similarity (WSOLA) for high quality time-scale modification of speech, ” in Acoustics, Speech, Signal Pr ocess., ICASSP-93, IEEE Int. Conf. on , vol. 2, 1993, pp. 554–557. [24] R. Pasco, “Source coding algorithms for fast data compres- sion, ” Ph.D. dissertation, Stanford Univ ersity , 1976. [25] J. J. Rissanen and G. Langdon, “ Arithmetic coding, ” IBM J . Res. De vel. , vol. 23, no. 2, pp. 149–162, 1979. [26] T . Lookabough and R. Gray , “High-resolution theory and the vector quantizer advantage, ” IEEE T rans. Information Theory , vol. IT -35, no. 5, pp. 1020–1033, 1989. [27] W . B. Kleijn and A. Ozerov , “Rate distrib ution between model and signal, ” in Proc. IEEE W orkshop on Applic. Signal Pr o- cess. A udio and Acoust. (W ASP AA) , Nov . 2007, pp. 243–246. [28] Eugene Charniak et al., “BLLIP 1987-89 WSJ corpus release 1 ldc2000t43 web download, ” Linguistic Data Consortium, Philadelphia, 2000. [29] A. McCree, K. Truong, E. George, T . Barnwell, and V . V ishu, “ A 2.4 kbit/s MELP coder candidate for the new U.S. fed- eral standard, ” in Acoustics, Speech, Signal Process., 1988. ICASSP-88., 1988 Int. Conf . on , vol. 1, 1996, pp. 200 – 203 vol. 1. [30] J. M. V alin. (2016) Speex. [Online]. A v ailable: https: //www .speex.org [31] L. W an, Q. W ang, A. Papir, and I. Lopez Moreno, “Generalized End-to-End Loss for Speaker V erification, ” ArXiv e-prints , Oct. 2017. [32] ITU, “Perceptual objectiv e listening quality assessment, ” Rec. ITU-T P .863 , 2011. [33] ——, “Method for the subjecti ve assessment of intermediate sound quality (MUSHRA), ” Rec. ITU-R.BS.1534-1 , 2003. [34] R. B. Dunn, T . F . Quatieri, D. A. Reynolds, and J. P . Camp- bell, “Speaker recognition from coded speech and the ef- fects of score normalization, ” in Confer ence Recor d of Thirty- F ifth Asilomar Confer ence on Signals, Systems and Computers (Cat.No.01CH37256) , vol. 2, No v 2001, pp. 1562–1567 vol.2.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment