Epoch-Synchronous Overlap-Add (ESOLA) for Time- and Pitch-Scale Modification of Speech Signals
Time- and pitch-scale modifications of speech signals find important applications in speech synthesis, playback systems, voice conversion, learning/hearing aids, etc.. There is a requirement for computationally efficient and real-time implementable a…
Authors: Sunil Rudresh, Aditya Vasisht, Karthika Vijayan
1 Epoch-Synchronous Ov erlap-Add (ESOLA) for T ime- and Pitch-Scale Modification of Speech Signals Sunil Rudresh, Aditya V asisht, Karthika V ijayan, and Chandra Sekhar Seelamantula, Senior Member , IEEE Abstract —Time- and pitch-scale modifications of speech signals find important applications in speech synthesis, playback systems, voice con version, learning/hearing aids, etc.. There is a r equire- ment f or computationally efficient and real-time implementable algorithms. In this paper , we propose a high quality and compu- tationally efficient time- and pitch-scaling methodology based on the glottal closure instants (GCIs) or epochs in speech signals. The proposed algorithm, termed as epoch-synchronous ov erlap- add time/pitch-scaling (ESOLA-TS/PS), segments speech signals into overlapping short-time frames with the o verlap between frames being dependent on the time-scaling factor . The adjacent frames are then aligned with respect to the epochs and the frames are overlap-added to synthesize time-scale modified speech. Pitch scaling is achieved by resampling the time-scaled speech by a desired sampling factor . W e also propose a concept of epoch embedding into speech signals, which facilitates the identification and time-stamping of samples corresponding to epochs and using them for time/pitch-scaling to multiple scaling factors whenever desired, thereby contributing to faster and efficient implementa- tion. The results of perceptual evaluation tests r eported in this paper indicate the superiority of ESOLA over state-of-the-art techniques. The proposed ESOLA significantly outperforms the con ventional pitch synchronous overlap-add (PSOLA) techniques in terms of perceptual quality and intelligibility of the modified speech. Unlike the wavef orm similarity overlap-add (WSOLA) or synchronous overlap-add (SOLA) techniques, the ESOLA technique has the capability to do exact time-scaling of speech with high quality to any desired modification factor within a range of 0 . 5 to 2 . Compared to synchronous overlap-add with fixed synthesis (SOLAFS), the ESOLA is computationally advantageous and at least three times faster . I . I N T RO D U C T I O N T ime- and pitch-scaling of speech are important problems in speech signal processing. They are relev ant in a myriad of applications in speech processing including, but not limited to, speech synthesis, v oice con version, automatic learning aids, hearing aids, v oice mail systems, multimedia applications, etc.. The modifications of time duration, pitch, and loudness of speech signals in a controlled manner result in prosody alter - ation [1]. Duration expansion and compression are widely used in playback systems, tutorial learning aids, voice mail systems, etc. for slo wing down speech for better comprehension or fast S. Rudresh and C. S. Seelamantula are with the Department of Electrical Engineering, Indian Institute of Science (IISc), Bangalore-560012, India (Email: sunilr@iisc.ac.in, chandrasekhar@iisc.ac.in; Phone: +91 80 2293 2695, Fax: +91 80 2360 0444). A. V asisht is now with Intel T echnology India Pvt Ltd, Bang alore, India (Email: aditya.v asisht@gmail.com). K. V i- jayan is now with the Department of Electrical and Computer Engineer- ing, National University of Singapore (NUS), Singapore (Email: karthikav- ijayan@gmail.com). scanning of recorded speech data [2]. Altering pitch finds applications in voice con v ersion systems, animation movie voiceo vers, gaming, etc.. T ime- and pitch-scale modification are crucial in concatenativ e speech synthesis, where it is required to manipulate the pitch contours and durations of speech units before concatenating them and later in their post processing. Hence, there is a need for reliable, computationally efficient, and real-time implementable time- and pitch-scale modification techniques in speech signal processing. The existing techniques in literature could be broadly clas- sified into two categories, namely , (i) pitch-blind and (ii) pitch synchronous techniques. Next, we briefly present an o vervie w of these tw o classes of techniques. A. Pitch-Blind Overlap-Add T echniques 1) Overlap and Add (OLA): The early techniques relied on simple ov erlap-add (OLA) algorithms [3], [4], wherein the speech signal is segmented into overlapping frames with an analysis frame-shift of S a . Subsequently , the time-scale mod- ified speech is synthesized by ov erlap-adding the successive frames after altering the synthesis frame-shift to S s = αS a , where α is the time-scale modification factor . The analysis frame-shift S a signifies the number of samples in the frame of speech being processed. The synthesis frame-shift S s represents the number of samples of the time-scaled speech synthesized with each overlap-add. The major disadv antage associated with OLA is that it does not guarantee pitch consistency and hence introduces significant artifacts upon time-scale modification. Synchr onous Overlap and Add (SOLA): The synchronous OLA (SOLA) was proposed to introduce some criteria so as to choose which portions of the speech segments must be ov erlap-added. In SOLA, the successive frames are aligned with each other prior to ov erlap-add [5]. The alignment of frames w as accomplished using autocorrelation analysis. The speech signal is segmented into overlapping frames with an analysis frame-shift S a and synthesis frame-shift S s = α S a , similar to the OLA algorithm. The synthesis frame-shift for each frame is computed such that the successi ve frames over - lap at the locations of maximum w av eform similarity between the o verlapping frames. That is, the synthesis frame-shift for i th frame is altered as S ( i ) s = S ( i ) s + k i − k i − 1 , where k i is the offset assuring the frame alignment in a synchronous manner for the i th synthesis frame and is computed as k i = arg max k R i ( k ) and R i ( k ) is correlation between the analysis and synthesis 2 frames under consideration. The drawback of SOLA algorithm is the v ariable synthesis frame length, i.e., the amount of ov erlap between successi ve frames varies for each synthesis frame depending on the correlation between the overlapping frames. The v ariable length of synthesis frames may not allow for exact time-scaling. Also, the SOLA algorithm necessitates computation of the correlation function at each synthesis frame, which is computationally expensi ve. V ariants of SOLA: Many v ariants of the SOLA algorithm hav e been proposed to reduce the computational complex- ity and execution time, mainly by replacing the correlation function with unbiased correlation [6], simplified normalized correlation [7], average maximum difference function (AMDF) [8], mean-squared dif ference function [9], modified en v elope matching [10], etc.. Instead of computing correlation function, a simple peak alignment technique was used to locate the optimum overlap between successiv e frames of speech having maximum wa veform similarity [11]. As peak amplitudes can get easily affected by noise, the perceptual quality of the time- scaled speech is highly susceptible to noise. Another variant of SOLA called synchronized and adapti v e o verlap-add (SA OLA) was proposed, which allows for variable analysis frame length S a unlike SOLA. SA OLA adaptively chooses S a as a func- tion of the time-scale modification factor thus reducing the computational load for lower time-scale modification factors [12]. These algorithms generally perform faster than SOLA, but suf fer from reduced quality of time-scaled speech [13]. SOLA with Fixed Synthesis (SOLAFS): A significant v ariant of SOLA termed as SOLA-fixed synthesis (SOLAFS) uses fixed synthesis frame length, instead of variable synthesis frame length, resulting in an improv ed quality of time-scale modification. SOLAFS segments the speech signal at an a ver - age rate of S a [14]. It allo ws the beginning of each analysis frame to v ary within a narrow interv al, such that the adjacent frames of the output speech are aligned with each other in terms of wav eform similarity . T o be specific, the offset k i corresponding to maximum wav eform similarity affects the beginning point of frames. This flexibility in altering the beginning points of analysis frames facilitates to hav e a fixed synthesis frame-shift S s , which aids in attaining the exact time-scaling factor . Even though SOLAFS reduces the compu- tational load of SOLA by keeping a fixed synthesis frame rate, it still relies on correlation between two consecutiv e frames as a measure of wa veform similarity . B. Pitch Synchr onous T echniques Another widely used class of techniques for time- and pitch-scaling is the pitch synchronous o verlap-add (PSOLA) [15], which employs pitch synchronous windowing to segment speech signals. The windowed segments containing at least one pitch period are replicated or discarded appropriately to accomplish required time-scaling. On the other hand, the pitch periods in the windowed segments are resampled by a required factor to achiev e pitch-scale modification. The time/pitch-scaled speech signals are synthesized by ov erlap- adding the modified segments. For PSOLA to provide high quality time/pitch-scaled speech signals, accurate pitch marks, on which the pitch synchronous windo ws have to be centered, are essential. Inaccurate pitch marks will result in spectral, pitch, and phase mismatches between adjacent frames [16]. While the time-domain PSOLA (TD-PSOLA) methods operate on the speech wa veform itself, frequency-domain PSOLA (FD-PSOLA) methods operate in the spectral domain and are employed only for pitch scaling [17]. Linear Prediction PSOLA (LP-PSOLA): Application of PSOLA technique on linear prediction (LP) residual [18] results in LP-PSOLA [15], [17]. Accurate pitch markers are required for LP-PSOLA to minimize pitch and phase discon- tinuities. A recent technique by Rao and Y egnanarayana [2] deriv es epochs from the LP residual signal of speech and modifies the epoch sequence according to a desired time- scale factor . Then, a modified LP residual is deriv ed from the modified epoch sequence, which is passed through the LP filter to synthesize the time-scaled speech. W aveform Similarity SOLA (WSOLA): Another technique, which relies on the pitch marks for ov erlap-add is waveform similarity based SOLA (WSOLA). In WSOLA, the instants of maximum wav eform similarity are located using the signal au- tocorrelation and are used as pitch marks [19]. This technique is not capable of producing speech signals with exact time- scale factor due to ambiguities in replication/deletion of pitch periods chosen based on autocorrelation function. Apart from the autocorrelation, the absolute differences between adjacent frames of speech at different frame-shifts are computed to identify points of maximum wav eform similarity [20]–[22]. Other considerably different classes of algorithms for time- and pitch-scale modification represent speech in its parametric form using a sinusoidal model [1], harmonic plus noise model [23], phase vocoder based techniques [24], speech representation and transformation using adaptiv e interpolation of weighted spectrum (STRAIGHT) model [25], etc.. I I . T H I S P A P E R In this paper , we propose an algorithm to perform time- and pitch-scaling of speech signals exactly to a giv en factor using epoch-synchronous overlap-add (ESOLA) technique (Section IV). A given speech signal is divided into short-time segments such that each segment contains at least three or four pitch pe- riods. The analysis frame-shift is adaptiv ely chosen depending upon the time- or pitch-scale modification factor . Pitch-scaling is performed by first time-scaling the speech signal and then appropriately resampling the resulting time-scaled speech. W e also propose the concept of epoch embedding into speech signals, which is done by determining epochs and coding the epoch or non-epoch information into the least-significant bit (LSB) of each sample in its 8 / 16 -bit representation (Section V). Since epoch extraction has to be done only once and the resultant information about epochs is embedded in the speech signal itself, for subsequent time/pitch-scaling, epoch alignment and overlap-add are the only operations required. This minimizes the computational load and reduces the ex- ecution time compared with SOLA and its variants, which require computation of correlation between the two frames for each time-scale factor . The proposed technique deli vers high 3 T ABLE I A N O B JE C T I VE C O M P A RI S O N O F T H E P RO P O SE D M E T H OD W I TH S TA TE - O F - T H E - A RT M E T HO D S . ( N I S T H E N U M BE R O F S A MP L E S I N A F R A M E ) T echnique Criteria for synchronization Is exact time-scaling attained? Computational complexity Output speech quality OLA [3], [4] None Y es O (1) Poor SOLA [5] Cross-correlation No O ( N 2 ) Moderate TD-PSOLA [15] Alignment of indi vidual pitch periods No O ( N log N ) Moderate LP-PSOLA [17] Alignment of pitch marks from LP residue in frequency domain No O ( N 2 ) Good SOLAFS [14] Cross-correlation Y es O ( N 2 ) High ESOLA Epoch alignment in time domain Y es O ( N log N ) V ery high quality time-scale modified speech, while being unaffected by pitch, phase, and spectral mismatches. In Section VI, we present a comparati ve study of the proposed algorithm with the existing state-of-the-art algorithms by indicating the ke y dif- ferences in terms of perceptual quality of the resulting speech, computational cost, and execution time. T able I giv es an objectiv e comparison of dif ferent time/pitch-scaling techniques with the proposed ESOLA technique. Since, all the techniques in T able I employ frame-based analysis and synthesis, N used in the computational complexity column denotes the number of speech samples in a frame. Section VII presents a detailed perceptual ev aluation of performances of different time/pitch-scaling algorithms, vis- ` a-vis the proposed ESOLA technique. In Section , we discuss about the variation of the ESOLA technique for continuously changing the time/pitch- scale factor for a speech signal. The proposed method has been implemented on various platforms such as MA TLAB, Python, Praat, and Android. T ime- and pitch-scaled speech signals for a few Indian and Foreign languages, vocals, synthesized speech, speech do wnloaded from Y ouT ube are put up on the internet for the benefit of readers and can be accessed by the link http://spectrumee.wixsite.com/spectrumtts. I I I . E P O C H E X T R A C T I O N A N D I T S R O L E I N T I M E - A N D P I T C H - S C A L I N G O F S P E E C H S I G N A L S A. Role of Epochs in T ime/Pitch-Scaling V oiced speech is produced by exciting the time-varying vocal-tract system primarily by a sequence of glottal pulses. The e xcitation to the vocal tract system is constituted by the air flow from lungs, which is modulated into quasi-periodic puffs by the v ocal folds at glottis. The vibrations of vocal folds (closing and opening the wind pipe at the glottis) acoustically couple the supra-laryngeal vocal tract and trachea. Although glottal pulses are the source of excitation, the significant excitation of the v ocal-tract system occurs at the instant of glottal closure. Such impulse-like excitations during the closing phase of a glottal cycle are termed as epochs or glottal closure instants (GCIs) [26]. The speech thus produced is a quasi-periodic signal with pitch periods characterized by epochs. Pitch is a prominent speaker-specific property and it does not vary largely with the rate of speaking. An analysis of change in distribution of fundamental frequency ( F 0 ) with the change in speaking rate suggests that v ariation in F 0 is speaker specific [27]. That is, some speakers are able to maintain the same F 0 at different speaking rates. In other words, they can produce speech at different speaking rates while maintaining intelligibility and naturalness, which is exactly what we seek in time-scale modification of speech. Since, F 0 inherently depends on epochs, the very less variation of F 0 is attributed to less v ariation in the pitch periods. This motiv ates us to use epochs as anchor points for synchronizing consecutive frames for time-scale modification. B. Epoch Extraction Algorithms Determining epochs from speech signals is a non-trivial task and several algorithms hav e been proposed to solve the problem. Initial attempts were aimed at points of maximum short-time ener gy in segments of speech [28]–[30]. The es- timates of the pitch marks obtained using these techniques were refined using dynamic programming strategies, minimiz- ing cost functions formulated based on wav eform similarity and sustainment of continuous pitch contours over successi ve frames of speech [20], [22], [29], [31]. The drawback of most of these algorithms is the utilization of sev eral adhoc parameters. Epochs have also been obtained by identifying points of maximum energy in Hilbert en velope [32], by using group delay function [2], [33], using residual excitation and a mean-based signal (SEDREAMS) technique [34], based on spectral zero crossings [35], from positiv e zero crossings of zero frequency filtered (ZFF) signal [36], based on dynamic plosion index (DPI) [37], etc.. An extensiv e revie w of the various epoch extraction algorithms and their empirical com- putational complexity has been giv en in [38]. Any of these algorithms could be used as long as the y gi ve reliable estimates of epochs and are computationally efficient. As revie wed in [38] and [39], SEDREAMS, ZFF , or DPI give the most accurate estimate of epochs and a version of SEDREAMS called fast SEDREAMS is computationally ef ficient than the rest of the techniques [38]. 4 In this paper , we use the zero frequency resonator (ZFR) proposed by Murty and Y egnanarayana [36] for epoch extrac- tion as it gives reliable estimates and requires less computa- tional resources for implementation. The ZFR filters speech signals at a very narrow frequency band around 0 Hz, as this low frequency band of speech is not affected by the vocal tract system. The resulting signal is termed as zero frequency signal and it exhibits discontinuities at epoch locations as positive zero crossings [36]. The procedure of obtaining epochs using ZFR is summarized belo w . • Speech signal s [ n ] is preprocessed to remove the lo w- frequency bias present as x [ n ] = s [ n ] − s [ n − 1] . • The signal x [ n ] is passed through an ideal zero-frequency resonator (integrator) two times. This is done to reduce the effect of vocal tract on the resulting signal. y 1 [ n ] = − 2 X k =1 a k y 1 [ n − k ] + x [ n ] , y 2 [ n ] = − 2 X k =1 a k y 2 [ n − k ] + y 1 [ n ] . • The trend in y 2 [ n ] is removed by successi vely applying a mean-subtraction operation. y [ n ] = y 2 [ n ] − 1 2 N + 1 N X m = − N y 2 [ n + m ] The value of 2 N + 1 is chosen as to lie between 1 to 2 times the av erage pitch period of the speaker under consideration. • The positi ve zero crossings of y [ n ] indicate the epochs. Next, we propose a new time- and pitch-scale modification technique (ESOLA) based on epoch alignment. I V . E S O L A : E P O C H S Y N C H RO N O U S O V E R L A P - A D D T I M E - A N D P I T C H - S C A L E M O D I FI C AT I O N O F S P E E C H S I G NA L S T ime-scale modification is generally performed by discard- ing or repeating short-time segments of speech, or by manip- ulating the amount of overlap between successi ve segments. Pitch-scale modification in v olves resampling of the speech signal. In this paper, we adopt a pitch-blind windowing for segmentation of speech signals into ov erlapping frames (typ- ically , with 50 % ov erlap). Subsequently , the ov erlap between successiv e frames are increased or decreased for duration compression or expansion, respectiv ely . Depending on the desired time-scale modification factor , the overlap between successiv e frames, or equi valently the frame-shift, is modified. The newly formed frames (analysis frames) with modified frame-shifts are overlap-added to synthesize duration-modified speech signals. T o perform pitch-scaling, first, the speech signal is time-scaled by an appropriate factor and then resam- pled to match the length of original speech signal. A crucial requirement for time-scaling techniques is pitch consistency , i.e., the pitch of time-scaled speech signals should not v ary with duration expansion or compression. W e employ epoch alignment as a measure of synchronization between successive Overlapping regions Output synthesis frame Analysis frame Analysis frame obtained after shifting by samples k m k m N N ( discard ) (append) k m Epochs ` 1 m n 1 m n 2 m n 3 m k m = n 2 m ` 1 m n 1 m n 2 m n 3 m Fig. 1. Illustration of epoch alignment between synthesis and analysis frames. speech frames prior to overlap-add synthesis to ensure pitch consistency . For time-scale modification, the speech signal x [ n ] is se g- mented into frames, x m [ n ] of length N . Generally , N is chosen such that each frame contains three or four pitch periods. The av erage frame-shift between successi ve frames is S a . The exact analysis frame-shift is decided based on the desired time-scale modification factor . The analysis frames are selected in such a way that the overlap between the successiv e analysis frames is more when duration of speech signal has to be increased (slower speaking rate) compared with the amount of ov erlap when the duration has to be decreased (faster speaking rate). The m th analysis frame of a speech signal x [ n ] is gi ven by , x m [ n ] = x [ n + mS a + k m ] , n ∈ J 0 , N − 1 K , (1) where J 0 , N − 1 K denotes the integer set { 0 , 1 , · · · , N − 1 } and k m is the additional frame-shift ensuring frame alignment for the m th analysis frame. The synthesis frame-shift is chosen as S s = αS a , where α is the desired time-scale modification factor . The m th synthesis frame is denoted by y m [ n ] = y [ n + mS s ] , n ∈ J 0 , N − 1 K , (2) where y is the time-scaled signal. Note that the length of both analysis and synthesis frames is N and is fixed. Next, we discuss the frame alignment process, which in turn inv olves epoch alignment between the frames, which determines k m . A. Epoch Alignment The process of aligning frames with respect to epochs in order to compute k m is illustrated in Fig. 1. The m th analysis frame of speech x m [ n ] , which begins at mS a is to be shifted and aligned with the m th synthesis frame, y m [ n ] = y [ n + mS s ] , n ∈ J 0 , N − 1 K , which begins at mS s and overlap-added to get the m th output frame. Let ` 1 m denote the location index of the first epoch in the m th synthesis frame y m . Let { n 1 m , n 2 m , · · · , n P m } be the indices of P epochs in the 5 Analysis frame Time-scaled speech Current synthesis frame discard append Ne w y m [ n ] mS s ( m + 1) S s mS a mS a + k m mS a + k m + N N k m y [ n ] L mS s k m mS a + N Input speech y [ n ] shift by samples k m { Fig. 2. A schematic illustration of ESOLA-TS technique. m th analysis frame x m . Now , the analysis frame-shift k m is computed as k m = min 1 ≤ i ≤ P ( n i m − ` 1 m ) , such that k m ≥ 0 . The shift factor k m ensures frame alignment by forcing the m th analysis frame to begin at mS a + k m according to (1) as shown in Fig 2. Hence, the first epoch occurring in y m after the instant mS s is aligned with the next nearest occurring epoch in x m . Thus, the epochs in the synthesis and modified analysis frames are aligned with each other and an y undesirable ef fects perceiv ed as a result of pitch inconsistencies are mitigated from the time-scale modified speech. B. Overlap and Add In order to nullify the possible artifacts due to variable length of overlap-add region, we keep the fixed synthesis length. The additional k m samples for the m th analysis frame, which now begins at mS a + k m due to the shift k m and ends at mS a + k m + N are appended from the successiv e analysis frame, prior to overlap-add synthesis. This is done to ensure that y m holds exactly the required number of samples as demanded by the time-scale modification factor thereby deliv ering exact time-scale modification. The analysis shift k m and overlap-add are done directly on the time-domain speech signal as sho wn in Fig. 3. The modified analysis frame is overlap-added to the current output frame y m using a set of cross-fading functions β [ n ] and (1 − β [ n ]) as giv en in (3). The fading function β [ n ] could be a linear function or a raised-cosine function employed to reduce audible artifacts due to ov erlap of two frames during synthesis. The time-scaled output signal is synthesized as y [ n + mS s ] = β [ n ] y m [ n ] + (1 − β [ n ]) x m [ n ] , 0 ≤ n ≤ L − 1 , x m [ n ] , L ≤ n < N , (3) where L = N − S s denotes the ov erlapping region between the frames. Fig. 4 sho ws a segment of speech and its time- scaled versions to two different scale-factors. It is observed that the proposed ESOLA technique provides high quality time-scaled speech signals with pitch consistency , thereby preserving speaker characteristics. Current synthesis frame Analysis frame shifted by samples k m Speech signal obtained after weighted overlap-adding Overlapping regions samples are discarded after epoch alignment k m k m samples are appended Epochs Fig. 3. [Color online] Illustration of ESOLA-TS technique on a segment of a speech waveform. C. Selection of P arameters Range of k m : In voiced regions, both analysis and synthesis frames contain v alid epochs and k m is computed as described in Section IV -A. In the case of un voiced regions, epoch extraction algorithms may giv e spurious epochs and k m is computed in the same way as that for voiced regions. Since, there are no pitch periods present in unv oiced regions, the epoch alignment doesn’t make sense. In other words, carrying out the process of aligning the frames using spurious epochs in un voiced regions doesn’t create an y pitch inconsistency in the time-scaled signal. Howe v er , there might be cases where no epochs are present in one of the two frames or both the frames. In these cases, analysis shift k m is set to zero, i.e., the frames are ov erlap-added without any alignment. In the extreme case, the maximum value of k m is set to k max , which is equal to the synthesis frame-shift S s . Thus the range of analysis shift is giv en by 0 ≤ k m ≤ k max (= S s ) . Selection of N , S a , and S s : The length of analysis and synthesis frames is fixed as N . T ypically , the frame length N is chosen to contain at least three or four pitch periods/epochs. In this paper , we have used 20 ms frame length, which gi ves N = 20 × 10 − 3 F s , where F s is the sampling frequency . Since the length of synthesis frame is set to N , the synthesis rate S s depends on the amount of ov erlap ( L ) and is giv en by S s = N − L . In our experiments, the amount of overlap L is chosen to be 50 %, which gi ves L = N/ 2 and hence S s = N / 2 . Also, the analysis frame rate is related to synthesis frame rate as S s = α S a . D. Pitch-Scale Modification Resampling of a speech signal alters both pitch and duration of the signal. As we ha ve an efficient time-scaling technique, it can be employed for pitch-scaling. For a giv en pitch modi- fication factor ( β ), first, the speech signal is time-scaled by a factor , which is the reciprocal of the pitch-scale factor ( 1 /β ) and then the time-scaled speech is appropriately resampled (by the factor F s /β ) to match the length of the resampled 6 0 10 20 30 40 50 60 70 80 90 −0.1 0 0.1 0.2 (a) 0 10 20 30 40 50 60 70 80 90 −0.1 0 0.1 0.2 (b) 0 10 20 30 40 50 60 70 80 90 −0.1 0 0.1 0.2 TIME (ms) (c) Fig. 4. [Color online] T ime-scale modification using the ESOLA technique. (a) Original speech signal; time-scaled signals (b) α = 0 . 7 and (c) α = 1 . 3 . It is observed that the a verage pitch period in all the three speech segments remains more or less the same. 0 10 20 30 40 50 60 70 −0.1 0 0.1 0.2 (a) 0 10 20 30 40 50 60 70 −0.1 0 0.1 0.2 (b) 0 10 20 30 40 50 60 70 −0.1 0 0.1 0.2 TIME (ms) (c) Fig. 5. [Color online] Pitch-scale modification using the ESOLA technique. (a) Original speech signal; pitch-scaled signals (b) β = 1 . 4 and (c) β = 0 . 75 . It is observed that the av erage pitch period changes according to the scaling factors but the duration of all the three segments remains the same. signal to that of the original speech signal. Thus, the pitch- scaled signal has a dif ferent pitch because of the resampling and the length is unaltered due to time-scaling. It has been observed that the output speech stay consistent even if the order of these two operations gets re versed, i.e., the pitch-scale modification is in variant o ver the order in which time-scaling and resampling are performed. W e observ e that the resulting pitch-scaled signal is of high quality de v oid of any artifacts as compared with the other techniques such as PSOLA. Fig. 5 shows the pitch-scale modification of a segment of speech signal to tw o different modification factors. Figs. 6 and 7 show the spectrograms of the speech signal corresponding to the utterance “they nev er met, you know” 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 (a) FREQUENCY (kHz) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 2 4 6 (b) TIME (s) 0.5 1 1.5 2 0 2 4 6 (c) Fig. 6. [Color online] Spectrograms of the utterance “they never met, you know”. (a) Original speech signal; time-scaled signals for (b) α = 0 . 625 and (c) α = 1 . 66 . 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 (a) FREQUENCY (kHz) 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 (b) TIME (s) 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 (c) Fig. 7. [Color online] Spectrograms of the utterance “they never met, you know”. (a) Original speech signal; pitch-scaled signals for (b) β = 1 . 2 and (c) β = 0 . 8 . and its time- and pitch-scaled versions. It is observed that the spectral contents in the time-scaled spectrograms are preserved and do not contain any significant artifacts. The proposed time- and pitch-scale modification techniques using ESOLA are summarized in the form of flo wcharts in Fig. 8. V . E P O C H E M B E D D I N G The significant computation inv olved in time- and pitch- scaling methods is in the ev aluation of measures provid- ing synchronization between successive frames. Generally , normalized autocorrelation function, spectral autocorrelation function, short-time energy , etc. are utilized as measures for synchronization [5], [8], [9], [14], [28]. These measures are to 7 Epoch e xtraction Speech signal Analysis frames at the rate S a Overlap-add with synthesis rate S s Time-scaled speech signal Speech signal Resample Pitch-scaled speech signal Time-scale Epoch-based synchronization Epoch-synchronous time-scaling (ESOLA-TS) Epoch-synchronous pitch-scaling (ESOLA-PS) Fig. 8. Block diagram of ESOLA-TS/PS techniques. be computed for each analysis frame repeatedly for dif ferent time- and pitch-scale modification factors as they change with varying frame lengths and shifts. But epochs in speech signals are in v ariable to the changes in segmentation lengths and also to different time-scale modification factors. Hence, epochs can be extracted once from the speech signal and can be used repeatedly for different tasks. Thus the proposed method allows one to exploit this property to further reduce the computations in volved. T o exploit the adv antage of the fact that epochs could be extracted once and be used repeatedly for different scale- factors, we propose the method of epoch embedding into the speech signal. Consider an array of all zeros, whose length is equal to the length of the signal. Now , the values at the sample indices corresponding to the epochs are set to 1 . This array of binary decision about presence or absence of epochs is used to change the least significant bit (LSB) in the 8 / 16 -bit representation of the speech samples. If epoch is not present in the speech sample under consideration, then the LSB of that sample is set to 0 . If the speech sample indeed represents an epoch, then its LSB is set to 1 . Thus the epochs are computed and saved for further use in time- and pitch-scale modifications to dif ferent factors. This strategy largely reduces the computational cost and execution time. V I . C O M PA R I S O N O F D I FF E R E N T T I M E - A N D P I T C H - S C A L I N G T E C H N I Q U E S In this section, we discuss the ke y differences between the proposed methodology and state-of-the-art time and pitch- scale modification techniques. W e broadly divide the exist- ing techniques in the literature into two classes: (i) pitch- synchronous windowing techniques and (ii) pitch-blind win- dowing techniques. 1) Pitch-Synchr onous W indowing T echniques: The PSOLA and its variants mainly constitute the class (i) of techniques. As discussed in Section I-B, these class of techniques employ pitch synchronous windowing of speech signals, where each window is centered around pitch markers and typically cov ers two pitch periods [15]. Generally , tapered windo ws like Ham- ming or Hann windo ws are used for short-time se gmentation of speech. Such windo wed segments are replicated or deleted appropriately for time-scale modification, and are resampled for pitch-scale modification [15], [17]. The key philosophy behind the proposed ESOLA method is different from the PSOLA-based techniques in the sense that we employ pitch-blind windowing to segment the speech signal, where each segment grossly holds three to four pitch periods. The overlap between adjacent segments are manipu- lated in a controlled fashion for time-scale modification. The time-scaled speech is resampled appropriately for pitch-scale modification. As segmentation in ESOLA method is simpler and the number of frame manipulations required is lesser than PSOLA leading to computationally ef ficient method that produces superior quality time and pitch-scaled speech. Also, pitch-synchronous methods are not able to produce exact time-scaled speech signals unlike the proposed ESOLA method, which deli vers exact time-scale modifications due to fixed synthesis strategy . 2) Pitch-Blind W indowing T echniques: As detailed in Sec- tion I-A, the SOLA and its variants adopt pitch-blind win- dowing of speech signals for short-time segmentation, and adjusts the o verlap between successiv e frames based on some synchronization measure for time-scale modification [5]. The ESOLA method follows the same philosophy and yet the specific advantages delivere d by the proposed method in terms of computational requirements and execution time make it superior to the other methods. The SOLA algorithm and a wide range of its v ariants use autocorrelation measures for frame synchronization [5]–[7], which has to be repeatedly computed for different time and pitch scaling factors adding up to the total computational cost and execution time. The variants of SOLA employing synchronization of frames using AMDF , mean-square differ- ences, en v elope matching, peak alignment, etc. [8]–[11] are highly susceptible to noise. Epoch-based synchronization and epoch embedding proposed in ESOLA method contribute to reduction of overall computational requirements and since epochs are the high energy content in speech signals deli vering high signal to noise ratio in re gions around it [40], the resulting time and pitch-scaled speech signals are relatively robust to noise and holds superior perceptual quality . T o indicate the computational advantages rendered by ES- OLA algorithm ov er the existing techniques, we tab ulate the ex ecution time required by different time-scale modification algorithms in T able II. The reported execution times for the ESOLA method include the computation times in v olved in extracting epoch locations also. In this study , we have used ZFF algorithm [36] to estimate epoch locations. The The codes were run in MA TLAB R2015 on a Macintosh computer equipped with 2.7 GHz Intel Core i5 processor , 8 GB 1333 MHz DDR3 RAM. The ESOLA algorithm is the fastest among the algorithms under consideration, bringing out its advantage in applications to real time systems. 8 T ABLE II E X EC U T I ON T I ME ( I N S E CO N D S ) O F D I FF E RE N T T I M E - S C A LE M O D IFI C A T I ON T E CH N I QU E S F O R A S P EE C H S I G NA L O F D U R A TI O N 12 S E C O ND S T ime-scale factor TD-PSOLA [15] LP-PSOLA [17] WSOLA [19] SOLAFS [14] ESOLA 0 . 5 74.16 74.15 1.83 1.73 0.65 0 . 75 74.25 74.19 3.22 2.58 0.66 1 . 25 74.32 74.30 2.71 4.30 0.69 1 . 5 74.38 74.35 2.37 5.13 0.70 2 74.41 74.45 4.59 6.89 0.91 T ABLE III A T T RI B U TE S C O N S ID E R E D T O R A T E A T I ME / P IT C H - S C A L E D S P E E CH No: Attributes 1 Intended changes, i.e., whether the duration/speed or pitch of speech files has indeed changed or not 2 Pitch consistency in time-scale modification 3 Duration consistency in pitch-scale modification 4 Perceptual quality and intelligibility 5 Distortions or artifacts T ABLE IV R A T IN G S U S E D F O R A S S E SS I N G T H E Q UA L I T Y O F T I M E /P I T C H - S C A LE D S P EE C H [ 2 ] Rating Speech quality Distortions 1 Unsatisfactory V ery annoying and objectionable 2 Poor Annoying, but not objectionable 3 Fair Perceptible and slightly annoying 4 Good Just perceptible, but not annoying 5 Excellent Imperceptible V I I . P E R C E P T UA L E V A L UAT I O N T o ev aluate the performance of the proposed ESOLA method in comparison with other state-of-the-art techniques, we conducted detailed perceptual e v aluation tests. Three En- glish speech utterances of 3 - 4 seconds duration, two spoken by a male speaker and one by a female speaker , are chosen from the CMU Arctic database [41]. The speech signals are down sampled to 16 kHz and are segmented into frames of 20 ms duration with synthesis rate ( S s ) of 10 ms for time and pitch-scale modification. The time and pitch scaling are performed for fi ve different modification factors as mentioned in T able. V and T able. VI, respecti vely . All three sentences are time and pitch-scaled for the chosen modification factors making a total of three sets of speech files for perceptual ev aluation. T wenty five listeners with a basic understanding of speech signal processing, notions of pitch, duration, playback rate, etc. were chosen for the ev aluation test. Each listener was asked to listen carefully to one set of speech files and the three listening sets were randomly distrib uted among the 25 listeners, in order to remove any bias in e valuation to a particular speaker or utterance. The listeners were asked to rate each speech file on a scale of 1 to 5 based on the attributes giv en in T able. III. Each point in the rating represents the speech quality and level of distortion as given in T able. IV [2]. Each listener took approximately 45 minutes to complete the task of ev aluation. For perceptual ev aluation of performances, we hav e included four prominent time and pitch scaling methods reported in literature, namely , TD-PSOLA [15], LP- PSOLA [2], [17], WSOLA [19], and SOLAFS [14]. The performances are computed as mean opinion scores (MOS) T ABLE V M O S F O R D I FF ER E N T T I M E - S C AL E M O DI FI C A T IO N A L G O RI T H M S T ime-scale TD- LP- WSOLA SOLAFS ESOLA factor PSOLA PSOLA 0 . 5 1.58 2.11 4.15 4.15 4.21 0 . 75 1.58 2.16 3.84 4.26 4.53 1 . 25 1.80 2.35 3.45 4.30 4.80 1 . 5 1.55 2.20 3.25 4.10 4.65 2 1.60 2.20 3.10 3.85 4.15 T ABLE VI M O S F O R D I FF ER E N T P I TC H - S CA L E M O D I FIC A T I O N A L G O RI T H MS Pitch-scale TD- LP- WSOLA SOLAFS ESOLA factor PSOLA PSOLA 0 . 5 1.40 1.50 2.40 3.95 4.25 0 . 75 1.60 1.45 2.60 4.25 4.35 1 . 25 1.75 1.50 3.30 4.20 4.60 1 . 5 2.20 1.85 3.00 4.25 4.40 2 2.40 1.60 2.35 4.30 4.40 from 25 listeners over all three listening sets of speech signals. The time and pitch scaling performance of different algorithms are giv en in T able. V and T able. VI, respecti vely . Figs. 9 and 10 show the performance results as bar graphs along with the variances in MOS of 25 listeners. The ESOLA algorithm consistently deli vers better MOS values than the rest indicating the better quality of time/pitch-scaled speech of the proposed technique. Next, we list the observ ations based on the results of perceptual e v aluation. • The ESOLA method significantly outperforms the PSOLA-based techniques in both time and pitch scaling. This could be attributed to the simpler and efficient frame manipulations in the ESOLA algorithm. The performance of LP-PSOLA is poor compared with TD-PSOLA in pitch scaling because of the filtering in LP residue domain. • The ESOLA method achie ves e xact time scaling of speech signals. Whereas the WSOLA is not capable of producing speech exactly for a specified time-scale factor and it loses duration consistency in pitch-scale modi- fication, owing to the degraded time and pitch scaling performance. • The SOLAFS and ESOLA algorithm provide comparable performances with ESOLA having an edge over the SOLAFS method. This could be attributed to the fact that the ESOLA method performs frame alignment based on epoch information, which is more accurate than frame alignment based on cross correlation analysis as done in SOLAFS. • One of the positiv es of ESOLA method is its compu- tational efficiency over other methods as discussed in 9 2x 1.33x 0.8x 0.66x 0.5x 0 1 2 3 4 5 6 TIME − SCALING FACTOR MOS MOS FOR TSM WSOLA SOLAFS TD − PSOLA LP − PSOLA ESOLA Fig. 9. [Color online] MOS of various time-scaling techniques for five different scaling factors. V ariance of MOS for a TSM technique and a particular scaling factor is also plotted as a v ertical line on top of the bar graph. 2x 1.33x 0.8x 0.66x 0.5x 0 1 2 3 4 5 6 PITCH − SCALING FACTOR MOS MOS FOR PSM WSOLA SOLAFS TD − PSOLA LP − PSOLA ESOLA Fig. 10. [Color online] MOS of various pitch-scaling techniques for five different scaling factors. V ariance of MOS for a PSM technique and a particular scaling factor is also plotted as a v ertical line on top of the bar graph. Section VI (T able II). The proposed method reserves the least execution time among other prominent methods. V I I I . C O N C L U S I O N S In this paper, we proposed a computationally efficient, real- time implementable, and superior quality time and pitch-scale modification algorithms. The proposed technique (ESOLA) employs short-time segmentation of speech signals using pitch-blind windowing and manipulates the overlap between successiv e frames for time-scale modification. Appropriate resampling of time-scaled speech is performed for pitch- scale modification. The ke y features of the ESOLA algo- rithm are the utilization of epochs for synchronization of successiv e frames to remove pitch inconsistencies, deletion or insertion of samples in synthesis frames to ensure fixed synthesis and the technique of epoch embedding to signifi- cantly reduce the computational cost. Subjecti ve experiments conducted to study the performance of different time and pitch scaling algorithms rev ealed the superiority of the proposed technique. The ESOLA algorithm significantly outperforms PSOLA based techniques due to its simpler and efficient frame manipulations, and the SOLAFS due to the accurate frame alignment based on epochs. Also, the ESOLA algorithm is computationally efficient and requires the least e xecution time for its implementation among the other prominent time and pitch scaling techniques. I X . A C K N O W L E D G E M E N T S The authors would like to thank Jitendra Dhiman (C++), Pa van Kulkarni (Android application), Aishwarya Selvaraj (Praat plugin), and V inu Sankar (Python) for implementing the ESOLA algorithm with nice GUIs on v arious platforms. R E F E R E N C E S [1] T . F . Quatieri and R. J. McAulay , “Shape inv ariant time-scale and pitch modification of speech, ” IEEE T ransactions on Signal Pr ocessing , vol. 40, no. 3, pp. 497–510, Marc 1992. [2] K. S. Rao and B. Y e gnanarayana, “Prosody modification using instants of significant excitation, ” IEEE Tr ansactions on Audio, Speech, and Language Pr ocessing , vol. 14, no. 3, pp. 972–980, May 2006. [3] J. B. Allen and L. R. Rabiner , “ A unified approach to short-time Fourier analysis and synthesis, ” Proceedings of the IEEE , vol. 65, no. 11, pp. 1558–1564, 1977. [4] J. Allen, “Short-term spectral analysis, synthesis, and modification by discrete Fourier transform, ” IEEE T r ansactions on Acoustics, Speec h, and Signal Processing , vol. 25, no. 3, pp. 235–238, 1977. [5] S. Roucos and A. Wilgus, “High quality time-scale modification for speech, ” in Pr oceedings of IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing , vol. 10, April 1985, pp. 493–496. [6] J. Laroche, “ Autocorrelation method for high-quality time/pitch-scaling, ” in Proceedings of IEEE W orkshop on Applications of Signal Pr ocessing to A udio and Acoustics , October 1993, pp. 131–134. [7] B. Lawlor and A. D. Fagan, “ A no vel high quality ef ficient algorithm for time-scale modification of speech, ” in Proceedings of Eurospeec h , September 1999, pp. 2785–2788. [8] W . V erhelst and M. Roelands, “ An overlap-add technique based on wav eform similarity (wsola) for high quality time-scale modification of speech, ” in Pr oceedings of IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing , vol. 2, April 1993, pp. 554–557. [9] P . H. W . W ong and O. C. Au, “Fast SOLA-based time-scale modification using en velope matching, ” KAP Journal of VLSI Signal Pr ocessing- Systems for Signal, Image, and V ideo T echnology , pp. 75–90, 2003. [10] ——, “F ast SOLA-based time-scale modification using modified en v e- lope matching, ” in Pr oceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing , vol. 3, May 2002, pp. 3188– 3191. [11] D. Dorran, R. Lawlor, and E. Coyle, “High quality time-scale modifica- tion of speech using a peak alignment overlap-add algorithm (P A OLA), ” in Pr oceedings of IEEE International Confer ence on Acoustics, Speech, and Signal Processing , vol. 1, April 2003, pp. 700–703. [12] R. Dorran D., La wlor and C. E., “T ime-scale modification of speech using a synchronised and adaptiv e overlap-add (SA OLA) algorithm, ” in Pr oceedings of Audio Engineering Society , March 2003. [13] D. Dorran, R. Lawlor , and E. Coyle, “ A comparison of time-domain time-scale modification algorithms, ” in Proceedings of 120th Con v ention of the Audio Engineering Society , T ech. Rep., 2006. [14] D. Hejna and B. R. Musicus, “The SOLAFS time-scale modification algorithm, ” BBN, T ech. Rep., July 1991. [15] E. Moulines and F . Charpentier, “Pitch-synchronous wav eform pro- cessing techniques for text-to-speech synthesis using diphones, ” Speech Communication , v ol. 9, no. 5, pp. 453 – 467, 1990. [16] T . Dutoit and H. Leich, “MBR-PSOLA: T ext-to-speech synthesis based on an MBE re-synthesis of the segments database, ” Speech Communi- cation , v ol. 13, no. 3, pp. 435 – 440, 1993. [17] E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of speech, ” Speech Communication , vol. 16, no. 2, pp. 175–205, 1995. [18] J. Makhoul, “Linear prediction: A tutorial review , ” Proceedings of the IEEE , v ol. 63, no. 4, pp. 561–580, 1975. [19] W . V erhelst and M. Roelands, “ An o verlap-add technique based on wav e- form similarity (WSOLA) for high quality time-scale modification of speech, ” in Pr oceedings of IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing , vol. 2, April 1993, pp. 554–557. [20] Y . Laprie and V . Colotte, “ Automatic pitch marking for speech trans- formations via TD-PSOLA, ” in Proceedings of 9th European Signal Pr ocessing Confer ence , September 1998, pp. 1–4. 10 [21] R. V eldhuis, “Consistent pitch marking, ” in Pr oceedings of International Confer ence on Spoken Language Pr ocessing , October 2000, pp. 207– 210. [22] W . Mattheyses, W . V erhelst, and P . V erhoeve, “Robust pitch marking for prosodic modification of speech using TD-PSOLA, ” in Proceedings of IEEE Benelux/DSP V alley Signal Pr ocessing Symposium (SPS-DARTS) , 2006, pp. 43–46. [23] Y . Stylianou, “ Applying the harmonic plus noise model in concatenativ e speech synthesis, ” IEEE T ransactions on Speech and Audio Pr ocessing , vol. 9, no. 1, pp. 21–29, January 2001. [24] J. Laroche and M. Dolson, “Impro ved phase vocoder time-scale modi- fication of audio, ” IEEE T ransactions on Speech and Audio Pr ocessing , vol. 7, no. 3, pp. 323–332, May 1999. [25] H. Kawahara, I. Masuda-Katsuse, and A. de Chev eign, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based { F0 } extraction: Possible role of a repetitive structure in sounds, ” Speech Communication , vol. 27, no. 34, pp. 187–207, 1999. [26] L. R. Rabiner and R. W . Schafer, Digital Processing of Speech Signals , 4th ed. Pearson Education Inc., 2009. [27] B. Y egnanarayana and S. V . Gangashetty , “Epoch-based analysis of speech signals, ” Sadhana , vol. 36, no. 5, pp. 651–697, 2011. [28] V . Colotte and Y . Laprie, “Higher precision pitch marking for TD- PSOLA, ” in Pr oceedings of 11th Eur opean Signal Pr ocessing Confer- ence , September 2002, pp. 1–4. [29] C.-Y . Lin and J.-S. R. Jang, “ A two-phase pitch marking method for TD- PSOLA synthesis, ” in Proceedings of INTERSPEECH-ICSLP , October 2004, pp. 1189–1192. [30] T . Ewender and B. Pfister, “ Accurate pitch marking for prosodic modification of speech segments, ” in Pr oceedings of INTERSPEECH , September 2010, pp. 178–181. [31] A. Chalamandaris, P . Tsiakoulis, S. Karabetsos, and S. Raptis, “ An efficient and robust pitch marking algorithm on the speech waveform for TD-PSOLA, ” in Pr oceedings of IEEE International Confer ence on Signal and Ima ge Pr ocessing Applications , November 2009, pp. 397– 401. [32] F . M. G. de los Galanes, M. H. Savoji, and J. M. Pardo, “New algorithm for spectral smoothing and env elope modification for LP- PSOLA synthesis, ” in Pr oceedings of IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing , vol. 1, April 1994, pp. 573–576. [33] K. S. Rao, S. R. M. Prasanna, and B. Y egnanarayana, “Determination of instants of significant excitation in speech using Hilbert envelope and group delay function, ” IEEE Signal Pr ocessing Letters , vol. 14, no. 10, pp. 762–765, October 2007. [34] T . Drugman and T . Dutoit, “Glottal closure and opening instant detection from speech signals. ” in Pr oceedings of INTERSPEECH , 2009, pp. 2891–2894. [35] R. R. Shenoy and C. S. Seelamantula, “Spectral zero-crossings: Lo- calization properties and applications, ” IEEE Tr ansactions on Signal Pr ocessing , v ol. 63, no. 12, pp. 3177–3190, June 2015. [36] K. S. R. Murty and B. Y e gnanarayana, “Epoch extraction from speech signals, ” IEEE T ransactions on A udio, Speech and Language Pr ocessing , vol. 16, no. 8, pp. 1602–1613, Nov ember 2008. [37] A. P . Prathosh, T . V . Ananthapadmanabha, and A. G. Ramakrishnan, “Epoch extraction based on integrated linear prediction residual using plosion index, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 21, no. 12, pp. 2471–2480, December 2013. [38] T . Drugman, M. Thomas, J. Gudnason, P . Naylor, and T . Dutoit, “Detection of glottal closure instants from speech signals: A quantitative revie w , ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 20, no. 3, pp. 994–1006, March 2012. [39] N. Adiga, D. Govind, and S. R. M. Prasanna, “Significance of epoch identification accuracy for prosody modification, ” in Pr oceedings of International Confer ence on Signal Processing and Communications (SPCOM) , July 2014, pp. 1–6. [40] B. Y egnanarayana and P . S. Murty , “Enhancement of reverberent speech using LP residual signal, ” IEEE T r ansactions on Speech and Audio Pr ocessing , v ol. 8, no. 3, pp. 267–281, May 2000. [41] J. K ominek and A. W . Black, “The CMU Arctic speech databases, ” in Pr oceedings of 5th ISCA W orkshop on Speech Synthesis , 2004.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment