Limitations of Source-Filter Coupling In Phonation

L I M I T A T I O N S O F S O U R C E - F I L T E R C O U P L I N G I N P H O N A T I O N Debasish Ray Mohapatra ∗ 1 and Sidney Fels 1 1 The Univ ersity of British Columbia, V ancouver , BC, V6T 1Z4 1 Introduction As per the traditional source-ﬁlter theory , ev ery acoustic speech synthesizer requires a voice source model to produce acoustic energy and a ﬁlter which could modulate that en- ergy to produce speech like sound. These models could be categorized into three sections: parametric glottal ﬂo w models, kinematic v ocal fold models and self-oscillating bio- mechanical v ocal fold models. The parametric glottal ﬂo w model assumes that the voice source and vocal tract ﬁlter are linearly separable. But the current research in speech sci- ence precisely illustrates the impact of acoustic loading on the dynamic behaviour of the vocal fold vibration as well as the variation in the glottal ﬂow pulses’ shape. Both the kinematic and self-oscillating vocal fold models consider the source-ﬁlter interaction. This study outlines the source-ﬁlter theory; elucidates various low-dimensional lumped-mass models of the acoustic source and computational models of the vocal tract as artic- ulation. T o understand the limitations of source-ﬁlter inter - actions which are associated with each of these models, we considered their mechanical design, acoustic and physiologi- cal properties and aerodynamic simulation. 2 Nonlinear Source-Filter Interaction The acoustic interaction between the vocal cord and vocal tract is a growing interest in the study of articulatory speech production. The source-ﬁlter interaction refers to the proper- ties of the vocal tract model which affect the self-oscillating characteristics of the acoustic source. These properties play a signiﬁcant role in the designing of an articulatory speech synthesizer . In literature the source ﬁlter interaction has been demonstrated by considering the following ef fects: skewness in glottal ﬂow w av e, truncation, dispersion and superposition. In retrospect, the classic linear source-ﬁlter theory [1] as- sumes human speech generation as a two-stage independent process. First, the glottal source produces air pulses of mul- tiple fundamental frequencies F 0 which tra verses through the vocal tract (ﬁlter) . The vocal tract acts like an acoustic mod- ulator and it resonates only at formant frequencies F 1 to pro- duce a time-varying glottal ﬂow as output. This process has been demonstrated in Fig. 1. The linearly separable source- ﬁlter models could be represented mathematically as the con- volution of the source and ﬁlter function in time domain or multiplication in frequency domain. Though the linearity assumption for source-ﬁlter cou- pling is useful to build an over -simpliﬁed model, physiolog- ical systems are generally non-linear . And the linear cou- pling is only suitable where the fundamental frequencies of the source do not cross ov er the formant frequencies like in ∗ d.mohapatra@alumni.ubc.ca male speech voice. But for female or child speech voice and ev en while singing, it has been observed that the fundamen- tal frequencies are in the proximity of the formant frequen- cies. And during that case, there is a high degree of v aria- tion in the vocal tract impedance which may cause intense interaction between the source and ﬁlter models. That could possibly lead to bifurcations in the dynamics of vocal fold vibration, sudden F 0 jumps and variation in the source en- ergy [2]. Hence, to produce the actual speech like sound we should consider the nonlinear and time varying characteristics of source-ﬁlter coupling. Figure 1: Flow diagram sho wing the Source-Filter process 3 Speech Synthesizer Models There is a multitude of source-ﬁlter models in the litera- ture which are successfully implemented. Most of these self-oscillating source models could be organized in two categories: lumped-element and continuum mass models. Though continuum models provide a better representation of the vocal fold, they are complex and computationally expen- siv e. So for simpliﬁcation, we analyzed only the lumped- element models of the vocal fold. Despite their simplicity , these models were shown to be able to generate many char- acteristics of an actual vocal fold oscillation. They represent the vocal fold as point masses which are connected to a rigid wall through springs [3]. Fig. 2 demonstrates the lumped- element models. T o analyze the ef fect of acoustic loading, various computational models of the vocal tract are coupled with the acoustic source models. 3.1 Source Models The one-mass model of the vocal fold was designed with a single mass-spring oscillator , driv en by airﬂow from lungs. The point mass is assigned with a ﬁxed weight which could emulate the v ocal fold. And the spring system helps the point mass to oscillate to create air pulses with a particular glottal frequency [4]. Although the model can simulate acceptable voiced sound for an inducti ve acoustic load, it fails to sustain the self-oscillatory behaviour of the source for a capacitiv e load of the v ocal tract, i.e. when the fundamental frequen- cies are just abov e the formants [3]. Because the one-mass model has only one degree of freedom, it could not produce the phase difference between the upper and lower vocal cord edges during oscillation. Unlike the one-mass model, the two mass model of vo- cal fold [5] has successfully demonstrated the self-oscillating characteristics of the vocal fold. And the oscillation sustains for both the inductiv e and capacitiv e load of the vocal tract. T wo mass model uses two mass elements per vocal cord and provides the necessary degree-of-freedom to introduce the vertical phase difference in v ocal fold edges during oscilla- tion. Howe ver , the main disadv antage of both of these models is that their tissue discretization in a coronal plane does not capture the layered structure of the vocal folds. And there is no immediate correlation between the spring stif fness and the effects of muscle contractions. From speech synthesis point of vie w , the most signiﬁcant factor in the two-mass model is the characterization of its performance when its coupled with a transmission line model of the vocal tract.T o inv es- tigate the performance of a two-mass model, Ishizaka et al. measured the onset frequency of the jumps for a large acous- tic load on the model and compared the result with the same acoustic load in human v oicing. For the two-mass model, the jump in the fundamental frequency happens at the ﬁrst reso- nant frequency of the tube. Whereas in human voicing onset frequency of the jumps is higher than the ﬁrst formant fre- quency [5]. The limitation of layered structured representation in two-mass type models motiv ated to design the body-cover structure of vocal folds. Mostly its a three-mass model that adds a “body” mass lateral to the two cov er masses. Though the structural representation of the body-co ver model is dif- ferent, this model still preserves the self-oscillatory principle of the two-mass model. By controlling the body stif fness con- stant, the body-cov er model of v ocal fold could be reduced to a two-mass model. Figure 2: Lumpled-Element models of vocal fold. Image by Peter Birkholz [3] 3.2 Filter Models The vocal tract acoustic load has both positiv e and negativ e types of damping terms. The negati ve damping which helps in vocal fold oscillation, is provided by the vocal tract iner- tance. There are numerous articulatory models exist in the literature which could produce this fa vourable condition. The Kelly-Lochbaum model of the v ocal tract is traditionally con- structed by approximating the cross-sectional area of the vo- cal tract by cascading multiple cylindrical tube sections [6]. The resulting tube system then could be interpreted as a dig- ital wave guide or digital wa ve model of the vocal tract. But the sev ere pitfalls in this model are: 1) length of each of the cylindrical tube section has to be equal. 2) the junction of two tube section is not smooth. These limitations affect the formant frequencies of the articulatory model which in turn prev ent to achie ve the exact matching of a gi ven speech spec- trum. 4 Conclusions The limitation in source-ﬁlter interaction for various mod- els has been discussed. One of the signiﬁcant challenges in designing a speech synthesizer model is to address the de- gree of interaction between the vocal fold and vocal tract which v aries based on the types of articulatory gestures like singing; breathy v oice; male or female voice; high and low pitch voice. So it is much needed to do an accurate measure- ment of the shape changes in vocal tract during articulation and the precise simulation of vocal fold and vocal tract inter- action using a feedback channel. Hence, for a speciﬁc change in the vocal tract shape, there could be a notable impact on the v ocal fold oscillation and glottal ﬂo w which is an input to the resonator . A fruitful direction could be the use of 2D Finite-Difference Time-Domain wav e solver and the excita- tion mechanism while maintaining the stability of the solver when the domain boundaries (i.e., the vocal tract walls) are dynamically modiﬁed, as in the case of articulation. 5 Acknowledgement This work was funded by the Natural Sciences and Engineer- ing Research Council (NSERC) of Canada and Canadian In- stitutes for Health Research (CIHR). References [1] Fant Gunnar . The acoustic theory of speech production. S’Gravenhag e, Mouton , 1960. [2] Ingo R T itze. Nonlinear source–ﬁlter coupling in phonation: Theory . The Journal of the Acoustical Society of America , 123(4):1902–1915, 2008. [3] Peter Birkholz, BJ Kr ¨ oger , and P Birkholz. A survey of self- oscillating lumped-element models of the vocal folds. Stu- dientexte zur Sprachkommunikation: Elektr onische Sprac hsig- nalverarbeitung , pages 47–58, 2011. [4] J Flanagan and Lois Landgraf. Self-oscillating source for vocal- tract synthesizers. IEEE T ransactions on Audio and Electroa- coustics , 16(1):57–64, 1968. [5] Kenzo Ishizaka and James L Flanagan. Synthesis of voiced sounds from a two-mass model of the vocal cords. Bell system technical journal , 51(6):1233–1268, 1972. [6] V esa V ¨ alim ¨ aki and Matti Karjalainen. Improving the kelly- lochbaum vocal tract model using conical tube sections and fractional delay ﬁltering techniques. In Third International Con- fer ence on Spoken Languag e Pr ocessing , 1994.

Limitations of Source-Filter Coupling In Phonation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment