An Algorithm Based on Empirical Methods, for Text-to-Tuneful-Speech Synthesis of Sanskrit Verse

IJCSNS International Journal of Com puter Science and Network Security , VOL.10 No.1, January 2010 Manuscript received January 5, 2010 Manuscript revised January 20, 2010 An Algorithm Based on Empirica l Methods, for T ext-to-T uneful- Speech Synthesis of Sanskrit V erse Rama N. † and Meenakshi La kshmanan †† † Department of Computer Science, Presidency College, Chennai, I ndia †† Department of Computer Science, Meenakshi College for Women, Chennai, India and Research Scholar, Mother Teresa Wo men’s University, Kodaikanal, India Summar y The rendering of Sanskrit poetry from text to speech is a problem that has not been solved before. One reason m ay be the complications in the language itself. We pres ent unique algorithms based on extensive em pirical analysis, to synthe size speech from a given text input of Sanskrit verses. Using a pre- recorded audio units database which is itself tremendously reduced in size compared to the colossal size that woul d otherwise be required, the algorithms work on producing the b est possi ble, tune full y rendere d chant ing of the gi ven vers e. His would enable the visually impaired and those with r eading disabilities to easily access the contents of Sanskrit verses otherwise available only in writing. Key words: Sanskrit, verse, text-to-speech, musical tones, speech synthesis, sandhi, metre. 1. In tr odu ctio n Speech synthesis systems have proved to be extremely useful in improving the lives of the visua lly impaired and those with reading disabilities across t he globe. Howeve r, such systems that cater to western languag e are not applicable in the Indian context, because of the huge difference in the structure and pronunciation schem es of Indian languages. Work has been done to bring Indian vernaculars to the people th rough speech synthesis [3, 16, 18], but there is a dearth of such work in the context of Sans krit. Even a cursory glance at randomly chosen works in the Sanskrit literature would reveal that po etry hugely dominates the literature. Th e volume of the extant literature is vast and the conten ts profound, with topics rangi ng from gramm ar to spi ritu ality, f rom med icine to geography. Listening to verses being chanted and committing them to mem ory has been a traditional practice. Similar is the case with ch anting them tunefully. Obviously, rather good familiarity with the Sanskrit scr ipt is required to read the verses, and that too continuously, with reasonable speed and with a tune. Thus, the visually impaired and those with reading disabilities would find themselves at a serious disadvantage, as w ould those who do not know how to read Sanskrit but would like to know or memorize verses. Further, in today’s fast-moving world in which time is at a premium, a piece of software that reads out any desired Sanskrit poetical text would be welcome. We propose a comprehensive method based on empirical analysis, to convert Sanskrit poetical tex t to speech. This method is new, effective and produces output that is tuneful. 2. The Problem The most important qualities of a speech synthesis system are naturalness and intelligibility. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood [7]. The most popular and simplest method of speech synthesizing is the concatenative method. Formant synthesis, the other major speech synthesis method, would inevitably compromise on naturalness of the voice output [5, 9, 12, 14]. We deal with only the concatenative method in this work. Sanskrit is a highly ph onetic language, which adheres completely to the “what you see is what you hear” rule. Further, it is highly structured with stringent rules in its phonemic and morpho logical levels, but lends itself to extreme versatile in the higher syntactic, sem antic and pragmatic levels. As such, the limit for framing new compound words in Sanskrit is on ly the poet’s imagination and linguistic skill. This fact coupled with the complications posed by sandh i -s and case-inflectional forms or vibhakti -s, ensures that the stan dard method of text-to-speech synthesis, viz. creating a voice database of words in the dictionary and concatenating these stored audio files while parsing the v erse, is well nigh impossible IJCSNS International Journal of Com puter Science and Network Security, VOL.10 No.1, January 2010 to apply in the case of Sanskrit, in spite of the highly phonetic nature of the language. Secondly, creating just the pronounced individual letters as audio files and then concatenating them while parsing th e verse, would render a rather poorly pronounced verse, and in fact an incorrectly pronounced one. Consider the sample verse, vande gur ūṇāṁ cara ṇā ravinde sandar ś itassv ā tmasukh ā vabodhe | janasya ye j āṅ galik ā yam ā ne sa ṁ s ā rah ā l ā halamo ha śā ntyai || The large compound word, sandar ś itassv ā tmasukh ā vabodhe is actually sandar ś ita ḥ + sv ā tm asukh ā vabodh e to which a sandhi rule has been applied. There is no way one could have stored a priori, th is entire compound word created on the fly by the poet. Similarly, the word j anasya comes from the word jana ḥ which would be found in a dictionary, unlike janasya . The reason is that th e sixth out of eight case-endings has been applied to the root word jana ḥ meaning “people”, resulting in janasya meaning “of people”. There are 24 such case-inflectional form s in total for every noun, and nine or eighteen for verbs in each of six tenses and four moods. Considering that the count of nouns and verbs is in the thousands, the storing and retrieval of audio snippets of all such case-inflectional forms would be prohibitive in terms of space and time for creation and retrieval. Pronouncing the verse letter by letter would give “ v + a + n + d + e ”, etc., which is incorrect pronunciation. Even if the unit of pronunciation be considered as a conson ant with its succeeding vowel alone, it is insufficient, f or “ va ” would be pronounced correctly, but there would be a problem again with streaming the “ n ” separately, for it would result in weird pronunciation. It is thus clear that none of the methods of speech synthesis outlined above would be effective in th e case of Sanskrit text-to-speech processing. We present algorithms to make the output intelligible and constituting a correct reading of the verse with pauses as per the caesura data and also tunefully. 3. The Precursor to thi s Work Euphonic conjunctions or sandhi -s in Sanskrit are points between adjacent words or sub-words at which letters coalesce and transform. The application of sandhi is compulsory in Sanskrit verse, though the rules are not as stringent in the prose. A novel com putational approach to sandhi processing based on building sandhi -s rather than splitting them, was developed by the authors [10]. This was done in accordance with the gramm atical rules laid down by the ancient Sanskrit gramm arian-genius P āṇ ini in his magnum o pus, the A ṣṭā dhy ā y ī and forms a comprehensive sandhi -building engine. An example of sandhi is: nama ḥ + ś iv ā ya = nama śś iv ā ya . The visarga letter ( ḥ ) gets transformed because of the presence of the letter ś after it, into ś . This is an exam ple of consecutive application of a visarga sandhi rule and the ś cutva sandhi rule [10]. Though the original words nama ḥ and ś iv ā ya are independent words and are per se correct as they are, the rules of verse dema nd that the sandhi at their junction be applied and the transformation done as shown. This becomes relevant in the context of speech synthesis, because after the application of the sandh i rule, the two words become one compound wo rd and, as a result of the doubling of the letter ś , is read with a stress on the letter. Secondly, verses in Sanskrit are classified into metres according to the number and type of syllables in the fo ur quarters of the verse. Algorithms to efficiently parse and classify verses into more than 700 metres and t o gather information about the caesura or points in the verse wh ere a pause must be introduced while reading the verse, were developed by the authors [11]. Verses of different m etres are read in different tempos, with pauses at different caesura and with different tunes. Hence th e information provided by the metrical classification algorithm already developed by the authors to handle input verses in both Sanskrit Unicode and as E-text in t he Latin character set, is an important input for this work. 4. Text Pre-proce ssing 4.1 Unicode Represen tation of S anskrit Text The Unicode (UTF-8) standard is what has been adopted universally for the purpose of encoding Indian language texts into digital format. The Un icode Consortium has assigned the Unicode h exadecimal range 0900 - 097F f or Sanskrit characters. All characters including the diacritical characters used to represent Sanskrit letters in E-texts are found dispersed acr oss the Basic Latin ( 0000 -007F ), L atin- 1 Suppl emen t (0080-00FF), Latin Extended-A (0100-017F) and Latin Extended Additional (1E00 – 1EFF) Unicode ranges. The Latin character set has been em ployed in this paper to represent Sanskrit letters as E-text. The text given in the form of E-text u sing the Unicode Latin character set, is taken as input for processing. Unicode Sanskrit font is also accepted as input, but i s converted to the Latin character f orm before processing begins, as already presented by the authors in [11]. IJCSNS International Journal of Com puter Science and Network Security , VOL.10 No.1, January 2010 4.2 Sandhi Corr ection The f ollowi ng sa ndhi rules ar e specifically relevant because the transformation wrought by them have a bearing on the pronunciation of the word. 1. For the letter combination “ hn ”, such as in the word “ vahni ”, the normal pronunciation is actually “ nh ”, i.e. as “ van hi ”. Thus, when the combination “ hn ” is encountered in a word, it is re placed by “ nh ”. 2. Sandhi rules with respect to th e anusv ā ra are applied. For example, the word “ sa ṁ ny ā sa ”, split as “ sa ṁ +ny ā sa ” is normally not pronounced this way. Instead, it is pronounced as “ sanny ā sa ”, the transformation being governed by a sandhi category called the parasavarna sandhi . 3. Sandhi rules involving the transfor mation of the visarga . The example “ nama ḥ + ś iv ā ya ” discussed earlier in Section 3 is a typical one. 4. Sandhi rules for the jihv ā m ū liya (the aspirate sound produced near the base of the t ongue while pronouncing the visa rga that is follo wed by ‘ k ’ or ‘ kh ’) and the upadhm ā n ī ya (the sound of ‘f’ while pronouncing the visarga that is followed by ‘ p ’ or ‘ ph ’) have to be applied for correct pr onunciation. At the text pre-processing stage, such visarga -s are replaced appropriately by ‘z’ for the jihv ā m ū l ī ya and by ‘f’ for the upadhm ā n ī ya . The algorithm in [10] serves to effect th ese corrections on the given verse. 4.3 Identifying the Sy llabic Units of the Verse The text being processed has to be divided into single- syllabic units at the pre-processing stage, in such a way that each of the units is pronounceable. Th is is riddled with problems and cannot be ha ndled like European languages [1, 2, 4, 6, 8], as already discussed in Section 2 above. We refer to such pronounceable units as just ‘units’. A unit can have a maximum of three com ponents: 1. Vowel component 2. Pre-vowel component 3. Post-vowel component The vowel component is central an d indispensable to the unit. A unit would have the vowel component and optionally one or more of the o ther two components. Further, the pre-vowel and post-vowel components may consist of one or more characters. For our purposes, we use the categorization of t he Sanskrit alphabet given in Table 1. Table 1: The Sanskrit alphabet categorized # Category Letters 1 Vowels a, ā , i, ī , u, ū , ṛ , ṝ , ḷ , e, ai, o, au 2 Short vowels a, i, u, ṛ , ḷ 3 Long vowels ā , ī , ū , ṝ , e, ai, o, au 4 Consonants k, kh, g, gh, ṅ c, ch, j, jh, ñ ṭ , ṭ h, ḍ , ḍ h, ṇ t, th, d, dh, n p, ph, b, bh , m 5 Semi-vowels y, r, l, v 6 Sibilants ś , ṣ , s 7 Aspirate h 8 Anusv ā ra ṁ 9 Visarga ḥ The following empirically determined cases constitute all the possibilities that arise while parsing a verse: 1. We parse the verse starting from the end of the last unit identified, until we encounter the first vowel. As we do so, we include all the letters on the way in the unit. 2. If a consonant is encountered and it happens to be from the first or third columns of the Consonants category shown in Table 1, then we have to examine the following letter to see if it is an ‘ h ’. If it is , then the two letters together constitute a consonant belonging to the second or fourth rows of the Consonants category. This is important in order to correctly determine the next lett er while parsing. 3. If the vowel encountered = ‘ a ’ and is followed by ‘ i ’ or ‘ u ’, then the vowel is really ‘ ai ’ or ‘ au ’ respectively. 4. If a visarga or anusv ā ra follows the vowel of the unit, then it is included in the unit and t he unit is closed. 5. If the vowel is followed by a co nsonant that is in turn followed by a vowel, then the unit is closed with the vowel itself. Eg: In the word “ gur ūṇāṁ ”, first ‘ g ’ is taken, and then the vow el ‘ u ’ is encountered. Now since the vowel of t he unit is followed by a conson ant (‘ r ’) and then a vowel (‘ ū ’), the unit is cl osed as “ gu ”. The next unit will begin with the ‘ r ’. 6. If the vowel is followed by a consonant an d then by a non-vowel, then the consonant is also included in the unit and the unit closes wi th that. Eg: In the word “ vande ”, after parsing u pto “ va ”, it is found that the vowel ‘ a ’ of the unit is followed by a con sonant (‘ n ’) IJCSNS International Journal of Com puter Science and Network Security, VOL.10 No.1, January 2010 that is in turn followed by a non-vowel (‘ d ’). Hence, the unit includes ‘ n ’ also and is closed as “ van ”. 7. If the vowel is followed by the letter ‘ r ’ , follo wed b y a non-vowel and again a non-vowel (and any number of such non-vowels), then the unit is taken to include the letter ‘ r ’ and the non-vowel following it. Eg: In the word “ k ā rtsnya ṁ ”, after “ ka ” is parsed, we encounter ‘ r ’ follo wed by a non -vowel (‘ t ’) and again a non- vowel (‘ s ’). Thus, the unit is taken as “ k ā rt ”. Indeed, the wor d is pr onou nced a s k ā rt-snya ṁ . 8. If the vow el is followe d by ‘ r ’, then a non-vowel and then a vowel, then the un it is closed with the ‘ r ’. Eg: In th e word “ k ā rya ṁ ”, “ k ā ” is parsed, and after the foll owing ‘ r ’, we have a non- vowel (‘ y ’) and then a vowel (‘ a ’). Hence the un it is closed as “ k ā r ”. The way the word is pronounced in Sanskrit is k ā r-ya ṁ , and hence this is valid. Exceptions to Rule 6: a. Whenever the vowel is followed by the consonant pairs “ jñ ” or “ k ṣ ” t hen the unit is closed with the vowel itself. Eg: In the word “ ajñ ā ”, the pronunciation is a-jñ ā and not aj-ñ ā as Rule 6 would have it. b. In cases where the consonant pai rs “ pr ” or “ br ” or “ kr ” or the consonant ‘ h ’ follow a short vowel, the unit is closed with the vowel itself. Eg: The word “ sapriya ḥ ” is to be pronounced sa-priya ḥ and not as sap-riya ḥ as would be required by Rule 6. We propose the fo llowing parsing algorithm that parses a given verse, handles all the above cases including the exceptions and recognizes the units in t he verse. Algorithm SplitVerseIntoUnits //strVerse is a string variable that contains the enti re verse. //strVerse(i) denotes its i-th character. //strUnit is a string variable storing the unit being //processed. It is initialized to the empty string. i = 0; while i < strVerse.length() do while strVerse(i) is not a vowel do //p arse till 1st vowel append strVerse(i) to strUnit; i = i + 1; end while if k = ‘ a ’ and (k 1 = ‘ i ’ or k 1 = ‘ u ’) th en append k 1 to strUn it; i = i + 1; end if k = strVerse(i); //the fi rst vowel in the unit k 1 = strVerse(i+1); if k 1 is a con sonan t if (k 1 = ‘ k ’ or k 1 = ‘ g ’ or k 1 = ‘ c ’ or k 1 = ‘ j ’ or k 1 = ‘ ṭ ’ or k 1 = ‘ ḍ ’ or k 1 = ‘ t ’ or k 1 = ‘ d ’ or k 1 = ‘ p ’ or k 1 = ‘ b ’) and strVerse(i+2) = ‘ h ’ then i = i + 1; end if end if k 2 = strVerse(i+2); if k 2 is a consonant if (k 2 = ‘ k ’ or k 2 = ‘ g ’ or k 2 = ‘ c ’ or k 2 = ‘ j ’ or k 2 = ‘ ṭ ’ or k 2 = ‘ ḍ ’ or k 2 = ‘ t ’ or k 2 = ‘ d ’ or k 2 = ‘ p ’ or k 2 = ‘ b ’) and strVerse(i+3) = ‘ h ’ then i = i + 1; end if end if k 3 = strVerse(i+3); append k to strUnit; if k 1 is visarga or k 1 is anusv ā ra then append k 1 to strUnit; close strUnit and initialize to empty string to ho ld the next unit; else if k 1 is a consonant if k 2 is a vowel then close strUnit and initialize to empty string to hold the next unit; else if (k 2 =‘ j ’an d k 3 =‘ ñ ’) or (k 2 =‘ k ’ and k 3 =‘ ṣ ’) th en close strUnit and initialize to empty string to hold the next unit; else if (k is a short vow el) and (k 2 k 3 = “ pr ” or k 2 k 3 = “ br ” or k 2 k 3 = “ kr ” or k 2 = ‘ h ’) th en close strUnit and initialize to empty string to hold the next unit; else //k 2 not vowel, above exceptions do not hold append k 1 to strUnit; close strUnit and initialize to empty string to hold the next unit; end if else if k 1 is ‘ r ’ and k 2 is not a vowel if k 3 is not a vowel then append k 1 to strUnit; append k 2 to strUnit; close strUnit and initialize to empty string to hold the next unit; else //k 3 is a vowel append k 1 to strUnit; close strUnit and initialize to empty string to hold the next unit; end if end if i = i + 1; end while end Algorithm 5. The Audio Units Database Determining the units to be recorded as audio files and stored in a database is a non-trivial task, as is clear from the discussions in Sections 2 and 4 above. The possible number of such recordable units is huge. There are a total IJCSNS International Journal of Com puter Science and Network Security , VOL.10 No.1, January 2010 of 34 letters in the consonan ts, semi-vowel, sibilant and aspirate categories, and 13 vowels. As such, for the units with one consonant and one vowel, we would have possible 34 x 13 = 4 42 recordable units, which seems feasible. But for units having two consonants, a vowel and a consonant, there are theoretically 34 * 13 * 34 * 34 = 510, 952 recordable units possible! The number of cases with more than two consonants in close succession would obviously be much greater. Thus, we see that there is a combinatorial explosion of the number of required recordable units in this scheme. This problem was surmounted by extensively analyzing the words present in a comprehensive Sanskrit dictionary [13]. Pronounceable units were gathered through this exercise, and the total number of units for practical use was thus substantially reduced. Further, the glyphs of the Sanskrit 2003 font were also studied and som e possible units were eliminated as unpronounceable. This font was particularly chosen because it provides a p lethora of composite glyphs as well. In th is manner, the number of readable units was significantly reduced to yield a database size of just approxim ately 2000 recorded unit clips. The recording of each unit was done in the same vo cal pitch. However, the length of intoning of the sound was varied according as whether the unit being intoned has a long ( guru ) vowel or a short ( lag hu ) one. All long vowels are guru and all sho rt vowels are laghu . The exception is that laghu vowels become guru if they are 1. followed by double consonants (this being optional in the case of pr, br, kr and h ) 2. followed by anusv ā ra or visarga All laghu units were recorded in one time unit, and all guru ones, in two. For example, in the sa mple verse given in Section 2 above, the word “ vande ” has two units, “ van ” and “ de ”, which are respectively of the laghu and guru kinds. As such, “ van ” was recorded in one time u nit (2 seconds) and “ de ” in two time un its (4 seconds). The reason for using 2 and 4 seconds rather than p erhaps 1 and 2 seconds, is that a units have different number of letters in them and yet have to be intoned in the same time span. For example, the un its k ā and k ā rt are both guru units, and hence must each be intoned in two tim e units. However, it would clearly take a l ittle longer to pronounce the whole of k ā rt than it would to pronounce k ā . H ence, the k ā would have to elongated and recorded. It is to provide for this that the longer ti me spans of 2 and 4 seconds were fixed and followed during the r ecording process. The purpose of assigning the duration of an aud io clip based on whether the unit being pronounced is laghu or guru , is to help make the synthesized chanting of the verse follow a beat. This factor is important in order to achieve near-human reproduction of verse-chanting. The beat to be followed for a verse would vary according to t he metre and associated caesura of the verse [11]. 6. The Musical Component of the Sp eech Synthesizer Indian music is similar to its Western counterpart in the context of the theory of notes and octaves. There are a total of 12 notes in each octave, wit h the notes separated by a few frequencies. The ba sic notes bear the names sa, ri (sof t) , ri, ga (s oft) , ga, ma, ma (sharp), pa, dha (soft) , dha, ni (soft), ni . These 12 notes repeat themselves in hig her and higher levels of f requencies, and thus are octav es formed. To make the chanting of the verse tuneful, it is necessary to introduce these musical notes. The fundamental idea is that slight changes made to the frequency of th e recorded audio unit file, result s in a change of the musical note at which the unit is heard when played. The changes have necessarily to be slight, for otherwise it is found that the very texture of the voice changes with vast changes in frequency. Table 2: Values given for m usical notes # Note V alue 1 sa -7 2 ri (soft) -6 3 ri -5 4 ga (soft ) -4 5 ga -3 6 ma -2 7 ma (sharp) -1 8 pa 0 9 dha (s oft) 1 10 dha 2 11 ni (soft) 3 12 ni 4 Let u 1 , u 2 , … u n be the n units constituting the given verse. Now the two variables associated with each such unit are the syllable intoned and the pitch of the sound. We w ill assume that the same octave is maintained throughout. Let p i denote the pitc h at whic h u i should be intoned in order for the verse to be chanted tunefully. Thus, the final output for the ith unit is a function f of u i and p i . Thus, the final IJCSNS International Journal of Com puter Science and Network Security, VOL.10 No.1, January 2010 speech output is not just ∑      but ∑ 󰇛   ,  󰇜   where summation here stands for concatenation. The pitch levels p 1 , p 2 , … p n for the n syllables of a verse’s quarter, were fixed for the categories of m etres enumerated and stored in the database created for the earlier work on metre classification outlined in Section 3 [11]. Here, n m ay vary from 1 to 26 for equal-quarter-metres, as well as metres for with half-equal as also unequal quarters. The values of p i were fixed according to the general schem e presented in Table 2. By tradition, the note pa is considered as the middle note and therefo re assigned the value 0. We create an array p[] containing the p i values for each category of metres from 1 to 26 syllables per quarter and also for the half-equal and unequal metres. Consider the two examples depicted in Tables 3 and 4. Table 3: Val ue of p[] f or Anu ṣṭ hup metre (8 syllables per quarter) Quarte rs 1, 3 0 1 1 2 2 0 1 1 Quarte rs 2, 4 0 1 -1 0 0 1 1 1 Table 4: Values of p[] for Indravajr ā , Upendr avajr ā , Upa j ā ti metres (11 syllables per quarter) Quarte rs 1, 3 0 0 1 2 2 0 0 1 -1 0 - 1 Quarte rs 2, 4 0 1 0 0 0 0 -1 0 1 1 1 7. The Algorithm for Text-to-Tuneful-Speech Initially, the verse is parsed by the metre classification algorithm and converted to its binary representation with laghu syllables taking the value 0 and guru syllables taking the value 1 [11]. For the samp le verse given in Section 2 above, Table 5 depicts the binary representation thus obtained. However, this may not match with the binary representation of individual units in the current context. Table 6 depicts the scenario g enerated by the consideration of individual units. The reason for this discrepancy seen between Tables 5 and 6, is that in the cu rrent context depicted by Table 6, we are considering units as individual entities and independently categorizing them as laghu and guru . However, in the case of Table 5, we are considering each syllable in conjunction with the following one and making corrections to the laghu-guru status. For instance, “ van ” h as a short vowel and is hence valued at 0 i n Table 6. However, the fu ll context of “ van ” is “ vande ”, and so the double consonant “ nd ” after the vowel ‘ a ’ fo rces the syllable “ van ” to be considered as guru and not lagh u (as discussed in Section 5). Hence it is valued at 1 in Table 5. Table 5: Actual binary representation of one quarter of the sample verse # Syllable Value ( v ) 1 van 1 2 de 1 3 gu 0 4 r ū 1 5 ṇāṁ 1 6 ca 0 7 ra 0 8 ṇā 1 9 ra 0 10 vin 1 11 de 1 Table 6: Binary representation of one quarter of the sample verse when split into pronounceable units # Syllable Value ( v ) 1 van 0 2 de 1 3 gu 0 4 r ū 1 5 ṇāṁ 1 6 ca 0 7 ra 0 8 ṇā 1 9 ra 0 10 vin 0 11 de 1 Now since we assign 1 unit of time to a laghu syllable and 2 units to a gur u one, we adopt the fo llowing notation. Let v i be the actual value of unit u i (as per Table 5 and used for metre identification). We denote by T E , the expected total units of time to chant the quarter of the verse under consideration, and by T A , the actual total units of time to chant the concerned quarter. Therefore, we have T E = ∑ 󰇛   1 󰇜    T A = ∑ 󰇛   1 󰇜    Clearly, T A ≤ T E , because short syllables in the units may be converted to long if factors regarding the adjacent unit are taken into consideration. However, long syllables are never converted to short. IJCSNS International Journal of Com puter Science and Network Security , VOL.10 No.1, January 2010 We present a general algorithm to adjust the beat of the chant, the chanting being achieved by concatenating the pre-recorded unit audio clips. One solution to making sure the beat is m aintained, is to insert single units of silence when a unit is being intoned as laghu i nstead of guru . However, since a period of silence cannot be introduced in the middle of a word, in such cases, the syl labic unit being considered can be stretched to cov er one more time unit. Algorithm ConcatenateUnits nTotalUnits = n; if T A < T E then k = 1; while k <= nTotalUnits do if t k < v k then if k is the end of a word then insert 1 time unit of sile nce at position k+1; else stretch the kth audio unit; end if nTotalU nits = nTotalU nits + 1; end if end while end if n = nTotalUnits; concatenate f (u 1 , p 1 ), f (u 2 , p 2 ) … f (u n , p n ); append 1 time unit of silen ce at caesura of the metre; end Algor ith m 8. Th e Over all S ynth esis Algor ithm We now present the overall algorithm incorporating all factors discussed above. Algorithm VerseTextToTunefulSpeechSynthesizer Step 1: Parse the verse and i dentify its metre, caesura and retrieve the stored pitch array valu es; Step 2: Apply sandhi rules to correct the specific cases where the pronunciation would change; Step 3: Call Algorithm SplitVerseIntoUnits, to identify the pronounceable units in the verse; Step 4: Retrieve the appropriate audio file for each unit from a fi le collection; Step 5: Call Algorithm ConcatenateUnits, to adjust the beat of the chanting and also to app ly the note (pitch) variations to make the chanting tun eful; Step 6: Play the concatenated file; end Algor ith m The audio unit f iles were stored as .wav file s and the concatenation was done by streaming them con secutively into a target .wav file. It was found that th e function f (u i , p i ) used to change the f requency and hence the musical tone of the audio file m ay be realized both through the free APIs provided with the versatile open source software Audacity [15] and through the Microsof t DirectX SDK [17]. Clearly, the accuracy of t he output is dependent on the tone and time duration of the recorded au dio units. As such, during testing, changes to these audio unit files in terms of the intoning pitch and more importantly the tim e for which individual units were recorded, had to be made. This drastically improved the output qu ality. 9. Conclusions Text-to-speech synthesis of San skrit verse is a hitherto unsolved problem. The fact is that such synthesis is problem-ridden owing to the numerous complexities inherent in the Sanskrit language in general and versification in particular. This work presents a beguilingly simple, yet com prehensive and effective solution based on the concatenative method of text-to- speech synthesis. The novel method presented here does not suffer from any performance bottlenecks. This work utilizes earlier work by the au thors on metrical classification of Sanskrit verses and on the P āṇ inian meth od of sandhi processing, and builds a Text-to-speech synthesizer for Sanskrit Verse. Empirical methods of analysis were used to create algorithms for splitting the verse into bits of pronounceable text and to significantly reduce the audio corpora required for the algorithm to function reasonably well. Furthermore, since Sanskrit verses are always tunefully chanted rather than uttered in a prosaic way, this solution incorporates a unique m usical element too, and achieves a tuneful rendering of verses of various metres through manipulation of the frequency of the sound at appropriate places. The work would be of tremendous use to those with visual impairments or reading disabilities, wh o would want to listen to and even memorize San skrit verses from any text they wish. References [1] Alistair Conkie, Mark Beutnagel, A nn Syrdal, and Philip Brown, Preselection of cand idate units in a uni t selection- based text-to-speech synthesis system , Proceedings of ICSLP, volume 3, pages 314—3 17, 2000. [2] Andrew Hunt and Alan Black, Unit selection in a concatenative speech syn thesis system , Proceeding s of ICASSP'96, volume 1, pages 37 3--376, Atlanta, GA, 199 6. [3] Anirban Lahiri, Satya Jyoti Chattopadhyay, Anupam Basu, Sparsha: A Comprehensive Indian Lan guage Toolset for the Blind , Proceedings of th e 7th international AC M SIGACCESS conference on Computers and acces sibility, October 2005, (ISBN:1-59593-159-7), Pages: 11 4 – 120. IJCSNS International Journal of Com puter Science and Network Security, VOL.10 No.1, January 2010 [4] Cyril Allauzen, Mehryar Mohri, and Michael Riley , Statistical mod eling for unit se lection in speech s ynthesis , Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 2004, Artic le No.: 55 [5] Dan iel Ju rafs ky, Jame s H . M art in, Speech and Language Processin g , Pearson Education 2000, Reprint 200 5. [6] Hideyuki Mizuno, Satoshi Takahashi, Unit selection using k-nearest neighbor search for concatenative speech synthe sis , Proceedings of the 3rd International Universa l Communication Symposium, 2009, Pages: 379- 382 (ISBN:978-1-60558-641-0). [7] Jonathan Allen, M. Sharon Hunnicutt, Denn is Klatt, From Text to Speech: The MITalk system , Cambridge University Press: 1987. (ISBN 0521306418). [8] Mark Beutnagel, Mehryar Mohri, and Michael Riley, Rapid unit selection from a large speech corpus for concatenative speech syn thesis, Proceedings of Eurospeech, volume 2, p ages 607—610, 1999. [9] Mattingly, Ignatius G., Speech s ynthesis for phonetic and phonological models , Thomas A. Sebeok (Ed.), Current Trends in Linguistics, Volume 12, Mouton, The Hague, pp. 2451-2487, 1974. [10] Ra ma N., M eenakshi Lakshm anan, A New Computational Schema for Euphonic Conjunctions in Sanskrit Processin g , IJCSI International Journal of Computer Science Issues, Vol. 5, 2009 (ISSN (Online): 1694-0784, ISSN (Print): 1694-0814), Pages 43-51, (http://ijcsi.or g/papers/IJCS I-5-43-51.pdf - last a ccessed on 20.01.2010). [11] Ra ma N., Meenakshi Lakshmanan, A Computational Algorithm for Metrical Classification of Verse , submitted to IJCSI International Journal of Com puter Science Issues, Vol. 5, 2009 (ISSN (Online): 1694-0784, ISSN (Print): 1694-0814). [12] Rubin P., Baer T., Mermelstein, P., An articul atory synthesizer for perceptu al research , Journal of the Acoustical Society of Amer ica, 1 981, 70, 321-328. [13] Va man Shivram Apte, Practical Sanskrit-English Dictionary , Motilal Banarsidass Publisher s Pvt. Ltd., Delhi, 1998, Revised and Enlarged Editio n, 2007. [14] Van Santen P. H., Richard William Sproat, Joseph P. Olive, and Julia Hirschberg, Progress in Speech Synthesis , Springer: 1997. (ISBN 0387947019) Websites [15] Audacity Sound Editor and Recorder, http://audacity.sourceforge.net (l ast accessed on 17.01.2010). [16] India n Institute of Technology , Madras, http://acharya.iitm.ac .i n/disabilities/mbrola.php (last accessed on 17.01.2010). [17] Microsoft DirectX Developer Center, http://msdn.microsof t.com/en-us/directx/defau lt.aspx , (last accessed on 17.01.2010). [18] Technology Development for Indian Languages, Department of Information Technology , Ministry of Communication & Information Technology, Governm ent of India, http://tdil.mit.gov.in/standards.htm , (last acce ssed on 17.01.2010). Dr. Rama N. Completed B.Sc. (M athematics ), Master of Computer Applications and Ph.D . (Computer Science) from the University of Madra s, India. She served in faculty positions at Anna Adarsh College, Chennai and as Head of the Department of Computer Science at Bharath i Women’s College, Chennai, before moving on to Presidency College, Chennai, where she currently serve s as Associate Prof essor. She has 20 y ears of teaching experience includ ing 10 years of postgraduate (PG) teaching, and has guided 15 M.Phil. students. She has been the Chairperson of the Board of Studies in Com puter Science for UG, and Member, Board of Studies in Com puter Science for PG and Research at the Univ ersity of Madras. Cur rent research interests: P rogram Security. S he is the Mem ber of the Ed itorial cum Advisory Board of the Or iental Journal of Computer Science and Technology. Meenakshi Lakshmanan Having complete d B.Sc. (Mathematics), Master of Computer A pplications at the University of Madras and M.Phil. (Computer Science) , she is currently pursuing Ph.D. (Computer Science) at Mother Teresa Women’s University, Kod aikanal, India. She is also pu rsuing Level 4 Sanskrit ( Samartha ) of the Sams k ṛ ta Bh āṣā Prac ā ri ṇī Sabh ā , Chittoor, India. Starting off her career as an executive at SRA Systems Pvt. Ltd ., she switched to academics and current ly heads the Department of Computer Science, Meenakshi Col lege for Women, Chennai, India. She is a professional member of the ACM and IEEE.

An Algorithm Based on Empirical Methods, for Text-to-Tuneful-Speech Synthesis of Sanskrit Verse

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment