Entropy of Telugu

June 30, 2011

Reading time: 5 minute

...

📝 Original Info

Title: Entropy of Telugu
ArXiv ID: 1106.5973
Date: 2011-06-30
Authors: Venkata Ravinder Paruchuri

📝 Abstract

This paper presents an investigation of the entropy of the Telugu script. Since this script is syllabic, and not alphabetic, the computation of entropy is somewhat complicated.

💡 Deep Analysis

📄 Full Content

Indian languages are highly phonetic; i.e. the pronunciation of new words can be reliably predicted from their written form. A background to Indian scripts is given in [1]- [6]. Indian scripts are highly systematic in their arrangement of sounds, and the milieu in which they arose is provided in papers on early Indian science [7]- [11]. From the evidence available at this time it may be assumed that the 3 rd millennium BC Indus script evolved into the Brahmi script of late centuries BC which, in turn, evolved into the different Indian modern Scripts. Structurally, all the Indian scripts are thus closely related although their forms may look quite different. The Brahmi script is also the parent to Southeast Asian scripts.

The alphabets of Indian languages are classified into consonants, vowels and other symbols. An akshara or syllable consists of 0, 1, 2 or 3 consonants and a vowel or other symbol. Each akshara can be pronounced independently. Alphabets of all the Indian languages are derived from the Brahmi script. In all the Indian languages there are 33 common consonants and 15 common vowels. In addition, there are 3-4 consonants and 2-3 vowels that are specific to each language, but not very significant in practice. Words are made up of one or more aksharas. If an akshara consists of more than one consonant, they are called samyuktakshara.

The similarity in alphabet is not extended in graphical form which is used for printing. Each language uses different scripts, which consists of different graphemes. There are about 10-12 major scripts in India among which Devanagari is the most widely used. Different languages have different statistical characteristics. Some have an over line for the entire word, while some have not touching graphemes. The vowels and the supporting consonants in samyuktakshara can appear on the left, right, above, below or in combinations to the main consonant.

Entropy of the language is the measure of disorder associated with the system. For a random variable X with n outcomes , the Shannon entropy, a measure of uncertainty and denoted by H(X), is defined as Entropy of the English language is calculated by taking into consideration, the 26 alphabets and space character and leaving out the punctuation.

Entropy of Telugu language is computed by converting the Telugu language into English language using some software’s and then using the above mentioned formula to compute it. The entropy in Telugu language is computed in two ways.

 By converting it to English and then considering them as English letters  By converting it to English and then considering them as Telugu letters.

To understand the conversion of Telugu into English font, here are some of the examples:

In the first method, after converting them into English font, they are considered as English alphabets, i.e. padmavibhUShaN^ is considered as a sequence of characters containing p, a, d, m, a, v, i, b, h, U, S, h, a, N, ^. In the Telugu language, the alphabets are case sensitive as each has different meaning. The frequencies of each letter are shown in the table 1. The above results are calculated based on 10,000 characters and the frequencies are rounded off to 2 decimal points.

In the next method, after converting them to English font, they are considered as Telugu syllables as opposed to English alphabets in method one. In this approach, the words are partitioned on the basis of Telugu syllables. Some examples are shown below. padmavibhUShaN^ pa, dma, vi, bhU, Sha , N^ kAryAlayaM kA, ryA, la, yaM sAdhyamainaMta sA, dhya, mai, naM, ta

The frequencies of each syllable according to the given text are computed and then the entropy of the language is calculated.

In this case we are considering one syllable at a time and the entropy of the language calculated is approximately 5.98 for the 10,000 characters that I have considered.

We have continued this method for finding the entropy of Telugu language by considering two syllables at a time to decrease the entropy of the language.

In this approach, the words are partitioned on the basis of Telugu syllables. Some examples are shown below. padmavibhUShaN^ padma, dmavi, vibhU, bhUSha, ShaN^ kAryAlayaM kAryA, ryAla, layaM sAdhyamainaMta sAdhya, dhyamai, mainaM, naMta

In this case the entropy of the language is calculated to be 3.98 approximately for the same 10,000 characters.

In the next step, we found the entropy of Telugu language by considering three syllables at a time. In this approach, the words are partitioned on the basis of Telugu syllables. Some examples are shown below. padmavibhUShaN^ padmavi, dmavibhU, vibhUSha, bhUShaN^ kAryAlayaM kAryAla, ryAlayaM sAdhyamainaMta sAdhyamai, dhyamainaM, mainaMta

In this case the entropy of the language is calculated to be 2.739 approximately for the same 10,000 characters.

In the next step, we found the entropy of Telugu language by considering four syllables at a time. In this approach, the words are partitioned on the

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

Entropy of Telugu

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

LISA (Localhost Information Service Agent)

Pose Estimation from a Single Depth Image for Arbitrary Kinematic Skeletons

A Novel Attack against Android Phones

Start searching

No results found