Indowordnets help in Indian Language Machine Translation

Reading time: 6 minute
...

📝 Original Info

  • Title: Indowordnets help in Indian Language Machine Translation
  • ArXiv ID: 1710.02086
  • Date: 2017-10-09
  • Authors: ** Sreelekha S., Pushpak Bhattacharyya (Indian Institute of Technology (IIT) Bombay, India) **

📝 Abstract

Being less resource languages, Indian-Indian and English-Indian language MT system developments faces the difficulty to translate various lexical phenomena. In this paper, we present our work on a comparative study of 440 phrase-based statistical trained models for 110 language pairs across 11 Indian languages. We have developed 110 baseline Statistical Machine Translation systems. Then we have augmented the training corpus with Indowordnet synset word entries of lexical database and further trained 110 models on top of the baseline system. We have done a detailed performance comparison using various evaluation metrics such as BLEU score, METEOR and TER. We observed significant improvement in evaluations of translation quality across all the 440 models after using the Indowordnet. These experiments give a detailed insight in two ways : (1) usage of lexical database with synset mapping for resource poor languages (2) efficient usage of Indowordnet sysnset mapping. More over, synset mapped lexical entries helped the SMT system to handle the ambiguity to a great extent during the translation.

💡 Deep Analysis

Deep Dive into Indowordnets help in Indian Language Machine Translation.

Being less resource languages, Indian-Indian and English-Indian language MT system developments faces the difficulty to translate various lexical phenomena. In this paper, we present our work on a comparative study of 440 phrase-based statistical trained models for 110 language pairs across 11 Indian languages. We have developed 110 baseline Statistical Machine Translation systems. Then we have augmented the training corpus with Indowordnet synset word entries of lexical database and further trained 110 models on top of the baseline system. We have done a detailed performance comparison using various evaluation metrics such as BLEU score, METEOR and TER. We observed significant improvement in evaluations of translation quality across all the 440 models after using the Indowordnet. These experiments give a detailed insight in two ways : (1) usage of lexical database with synset mapping for resource poor languages (2) efficient usage of Indowordnet sysnset mapping. More over, synset mapp

📄 Full Content

Indowordnet’s help in Indian Language Machine Translation

Sreelekha S, Pushpak Bhattacharyya Indian Institute of Technology (IIT) Bombay, India {sreelekha, pb}@cse.iitb.ac.in Abstract Being less resource languages, Indian-Indian and English-Indian language MT system developments faces the difficulty to translate various lexical phenomena. In this paper, we present our work on a comparative study of 440 phrase-based statistical trained models for 110 language pairs across 11 Indian languages. We have developed 110 baseline Statistical Machine Translation systems. Then we have augmented the training corpus with Indowordnet synset word entries of lexical database and further trained 110 models on top of the baseline system. We have done a detailed performance comparison using various evaluation metrics such as BLEU score, METEOR and TER. We observed significant improvement in evaluations of translation quality across all the 440 models after using the Indowordnet. These experiments give a detailed insight in two ways : (1) usage of lexical database with synset mapping for resource poor languages (2) efficient usage of Indowordnet sysnset mapping. More over, synset mapped lexical entries helped the SMT system to handle the ambiguity to a great extent during the translation. Keywords: Indowordnet, Machine Translation

  1. Introduction Machine Translation (MT) faces difficulty when dealing with morphologically complex languages. Being a country with rich linguistic diversity, India has 22 scheduled languages and 30 Indian languages. These languages are spread across four language families such as; Indo-Aryan, Dravidian, Tibeto-Burman and Austro-Asiatic with 10 major scripts. Out of these, Hindi is the most prominent, which belongs to the Indo-Aryan family of languages. Most of the official documents are either in Hindi or English. 95% of the population is illeterate of English. Thus, for a proper functioning, there is a large requirement to translate these official documents into regional languages. More over, the medias and news agencies are required to translate news received in English from International news agencies to respective regional languages. Hence, there is a huge requirement for automatic MT system developments between English to Indian languages and Indian to Indian languages. To handle this lingusitistic diversity and rich morphology with lack of proper resources is the major challenge faced during the development of MT system between Indian languages. There were many MT system developments are going on for Indian languages using rule-based, statistical-based and hybrid approaches (Antony P. J. 2013; Ashan et. al., 2010; Brown et. al., 1993; Nair, et.al., 2012; Sreelekha et. al., 2013; Sreelekha et. al., 2015; Sreelekha et. al., 2017; Sreelekha et. al., 2018). Out of these, Statical MT(SMT) approach is the most promising due to its flexibility and it’s easiness to develop. In this work, we developed phrase-based SMT systems for 110 language pairs and our further attempts to improve the quality of the translation systems on top of these baseline systems. After analyzing the developed SMT systems, we observed that the system fails to handle various linguistic phenomena and inflected word forms. Hence, we have decided to use the Indowordnet for SMT system development as a lexical database, which covers, dictionary words, transliteration, short phrases and coined words.

  2. Indowordnet IndoWordnet (Bhattacharyya, 2010) is as lexical database for various Indian languages, in which Hindi wordnet is the root and all other Indian language wordnets are linked through the expansion approach. Words and it’s concepts are stored in a structure called the Lexical Matrix, where rows represent word meanings and columns represents the forms. IndoWordnet stores different words and relations mainly Lexical Relations and Semantic Relations. Different types of Lexical Relations such as Gradation for state, size, light, gender, temperature, color, time, quality, action, manner, Antonymy for action, amount, direction, gender, personality, place, quality, size, state, time, color, manner, Compound for nouns and Conjunction for verbs. Semantic Relation types such as Hypernymy for noun and verbs, Holonymy for nouns, Meronymy for component object, member collection, feature, activity, place, area, face, state, portion, mass, resource, process, position, area, Troponymy for verbs, Similar Attribute between noun and adjective, Function verb between noun and verb, Ability verb between noun and verb, Capability verb between noun and verb, Adverb modifies verb between adverb and verb, Causative for verb, Entailment for verb, Near synset and Adjective modifies noun between adjective and noun.

There are three principles the synset construction process must adhere to. Minimality principle insists on capturing that minimal set of the words in the

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut