Role of Morphology Injection in Statistical Machine Translation

Reading time: 6 minute
...

📝 Original Info

  • Title: Role of Morphology Injection in Statistical Machine Translation
  • ArXiv ID: 1709.05487
  • Date: 2017-09-19
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Phrase-based Statistical models are more commonly used as they perform optimally in terms of both, translation quality and complexity of the system. Hindi and in general all Indian languages are morphologically richer than English. Hence, even though Phrase-based systems perform very well for the less divergent language pairs, for English to Indian language translation, we need more linguistic information (such as morphology, parse tree, parts of speech tags, etc.) on the source side. Factored models seem to be useful in this case, as Factored models consider word as a vector of factors. These factors can contain any information about the surface word and use it while translating. Hence, the objective of this work is to handle morphological inflections in Hindi and Marathi using Factored translation models while translating from English. SMT approaches face the problem of data sparsity while translating into a morphologically rich language. It is very unlikely for a parallel corpus to contain all morphological forms of words. We propose a solution to generate these unseen morphological forms and inject them into original training corpora. In this paper, we study factored models and the problem of sparseness in context of translation to morphologically rich languages. We propose a simple and effective solution which is based on enriching the input with various morphological forms of words. We observe that morphology injection improves the quality of translation in terms of both adequacy and fluency. We verify this with the experiments on two morphologically rich languages: Hindi and Marathi, while translating from English.

💡 Deep Analysis

Deep Dive into Role of Morphology Injection in Statistical Machine Translation.

Phrase-based Statistical models are more commonly used as they perform optimally in terms of both, translation quality and complexity of the system. Hindi and in general all Indian languages are morphologically richer than English. Hence, even though Phrase-based systems perform very well for the less divergent language pairs, for English to Indian language translation, we need more linguistic information (such as morphology, parse tree, parts of speech tags, etc.) on the source side. Factored models seem to be useful in this case, as Factored models consider word as a vector of factors. These factors can contain any information about the surface word and use it while translating. Hence, the objective of this work is to handle morphological inflections in Hindi and Marathi using Factored translation models while translating from English. SMT approaches face the problem of data sparsity while translating into a morphologically rich language. It is very unlikely for a parallel corpus to

📄 Full Content

Role of Morphology Injection in Statistical Machine Translation SREELEKHA. S, Indian Institute of Technology Bombay, India PUSHPAK BHATTACHARYYA, Indian Institute of Technology Bombay, India

Phrase-based Statistical models are more commonly used as they perform optimally in terms of both, translation quality and complexity of the system. Hindi and in general all Indian languages are morphologically richer than English. Hence, even though Phrase-based systems perform very well for the less divergent language pairs, for English to Indian language translation, we need more linguistic information (such as morphology, parse tree, parts of speech tags, etc.) on the source side. Factored models seem to be useful in this case, as Factored models consider word as a vector of factors. These factors can contain any information about the surface word and use it while translating. Hence, the objective of this work is to handle morphological inflections in Hindi and Marathi using Factored translation models while translating from English. SMT approaches face the problem of data sparsity while translating into a morphologically rich language. It is very unlikely for a parallel corpus to contain all morphological forms of words. We propose a solution to generate these unseen morphological forms and inject them into original training corpora. In this paper, we study factored models and the problem of sparseness in context of translation to morphologically rich languages. We propose a simple and effective solution which is based on enriching the input with various morphological forms of words. We observe that morphology injection improves the quality of translation in terms of both adequacy and fluency. We verify this with the experiments on two morphologically rich languages: Hindi and Marathi, while translating from English.

Morphology Injection; a case study on Indian Language perspective • Computing methodologies → Artificial intelligence → Natural language processing → Machine translation

• Computing methodologies → Artificial intelligence → Natural language processing → Phonology / morphology Additional Key Words and Phrases: Statistical Machine Translation, Factored Machine Translation Models, Morphology Injection.

  1. INTRODUCTION

Formally, Machine translation is a subfield of computational linguistics that investigates the use of software to translate text or speech from one natural language to another1. Languages do not encode the same information in the same way, which makes machine translation a difficult task. The Machine Translation methods are classified as transfer-based, rule-based, example-based, interlingua-based, statistics- based, etc. Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora2.

This work is supported by Department of Science & Technology, Government of India under Woman Scientist Scheme (WOS-A) with the project code – “SR/WOS-A/ET/1075/2014”.

Author’s addresses: Sreelekha. S, DST- Woman Scientist, Dept. of Computer Science and Engineering, Indian Institute of Technology, Bombay, India, Email: sreelekha@cse.iitb.ac.in, Piyush Dungarwal, Indian Institute of Technology Bombay, India, Email: piyushdd@cse.iitb.ac.in, Pushpak Bhattacharyya, Vijay & Sita Vashee Chair Professor, Indian Institute of Technology, Bombay, India, Email: pb@cse.iitb.ac.in.

1 http://en.wikipedia.org/wiki/Machine_translation 2 http://en.wikipedia.org/wiki/Statistical_machine_translation

35:2 Sreelekha et al.

 Word-based models: The basic unit of translation is a word. IBM models 1 to 5 describe these models. Even though these models are simple, their biggest disadvantage is that they do not consider context while modeling.

 Phrase-based models: The aim is to reduce the restrictions of word-based models by translating chunks of words which are contiguous, also called Phrases. Note that these phrases need not be linguistic phrases. The length of the phrase is variable. Phrase-based models are currently most widely used models for the SMT.

 Syntax-based models: Syntax-based translation is based on the idea of translating syntactic units, rather than single words or strings of words as in phrase- based MT. These models make use of syntactic features of a sentence such as parse trees, parts of speech (POS) tags, etc.

 Hierarchical phrase-based models: Hierarchical phrase-based translation combines the strengths of phrase-based and syntax-based translation. It uses phrases (segments or blocks of words) as units for translation and uses synchronous context-free grammars as rules (syntax-based translation).

 Factored phrase-based model

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut