Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification

Reading time: 5 minute
...

📝 Original Info

  • Title: Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification
  • ArXiv ID: 1111.2399
  • Date: 2012-03-30
  • Authors: ** - Kishorjit Nongmeikapam (Manipur Institute of Technology, Manipur University, 인도) – kishorjit.nongmeikapa@gmail.com - Sivaji Bandyopadhyay (Jadavpur University, 콜카타, 인도) – sivaji_cse_ju@yahoo.com **

📝 Abstract

This paper deals with the identification of Multiword Expressions (MWEs) in Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the Eight Schedule of Indian Constitution. MWE plays an important role in the applications of Natural Language Processing(NLP) like Machine Translation, Part of Speech tagging, Information Retrieval, Question Answering etc. Feature selection is an important factor in the recognition of Manipuri MWEs using Conditional Random Field (CRF). The disadvantage of manual selection and choosing of the appropriate features for running CRF motivates us to think of Genetic Algorithm (GA). Using GA we are able to find the optimal features to run the CRF. We have tried with fifty generations in feature selection along with three fold cross validation as fitness function. This model demonstrated the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%, showing an improvement over the CRF based Manipuri MWE identification without GA application.

💡 Deep Analysis

📄 Full Content

GENETIC ALGORITHM (GA) IN FEATURE SELECTION FOR CRF BASED MANIPURI MULTIWORD EXPRESSION (MWE) IDENTIFICATION Kishorjit Nongmeikapam1 and Sivaji Bandyopadhyay2 1Department of Computer Sc. and Engineering, Manipur Institute of Technology, Manipur University, Imphal, India kishorjit.nongmeikapa@gmail.com 2Department of Computer Sc. & Engg., Jadavpur University, Jadavpur, Kolkata, India sivaji_cse_ju@yahoo.com ABSTRACT This paper deals with the identification of Multiword Expressions (MWEs) in Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the Eight Schedule of Indian Constitution. MWE plays an important role in the applications of Natural Language Processing(NLP) like Machine Translation, Part of Speech tagging, Information Retrieval, Question Answering etc. Feature selection is an important factor in the recognition of Manipuri MWEs using Conditional Random Field (CRF). The disadvantage of manual selection and choosing of the appropriate features for running CRF motivates us to think of Genetic Algorithm (GA). Using GA we are able to find the optimal features to run the CRF. We have tried with fifty generations in feature selection along with three fold cross validation as fitness function. This model demonstrated the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%, showing an improvement over the CRF based Manipuri MWE identification without GA application. KEYWORDS CRF, MWE, Manipuri, GA, Features 1. INTRODUCTION This MWE is an important topic in the application of Natural Language Processing (NLP) like Part of Speech Tagging, Information Retrieval, Question Answering, Summarization, Machine Translation etc. The MWE is composed of an ordered group of words which can stand independently and carries a different meaning from its constituent words. For example in English: ‘to and fro’, ‘bye bye’, ‘kick the bucket’ etc. MWEs include compounds (both word- compounds and phrasal compounds), fixed expressions and technical terms. A fixed expression MWE is one whose constituent words cannot be moved randomly or substituted without distorting the overall meaning or allowing a literal interpretation. Fixed expressions range from word-compounds and collocations to idioms. Some of the proverbs and quotations can also be considered as fixed expressions. Manipuri is a highly agglutinative Indian Language spoken mainly in Manipur and in some other parts of the North Eastern India. Apart from India it is also spoken in some parts of Bangladesh and Myanmar. This language is a Tibeto-Burman type of language. Manipuri language is a Scheduled Language of Indian Constitution. Manipuri uses two scripts; one is the borrowed Bengali Script while the other one is its original Meitei Mayek (Script). The Manipuri with Bengali Script is adopted in this work since the corpus of Meitei Mayek Manipuri is so far hard to collect. The development of an automatic MWE system requires either a comprehensive set of linguistically motivated rules or a large amount of annotated corpora in order to achieve reasonable performance. Different attempts of using CRF are found in order to find Multiword Expression for other language and even for Manipuri but the main handicap is in the selection of the features for CRF. That is the reason a hybrid model of identifying MWE using CRF as in [1] and GA [2] is designed. The feature selection of CRF model is not an easy task. So far the CRF based MWE feature selection is manual or is simply a hit and trial method. A new approach to tackle this issue is the implementation of GA. A simple model of Genetic Algorithm (GA) is adopted in order to choose the features as in [3]. The paper is organized with related works of MWE in Manipuri and other languages in Section 2 followed by the motivation and challenges of Manipuri in Section 3. Section 4 and Section 5 present the basic concepts of CRF and Genetic Algorithm, Section 6 mentions a simple stemming rule for Manipuri. Section 7 lists all the features for running the CRF and the hybrid model of Manipuri MWE identification is discussed. Section 8 describes the experiments and the evaluation. The conclusion and the future works road map are drawn in Section 9. 2. RELATED WORKS The works on MWEs can be seen in [4]-[7]. For Indian languages also works are being done to identify the MWEs [8]-[12]. The published works on identifications of NER and MWEs in Manipuri are also found. For the NER, works are reported in [13]-[15]. Manipuri MWE works are reported in [1] and reduplicated MWEs in [16] and [17]. The identification of Manipuri MWEs is quite difficult since the root words in Manipuri are only noun and verb, words of different part of speech are derived from them. The stemming work in Manipuri can be found in [18] and [19]. 3. CHALLENGES AND MOTIVATION 3.1. Challenges in identification of MWE NEs in Indian languages The notable work of [13] give

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut