📝 Original Info
- Title: Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification
- ArXiv ID: 1111.2399
- Date: 2012-03-30
- Authors: ** - Kishorjit Nongmeikapam (Manipur Institute of Technology, Manipur University, 인도) – kishorjit.nongmeikapa@gmail.com - Sivaji Bandyopadhyay (Jadavpur University, 콜카타, 인도) – sivaji_cse_ju@yahoo.com **
📝 Abstract
This paper deals with the identification of Multiword Expressions (MWEs) in Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the Eight Schedule of Indian Constitution. MWE plays an important role in the applications of Natural Language Processing(NLP) like Machine Translation, Part of Speech tagging, Information Retrieval, Question Answering etc. Feature selection is an important factor in the recognition of Manipuri MWEs using Conditional Random Field (CRF). The disadvantage of manual selection and choosing of the appropriate features for running CRF motivates us to think of Genetic Algorithm (GA). Using GA we are able to find the optimal features to run the CRF. We have tried with fifty generations in feature selection along with three fold cross validation as fitness function. This model demonstrated the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%, showing an improvement over the CRF based Manipuri MWE identification without GA application.
💡 Deep Analysis
📄 Full Content
GENETIC ALGORITHM (GA) IN FEATURE
SELECTION FOR CRF BASED MANIPURI
MULTIWORD EXPRESSION (MWE)
IDENTIFICATION
Kishorjit Nongmeikapam1 and Sivaji Bandyopadhyay2
1Department of Computer Sc. and Engineering, Manipur Institute of Technology,
Manipur University, Imphal, India
kishorjit.nongmeikapa@gmail.com
2Department of Computer Sc. & Engg., Jadavpur University,
Jadavpur, Kolkata, India
sivaji_cse_ju@yahoo.com
ABSTRACT
This paper deals with the identification of Multiword Expressions (MWEs) in Manipuri, a highly
agglutinative Indian Language. Manipuri is listed in the Eight Schedule of Indian Constitution. MWE
plays an important role in the applications of Natural Language Processing(NLP) like Machine
Translation, Part of Speech tagging, Information Retrieval, Question Answering etc. Feature selection is
an important factor in the recognition of Manipuri MWEs using Conditional Random Field (CRF). The
disadvantage of manual selection and choosing of the appropriate features for running CRF motivates us
to think of Genetic Algorithm (GA). Using GA we are able to find the optimal features to run the CRF.
We have tried with fifty generations in feature selection along with three fold cross validation as fitness
function. This model demonstrated the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F)
of 73.74%, showing an improvement over the CRF based Manipuri MWE identification without GA
application.
KEYWORDS
CRF, MWE, Manipuri, GA, Features
1. INTRODUCTION
This MWE is an important topic in the application of Natural Language Processing (NLP) like
Part of Speech Tagging, Information Retrieval, Question Answering, Summarization, Machine
Translation etc. The MWE is composed of an ordered group of words which can stand
independently and carries a different meaning from its constituent words. For example in
English: ‘to and fro’, ‘bye bye’, ‘kick the bucket’ etc. MWEs include compounds (both word-
compounds and phrasal compounds), fixed expressions and technical terms. A fixed expression
MWE is one whose constituent words cannot be moved randomly or substituted without
distorting the overall meaning or allowing a literal interpretation. Fixed expressions range from
word-compounds and collocations to idioms. Some of the proverbs and quotations can also be
considered as fixed expressions.
Manipuri is a highly agglutinative Indian Language spoken mainly in Manipur and in some
other parts of the North Eastern India. Apart from India it is also spoken in some parts of
Bangladesh and Myanmar. This language is a Tibeto-Burman type of language. Manipuri
language is a Scheduled Language of Indian Constitution.
Manipuri uses two scripts; one is the borrowed Bengali Script while the other one is its original
Meitei Mayek (Script). The Manipuri with Bengali Script is adopted in this work since the
corpus of Meitei Mayek Manipuri is so far hard to collect. The development of an automatic
MWE system requires either a comprehensive set of linguistically motivated rules or a large
amount of annotated corpora in order to achieve reasonable performance.
Different attempts of using CRF are found in order to find Multiword Expression for other
language and even for Manipuri but the main handicap is in the selection of the features for
CRF. That is the reason a hybrid model of identifying MWE using CRF as in [1] and GA [2] is
designed. The feature selection of CRF model is not an easy task. So far the CRF based MWE
feature selection is manual or is simply a hit and trial method. A new approach to tackle this
issue is the implementation of GA. A simple model of Genetic Algorithm (GA) is adopted in
order to choose the features as in [3].
The paper is organized with related works of MWE in Manipuri and other languages in Section
2 followed by the motivation and challenges of Manipuri in Section 3. Section 4 and Section 5
present the basic concepts of CRF and Genetic Algorithm, Section 6 mentions a simple
stemming rule for Manipuri. Section 7 lists all the features for running the CRF and the hybrid
model of Manipuri MWE identification is discussed. Section 8 describes the experiments and
the evaluation. The conclusion and the future works road map are drawn in Section 9.
2. RELATED WORKS
The works on MWEs can be seen in [4]-[7]. For Indian languages also works are being done to
identify the MWEs [8]-[12]. The published works on identifications of NER and MWEs in
Manipuri are also found. For the NER, works are reported in [13]-[15]. Manipuri MWE works
are reported in [1] and reduplicated MWEs in [16] and [17]. The identification of Manipuri
MWEs is quite difficult since the root words in Manipuri are only noun and verb, words of
different part of speech are derived from them. The stemming work in Manipuri can be found in
[18] and [19].
3. CHALLENGES AND MOTIVATION
3.1. Challenges in identification of MWE NEs in Indian languages
The notable work of [13] give
Reference
This content is AI-processed based on open access ArXiv data.