Text Document classification aims in associating one or more predefined categories based on the likelihood suggested by the training set of labeled documents. Many machine learning algorithms play a vital role in training the system with predefined categories among which Na\"ive Bayes has some intriguing facts that it is simple, easy to implement and draws better accuracy in large datasets in spite of the na\"ive dependence. The importance of Na\"ive Bayes Machine learning approach has felt hence the study has been taken up for text document classification and the statistical event models available. This survey the various feature selection methods has been discussed and compared along with the metrics related to text document classification.
Deep Dive into A Survey of Na"ive Bayes Machine Learning approach in Text Document Classification.
Text Document classification aims in associating one or more predefined categories based on the likelihood suggested by the training set of labeled documents. Many machine learning algorithms play a vital role in training the system with predefined categories among which Na"ive Bayes has some intriguing facts that it is simple, easy to implement and draws better accuracy in large datasets in spite of the na"ive dependence. The importance of Na"ive Bayes Machine learning approach has felt hence the study has been taken up for text document classification and the statistical event models available. This survey the various feature selection methods has been discussed and compared along with the metrics related to text document classification.
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 7, No. 2, 2010
A Survey of Naïve Bayes Machine Learning approach in
Text Document Classification
Vidhya.K.A
Department of Computer Science
Pondicherry University
Pondicherry, India
.
G.Aghila
Department of Computer Science
Pondicherry University
Pondicherry, India
Abstract— Text Document classification aims in associating one
or more predefined categories based on the likelihood suggested
by the training set of labeled documents. Many machine learning
algorithms play a vital role in training the system with predefined
categories among which Naïve Bayes has some intriguing facts
that it is simple, easy to implement and draws better accuracy in
large datasets in spite of the naïve dependence.
The importance of Naïve Bayes Machine learning approach has
felt hence the study has been taken up for text document
classification and the statistical event models available. This
survey the various feature selection methods has been discussed
and compared along with the metrics related to text document
classification.
Keyword- Text Mining, Naïve Bayes; Event models, Metrics,
probability distribution.
I.
INTRODUCTION
Text Document Classification is a task of classifying a
document into predefined categories based on the contents of
the document. A document is represented by a piece of text
expressed as phrases or words. The task of traditional text
categorization methods is done by human experts. It usually
needs a large amount of time to deal with the task of text
categorization. In recent years, text categorization has become
an important research topic in machine learning and
information retrieval and e-mail spam filtering. It also has
become an important research topic in text mining, which
analyses and extracts useful information from texts. More
Learning techniques has been in research for dealing with text
categorization. The existing text classification methods can be
classified into below six [11],[12],[13] categories:
(1) Based on Rocchio‟s method (Dumais, Platt, Heckerman,
& Sahami, 1998; Hull, 1994; Joachims, 1998; Lam & Ho,
1998).
(2) Based on K-nearest neighbors (KNN) (Hull, 1994; Lam &
Ho, 1998; Tan, 2005; Tan, 2006; Yang & Liu, 1999).
(3) Based on regression models (Yang, 1999; Yang & Liu,
1999).
(4) Based on Naıve Bayes and Bayesian nets (Dumais et al.,
1998; Hull, 1994; Yang & Liu, 1999; Sahami, 1996).
(5) Based on decision trees (Fuhr & Buckley, 1991; Hull,
1994).
(6) Based on decision rules (Apte`, Damerau, & Weiss, 1994;
Cohen & Singer, 1999).
Among the six types the survey aims in getting an intuitive
understanding of Naïve Bayes approach in which the
application of various Machine Learning Techniques to the text
categorization problem like in the field of medicine, e-mail
filtering, including rule learning for knowledge base systems
has been explored. The survey is oriented towards the various
probabilistic approach of Naïve Bayes Machine Learning
algorithm for which the text categorization aims to classify the
document with optimal accuracy.
Naïve Bayes Model works with the conditional probability
which originates from well known statistical approach “Bayes
Theorem”, where as Naïve refers to “assumption” that all the
attributes of the examples are independent of each other given
the context of the category. Because of the independence
assumption the parameters for each attribute can be learned
separately and this greatly simplifies learning especially when
the number of attributes is large[15]. In this context of text
classification, the probability that a document d belongs to class
c is calculated by the Bayes theorem as follows
)
(
)
(
)
/
(
)
/
(
d
P
c
P
c
d
P
d
c
P
The estimation of P (d/c) is difficult since the number of
possible vectors d is too high. This difficulty is overcome by
using the naïve assumption that any two coordinates of the
document is statistically independent. Using this assumption
the most probable category „c ‟can be estimated.
The survey is organized in the following depicted way that
section II for Survey work where the discussion on
probabilistic event modes are done, Section III for data
characteristics affecting the Naïve Bayes model, Section IV
for the results of Naive Bayes text classification method and
Section V for Conclusion.
II.
SURVEY WORK
Despite its popularity, there has been some confusion in the
document classification community about the “Naive Bayes"
classifier because there are two different generative model in
common use, both of which make the Naive Bayes
assumption. One model specifies that a document is
represented by a vector of binary attributes indicating which
words occur and do not occur in the document. The number of
times a word occurs in a document is not captured. When
calculating the probability of a document, one multiplies t
…(Full text truncated)…
This content is AI-processed based on ArXiv data.