A Survey of Na'ive Bayes Machine Learning approach in Text Document Classification

Reading time: 6 minute
...

📝 Original Info

  • Title: A Survey of Na’ive Bayes Machine Learning approach in Text Document Classification
  • ArXiv ID: 1003.1795
  • Date: 2010-03-10
  • Authors: ** - Vidhya K. A., Department of Computer Science, Pondicherry University, Pondicherry, India - G. Aghila, Department of Computer Science, Pondicherry University, Pondicherry, India **

📝 Abstract

Text Document classification aims in associating one or more predefined categories based on the likelihood suggested by the training set of labeled documents. Many machine learning algorithms play a vital role in training the system with predefined categories among which Na\"ive Bayes has some intriguing facts that it is simple, easy to implement and draws better accuracy in large datasets in spite of the na\"ive dependence. The importance of Na\"ive Bayes Machine learning approach has felt hence the study has been taken up for text document classification and the statistical event models available. This survey the various feature selection methods has been discussed and compared along with the metrics related to text document classification.

💡 Deep Analysis

Deep Dive into A Survey of Na"ive Bayes Machine Learning approach in Text Document Classification.

Text Document classification aims in associating one or more predefined categories based on the likelihood suggested by the training set of labeled documents. Many machine learning algorithms play a vital role in training the system with predefined categories among which Na"ive Bayes has some intriguing facts that it is simple, easy to implement and draws better accuracy in large datasets in spite of the na"ive dependence. The importance of Na"ive Bayes Machine learning approach has felt hence the study has been taken up for text document classification and the statistical event models available. This survey the various feature selection methods has been discussed and compared along with the metrics related to text document classification.

📄 Full Content

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, No. 2, 2010 A Survey of Naïve Bayes Machine Learning approach in Text Document Classification

Vidhya.K.A Department of Computer Science Pondicherry University Pondicherry, India .

G.Aghila Department of Computer Science Pondicherry University Pondicherry, India

Abstract— Text Document classification aims in associating one or more predefined categories based on the likelihood suggested by the training set of labeled documents. Many machine learning algorithms play a vital role in training the system with predefined categories among which Naïve Bayes has some intriguing facts that it is simple, easy to implement and draws better accuracy in large datasets in spite of the naïve dependence. The importance of Naïve Bayes Machine learning approach has felt hence the study has been taken up for text document classification and the statistical event models available. This survey the various feature selection methods has been discussed and compared along with the metrics related to text document classification.

Keyword- Text Mining, Naïve Bayes; Event models, Metrics, probability distribution.
I. INTRODUCTION Text Document Classification is a task of classifying a document into predefined categories based on the contents of the document. A document is represented by a piece of text expressed as phrases or words. The task of traditional text categorization methods is done by human experts. It usually needs a large amount of time to deal with the task of text categorization. In recent years, text categorization has become an important research topic in machine learning and information retrieval and e-mail spam filtering. It also has become an important research topic in text mining, which analyses and extracts useful information from texts. More Learning techniques has been in research for dealing with text categorization. The existing text classification methods can be classified into below six [11],[12],[13] categories:

(1) Based on Rocchio‟s method (Dumais, Platt, Heckerman, & Sahami, 1998; Hull, 1994; Joachims, 1998; Lam & Ho, 1998). (2) Based on K-nearest neighbors (KNN) (Hull, 1994; Lam & Ho, 1998; Tan, 2005; Tan, 2006; Yang & Liu, 1999). (3) Based on regression models (Yang, 1999; Yang & Liu, 1999). (4) Based on Naıve Bayes and Bayesian nets (Dumais et al., 1998; Hull, 1994; Yang & Liu, 1999; Sahami, 1996). (5) Based on decision trees (Fuhr & Buckley, 1991; Hull, 1994). (6) Based on decision rules (Apte`, Damerau, & Weiss, 1994; Cohen & Singer, 1999). Among the six types the survey aims in getting an intuitive understanding of Naïve Bayes approach in which the application of various Machine Learning Techniques to the text categorization problem like in the field of medicine, e-mail filtering, including rule learning for knowledge base systems has been explored. The survey is oriented towards the various probabilistic approach of Naïve Bayes Machine Learning algorithm for which the text categorization aims to classify the document with optimal accuracy. Naïve Bayes Model works with the conditional probability which originates from well known statistical approach “Bayes Theorem”, where as Naïve refers to “assumption” that all the attributes of the examples are independent of each other given the context of the category. Because of the independence assumption the parameters for each attribute can be learned separately and this greatly simplifies learning especially when the number of attributes is large[15]. In this context of text classification, the probability that a document d belongs to class c is calculated by the Bayes theorem as follows   ) ( ) ( ) / ( ) / ( d P c P c d P d c P    The estimation of P (d/c) is difficult since the number of possible vectors d is too high. This difficulty is overcome by using the naïve assumption that any two coordinates of the document is statistically independent. Using this assumption the most probable category „c ‟can be estimated. The survey is organized in the following depicted way that section II for Survey work where the discussion on probabilistic event modes are done, Section III for data characteristics affecting the Naïve Bayes model, Section IV for the results of Naive Bayes text classification method and Section V for Conclusion. II. SURVEY WORK Despite its popularity, there has been some confusion in the document classification community about the “Naive Bayes" classifier because there are two different generative model in common use, both of which make the Naive Bayes assumption. One model specifies that a document is represented by a vector of binary attributes indicating which words occur and do not occur in the document. The number of times a word occurs in a document is not captured. When calculating the probability of a document, one multiplies t

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut