In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset.
Deep Dive into Semi-supervised Text Categorization Using Recursive K-means Clustering.
In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset.
Semi-Supervised Text Categorization using Recursive
K-means clustering
Harsha S Gowda, Mahamad Suhil, D S Guru and Lavanya Narayana Raju
Department of Studies in Computer Science, University of Mysore, Mysore, India.
harshasgmysore@gmail.com, mahamad45@yahoo.co.in,
dsg@compsci.uni-mysore.ac.in and swaralavz@gmail.com
Abstract. In this paper, we present a semi-supervised learning algorithm for
classification of text documents. A method of labeling unlabeled text
documents is presented. The presented method is based on the principle of
divide and conquer strategy. It uses recursive K-means algorithm for
partitioning both labeled and unlabeled data collection. The K-means algorithm
is applied recursively on each partition till a desired level partition is achieved
such that each partition contains labeled documents of a single class. Once the
desired clusters are obtained, the respective cluster centroids are considered as
representatives of the clusters and the nearest neighbor rule is used for
classifying an unknown text document. Series of experiments have been
conducted to bring out the superiority of the proposed model over other recent
state of the art models on 20Newsgroups dataset.
Keywords: Unlabeled Text Documents, Recursive K-means Algorithm, Semi-
supervised Learning, Text Categorization.
1 Introduction
The amount of text content available over the web is so abundant that anybody can
get information related to any topic. But, the performance of retrieval systems is still
far below the level of expectation. One of the major reasons for this is the amount of
labeled text available is negligibly small when compared to that of the unlabeled text.
Thus, automatic text categorization has received a very high demand by many
applications to well organize a huge collection of text content in hand. To this end,
many machine learning based algorithms are being developed to best make use of the
unlabeled text in addition to the available labeled text to draw clear boundaries
between different classes of documents present in the corpus (Nigam et al., 2011;
aZhang et al., 2015; Su et al., 2011). Hence, further categorization of unlabeled
samples can be done effectively. The process of using unlabeled samples along with a
small set of labeled samples to better understand the class structure is known as semi-
supervised clustering (Zhu., 2008).
Generally there are two approaches to semi-supervised learning; one is a similarity
based approach and the other one is a search based approach. In similarity based
approach an existing clustering algorithm that uses a similarity matric is employed,
while in search based approach, the clustering algorithm itself is modified so that the
user provided labels are used to bias the search for an appropriate partition. A detailed
comparative analysis on these two approaches can be found in (Basu et al., 2003).
Further, readers can find a review on semi supervised clustering methods for more
details (Bair, 2013).
A method based on hierarchical clustering approach is proposed by (Zhang et al.,
2015) where labeled and unlabeled texts are respectively used for capturing
silhouettes and adapting centroids of text clusters. A simple semi-supervised
extension of multinomial Naive Bayes has been proposed (Su et al., 2011). This
model improves the results when unlabeled data are added. The notion of weakly
related unlabeled data is introduced in (Yang et al., 2009). The strength of this model
is that it works even on a small training pool. A model based on combination of
expectation maximization and a Naive Bayes classifier is introduced by (Nigam et al.,
2013). This algorithm trains a classifier using available label documents first and
subsequently labels the unlabeled documents probabilistically. A variant of
expectation maximization by integrating Bayesian regression is also proposed (Zhang
et al., 2015). Based on co-clustering concept a fuzzy semi supervised model can also
be traced in literature (Yan et al., 2013).
Based on the above literature survey we learnt that the notion of semi-supervised
learning is receiving greater attention by the researchers in recent years. It is also
noted that consideration of unlabeled data at the time of learning along with labeled
data has a tendency of improving the results. In this direction, here in this work, we
made a successful attempt for text categorization through semi-supervised learning
and as a result of it, we propose a search based approach. The proposed approach is
based on a partitional clustering where in, the K-means algorithm is tuned up to be a
recursive algorithm. The proposed model works based on divide and conquer strategy.
Initially K-means clustering algorithm is used to partition the sample space into as
many as the number of classes and subsequently, on each obtained cluster we employ
K-means algorithm recursively till the partitions meet a pre-defined cr
…(Full text truncated)…
This content is AI-processed based on ArXiv data.