Semi-supervised Text Categorization Using Recursive K-means Clustering

Reading time: 6 minute
...

📝 Original Info

  • Title: Semi-supervised Text Categorization Using Recursive K-means Clustering
  • ArXiv ID: 1706.07913
  • Date: 2017-06-27
  • Authors: Researchers from original ArXiv paper

📝 Abstract

In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset.

💡 Deep Analysis

Deep Dive into Semi-supervised Text Categorization Using Recursive K-means Clustering.

In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset.

📄 Full Content

Semi-Supervised Text Categorization using Recursive K-means clustering Harsha S Gowda, Mahamad Suhil, D S Guru and Lavanya Narayana Raju Department of Studies in Computer Science, University of Mysore, Mysore, India. harshasgmysore@gmail.com, mahamad45@yahoo.co.in, dsg@compsci.uni-mysore.ac.in and swaralavz@gmail.com Abstract. In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset. Keywords: Unlabeled Text Documents, Recursive K-means Algorithm, Semi- supervised Learning, Text Categorization. 1 Introduction The amount of text content available over the web is so abundant that anybody can get information related to any topic. But, the performance of retrieval systems is still far below the level of expectation. One of the major reasons for this is the amount of labeled text available is negligibly small when compared to that of the unlabeled text. Thus, automatic text categorization has received a very high demand by many applications to well organize a huge collection of text content in hand. To this end, many machine learning based algorithms are being developed to best make use of the unlabeled text in addition to the available labeled text to draw clear boundaries between different classes of documents present in the corpus (Nigam et al., 2011; aZhang et al., 2015; Su et al., 2011). Hence, further categorization of unlabeled samples can be done effectively. The process of using unlabeled samples along with a small set of labeled samples to better understand the class structure is known as semi- supervised clustering (Zhu., 2008). Generally there are two approaches to semi-supervised learning; one is a similarity based approach and the other one is a search based approach. In similarity based approach an existing clustering algorithm that uses a similarity matric is employed, while in search based approach, the clustering algorithm itself is modified so that the user provided labels are used to bias the search for an appropriate partition. A detailed comparative analysis on these two approaches can be found in (Basu et al., 2003). Further, readers can find a review on semi supervised clustering methods for more details (Bair, 2013). A method based on hierarchical clustering approach is proposed by (Zhang et al., 2015) where labeled and unlabeled texts are respectively used for capturing silhouettes and adapting centroids of text clusters. A simple semi-supervised extension of multinomial Naive Bayes has been proposed (Su et al., 2011). This model improves the results when unlabeled data are added. The notion of weakly related unlabeled data is introduced in (Yang et al., 2009). The strength of this model is that it works even on a small training pool. A model based on combination of expectation maximization and a Naive Bayes classifier is introduced by (Nigam et al., 2013). This algorithm trains a classifier using available label documents first and subsequently labels the unlabeled documents probabilistically. A variant of expectation maximization by integrating Bayesian regression is also proposed (Zhang et al., 2015). Based on co-clustering concept a fuzzy semi supervised model can also be traced in literature (Yan et al., 2013). Based on the above literature survey we learnt that the notion of semi-supervised learning is receiving greater attention by the researchers in recent years. It is also noted that consideration of unlabeled data at the time of learning along with labeled data has a tendency of improving the results. In this direction, here in this work, we made a successful attempt for text categorization through semi-supervised learning and as a result of it, we propose a search based approach. The proposed approach is based on a partitional clustering where in, the K-means algorithm is tuned up to be a recursive algorithm. The proposed model works based on divide and conquer strategy. Initially K-means clustering algorithm is used to partition the sample space into as many as the number of classes and subsequently, on each obtained cluster we employ K-means algorithm recursively till the partitions meet a pre-defined cr

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut