A WL-SPPIM Semantic Model for Document Classification
📝 Abstract
In this paper, we explore SPPIM-based text classification method, and the experiment reveals that the SPPIM method is equal to or even superior than SGNS method in text classification task on three international and standard text datasets, namely 20newsgroups, Reuters52 and WebKB. Comparing to SGNS, although SPPMI provides a better solution, it is not necessarily better than SGNS in text classification tasks. Based on our analysis, SGNS takes into the consideration of weight calculation during decomposition process, so it has better performance than SPPIM in some standard datasets. Inspired by this, we propose a WL-SPPIM semantic model based on SPPIM model, and experiment shows that WL-SPPIM approach has better classification and higher scalability in the text classification task compared with LDA, SGNS and SPPIM approaches.
💡 Analysis
In this paper, we explore SPPIM-based text classification method, and the experiment reveals that the SPPIM method is equal to or even superior than SGNS method in text classification task on three international and standard text datasets, namely 20newsgroups, Reuters52 and WebKB. Comparing to SGNS, although SPPMI provides a better solution, it is not necessarily better than SGNS in text classification tasks. Based on our analysis, SGNS takes into the consideration of weight calculation during decomposition process, so it has better performance than SPPIM in some standard datasets. Inspired by this, we propose a WL-SPPIM semantic model based on SPPIM model, and experiment shows that WL-SPPIM approach has better classification and higher scalability in the text classification task compared with LDA, SGNS and SPPIM approaches.
📄 Content
1 A WL-SPPIM Semantic Model for Document Classification First A. Ming Li, Second B. Peilun Xiao, and Third C. Ju Zhang Abstract—In this paper, we explore SPPIM-based text classification method, and the experiment reveals that the SPPIM method is equal to or even superior than SGNS method in text classification task on three international and standard text datasets, namely 20newsgroups, Reuters52 and WebKB. Comparing to SGNS, although SPPMI provides a better solution, it is not necessarily better than SGNS in text classification tasks.. Based on our analysis, SGNS takes into the consideration of weight calculation during decomposition process, so it has better performance than SPPIM in some standard datasets. Inspired by this, we propose a WL-SPPIM semantic model based on SPPIM model, and experiment shows that WL-SPPIM approach has better classification and higher scalability in the text classification task compared with LDA, SGNS and SPPIM approaches. Index Terms—LDA; SPPIM; word embedding; low frequency;document classification —————————— —————————— 1 INTRODUCTION istribution of semantic vectors is widely used in text semantic expression, including text classification, text clustering, semantic retrieval, automatic question and answer, dictionary generation, semantic disambiguation, query expansion, text advertisements and machine translation, especially for measuring semantic relevance[1, 2]. We divided the DSMs into two categories, one we called count-based models, many traditional DSMs belong to this category, the other category we call prediction-based models, which are based on neural embedding Among the traditional count-based models the best know is Latent Semantic Analysis. LSA is a low dimensional semantic space for texts, and LSA derive the document vector by the use of co-occurrence information between words[3]. More recently, LDA has received more and more extensive attention, as a semantic model of DSMs[4, 5]. LDA is a three-layer Bayesian probability model proposed by Blei et al in 2003, which contains the three layers structure of document, topic and word. Document to topic subjects to Dirichlet distribution, topic to word subjects to polynomial distribution. LDA semantic model usually shows very good performance on NLP tasks, partly because it projects the document into a low-dimensional topic semantic space. Word frequencies determine the topic of calculated and deducted by LDA largely, which leads less frequent but important topics can not be effectively calculated. Pointwise mutual information(PMI) has been extensively uesd as count-based model in distributional sematic models. Pointwise Mutual Information (PMI) is the popular word co-occurrence based measure[6, 7]. PMI has a well-known tendency that it calculates too high scores for low frequency words[2]. To solve the limitation fo PMI, many variants of it have been proposed, PPMI is the simple one of the variants, in which all PMI values that below zero will be set to zero[8]. Bullinaria and Levydem demonstrate that PPMI outperforms various other weighting methods ,when measuring semantic similarity through word-context matrices.[9]. Traditional distributional Semantic Models achieve considerable effective effects on various NLP tasks, including semantic relevance and text classification. The last few years have seen the development of prediction-based neural embedding models in which words are embedded into a low dimensional space. Word embedding models can efficiently learn word vector from a large number of unstructured text, and can effectively reflect the syntactic or semantic relations between words[10]. The initial pioneering research work was started by Bengio and his colleagues, who generated word vectors in the study of the neural language model[11] and a number of subsequent research work including various word embedding models and efficient learning algorithms[12-16]. In particular, We notice a conclusion that the SGNS(skip-gram with negative- sampling)model which is efficient to provides competitive results on various NLP tasks, which is concluded in a sequence of papers by Mikolov and colleagues[14, 17]. The SGNS model maximizes the conditional probability of the observed contexts given the current word when scanning through the corpus, however it is not clear what information the embedding vectors really convey, so prediction-based neural embeddings are considered opaque. The researchers have different conclusions on the various types of distributed semantic models applying to the performance of NLP tasks, some of these conclusions ———————————————— F. A. Author is with High performance computing application R&D Center, Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China & University of Chinese Academy of Sciences, Beijing, China & Center for Speech and Language Technology,Research Institute of Information Technology, Tsinghua University,Beijing, China; E-mail: liming@c
This content is AI-processed based on ArXiv data.