Word and Document Embeddings based on Neural Network Approaches

February 23, 2026

Reading time: 7 minute

...

📝 Abstract

Data representation is a fundamental task in machine learning. The representation of data affects the performance of the whole machine learning system. In a long history, the representation of data is done by feature engineering, and researchers aim at designing better features for specific tasks. Recently, the rapid development of deep learning and representation learning has brought new inspiration to various domains. In natural language processing, the most widely used feature representation is the Bag-of-Words model. This model has the data sparsity problem and cannot keep the word order information. Other features such as part-of-speech tagging or more complex syntax features can only fit for specific tasks in most cases. This thesis focuses on word representation and document representation. We compare the existing systems and present our new model. First, for generating word embeddings, we make comprehensive comparisons among existing word embedding models. In terms of theory, we figure out the relationship between the two most important models, i.e., Skip-gram and GloVe. In our experiments, we analyze three key points in generating word embeddings, including the model construction, the training corpus and parameter design. We evaluate word embeddings with three types of tasks, and we argue that they cover the existing use of word embeddings. Through theory and practical experiments, we present some guidelines for how to generate a good word embedding. Second, in Chinese character or word representation. We introduce the joint training of Chinese character and word. … Third, for document representation, we analyze the existing document representation models, including recursive NNs, recurrent NNs and convolutional NNs. We point out the drawbacks of these models and present our new model, the recurrent convolutional neural networks. …

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

密级博士学位论文基于神经网络的词和文档语义向量表示方法研究作者姓名来斯惟指导教师赵军研究员中国科学院自动化研究所学位类别工学博士学科专业模式识别与智能系统培养单位中国科学院自动化研究所 2016 年1 月 Word and Document Embeddings based on Neural Network Approaches By Siwei Lai A Dissertation Submitted to The University of Chinese Academy of Sciences In partial fulfillment of the requirement For the degree of Doctor of Engineering Institute of Automation Chinese Academy of Sciences January, 2016 独创性声明本人声明所递交的论文是我个人在导师指导下进行的研究工作及取得的研究成果。尽我所知，除了文中特别加以标注和致谢的地方外，论文中不包含其他人已经发表或撰写过的研究成果。与我一同工作的同志对本研究所做的任何贡献均已在论文中作了明确地说明并表示了谢意。签名: 日期：关于论文使用授权的说明本人完全了解中国科学院自动化研究所有关保留、使用学位论文的规定，即：中国科学院自动化研究所有权保留送交论文的复印件，允许论文被查阅和借阅；可以公布论文的全部或部分内容，可以采用影印、缩印或其他复制手段保存论文。（保密的论文在解密后应遵守此规定）签名: 导师签名: 日期：摘要数据表示是机器学习中的基础工作，数据表示的好坏直接影响到整个系统的性能。传统机器学习思路下，对数据的表示主要通过人工设计特征来完成，在很长一段时间里，文本、语音、图像领域中的各项任务均通过人工设计更好的特征来实现性能的提升。近年来，随着深度学习和表示学习的兴起，基于神经网络的数据表示技术在各个领域崭露头角。在自然语言处理领域，最常用的语义表示方法是词袋子模型，该方法存在数据稀疏问题，并且不能保留词序信息。早期方法中提出的词性、句法结构等复杂特征，往往只能对特定的任务带来性能提升。本文从词和文档两个层次对文本的语义表示技术进行系统的总结分析，并提出了自己的表示技术，具体如下。一、词向量表示技术的理论及实验分析。在这一部分中，本文对现有的词向量表示技术进行了系统的理论对比及实验分析。理论方面，本文阐述了现有各种模型之间的联系，从模型的结构与目标等方面对模型进行了比较，并证明了其中最重要的两个模型Skip-gram 与GloVe 之间的关系。实验方面，本文从模型、语料和训练参数三个角度分析了训练词向量的关键技术。本文选取了三大类一共八个指标对词向量进行评价，这三大类指标涵盖了现有的词向量用法。本工作为首个对词向量进行系统评价的工作，通过理论和实验的比较分析，文章提出了一些对生成词向量的参考建议。二、基于字词联合训练的中文表示及应用。现有的中文表示技术往往沿用了英文的思路，直接从词的层面对文本表示进行构建。本文根据中文的特点，提出了基于字词联合训练的表示技术。该方法在字的上下文空间中融入了词，利用词的语义空间，更好地对汉字建模；同时利用字的平滑效果，更好地对词建模。文章在分词任务、词义相似度任务和文本分类任务上对字和词的表示进行了评价，实验表明字词联合训练得到的字词向量，相比单独训练字向量或词向量，有显著的提升。三、基于循环卷积网络的文档表示及应用。在这一部分中，本文分析了现有的文档表示技术：基于循环网络的表示技术、基于递归网络的表示技术和基于卷积网络的表示技术。并且，针对现有的三种表示技术的不足，本文提出了 ii 基于神经网络的词和文档语义向量表示方法研究基于卷积循环网络的文档表示技术。该方法克服了此前递归网络的复杂度过高的问题，循环网络的语义偏置问题，以及卷积网络窗口较难选择的问题。文章在文本分类任务上对新提出的表示技术进行了对比分析，实验表明基于循环卷积网络的文本表示技术比现有的表示技术能取得更好的性能。关键词：自然语言处理，词向量，神经网络，表示学习，分布表示 Abstract Data representation is a fundamental task in machine learning. The representa- tion of data affects the performance of the whole machine learning system. In a long history, the representation of data is done by feature engineering, and researchers aim at designing better features for specific tasks. Recently, the rapid development of deep learning and representation learning has brought new inspiration to various domains. In natural language processing, the most widely used feature representation is the Bag-of-Words model. This model has the data sparsity problem and cannot keep the word order information. Other features such as part-of-speech tagging or more complex syntax features can only fit for specific tasks in most cases. This thesis focuses on word representation and document representation. We compare the existing systems and present our new model. First, for generating word embeddings, we make comprehensive comparisons among existing word embedding models. In terms of theory, we figure out the re- lationship between the two most important models, i.e., Skip-gram and GloVe. In our experiments, we analyze three key points in generating word embeddings, including the model construction, the training corpus and parameter design. We evaluate word embeddings with three types of tasks, and we argue that they cover the existing use of word embeddings. Through theory and practical experiments, we present some guide- lines for how to generate a good word embedding. Second, in Chinese character or word representation, we find that the existing models always use the word embedding models directly. We introduce the joint training of Chinese character and word. This method incorporates the context words into the representation space of a Chinese character, which leads to a better representation of Chinese characters and words. In the tasks of Chinese character segmentation and document classification, the joint training outperforms the existing methods that train characters or words with traditional word embedding algorithms. Third, for document representation, we analyze the existing document represen- tation models, including recursive neural networks, recurrent neural networks and con- iv 基于神经网络的词和文档语义向量表示方法研究 volutional neural networks. We point out the drawbacks of these models and present our new model, the recurrent convolutional neural networks. In text classification task, the experimental results show that our model outperforms the existing models. Keywords: Natural Language Processing, Word Embedding, Neural Network, Rep- resentation Learning, Distributional Representation 目录摘要· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · i Abstract · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · iii 目录· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · v 术语与符号· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · xiii 0.1 术语· · · · · · · ·

View Original ArXiv

This content is AI-processed based on ArXiv data.

Word and Document Embeddings based on Neural Network Approaches

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found