Unsupervised Document Embedding With CNNs

Unsupervised Document Embedding With CNNs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a new model for unsupervised document embedding. Leading existing approaches either require complex inference or use recurrent neural networks (RNN) that are difficult to parallelize. We take a different route and develop a convolutional neural network (CNN) embedding model. Our CNN architecture is fully parallelizable resulting in over 10x speedup in inference time over RNN models. Parallelizable architecture enables to train deeper models where each successive layer has increasingly larger receptive field and models longer range semantic structure within the document. We additionally propose a fully unsupervised learning algorithm to train this model based on stochastic forward prediction. Empirical results on two public benchmarks show that our approach produces comparable to state-of-the-art accuracy at a fraction of computational cost.


💡 Research Summary

This paper introduces a novel convolutional neural network (CNN) architecture for unsupervised document embedding, addressing key limitations of existing approaches. The primary goal is to learn a function that maps variable-length text documents into fixed-length, dense vector representations that encapsulate their semantic content, without requiring any labeled data.

The authors identify two major drawbacks in prevailing methods: 1) Recurrent Neural Network (RNN) based models are inherently sequential, making them difficult to parallelize and slow to train and infer on modern hardware like GPUs. 2) Alternative models like Doc2Vec require iterative optimization during inference for each new document, which is computationally expensive for large-scale applications. Furthermore, RNNs with gating mechanisms (e.g., LSTM) can develop a bias towards later words in a sequence, potentially neglecting important information from the beginning of a document.

The proposed CNN-based model offers a fundamentally different and more efficient architecture. Its core advantage is full parallelizability; convolutional operations across the input sequence can be computed simultaneously. This leads to a reported speedup of over 10x in inference time compared to RNN baselines. The speed also enables the exploration of deeper networks. The architecture stacks multiple convolutional layers with Gated Linear Unit (GLU) activations. Each successive layer has a larger receptive field, allowing the model to capture longer-range semantic dependencies and document-level structure. GLUs help mitigate vanishing gradients and selectively control information flow.

To handle documents of varying lengths, the model employs an aggregating layer (e.g., max pooling or top-k pooling) after the final convolutional layer. This layer converts the variable-sized activation map into a fixed-length vector, elegantly solving the input size problem without resorting to arbitrary padding or truncation. This pooled representation is then passed through fully connected layers to produce the final document embedding.

For training, the authors devise a novel unsupervised learning algorithm based on “stochastic multiple word forward prediction.” The core learning signal comes from a document’s own word order. Given a document, the model randomly selects a position i. It uses the word subsequence from the start to position i as input to the CNN. The resulting embedding is then used to predict the next h words (from i+1 to i+h). Prediction is formulated as computing the dot product between the document embedding and the word vectors of the target words, passed through a sigmoid function to represent probability. The model parameters (CNN weights and word vectors) are jointly optimized to maximize the probability of the actual subsequent words, while minimizing it for randomly sampled negative words. This approach requires minimal preprocessing and creates a rich training signal by generating multiple subsequence samples from each document.

Empirical evaluation on standard public benchmarks for text classification and information retrieval tasks demonstrates that the proposed CNN embedding model achieves accuracy comparable to state-of-the-art RNN models. Crucially, it attains this performance at a fraction of the computational cost, highlighting its practical efficiency. The paper also notes that the max pooling operation provides a degree of interpretability, as one can trace which input words contributed most to the generated embedding. In summary, this work presents a fast, scalable, and accurate alternative for unsupervised document representation learning, effectively leveraging the parallel processing power of CNNs for an NLP task traditionally dominated by sequential models.


Comments & Academic Discussion

Loading comments...

Leave a Comment