Learning Multilingual Word Representations using a Bag-of-Words Autoencoder
Recent work on learning multilingual word representations usually relies on the use of word-level alignements (e.g. infered with the help of GIZA++) between translated sentences, in order to align the word embeddings in different languages. In this workshop paper, we investigate an autoencoder model for learning multilingual word representations that does without such word-level alignements. The autoencoder is trained to reconstruct the bag-of-word representation of given sentence from an encoded representation extracted from its translation. We evaluate our approach on a multilingual document classification task, where labeled data is available only for one language (e.g. English) while classification must be performed in a different language (e.g. French). In our experiments, we observe that our method compares favorably with a previously proposed method that exploits word-level alignments to learn word representations.
💡 Research Summary
The paper introduces a multilingual word‑embedding learning method that completely avoids word‑level alignments such as those produced by GIZA++. Instead, it relies only on sentence‑level parallel corpora. The core of the approach is a bag‑of‑words (BoW) autoencoder. Each sentence is represented as a BoW vector; the encoder aggregates the embeddings of the words present in the sentence (by simple summation) to obtain a fixed‑dimensional sentence representation φ. A decoder then tries to reconstruct the BoW of the other language from this representation.
To keep the reconstruction tractable for large vocabularies, the authors adopt a binary tree decomposition of the output distribution. Every word occupies a leaf in a full binary tree; the probability of a word is the product of left/right branching probabilities along the path from the root to that leaf. Each branching probability is modeled by a logistic regression (sigmoid) that takes a non‑linear transformation of φ as input. Because the tree depth is O(log V), the decoder needs only O(|sentence| log V) outputs, far fewer than the O(V) required by a naïve softmax.
For the multilingual extension, two separate embedding matrices (Wₓ for language X and Wᵧ for language Y) and two language‑specific decoder parameter sets (bₓ, Vₓ and bᵧ, Vᵧ) are introduced, while the encoder bias c and the non‑linear transform h are shared. Training on a parallel sentence pair (x, y) simultaneously optimizes four reconstruction tasks: (i) reconstruct y from x, (ii) reconstruct x from y, (iii) reconstruct x from itself, and (iv) reconstruct y from itself. The last two tasks can be performed on monolingual data, opening the door to leveraging abundant unlabeled corpora.
The authors evaluate the quality of the learned embeddings on a cross‑lingual document classification task. They first train the multilingual autoencoder on the English‑French and English‑German portions of Europarl‑v7 (≈2 M sentence pairs). Then they use the Reuters RCV1/RCV2 news corpus, which provides documents in English, French and German, each labeled with one of four top‑level categories. Documents are represented as TF‑IDF weighted BoW vectors; each vector is multiplied by the corresponding language’s embedding matrix to obtain a document‑level representation. A linear SVM is trained on English documents and directly applied to French or German test documents, using the respective embeddings.
Results are compared against the method of Klementiev et al. (2012), which learns bilingual embeddings by jointly training neural language models with a regularizer that forces aligned words (obtained from word‑level alignments) to have similar vectors. Despite not using any word‑level alignment, the proposed autoencoder achieves comparable or better error rates: for English→French, Klementiev reports 34.9 % (train) → 49.2 % (test) whereas the autoencoder yields 27.7 % → 32.4 %; for English→German, Klementiev’s numbers are 42.7 % → 59.5 % versus 29.8 % → 37.7 % for the autoencoder.
Qualitative analysis with t‑SNE visualizations shows that words with similar meanings in the two languages cluster together in the shared embedding space, and nearest‑neighbor queries confirm that French words such as “roi” (king) are close to their English counterparts.
The paper acknowledges limitations: the random assignment of words to tree leaves may be sub‑optimal, and the BoW representation discards word order and syntactic information. Future work is suggested to extend the model to bags‑of‑ngrams or phrase representations, and to explore more informed tree constructions (e.g., frequency‑ or semantics‑based clustering).
In summary, this work demonstrates that multilingual word embeddings can be learned efficiently without any word‑level alignment, using a tree‑based BoW autoencoder, and that the resulting embeddings are sufficiently aligned to enable successful cross‑lingual transfer in document classification.
Comments & Academic Discussion
Loading comments...
Leave a Comment