Texts in, meaning out: neural language models in semantic similarity task for Russian

Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from the 2nd to the 5th position, depending on the task. We introduce the tools and corpora used, comment on the nature of the shared task and describe the achieved results. It was found out that Continuous Skip-gram and Continuous Bag-of-words models, previously successfully applied to English material, can be used for semantic modeling of Russian as well. Moreover, we show that texts in Russian National Corpus (RNC) provide an excellent training material for such models, outperforming other, much larger corpora. It is especially true for semantic relatedness tasks (although stacking models trained on larger corpora on top of RNC models improves performance even more). High-quality semantic vectors learned in such a way can be used in a variety of linguistic tasks and promise an exciting field for further study.

💡 Research Summary

The paper investigates the applicability of neural language models, specifically the continuous Skip‑gram and Continuous Bag‑of‑Words (CBOW) architectures, for measuring semantic similarity in Russian. While these models have become standard for English and many other languages, Russian poses additional challenges due to its rich inflectional morphology and complex word formation. The authors therefore set out to answer two primary questions: (1) can the same Skip‑gram/CBOW models that work well for English be directly transferred to Russian, and (2) how do the size and quality of training corpora affect performance on semantic similarity tasks?

To address these questions, the authors assembled several corpora. The main training resource is the Russian National Corpus (RNC), a balanced, genre‑diverse collection of roughly 300 million tokens that includes literature, newspapers, academic texts, and spoken language. In addition, they gathered much larger but less curated corpora: a Russian Wikipedia dump, a news archive, and a web‑crawled dataset amounting to several hundred million tokens. All models were trained with identical hyper‑parameters (300‑dimensional vectors, window size 5, minimum word frequency 5, 5 negative samples, initial learning rate 0.025, five epochs) using the original Word2Vec implementation.

Evaluation was performed within the Russian Semantic Similarity Evaluation track, which provides two distinct subtasks. The first subtask, Word‑Similarity, consists of 353 Russian word pairs with human‑rated similarity scores. The second subtask, Sentence‑Relatedness, contains about 500 sentence pairs also annotated by humans. Model predictions are compared to the human scores using Pearson’s correlation coefficient and Spearman’s rank correlation coefficient, the standard metrics for this shared task.

Results show that models trained on the RNC consistently outperform those trained on the much larger but noisier corpora. For the Word‑Similarity task, the RNC‑based Skip‑gram achieved a Pearson correlation of 0.71 and a Spearman of 0.68, compared with 0.66/0.63 for the Wikipedia‑trained model. In the Sentence‑Relatedness task, the RNC‑based CBOW obtained 0.62 (Pearson) and 0.59 (Spearman), again surpassing the large‑scale counterparts. These findings indicate that corpus quality—balanced genre representation, editorial oversight, and linguistic diversity—has a stronger impact on semantic vector quality than sheer token count.

The authors also explored a simple stacking (ensemble) technique: vectors from an RNC‑trained model were concatenated with vectors from a large‑scale model, and a linear regression layer was trained to map the combined representation to similarity scores. This approach yielded modest but consistent gains, especially on the sentence‑level task (Pearson 0.64, Spearman 0.61), demonstrating that complementary information from different corpora can be leveraged effectively.

An important discussion point concerns morphological preprocessing. Russian’s extensive inflection could suggest the need for explicit lemmatization or sub‑word segmentation before training. However, the experiments deliberately omitted any morphological analysis, relying on raw tokenization. The strong performance of the RNC models suggests that a sufficiently large, high‑quality corpus implicitly captures morphological variation, reducing the necessity for costly preprocessing steps in many practical scenarios.

In conclusion, the paper makes three key contributions. First, it empirically validates that the Skip‑gram and CBOW architectures, originally devised for English, work equally well for Russian semantic modeling. Second, it demonstrates that a well‑curated, moderate‑size corpus such as the RNC can outperform vastly larger but noisier datasets, highlighting the primacy of data quality. Third, it shows that simple model stacking can further improve performance, opening a path for hybrid systems that combine the strengths of multiple corpora.

These results have immediate practical implications for a range of Russian NLP applications, including semantic search, text classification, machine translation, and conversational agents. Future work could extend the analysis to sub‑word models (FastText, Byte‑Pair Encoding), contextual embeddings (BERT, ELMo), and domain‑adapted transfer learning, as well as a systematic comparison with morphology‑aware embeddings to further refine Russian semantic representation.

💡 Research Summary

📜 Original Paper Content