Learning Multilingual Embeddings for Cross-Lingual Information Retrieval in the Presence of Topically Aligned Corpora
Introduction
Cross-lingual information retrieval, where multiple languages are used simultaneously in an information retrieval (IR) task, is an important area of research. The increasing amount of non-English data available through the Internet and processed by several modern age IR/NLP (natural language processing) tasks has magnified the importance of cross-lingual IR manifold. In particular, we address the general ad-hoc information retrieval task where the query is in any of the $`n`$ languages, and retrieval can be from any of the remaining languages. In countries such as India where multiple languages are used officially and regularly by a large amount of computer-educated citizens, the above task is particularly important, and can be a game changer for many of the digital initiatives that governments across the world are actively promoting.
Such queries can be quite common. For example, in nation-wide events such as general elections, or an emergency situation, a sports event, etc., queries like “How many seats have party X won in state Y?” are quite common and will be issued in several languages. The proposed system should be able to retrieve the answer from documents written in any language.
Most of the previous work on cross-lingual IR require sentence-aligned parallel data and other language specific resources such as dictionaries. Vulic et al. removed this extremely constraining requirement and learned bilingual word embedding using only document-aligned comparable corpora. However, such aligned corpora is not always readily available and need considerable effort to be built. Resource-poor languages such as the Indian languages specifically suffer from this setback.
To this end, we present a multi-lingual setup where we build a cross-lingual IR system that requires no such aligned corpora or language specific resources. It automatically learns cross-lingual embeddings using merely TREC-style test collections. We also propose to build a multi-lingual embedding on the same setup. This eliminates the requirement of building embeddings for collection pairs in a cross-lingual retrieval paradigm as well as the need to train bilingual embedding for all possible language pairs. Instead, this single multi-lingual embedding will leverage automatic cross-lingual retrieval between any two pairs of languages. The proposed setup is particularly useful in online situations in multi-lingual countries such as India.
To the best of our knowledge, this is the first cross-lingual IR work that works on more than $`2`$ languages directly and simultaneously without targeting each pair separately.
Our proposed method yields considerable improvements over Vulic et al. in the bilingual setting on standard Indian language test collections. We further demonstrate the efficacy of our method by using a trilingual embedding.
Conclusions
In this paper, we have proposed a cross-lingual IR setup in the absence of aligned comparable corpora. Our method used a common embedding for all the languages and produced better performance than the closest state-of-the-art. In future, we would like to experiment with other embedding methods.
Experiments
| FIRE 2010 | ||||
|---|---|---|---|---|
| Language | #Docs | #Queries | Mean rel docs per query | |
| English | 1,25,586 | 50 | 13.06 | |
| Hindi | 1,49,482 | 50 | 18.30 | |
| Bangla | 1,23,047 | 50 | 10.02 | |
| FIRE 2011 | ||||
| Language | #Docs | #Queries | Mean rel docs per query | |
| English | 89,286 | 50 | 55.22 | |
| Hindi | 3,31,599 | 50 | 57.70 | |
| Bangla | 3,77,104 | 50 | 55.56 | |
| FIRE 2012 | ||||
| Language | #Docs | #Queries | Mean rel docs per query | |
| English | 89,286 | 50 | 70.78 | |
| Hindi | 3,31,599 | 50 | 46.18 | |
| Bangla | 3,77,111 | 50 | 51.62 | |
Setup
Datasets: We use FIRE (http://fire.irsi.res.in/fire/static/data) cross-lingual datasets in English, Hindi and Bangla (details given in Table 1). The documents were collected from the following newspapers: The Telegraph (http://www.telegraphindia.com) for English, Dainik Jagran (http://www.jagran.com) and Amar Ujala (http://www.amarujala.com) for Hindi, and Anandabazar Patrika (http://www.anandabazar.com) for Bangla. Query sets were created such that queries with the same identifier are translations of each other. For each language and collection, we choose randomly $`10`$ queries for testing. The rest are used for training in a 5-fold cross validation manner.
For retrieval, only the title field of the queries were used. Stopword removal was done. We use the default Dirichlet Language Model implemented in Terrier IR toolkit (http://terrier.org/) for all our retrieval experiments.
Baseline: We compare our method for cross-lingual IR with bilingual embeddings with Vulic et al. . The shuffling code used is obtained from the authors.
Mono-lingual: In the monolingual setup, the results when the actual target language queries are used for retrieval on the target set sets the upper bound of performance that can be achieved with a multi-lingual setup.
Training
The Gensim implementation for Word2Vec (https://radimrehurek.com/gensim/models/word2vec.html) was used. The skip-gram model was used for the training using the following parameters: (i) vector dimensionality: 100, (ii) learning rate: 0.01, (iii) min word count: 1. The context window size was varied from 5 to 50 in intervals of 5. For bilingual embedding, window size 25 produced the best results on the training set and was subsequently used on the test queries, while for trilingual embedding, the best window size was 50. The parameters $`\kappa`$ and $`\tau`$ were tuned on the training set over the values {5, 10, 15, 20} and {5, 10, 15} respectively.
Results
To assess quality, we report the Mean Average Precision (MAP), R-Precision (R-Prec) and Binary Preference (BPref).
We report our retrieval results in Table [tab:bi] and Table [tab:tri]. We uniformly use the cross-lingual retrieval convention source language $`\to`$ target language. For example, B$`\to`$E indicates that Bangla is the source language while English is the target language.
Bilingual Embeddings: Table [tab:bi] shows the results for bilingual retrieval, i.e., when the embedding space is built using only 2 languages. For all the language pairs, the proposed method outperforms Vulic et al. significantly; the differences are statistically significant at 5% level of confidence ($`p < 0.05`$) by Wilcoxon signed-rank test . We have reported the Monolingual results that does not require any cross-lingual IR as an upper bound of performance. Interestingly, our proposed method produces comparable MAP results for H$`\to`$B (FIRE 2010). It exhibits better BPref than Monolingual B$`\to`$B for H$`\to`$B (FIRE 2010) and the difference is statistically significant at 5% level of confidence by Wilcoxon signed-rank test. It is also comparable with Monolingual H$`\to`$H for B$`\to`$H (FIRE 2010), with Monolingual B$`\to`$B for E$`\to`$B, H$`\to`$B (FIRE 2011) and with Monolingual E$`\to`$E for H$`\to`$E (FIRE 2012). While evaluating with R-Prec, H$`\to`$B (FIRE 2010) is slightly better than Monolingual B$`\to`$B. This shows that the proposed method produces competitive performance even when compared with a strong baseline like Monolingual.
Time requirements: The time requirements comparison with Vulic et al. is reported in Table 2. Our pre-retrieval time involves indexing time using Terrier (http://terrier.org) and cross-lingual query generation time. Pre-retrieval time for Vulic is the time taken to create the document vectors for all the documents in a corpus. Retrieval time for us is the one taken by Terrier to produce the ranked list for only the test queries. Retrieval time for Vulic comprises of calculating the cosine score between the query vectors of the test queries and all the documents in the collection followed by sorting the documents of the whole collection in the decreasing order of this score for each query. Our proposed method clearly outperforms Vulic in terms of time requirements.
Trilingual Embeddings: We report the retrieval performance of the trilingual setting in Table [tab:tri]. We chose not to compare with Vulic et al. any further since we have already established our superiority over the latter in the bilingual setting. For FIRE 2010, our proposed method produces superior performance in both MAP and BPref over Monolingual B$`\to`$B for both E$`\to`$B and H$`\to`$B and the differences are statistically significant at 5% level of confidence by Wilcoxon signed-rank test. Using R-Prec, for FIRE 2010, E$`\to`$B is considerably better than Monolingual B$`\to`$B ($`p < 0.05`$ by Wilcoxon signed-rank test). For FIRE 2011, our proposed method produces better results in BPref over Monolingual B$`\to`$B for E$`\to`$B and over Monolingual H$`\to`$H for E$`\to`$H. This shows that our proposed method is able to maintain its performance when compared with Monolingual even in a trilingual setting.
| Method | Language | Pre-retrieval time | Retrieval time |
|---|---|---|---|
| Proposed | English | 175.67s | 5.51s |
| Vulic | English | 10559.37s | 13.37s |
| Proposed | Hindi | 2066.27s | 0.92s |
| Vulic | Hindi | 26206.04s | 36.19s |
| Proposed | Bangla | 1627.92s | 3.60s |
| Vulic | Bangla | 15220.24s | 118.86s |
Time requirements, averaged over three datasets.
| Method | FIRE 2010 | FIRE 2011 | FIRE 2012 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 2-11 | Retrieval | MAP | R-Prec | BPref | MAP | R-Prec | BPref | MAP | R-Prec | BPref |
| Monolingual | E→E | 0.4256 | 0.4044 | 0.3785 | 0.2836 | 0.3098 | 0.3528 | 0.4868 | 0.4785 | 0.4507 |
| Proposed | B→E | 0.1761 | 0.2041 | 0.2297 | 0.1148 | 0.1164 | 0.2204 | 0.2890 | 0.2899 | 0.3418 |
| H→E | 0.2991 | 0.2548 | 0.2787 | 0.1461 | 0.1810 | 0.2548 | 0.3565 | 0.3861 | 0.4424 | |
| Vulic | B→E | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| H→E | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0009 | 0.0000 | 0.0000 | 0.0006 | |
| Monolingual | B→B | 0.3354 | 0.2951 | 0.2593 | 0.2127 | 0.2677 | 0.2164 | 0.3093 | 0.3188 | 0.3203 |
| Proposed | E→B | 0.1964 | 0.2429 | 0.2017 | 0.1302 | 0.1797 | 0.2105 | 0.2114 | 0.2409 | 0.2522 |
| H→B | 0.3108 | 0.3044 | 0.3185 | 0.1098 | 0.1410 | 0.2089 | 0.2058 | 0.2202 | 0.2383 | |
| Vulic | E→B | 0.0000 | 0.0000 | 0.0016 | 0.0000 | 0.0000 | 0.0032 | 0.0000 | 0.0000 | 0.0032 |
| H→B | 0.0001 | 0.0000 | 0.0206 | 0.0000 | 0.0000 | 0.0008 | 0.0000 | 0.0000 | 0.0008 | |
| Monolingual | H→H | 0.3169 | 0.2872 | 0.2691 | 0.2408 | 0.2756 | 0.2637 | 0.4221 | 0.4407 | 0.4226 |
| Proposed | E→H | 0.1497 | 0.1663 | 0.1681 | 0.1526 | 0.1806 | 0.2038 | 0.3094 | 0.3093 | 0.3325 |
| B→H | 0.1791 | 0.2113 | 0.2530 | 0.1244 | 0.1568 | 0.1768 | 0.2751 | 0.2925 | 0.3398 | |
| Vulic | E→H | 0.0000 | 0.0000 | 0.0030 | 0.0000 | 0.0000 | 0.0089 | 0.0000 | 0.0000 | 0.0000 |
| B→H | 0.0000 | 0.0000 | 0.0080 | 0.0000 | 0.0000 | 0.0012 | 0.0000 | 0.0000 | 0.0000 | |
| Method | FIRE 2010 | FIRE 2011 | FIRE 2012 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 2-11 | Retrieval | MAP | R-Prec | BPref | MAP | R-Prec | BPref | MAP | R-Prec | BPref |
| Monolingual | E→E | 0.4256 | 0.4044 | 0.3785 | 0.2836 | 0.3098 | 0.3528 | 0.4868 | 0.4785 | 0.4507 |
| Proposed | B→E | 0.2096 | 0.2528 | 0.2243 | 0.1434 | 0.1873 | 0.2497 | 0.3632 | 0.3830 | 0.4370 |
| H→E | 0.3039 | 0.2981 | 0.2973 | 0.1762 | 0.1887 | 0.2387 | 0.3632 | 0.3854 | 0.4426 | |
| Monolingual | B→B | 0.3354 | 0.2951 | 0.2593 | 0.2127 | 0.2677 | 0.2164 | 0.3093 | 0.3188 | 0.3203 |
| Proposed | E→B | 0.3950 | 0.3651 | 0.3489 | 0.1843 | 0.2062 | 0.2324 | 0.2960 | 0.3198 | 0.3120 |
| H→B | 0.3558 | 0.2945 | 0.3237 | 0.1566 | 0.1954 | 0.2127 | 0.1941 | 0.2094 | 0.2314 | |
| Monolingual | H→H | 0.3169 | 0.2872 | 0.2691 | 0.2408 | 0.2756 | 0.2637 | 0.4221 | 0.4407 | 0.4226 |
| Proposed | E→H | 0.1759 | 0.2139 | 0.2060 | 0.2259 | 0.2178 | 0.28820 | 0.2774 | 0.2614 | 0.2702 |
| B→H | 0.1377 | 0.1570 | 0.1833 | 0.1484 | 0.1827 | 0.2160 | 0.2402 | 0.2494 | 0.3082 | |
Analysis
Figure 1 shows some example queries generated by our proposed method using bilingual and trilingual embeddings. For the query surrender of sanjay dutt (on the conviction of Bollywood actor Sanjay Dutt with relation to terrorist attack in Bombay in 1993), the generated query contains important words such as sanjay, dutt, salem (Abu Salem, a terrorist), ak (AK-47, a firearm), munnabhai (a popular screen name of Sanjay). The generated query for cervical cancer awareness treatment vaccine contains useful terms like cervical, hpv (Human papillomavirus), infection, pregnant, silvia (Silvia De Sanjose, a leading researcher in Cancer Epidemiology). The generated query for death of Yasser Arafat contains the terms arafat, yasser, ramallah (the headquarters of Yasser), palestine, suha (Suha Arafat, Yasser Arafat’s wife) and plo (Palestine Liberation Organization). The generated query for taj mahal controversy (regarding if Taj Mahal is a Waqf property as claimed by Uttar Pradesh Sunni Wakf Board and subsequent statements by the Archaeological Survey of India) contains vital terms such as taj, mahal, wakf, archaeological and sunni. These examples clearly portray the effectiveness of our target query generation.