- Title: Learning Multilingual Embeddings for Cross-Lingual Information Retrieval in the Presence of Topically Aligned Corpora
- ArXiv ID: 1804.04475
- Date: 2018-04-13
- Authors: Mitodru Niyogi, Kripabandhu Ghosh, Arnab Bhattacharya
📝 Abstract
Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corpora. In this paper, we address this problem by considering topically aligned corpora designed for evaluating an IR setup. To emphasize, we neither use any sentence-aligned corpora or document-aligned corpora, nor do we use any language specific resources such as dictionary, thesaurus, or grammar rules. Instead, we use an embedding into a common space and learn word correspondences directly from there. We test our proposed approach for bilingual IR on standard FIRE datasets for Bangla, Hindi and English. The proposed method is superior to the state-of-the-art method not only for IR evaluation measures but also in terms of time requirements. We extend our method successfully to the trilingual setting.
💡 Summary & Analysis
This paper proposes a method to perform effective cross-lingual information retrieval (IR) without the need for parallel corpora. It introduces an approach that uses embeddings in a common space to learn word correspondences across languages, thereby enabling better IR performance. The problem addressed is the challenge of performing IR tasks when parallel corpora are not available or limited, which particularly affects resource-poor languages like those used in India. The solution involves training multilingual embeddings directly from comparable corpora without relying on parallel texts, dictionaries, or language-specific resources. This method has been tested and shown to outperform state-of-the-art methods both in terms of IR evaluation metrics and time efficiency across bilingual and trilingual setups using standard datasets like FIRE. This work is significant as it provides a way to enhance cross-lingual information retrieval without the need for extensive parallel data, making it especially useful for resource-constrained languages.
📄 Full Paper Content (ArXiv Source)
# Introduction
Cross-lingual information retrieval, where multiple languages are used
simultaneously in an information retrieval (IR) task, is an important
area of research. The increasing amount of non-English data available
through the Internet and processed by several modern age IR/NLP (natural
language processing) tasks has magnified the importance of cross-lingual
IR manifold. In particular, we address the general ad-hoc information
retrieval task where the query is in any of the $`n`$ languages, and
retrieval can be from any of the remaining languages. In countries such
as India where multiple languages are used officially and regularly by a
large amount of computer-educated citizens, the above task is
particularly important, and can be a game changer for many of the
digital initiatives that governments across the world are actively
promoting.
Such queries can be quite common. For example, in nation-wide events
such as general elections, or an emergency situation, a sports event,
etc., queries like “How many seats have party X won in state Y?” are
quite common and will be issued in several languages. The proposed
system should be able to retrieve the answer from documents written in
any language.
Most of the previous work on cross-lingual IR require sentence-aligned
parallel data and other language specific resources such as
dictionaries. Vulic et al. removed this extremely constraining
requirement and learned bilingual word embedding using only
document-aligned comparable corpora. However, such aligned corpora is
not always readily available and need considerable effort to be built.
Resource-poor languages such as the Indian languages specifically suffer
from this setback.
To this end, we present a multi-lingual setup where we build a
cross-lingual IR system that requires no such aligned corpora or
language specific resources. It automatically learns cross-lingual
embeddings using merely TREC-style test collections. We also propose to
build a multi-lingual embedding on the same setup. This eliminates the
requirement of building embeddings for collection pairs in a
cross-lingual retrieval paradigm as well as the need to train bilingual
embedding for all possible language pairs. Instead, this single
multi-lingual embedding will leverage automatic cross-lingual retrieval
between any two pairs of languages. The proposed setup is particularly
useful in online situations in multi-lingual countries such as India.
To the best of our knowledge, this is the first cross-lingual IR work
that works on more than $`2`$ languages directly and simultaneously
without targeting each pair separately.
Our proposed method yields considerable improvements over Vulic et al.
in the bilingual setting on standard Indian language test collections.
We further demonstrate the efficacy of our method by using a trilingual
embedding.
Conclusions
In this paper, we have proposed a cross-lingual IR setup in the absence
of aligned comparable corpora. Our method used a common embedding for
all the languages and produced better performance than the closest
state-of-the-art. In future, we would like to experiment with other
embedding methods.
Experiments
Datasets.
FIRE
2010
Language
#Docs
#Queries
Mean rel docs per
query
English
1,25,586
50
13.06
Hindi
1,49,482
50
18.30
Bangla
1,23,047
50
10.02
FIRE
2011
Language
#Docs
#Queries
Mean rel docs per
query
English
89,286
50
55.22
Hindi
3,31,599
50
57.70
Bangla
3,77,104
50
55.56
FIRE
2012
Language
#Docs
#Queries
Mean rel docs per
query
English
89,286
50
70.78
Hindi
3,31,599
50
46.18
Bangla
3,77,111
50
51.62
Setup
Datasets: We use FIRE (http://fire.irsi.res.in/fire/static/data
)
cross-lingual datasets in English, Hindi and Bangla (details given in
Table 1). The documents were collected from
the following newspapers: The Telegraph
(http://www.telegraphindia.com
) for English, Dainik Jagran
(http://www.jagran.com
) and Amar Ujala (http://www.amarujala.com
)
for Hindi, and Anandabazar Patrika (http://www.anandabazar.com
) for
Bangla. Query sets were created such that queries with the same
identifier are translations of each other. For each language and
collection, we choose randomly $`10`$ queries for testing. The rest are
used for training in a 5-fold cross validation manner.
For retrieval, only the title field of the queries were used. Stopword
removal was done. We use the default Dirichlet Language Model
implemented in Terrier IR toolkit (http://terrier.org/
) for all our
retrieval experiments.
Baseline: We compare our method for cross-lingual IR with bilingual
embeddings with Vulic et al. . The shuffling code used is obtained from
the authors.
Mono-lingual: In the monolingual setup, the results when the actual
target language queries are used for retrieval on the target set sets
the upper bound of performance that can be achieved with a multi-lingual
setup.
Training
The Gensim implementation for Word2Vec
(https://radimrehurek.com/gensim/models/word2vec.html
) was used. The
skip-gram model was used for the training using the following
parameters: (i) vector dimensionality: 100, (ii) learning rate: 0.01,
(iii) min word count: 1. The context window size was varied from 5 to 50
in intervals of 5. For bilingual embedding, window size 25 produced the
best results on the training set and was subsequently used on the test
queries, while for trilingual embedding, the best window size was 50.
The parameters $`\kappa`$ and $`\tau`$ were tuned on the training set
over the values {5, 10, 15, 20} and {5, 10, 15} respectively.
Results
To assess quality, we report the Mean Average Precision (MAP),
R-Precision (R-Prec) and Binary Preference (BPref).
We report our retrieval results in Table
[tab:bi] and Table
[tab:tri]. We uniformly use the
cross-lingual retrieval convention source language $`\to`$ target
language. For example, B$`\to`$E indicates that Bangla is the source
language while English is the target language.
Bilingual Embeddings: Table
[tab:bi] shows the results for bilingual
retrieval, i.e., when the embedding space is built using only 2
languages. For all the language pairs, the proposed method outperforms
Vulic et al. significantly; the differences are statistically
significant at 5% level of confidence ($`p < 0.05`$) by Wilcoxon
signed-rank test . We have reported the Monolingual results that does
not require any cross-lingual IR as an upper bound of performance.
Interestingly, our proposed method produces comparable MAP results for
H$`\to`$B (FIRE 2010). It exhibits better BPref than Monolingual
B$`\to`$B for H$`\to`$B (FIRE 2010) and the difference is statistically
significant at 5% level of confidence by Wilcoxon signed-rank test. It
is also comparable with Monolingual H$`\to`$H for B$`\to`$H (FIRE 2010),
with Monolingual B$`\to`$B for E$`\to`$B, H$`\to`$B (FIRE 2011) and with
Monolingual E$`\to`$E for H$`\to`$E (FIRE 2012). While evaluating with
R-Prec, H$`\to`$B (FIRE 2010) is slightly better than Monolingual
B$`\to`$B. This shows that the proposed method produces competitive
performance even when compared with a strong baseline like Monolingual.
Time requirements: The time requirements comparison with Vulic et
al. is reported in Table 2. Our pre-retrieval time involves
indexing time using Terrier (http://terrier.org
) and cross-lingual
query generation time. Pre-retrieval time for Vulic is the time taken to
create the document vectors for all the documents in a corpus. Retrieval
time for us is the one taken by Terrier to produce the ranked list for
only the test queries. Retrieval time for Vulic comprises of calculating
the cosine score between the query vectors of the test queries and all
the documents in the collection followed by sorting the documents of the
whole collection in the decreasing order of this score for each query.
Our proposed method clearly outperforms Vulic in terms of time
requirements.
Trilingual Embeddings: We report the retrieval performance of the
trilingual setting in Table [tab:tri]. We chose not to compare with
Vulic et al. any further since we have already established our
superiority over the latter in the bilingual setting. For FIRE 2010, our
proposed method produces superior performance in both MAP and BPref over
Monolingual B$`\to`$B for both E$`\to`$B and H$`\to`$B and the
differences are statistically significant at 5% level of confidence by
Wilcoxon signed-rank test. Using R-Prec, for FIRE 2010, E$`\to`$B is
considerably better than Monolingual B$`\to`$B ($`p < 0.05`$ by Wilcoxon
signed-rank test). For FIRE 2011, our proposed method produces better
results in BPref over Monolingual B$`\to`$B for E$`\to`$B and over
Monolingual H$`\to`$H for E$`\to`$H. This shows that our proposed method
is able to maintain its performance when compared with Monolingual even
in a trilingual setting.
Method
Language
Pre-retrieval time
Retrieval time
Proposed
English
175.67s
5.51s
Vulic
English
10559.37s
13.37s
Proposed
Hindi
2066.27s
0.92s
Vulic
Hindi
26206.04s
36.19s
Proposed
Bangla
1627.92s
3.60s
Vulic
Bangla
15220.24s
118.86s
Time requirements, averaged over three datasets.
Method
FIRE
2010
FIRE
2011
FIRE
2012
2-11
Retrieval
MAP
R-Prec
BPref
MAP
R-Prec
BPref
MAP
R-Prec
BPref
Monolingual
E→E
0.4256
0.4044
0.3785
0.2836
0.3098
0.3528
0.4868
0.4785
0.4507
Proposed
B→E
0.1761
0.2041
0.2297
0.1148
0.1164
0.2204
0.2890
0.2899
0.3418
H→E
0.2991
0.2548
0.2787
0.1461
0.1810
0.2548
0.3565
0.3861
0.4424
Vulic
B→E
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
H→E
0.0000
0.0000
0.0000
0.0000
0.0000
0.0009
0.0000
0.0000
0.0006
Monolingual
B→B
0.3354
0.2951
0.2593
0.2127
0.2677
0.2164
0.3093
0.3188
0.3203
Proposed
E→B
0.1964
0.2429
0.2017
0.1302
0.1797
0.2105
0.2114
0.2409
0.2522
H→B
0.3108
0.3044
0.3185
0.1098
0.1410
0.2089
0.2058
0.2202
0.2383
Vulic
E→B
0.0000
0.0000
0.0016
0.0000
0.0000
0.0032
0.0000
0.0000
0.0032
H→B
0.0001
0.0000
0.0206
0.0000
0.0000
0.0008
0.0000
0.0000
0.0008
Monolingual
H→H
0.3169
0.2872
0.2691
0.2408
0.2756
0.2637
0.4221
0.4407
0.4226
Proposed
E→H
0.1497
0.1663
0.1681
0.1526
0.1806
0.2038
0.3094
0.3093
0.3325
B→H
0.1791
0.2113
0.2530
0.1244
0.1568
0.1768
0.2751
0.2925
0.3398
Vulic
E→H
0.0000
0.0000
0.0030
0.0000
0.0000
0.0089
0.0000
0.0000
0.0000
B→H
0.0000
0.0000
0.0080
0.0000
0.0000
0.0012
0.0000
0.0000
0.0000
Method
FIRE
2010
FIRE
2011
FIRE
2012
2-11
Retrieval
MAP
R-Prec
BPref
MAP
R-Prec
BPref
MAP
R-Prec
BPref
Monolingual
E→E
0.4256
0.4044
0.3785
0.2836
0.3098
0.3528
0.4868
0.4785
0.4507
Proposed
B→E
0.2096
0.2528
0.2243
0.1434
0.1873
0.2497
0.3632
0.3830
0.4370
H→E
0.3039
0.2981
0.2973
0.1762
0.1887
0.2387
0.3632
0.3854
0.4426
Monolingual
B→B
0.3354
0.2951
0.2593
0.2127
0.2677
0.2164
0.3093
0.3188
0.3203
Proposed
E→B
0.3950
0.3651
0.3489
0.1843
0.2062
0.2324
0.2960
0.3198
0.3120
H→B
0.3558
0.2945
0.3237
0.1566
0.1954
0.2127
0.1941
0.2094
0.2314
Monolingual
H→H
0.3169
0.2872
0.2691
0.2408
0.2756
0.2637
0.4221
0.4407
0.4226
Proposed
E→H
0.1759
0.2139
0.2060
0.2259
0.2178
0.28820
0.2774
0.2614
0.2702
B→H
0.1377
0.1570
0.1833
0.1484
0.1827
0.2160
0.2402
0.2494
0.3082
Target and generated query examples.
Analysis
Figure 1 shows some example queries generated
by our proposed method using bilingual and trilingual embeddings. For
the query surrender of sanjay dutt (on the conviction of Bollywood
actor Sanjay Dutt with relation to terrorist attack in Bombay in 1993),
the generated query contains important words such as sanjay, dutt,
salem (Abu Salem, a terrorist), ak (AK-47, a firearm), munnabhai
(a popular screen name of Sanjay). The generated query for cervical
cancer awareness treatment vaccine contains useful terms like
cervical, hpv (Human papillomavirus), infection, pregnant,
silvia (Silvia De Sanjose, a leading researcher in Cancer
Epidemiology). The generated query for death of Yasser Arafat contains
the terms arafat, yasser, ramallah (the headquarters of Yasser),
palestine, suha (Suha Arafat, Yasser Arafat’s wife) and plo
(Palestine Liberation Organization). The generated query for taj mahal
controversy (regarding if Taj Mahal is a Waqf property as claimed by
Uttar Pradesh Sunni Wakf Board and subsequent statements by the
Archaeological Survey of India) contains vital terms such as taj,
mahal, wakf, archaeological and sunni. These examples clearly
portray the effectiveness of our target query generation.
The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.