WSDM Cup 2026 Multilingual Retrieval: A Low-Cost Multi-Stage Retrieval Pipeline
We present a low-cost retrieval system for the WSDM Cup 2026 multilingual retrieval task, where English queries are used to retrieve relevant documents from a collection of approximately ten million news articles in Chinese, Persian, and Russian, and to output the top-1000 ranked results for each query. We follow a four-stage pipeline that combines LLM-based GRF-style query expansion with BM25 candidate retrieval, dense ranking using long-text representations from jina-embeddings-v4, and pointwise re-ranking of the top-20 candidates using Qwen3-Reranker-4B while preserving the dense order for the remaining results. On the official evaluation, the system achieves nDCG@20 of 0.403 and Judged@20 of 0.95. We further conduct extensive ablation experiments to quantify the contribution of each stage and to analyze the effectiveness of query expansion, dense ranking, and top-$k$ reranking under limited compute budgets.
💡 Research Summary
The paper presents a cost‑effective, fully automated four‑stage retrieval pipeline designed for the WSDM Cup 2026 multilingual retrieval task, where English informational queries must retrieve relevant news articles written in Chinese, Persian, and Russian from a corpus of roughly ten million documents. To avoid the complexities of cross‑language indexing, the authors first translate every non‑English document into English using a high‑quality machine‑translation system and index the English renderings (˜d). All subsequent retrieval operations are performed on these translated texts, while the final output consists of the original multilingual document IDs in the required TREC run format.
Stage 1 – LLM‑based GRF‑style query expansion.
Given an English query q, the pipeline prompts the deepseek‑chat large language model to generate a short, news‑style pseudo‑document g(q). After lemmatization and stop‑word removal with spaCy, the top θ = 30 terms by term frequency (excluding those already present in q) are extracted and concatenated to the original query, forming an expanded query q′. This step implements a generalized pseudo‑relevance feedback (GRF) mechanism without needing any relevance judgments. Experiments show that adding GRF to BM25 raises nDCG@20 from 0.3306 to 0.4020 and improves recall at 1000 (R@1000) from 0.5867 to 0.6526, demonstrating that the expansion substantially enriches the lexical match space.
Stage 2 – Sparse candidate generation with BM25.
The expanded query q′ is run against a single BM25 index built on the English translations, retrieving the top 2,000 candidate documents (C2000). Default BM25 parameters are used, and the candidate set is later mapped back to the original multilingual IDs.
Stage 3 – Dense ranking with jina‑embeddings‑v4.
Each candidate document’s English text (title + body) is encoded with the multilingual, multimodal jina‑embeddings‑v4 model. The model produces 1,024‑dimensional vectors (truncated from the original 2,048) after L2‑normalization; queries are encoded in the same way. Candidates are re‑ordered by cosine similarity, yielding a dense top‑1,000 list. Using only this dense stage (BM25 + jina‑embeddings‑v4) lifts nDCG@20 to 0.4975, confirming that semantic representations are crucial for early precision in a multilingual news domain where lexical overlap is limited.
Stage 4 – Top‑k re‑ranking with Qwen3‑Reranker‑4B.
To capture the remaining performance headroom without incurring prohibitive inference costs, only the top k = 20 documents from the dense list are passed to the Qwen3‑Reranker‑4B model (4 B parameters). Each query–document pair is formatted as an instruction‑style prompt, and the model outputs a binary relevance probability P(yes) derived from the logits of the “yes” and “no” tokens via a two‑way softmax. The re‑ranked top‑20 replace the original dense order, while documents ranked 21–1,000 retain their dense positions. This selective re‑ranking adds roughly 0.006–0.008 to nDCG@20 while cutting the total LLM inference cost by more than an order of magnitude compared to re‑ranking the full candidate set.
Evaluation and Ablation.
On the official test set, the full pipeline achieves nDCG@20 = 0.403 and Judged@20 = 0.95, outperforming all organizer‑reported baselines (e.g., BGE‑M3 Sparse 0.054, Arctic‑Embed Large 0.352, MILCO 0.395). Detailed ablation experiments isolate each component:
- BM25 alone (no expansion) → nDCG@20 0.3306.
- BM25 + GRF → 0.4020.
- BM25 + jina‑embeddings‑v4 → 0.4975.
- Adding GRF to the dense stage (jina + GRF) → 0.5015.
- The best configuration (GRF applied both before BM25 and before dense ranking) reaches 0.5158 but slightly reduces R@1000, indicating a precision‑recall trade‑off where expanded queries concentrate high‑specificity entities at the expense of diversity.
Efficiency considerations.
The pipeline is deliberately lightweight: the only expensive operation is the top‑20 re‑ranking, which fits comfortably within modest GPU budgets. The use of pre‑computed dense embeddings for 2,000 candidates per query keeps latency low, and the reliance on a single English index simplifies deployment.
Conclusions and future directions.
The authors demonstrate that a modest combination of LLM‑driven query expansion, classic BM25 retrieval, modern multilingual dense embeddings, and a tightly scoped re‑ranking stage can deliver strong early‑ranking effectiveness under strict compute constraints. Future work is suggested in three areas: (i) more sophisticated cross‑language query rewriting beyond GRF, (ii) end‑to‑end cross‑lingual retrieval models that jointly learn from English queries and multilingual documents (or their translations) to reduce dependence on MT quality, and (iii) multi‑agent architectures where specialized modules (rewriting, sparse, dense, reranking) collaborate dynamically to improve robustness and scalability.
Comments & Academic Discussion
Loading comments...
Leave a Comment