NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain

NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Measuring advances in retrieval requires test collections with relevance judgments that can faithfully distinguish systems. This paper presents NeuCLIRTech, an evaluation collection for cross-language retrieval over technical information. The collection consists of technical documents written natively in Chinese and those same documents machine translated into English. It includes 110 queries with relevance judgments. The collection supports two retrieval scenarios: monolingual retrieval in Chinese, and cross-language retrieval with English as the query language. NeuCLIRTech combines the TREC NeuCLIR track topics of 2023 and 2024. The 110 queries with 35,962 document judgments provide strong statistical discriminatory power when trying to distinguish retrieval approaches. A fusion baseline of strong neural retrieval systems is included so that developers of reranking algorithms are not reliant on BM25 as their first stage retriever. The dataset and artifacts are released on Huggingface Datasets


💡 Research Summary

NeuCLIRTech introduces a new, large‑scale evaluation collection for cross‑language information retrieval (CLIR) focused on technical documents. The collection draws from the Chinese Scientific Literature (CSL) dataset, comprising 396,209 abstracts from 1,980 Chinese academic journals published between 2010 and 2020. Each abstract is provided in its original Chinese form and in an English translation generated by Google Translate in June 2023, yielding a bilingual corpus that enables direct comparison of monolingual Chinese retrieval and English‑to‑Chinese CLIR.

The authors assembled 110 queries with a total of 35,962 relevance judgments. Queries were created by 22 graduate students and one post‑doctoral researcher from Johns Hopkins University and the University of Maryland, who were selected for strong Chinese language skills and familiarity with scientific research. Each participant generated 5‑8 TREC‑style topics (title, description, narrative) in both Chinese and English, performed an exploratory search on the CSL collection, and refined the topics to ensure a manageable number of relevant documents. After internal review, 146 initial topics were reduced to 110 high‑quality queries.

Relevance judgments were obtained by pooling the top‑ranked documents from all runs submitted to the TREC NeuCLIR track for the years in which the topics were created. For queries 1‑199, the top 20 documents per run were judged; for queries 200 and above, the top 35 were judged. Annotators evaluated each document in two steps: (1) whether the abstract contained central information relevant to the query, and (2) the value of that information on a 0‑3 point scale (very valuable, somewhat valuable, not valuable). This “deep judgment” approach resulted in near‑complete coverage of relevant material for each topic, a notable improvement over many existing CLIR collections where only a small fraction of the pool is judged.

The paper provides two primary retrieval scenarios: (a) monolingual retrieval where both query and documents are in Chinese, and (b) cross‑language retrieval where the query is in English and the documents are in Chinese (or their English translations). To support research beyond the first‑stage retrieval, the authors release a fusion baseline that combines three modern first‑stage retrievers: a single‑vector dual‑encoder, a multi‑vector dual‑encoder, and a learned‑sparse (LSR) model. This baseline is intended to replace the traditional BM25 first‑stage, allowing rerankers to be evaluated on documents that may lack lexical overlap with the query.

Experimental results are reported using nDCG@20 and Judged@20. The strongest first‑stage retriever is Qwen‑3‑8B (both 0.6B and 4B/8B variants), achieving nDCG@20 of 0.480 (monolingual) and 0.472 (cross‑language) with Judged@20 of 0.87 and 0.87 respectively. Surprisingly, the recent multilingual LSR model MILCO underperforms BM25, highlighting that technical‑domain vocabulary and translation noise remain challenging for current LSR approaches. The fusion baseline, despite being a simple average of three models, consistently outperforms individual baselines, reaching nDCG@20 of 0.438 (monolingual) and 0.431 (cross‑language) with very high Judged@20 scores (0.92‑0.96).

Reranking experiments show that large language model (LLM) based rerankers (e.g., Qwen‑3‑8B Rerank, Rank‑Qwen‑32B) can improve over the first‑stage scores, but not uniformly. Some rerankers (e.g., Jina Reranker) maintain performance in the monolingual setting yet degrade sharply in the cross‑language scenario, underscoring the difficulty of transferring reranking capabilities across languages. The authors also note that not all rerankers surpass the initial retrieval performance, especially when the first‑stage already yields high-quality rankings.

A key methodological contribution is the inclusion of Judged@20 as a complementary metric. Because relevance judgments are limited to pooled top documents, systems that retrieve many unjudged yet potentially relevant documents may be unfairly penalized in nDCG@20. Reporting Judged@20 helps researchers gauge the extent of possible underestimation.

The paper acknowledges a limitation: relevance judgments are based only on documents retrieved by systems that participated in the TREC NeuCLIR track at the time of collection creation. Consequently, novel systems that surface unjudged relevant documents could be under‑scored. The authors recommend always reporting Judged@20 alongside nDCG@20 to mitigate this risk.

In conclusion, NeuCLIRTech is, to the authors’ knowledge, the first modern CLIR benchmark that focuses on technical documents. By providing a bilingual corpus, deep relevance judgments, and strong baselines (including a fusion first‑stage and several LLM rerankers), the collection offers a comprehensive testbed for investigating translation quality, multilingual embeddings, and reranking strategies in a domain where current models still struggle. The dataset, code, and artifacts are publicly released on HuggingFace Datasets, facilitating reproducibility and future extensions.


Comments & Academic Discussion

Loading comments...

Leave a Comment