WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss, and Knowledge Distillation with MarginMSE loss. WebFAQ 2.0 is not a static resource but part of a long-term effort. Since late 2025, structured FAQs are being regularly released through the Open Web Index, enabling continuous expansion and refinement. We publish the datasets and training scripts to facilitate further research in multilingual and cross-lingual IR. The dataset itself and all related resources are publicly available on GitHub and HuggingFace.


💡 Research Summary

WebFAQ 2.0 is a substantially upgraded version of the original WebFAQ dataset, aimed at supporting multilingual dense retrieval research. The authors collected 198 million natural question‑answer (QA) pairs spanning 108 languages, more than doubling the size of the first release. The new data were obtained by directly crawling the live web: URLs from the original dataset were supplemented with additional URLs mined from the 2025 Common Crawl dump, and pages containing the schema.org “FAQPage” markup were fetched using the OWLer distributed crawler. This approach yields three major benefits: (1) a far larger corpus than the yearly Web Data Commons dumps, (2) automatic extraction of hreflang links that expose multilingual versions of the same FAQ page, and (3) preservation of page titles and meta‑descriptions, which provide crucial context for otherwise ambiguous questions.

Language distribution is deliberately balanced; English now accounts for only 27.9 % of the QA pairs (≈55 M), while the remaining 72 % are spread across 107 other languages, including substantial growth for low‑resource languages such as Hindi, Ukrainian, and Polish. Topic labels were generated with a GPT‑5‑mini‑based classifier fine‑tuned on 79 k manually verified samples and then applied to the whole corpus, resulting in seven high‑level categories. “Travel and Hospitality” dominates (≈60 % of the data), reflecting the web‑derived nature of the source. Question‑type taxonomy follows Bolotova et al., extended to multilingual settings via an ensemble of large language models (LLaMA 3.1, Gemma 2, Qwen 2.5) and a subsequent XLM‑R fine‑tuning that achieves ~88 % F1.

A key contribution is the release of a hard‑negative dataset for dense retriever training. For 1.25 M queries in 20 languages, the authors first retrieved 1,000 candidates with BM25, then re‑ranked them using the multilingual cross‑encoder BGE‑m3, finally selecting the top 200 as hard negatives. Each negative is accompanied by its cross‑encoder score, enabling two fine‑tuning strategies: (i) contrastive learning with MultipleNegativesRanking loss, and (ii) knowledge distillation using MarginMSE loss. Experiments show that hard negatives improve performance over random negatives in several configurations, especially for non‑English languages, but also reveal persistent challenges: false negatives remain common, and random negatives sometimes outperform hard negatives in pure contrastive setups. Knowledge distillation offers more robust gains across languages but can slightly degrade English performance, suggesting cross‑encoder bias.

Bilingual QA alignments (bitexts) were constructed by embedding all questions and answers with LaBSE and retrieving nearest neighbors with a strict similarity threshold of 0.9. This process yields over 14.3 M aligned QA pairs across 3,970 language pairs, a dramatic increase from the 1.5 M pairs in the original release. Notably, many high‑resource pairs (e.g., English‑German) are complemented by numerous low‑resource pairs such as Marathi‑Telugu, German‑Spanish, and Russian‑Ukrainian. Alignment quality was validated using the GEMBA metric, which relies on LLM judgments, and the bitext task has been added to the MTEB benchmark, providing a new multilingual sentence‑embedding evaluation.

WebFAQ 2.0 is positioned as a living resource. Since November 2025, structured FAQ data are published daily through the Open Web Index, and the authors’ pipeline continuously harvests, cleans, and integrates new dumps. All data, metadata, and training scripts are openly released on GitHub and HuggingFace, ensuring reproducibility and encouraging community contributions. The dataset’s scale, multilingual breadth, inclusion of contextual metadata, hard‑negative signals, and ongoing expansion make it a valuable foundation for future work on multilingual dense retrieval, cross‑lingual question answering, and robust negative‑sampling strategies.


Comments & Academic Discussion

Loading comments...

Leave a Comment