Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets
Multilingual data from the web is essential for LLM pretraining. Yet, scraping it is expensive, and research groups repeatedly crawl the same content. For example, we found that over 40% of tokens across major Arabic web corpora are duplicated between sources. In this work, we propose to use this wasteful redundancy as a quality signal to create high-quality pretraining datasets. Our key insight is that cross-source agreement functions as a free, model-free quality filter: content retained by multiple independent pipelines is more likely to represent high-quality text. Crucially, this signal requires no additional computation beyond standard deduplication, which is already performed at scale when pretraining language models. So, we propose MixMinMatch, a method that combines multiple existing web corpora, performs cross-dataset MinHash deduplication, and identifies documents independently recovered by multiple sources. We apply MixMinMatch to Arabic, Turkish, and Hindi, producing corpora that match or exceed the quality of the best single-source baselines, while providing up to 4$\times$ more unique tokens. On Arabic, our matched subset achieves a 4.5% relative improvement over ArabicWeb24, while on Turkish, we improve over FineWeb-2 by 5.5%. We release the datasets at: https://huggingface.co/collections/AdaMLLab/mixminmatch
💡 Research Summary
The paper tackles a pervasive inefficiency in multilingual large‑scale language model pre‑training: many research groups independently crawl the web for the same languages, resulting in massive overlap across publicly released corpora. The authors observe that over 40 % of tokens in major Arabic web corpora are duplicated between sources, and they argue that this redundancy can be turned into a free quality signal rather than being treated merely as waste.
The central hypothesis is that when multiple independent pipelines—each with its own crawl schedule, filtering heuristics, and quality thresholds—retain the same document, that document is more likely to be high‑quality. This mirrors the well‑known principle that independent agreement reduces uncertainty, as seen in crowdsourcing (inter‑annotator agreement) and ensemble learning (bagging, boosting).
To exploit this insight, the authors introduce MixMinMatch, a three‑stage pipeline:
-
Mix – Aggregate a collection of publicly available multilingual web corpora (C4, CulturaX, HPL T 2.0, FinePDFs, FineWeb‑2, etc.) together with language‑specific resources (ArabicWeb24, Sangraha‑U, VNGRS‑Web). Each document is stored as a tuple (text, source_id) so that source provenance is preserved throughout processing.
-
MinHash – Perform cross‑dataset near‑duplicate detection using locality‑sensitive hashing (LSH). For each document, a set of 5‑gram character shingles is built, a MinHash signature of length 112 is computed, and the signature is split into 14 bands of 8 hashes each. Documents that share a band become candidate pairs; pairs whose estimated Jaccard similarity exceeds τ = 0.8 are considered near‑duplicates. A Union‑Find clustering step groups all such pairs into connected components, each representing a duplicate cluster. One representative (the earliest document in a deterministic order) is kept per cluster, guaranteeing reproducibility.
-
Match – Examine the source IDs within each MinHash cluster. If a cluster contains documents from at least two distinct sources, the representative is added to the “matched” subset. The number of distinct sources per cluster is stored as metadata, allowing downstream users to tighten the agreement threshold (e.g., require ≥ 3 sources). This step incurs essentially zero extra computation because the source labels are already attached to the clusters produced by MinHash.
The authors formalize the intuition: each pipeline’s retention decision can be viewed as a noisy annotator with probability q_s(x) positively correlated with an unobserved quality variable Q(x). The expected quality of a document that appears in many pipelines (large |S(x)|) is higher than that of a document appearing in only one. Thus, cross‑source agreement acts as an ensemble filter that suppresses false positives without any model inference.
Empirical evaluation focuses on three typologically diverse languages—Arabic (right‑to‑left script), Turkish (Latin script), and Hindi (Devanagari). For each language, the MixMinMatch pipeline produces two releases: a fully deduplicated corpus and a smaller “matched” subset. Token statistics (Table 2) show that starting from roughly 300 B tokens per language, quality filtering reduces the corpus to a few hundred million tokens, MinHash deduplication further cuts it to ~180 B tokens, and the final matched subset contains 54 B (Arabic), 56 B (Turkish), and 27 B (Hindi) tokens. Although the matched subset is a fraction of the original size, it retains 2–4× more unique tokens than any single source corpus because it aggregates content that survived multiple independent pipelines.
To assess downstream impact, the authors pre‑train a Llama‑3.2‑3B‑size model on each matched dataset and compare it against models trained on the strongest single‑source baselines (ArabicWeb24 for Arabic, FineWeb‑2 for Turkish). Results show relative improvements of 4.5 % (Arabic) and 5.5 % (Turkish) on standard evaluation suites, confirming that cross‑source agreement correlates with better downstream performance.
The paper also discusses practical considerations: language‑specific quality filters (minimum length, character repetition, script consistency) are kept lightweight, while the cross‑source agreement provides a language‑agnostic signal that works even for morphologically rich or non‑Latin scripts where English‑centric heuristics fail. Moreover, the computational cost of MixMinMatch is negligible beyond the standard MinHash deduplication already required for large‑scale corpora; in contrast, model‑based quality scoring would demand thousands of GPU‑hours.
Contributions are summarized as:
- A systematic quantification of redundancy across major multilingual web corpora.
- The MixMinMatch algorithm that turns cross‑source overlap into a free, ensemble‑style quality filter.
- Release of four multilingual pre‑training corpora (AraMix, TurMix, HinMix) with per‑document source counts.
- Empirical evidence that the matched subsets achieve equal or superior performance to the best single‑source baselines while offering substantially more unique data.
The authors suggest future work on scaling the approach to more languages, exploring higher agreement thresholds (≥ 3 sources) to trade off quality versus diversity, and extending the concept of cross‑source agreement to other modalities such as code or image metadata.
In summary, MixMinMatch demonstrates that “redundancy is not waste but a signal” and provides a simple, cost‑free method to harvest higher‑quality, higher‑diversity multilingual data for the next generation of large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment