SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora
We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.
💡 Research Summary
SoftMatcha 2 introduces a novel “soft” search engine capable of handling trillion‑token corpora (e.g., 1.4 trillion tokens in FineWeb‑Edu) with sub‑second latency (≈0.3 s). The system addresses two fundamental challenges that arise when extending exact‑match suffix‑array indexing to semantic‑aware search: (1) the prohibitive cost of random disk accesses on massive suffix‑array indexes, and (2) the exponential blow‑up of candidate patterns when allowing token‑level substitutions, insertions, and deletions.
To solve (1), the authors design a disk‑aware staged suffix array inspired by Google’s BigTable. The index is built in two stages: the first stage quickly narrows the search to a narrow range of suffixes, and the second stage performs a precise lookup within that range. This architecture guarantees only a single disk I/O per exact‑match query, dramatically reducing latency. Run‑length compression further shrinks the index from 56 TB to 21.6 TB for the FineWeb‑Edu corpus, making the structure feasible on modern SSD arrays.
For (2), they propose dynamic corpus‑aware pruning. By exploiting the well‑known Zipfian (power‑law) distribution of n‑grams in natural language, the algorithm iteratively discards candidate patterns that are unlikely to appear in the corpus. At each step of candidate generation, the partial pattern is checked for existence; if absent, the entire branch is pruned. This iterative pruning reduces the effective search space from exponential in query length to near‑linear in practice.
The similarity metric is also refined. Instead of the plain minimum cosine similarity used in the original SoftMatcha, SoftMatcha 2 employs a smooth‑minimum function parameterized by β = 10⁴, which aggregates token‑level similarities while still emphasizing the weakest link. Insertions and deletions are penalized by an exponential factor exp(−v/γ), where v is the squared norm of the inserted/deleted token after Zipfian whitening; this keeps penalties low for high‑frequency function words and higher for content‑rich tokens.
Experimental evaluation spans multiple massive corpora: English (FineWeb‑Edu, 1.4 T tokens), Chinese (38.3 B tokens), and Japanese (169 B tokens). The 95th‑percentile latency for soft search is 278 ms on FineWeb‑Edu and under 400 ms on the other corpora, while exact search is as low as 0.34 ms. Compared to prior state‑of‑the‑art systems—infini‑gram, infini‑gram‑mini, and the original SoftMatcha—SoftMatcha 2 achieves 2–3× lower latency and comparable or higher recall.
A practical application demonstrated is benchmark contamination detection. By querying evaluation‑set sentences, SoftMatcha 2 uncovers exact or near‑duplicate occurrences in the training corpora that were missed by earlier tools, highlighting its utility for data‑quality audits in large‑scale language model training.
Finally, the authors release an online demo running on a 100 B‑token English corpus, showcasing real‑time soft search across seven languages. The paper’s contributions are threefold: (1) a disk‑aware staged suffix‑array index that scales to multi‑terabyte corpora, (2) a statistically grounded pruning strategy that tames exponential candidate growth, and (3) an enhanced similarity formulation that balances token‑level semantics with insertion/deletion penalties.
Future work may explore extending the approach to syntactic‑level matching, multimodal corpora, and distributed indexing to further improve scalability and robustness. SoftMatcha 2 thus represents a significant step toward practical, real‑time, semantically aware search over the massive text collections that underpin modern LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment