FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.


💡 Research Summary

This paper introduces FiNERweb, a scalable pipeline for creating a large-scale multilingual Named Entity Recognition (NER) dataset, along with the resulting dataset itself. The work addresses a critical gap in current NER resources: while recent studies show Large Language Models (LLMs) can generate useful synthetic supervision data, such datasets often lack systematic construction and reusability. Furthermore, existing multilingual NER datasets typically offer either broad language coverage with limited label types (e.g., PAN-X) or rich label sets for only a handful of languages (e.g., DynamicNER). FiNERweb aims to provide both dimensions simultaneously.

The core contribution is a three-stage, scalable pipeline. Stage 1: High-Quality Passage Selection Model. The process begins by building a filter to identify passages useful for NER training. The authors sample 1k passages each for 91 languages (aligned with XLM-RoBERTa’s pretraining languages) from the FineWeb-2 corpus. Two multilingual LLMs (GPT-4o mini and Gemma3-27B) are prompted to rate each passage on a 1-4 scale for its NER utility. This preference data is used to train a regression model based on XLM-RoBERTa, which learns to predict this usefulness score. The best-performing model achieves over 84 F1 in identifying high-quality (score >=3) passages. Stage 2: Filtering FineWeb-2. The trained regression model is applied to score passages from FineWeb-2 across all 91 languages. Passages predicted to be high-quality are retained, resulting in a curated, unlabeled dataset of approximately 2,500 passages per language. An additional language identification check filters out embedded English advertisements. Stage 3: LLM Annotation and Merging. The filtered passages are annotated by both GPT-4o mini and Gemma3-27B, which are instructed to extract entity mentions and their types. The outputs from the two LLMs are meticulously merged: annotations must be exact substrings and are processed sequentially to avoid propagation errors. When spans from the two models overlap significantly (>=50%), their semantic similarity is computed using sentence embeddings; if sufficiently similar (similarity > 0.75), the entity type labels are concatenated (e.g., “person / human”). This merging strategy enriches the final label set. Finally, all English entity type labels are translated into their respective target languages using Google Translate API. The outcome is the FiNERweb dataset: ~225k passages across 91 languages and 25 scripts, containing ~235k distinct entity type labels.

The paper provides extensive evaluation to validate the dataset’s utility. Downstream Performance: Models (specifically Binder architecture) fine-tuned on subsets of FiNERweb (English, Swahili, Thai) achieve comparable or improved zero-shot transfer performance on standard benchmarks (CoNLL, ThaiNER, HA NER) against strong baselines like GLiNER-multi-v2.1 and GLiNER-X, despite using up to 19x less training data. Annotation Quality: Using an LLM-as-a-judge approach (Qwen3-235B), the annotations are evaluated for faithfulness (3.99/5) and completeness (4.05/5), indicating high quality. Label Language Effect: A notable analysis reveals that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target-language labels instead of English ones, underscoring the value of FiNERweb’s provision of both label sets and highlighting a current limitation in multilingual NER models.

In conclusion, the authors release the complete FiNERweb dataset, the trained regression models, and all creation artifacts to the community. FiNERweb represents a significant step towards scalable, high-quality resource creation for multilingual NER by systematically leveraging the synthetic data generation capabilities of modern LLMs within a robust pipeline. It facilitates more effective knowledge distillation from large teachers to efficient student models across a wide array of languages.


Comments & Academic Discussion

Loading comments...

Leave a Comment