Automatic register identification for the open web using multilingual deep learning
This article presents multilingual deep learning models for identifying web registers – text varieties such as news reports and discussion forums – across 16 languages. We introduce the Multilingual CORE corpora, which contain over 72,000 documents annotated with a hierarchical taxonomy of 25 registers designed to cover the entire open web. Using multi-label classification, our best model achieves 79% F1 averaged across languages, matching or exceeding previous studies that used simpler classification schemes. This demonstrates that models can perform well even with a complex register scheme at multilingual scale. However, we observe a consistent performance ceiling across all models and configurations. When we remove documents with uncertain labels through data pruning, performance increases to over 90% F1, suggesting that this ceiling stems from inherent ambiguity in web registers rather than model limitations. Analysis of hybrid texts (those combining multiple registers) reveals that the main challenge lies not in classifying hybrids themselves, but in distinguishing hybrid from non-hybrid documents. Multilingual models consistently outperform monolingual ones, particularly for languages with limited training data. Zero-shot performance on unseen languages drops by an average of 7%, though this varies by language (3–8%), indicating that while registers share features across languages, they also retain language-specific characteristics.
💡 Research Summary
The paper tackles the problem of automatically identifying web registers—situational text varieties such as news reports, discussion forums, tutorials, and machine‑generated content—across a broad multilingual landscape. Recognizing that large‑scale language models require metadata about the contexts in which texts are used, the authors construct the Multilingual CORE corpora, a new resource comprising 72,504 documents in 16 typologically diverse languages, each annotated with a hierarchical 25‑class register taxonomy (derived from the original CORE scheme but expanded to include sub‑registers).
Methodologically, the authors adopt a multi‑label classification framework, reflecting the reality that many web documents blend multiple registers (so‑called hybrids). They train transformer‑based models, primarily XLM‑R Large, on the first 512 tokens of each document (truncating longer texts) and evaluate using micro‑averaged F1. Experiments are conducted on five large training languages (English, Finnish, French, Swedish, Turkish) and eleven smaller evaluation languages (including Farsi, Japanese, Norwegian).
Key results: the multilingual model achieves an average 79 % F1 across all languages, matching or surpassing prior work that used simpler 9‑class schemes (X‑GENRE). A consistent performance ceiling is observed; however, when documents with uncertain or ambiguous labels are pruned, F1 rises above 90 %, indicating that the ceiling stems from label noise and inherent register ambiguity rather than model capacity. Hybrid documents constitute roughly a quarter of the data; the main difficulty lies in distinguishing hybrid from non‑hybrid texts, not in classifying the specific hybrid type.
Comparisons reveal that multilingual training consistently outperforms monolingual baselines, especially for low‑resource languages. Zero‑shot evaluation on unseen languages incurs an average 7 % drop (range 3–8 %), confirming that while registers share cross‑lingual features, language‑specific nuances remain. Among the tested encoders (XLM‑R Base, mBERT, etc.), XLM‑R Large offers the best trade‑off between speed and accuracy.
The authors enumerate several contributions: (1) release of the Multilingual CORE corpora and the best‑performing multilingual classifier; (2) demonstration that a detailed 25‑class hierarchical scheme can be learned as effectively as a coarse 9‑class scheme; (3) empirical evidence that label uncertainty is the primary bottleneck, not model architecture; (4) validation that multilingual deep learning benefits register identification, even for languages with limited training data.
Limitations include the reliance on the first 512 tokens (potentially discarding useful information), the ongoing challenge of accurately labeling hybrid documents, and modest zero‑shot performance for completely unseen languages. Future work is suggested in areas such as full‑document processing via sliding windows, prototype‑based modeling of register continua, and integration of register metadata into data‑curation pipelines for large language model training.
Overall, the study provides a robust, scalable solution for web register identification at multilingual scale, offering valuable resources and insights for both corpus linguistics and NLP practitioners interested in building more balanced and context‑aware web corpora.
Comments & Academic Discussion
Loading comments...
Leave a Comment