Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages
Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.
💡 Research Summary
The paper tackles the under‑explored phenomenon of morphological defectivity – the systematic absence or extreme rarity of expected inflectional forms – and investigates how reliable crowd‑sourced resources are for documenting such gaps. The authors focus on Latin and Italian, two morphologically rich languages for which Wiktionary hosts relatively extensive lists of “defective” verbs. To evaluate the accuracy of these lists, they first build a state‑of‑the‑art neural morphological analyzer by fine‑tuning the UD‑Tube architecture with a multilingual BERT (mBERT) encoder on the largest available Universal Dependencies treebanks for each language (UD Latin ITTB and UD Italian VIT). Hyper‑parameter optimization via Weights & Biases yields very high tagging accuracies (98 % for Latin, 96 % for Italian).
The trained analyzer is then applied to massive raw corpora extracted from the Common Crawl (≈390 M tokens for Latin, ≈5 B tokens for Italian). After preprocessing with UDPipe, each token is annotated with lemma and a full set of morphosyntactic features, producing a CoNLL‑U‑style corpus and a frequency database that records how often each possible inflected form occurs.
For validation, the authors adopt the principle of Indirect Negative Evidence (INE) from language acquisition theory: if a form is truly defective, it should be absent or occur only extremely rarely in natural usage. They operationalize this with two statistical measures: (1) absolute frequency – forms occurring ≤10 times are deemed “likely defective”; (2) log‑odds ratio – the observed probability of a form (p_w) is compared to the product of its lemma probability (p_l) and feature‑bundle probability (p_f). A log‑odds ratio ≥ 1.9 is taken as a “large divergence”, indicating the form is far more frequent than expected and therefore unlikely to be defective.
Applying these criteria to the Wiktionary‑derived lists (1,190 Latin verbs, 124 Italian verbs) yields the following results:
- Latin – 88 % of the listed lemmas appear in the corpus. Of those, 67 % satisfy the “likely defective” condition, while about 7 % show high absolute frequencies or large log‑odds, suggesting they are not defective (e.g., excommunico has a perfect form occurring 846 times). Some lemmas are not attested at all, reflecting either true rarity or gaps in the corpus.
- Italian – 83 % of the lemmas are attested. 79 % meet the defectivity criteria, and only 4 % are flagged as likely non‑defective (e.g., consumer e and concerner e appear frequently).
The discrepancy between the two languages is interpreted as a consequence of (1) the larger contemporary speaker base and richer modern corpora for Italian, which provide clearer statistical signals, and (2) Latin’s more complex inflectional paradigm combined with the historical nature of its texts, which makes automatic analysis more error‑prone and leaves many rare forms under‑represented.
The authors acknowledge several limitations: mBERT may not be optimal for Latin/Italian, and the corpora lack balanced representation of all historical periods and registers, potentially biasing frequency estimates. They suggest future work with newer multilingual models (e.g., XLM‑RoBERTa), more diverse corpora, and dynamic thresholding tailored to each language’s frequency distribution.
Beyond defectivity, the proposed pipeline could be extended to other morphological phenomena (e.g., irregular noun declensions, adjective agreement patterns) and to additional languages, offering a scalable method for quality‑checking crowd‑sourced linguistic resources.
In conclusion, the study demonstrates that Wiktionary, despite being a user‑generated platform, provides a surprisingly reliable source of morphological gap information, especially for Italian. For Latin, the error rate is higher but still modest, and can be mitigated through computational validation. By bridging computational morphology, large‑scale corpus analysis, and language‑acquisition theory, the paper offers a robust framework for assessing and improving the trustworthiness of crowd‑sourced linguistic data, with direct implications for downstream NLP applications such as morphological tagging, parsing, and machine translation in morphologically rich languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment