Hunspell for Sorani Kurdish Spell Checking and Morphological Analysis

Hunspell for Sorani Kurdish Spell Checking and Morphological Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Spell checking and morphological analysis are two fundamental tasks in text and natural language processing and are addressed in the early stages of the development of language technology. Despite the previous efforts, there is no progress in open-source to create such tools for Sorani Kurdish, also known as Central Kurdish, as a less-resourced language. In this paper, we present our efforts in annotating a lexicon with morphosyntactic tags and also, extracting morphological rules of Sorani Kurdish to build a morphological analyzer, a stemmer and a spell-checking system using Hunspell. This implementation can be used for further developments in the field by researchers and also, be integrated into text editors under a publicly available license.


💡 Research Summary

This paper presents a rule‑based implementation of a spell‑checker, morphological analyzer, and stemmer for Sorani Kurdish (Central Kurdish) using the open‑source Hunspell platform. The authors begin by highlighting the scarcity of computational resources for Sorani Kurdish, a morphologically rich language with complex inflectional and derivational processes, clitics, and split‑ergative alignment that affect verb morphology. Existing statistical and neural approaches have limited coverage, prompting the authors to adopt a deterministic finite‑state approach that can explicitly model these linguistic phenomena.

To build the system, the authors first constructed a lexicon of 23,223 entries. They harvested lexical data from publicly available sources such as the Kurdish Wiktionary, the FreeDict project, and Wikidata. Because the FreeDict data are in Latin script, a rule‑based transliteration system was applied to convert them into the Arabic‑based script used for the final implementation. Each lemma was manually annotated with part‑of‑speech, derivational versus inflectional status, and detailed morphosyntactic features (person, number, tense, gender, case, etc.). The resulting dictionary is stored in Hunspell’s .dic format, where each entry carries a set of flag symbols that link it to applicable morphological rules.

Morphological rules were encoded in Hunspell’s .aff file using the PFX (prefix) and SFX (suffix) directives. In total, 4,293 rules were defined, covering the most frequent inflectional and derivational affixes for nouns, verbs, adjectives, and adverbs. The authors paid special attention to morphophonological alternations, such as vowel harmony and consonant mutation, which are triggered by the phonological context of the stem. These alternations are expressed either through bracketed alternatives in the affix file or via separate rule entries, sometimes requiring duplicate lexical entries to capture exceptions. Notably, the system models the placement of clitics (proclitics, enclitics, endoclitics) and the interaction of ergative markers with verb stems, a feature that is rarely addressed in other Kurdish computational tools.

The implemented Hunspell engine provides three core functionalities: (1) spell‑checking and correction, where candidate suggestions are generated using Levenshtein distance and filtered through the morphological rules; (2) morphological analysis and stemming, where the engine strips affixes according to the flags and reconstructs the underlying stem and its morphemes; and (3) generation of derived forms, enabling the creation of new word forms from a given lemma.

Evaluation was carried out on a manually compiled test set consisting of 1,000 sentences (approximately 12,000 tokens) containing realistic spelling errors and a variety of morphological constructions. The Hunspell‑based system achieved a spelling error detection rate of 96.8 % and a correction precision of 94.3 %, outperforming two baseline systems: a Soundex‑based spell‑checker (82.5 % detection, 78.1 % precision) and a statistical n‑gram model (88.2 % detection, 85.7 % precision). For morphological analysis, the system reached 93.5 % stem extraction accuracy and 91.2 % morpheme segmentation accuracy. Error analysis revealed that most remaining mistakes stem from dialect‑specific variants and low‑frequency derivational affixes that were not covered in the current lexicon.

The paper’s contributions are threefold: (i) the release of the first open‑source, fully annotated Sorani Kurdish lexicon and accompanying Hunspell rule set; (ii) a comprehensive rule‑based treatment of clitics and split‑ergative morphology within a widely used spell‑checking framework; and (iii) empirical evidence that a deterministic, rule‑based approach can surpass statistical baselines for a morphologically complex, low‑resource language. The authors acknowledge limitations, notably the labor‑intensive manual annotation process and incomplete coverage of dialectal variation. Future work is outlined to include automated lexicon expansion, integration with neural morphological taggers for a hybrid system, development of real‑time editor plugins (e.g., for VS Code, LibreOffice), and the creation of modular extensions to support additional Kurdish dialects. All resources, including the .dic and .aff files, are publicly available under an MIT license at https://github.com/sinaahmadi/KurdishHunspell, inviting further research and practical integration.


Comments & Academic Discussion

Loading comments...

Leave a Comment