Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

Reading time: 5 minute
...

📝 Original Info

  • Title: Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language
  • ArXiv ID: 2511.16680
  • Date: 2025-11-24
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar and computational implementation, Shona spaCy advances NLP accessibility and digital inclusion for Shona speakers and provides a template for morphological analysis tools for other under-resourced Bantu languages.

💡 Deep Analysis

Deep Dive into Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language.

Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/ . Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar

📄 Full Content

The development of robust Natural Language Processing (NLP) systems has advanced rapidly over the past decade, yet many African languages remain underrepresented in both datasets and computational tools (Orife et al., 2020;Emezue & Dossou, 2021). Among these, Shona, a major Bantu language spoken by over 10 million people in Zimbabwe and neighboring regions, exemplifies the challenges of low-resource morphology-rich languages. While Shona has a well-documented grammatical tradition (Fortune, 1984;Hannan, 1984;Chimhundu, 2001), its computational modeling remains severely limited, with few available tokenizers, part-of-speech (POS) taggers, or morphological analyzers tailored to its linguistic structure. Shona's agglutinative morphology and extensive noun class system complicate the direct application of models designed for English or other high-resource languages. A single Shona verb can encode multiple grammatical features such as tense, subject, object, and polarity. For instance, ndichakupai decomposes into ndi-(subject: I) + -cha-(future tense) + kupa (give) + -i (object: you [plural]) → "I will give you." Conventional tokenizers trained on Indo-European languages typically segment this incorrectly (e.g., ["ndi", "cha", "kup", "ai"]), resulting in the loss of essential syntactic and semantic information. Further complicating Shona NLP is the prevalence of code-mixing and slang, especially in digital communication. Everyday speech frequently blends Shona with English, producing expressions such as "Ndiri kufara big time!" ("I'm super happy") or "Mhoro bro, wakadini zvako?" ("Hello bro, how are you?"). Most multilingual models, even those trained on large corpora, misclassify such utterances because they lack sensitivity to morphological variation, non-standard orthography, and informal syntax (Masoka, 2025).

Despite the existence of the Shona Universal Dependencies Treebank, it primarily covers formal written text (≈7,000 sentences) and lacks annotations for slang, morphologically complex verbs, and entity marking (e.g., muHarare → Harare [LOC]). This scarcity of annotated data inhibits the development of high-accuracy downstream tasks such as Named Entity Recognition (NER), syntactic parsing, and machine translation. The resulting gap contributes to digital exclusion -where native speakers are underrepresented in conversational AI, educational platforms, and translation systems.

In earlier work, Masoka (2025) addressed this gap through “Advancing Conversational AI with Shona Slang”, which proposed a hybrid dataset and transformer-based model to improve NLP robustness across formal and informal Shona. That research introduced the ShonaNLP v1 toolkit, which integrated morphology-aware tokenization, rule-based lemmatization, and POS tagging across formal and slang domains. While the results demonstrated improved performance in slang-influenced text, the system lacked a dedicated morphological analyzer capable of representing Shona’s noun class structure, verbal extensions, and derivational morphology in a linguistically interpretable format. This limitation motivated the development of Shona spaCy, a rule-based morphological pipeline integrated into the spaCy ecosystem. Unlike purely neural approaches, this model encodes the grammatical logic of Shona morphology through explicit linguistic rules and a structured lexical JSON database, supporting transparent feature extraction and interpretability.

This paper builds upon the foundation established in Masoka (2025) by extending Shona NLP beyond tokenization and tagging toward deep morphological parsing and grammatical reasoning. Specifically, the objectives are to:

  1. Develop a hybrid rule-based morphological analyzer for Shona integrated into the spaCy framework.

  2. Model Shona’s noun class system, verbal morphology, clitics, and ideophones using linguistic principles from Bantu morphology.

  3. Evaluate the analyzer on both formal and informal Shona datasets for accuracy, coverage, and interpretability.

  4. Release the analyzer as an open-source Python package (shona-spacy) and document it for integration in downstream NLP applications.

The Shona spaCy project represents the first open-source morphological analyzer for Shona, combining linguistic fidelity with computational scalability. Its design contributes to:

• Digital inclusion, by enabling Shona speakers to access language-aware AI systems;

• Linguistic preservation, through explicit encoding of Shona grammar in machine-readable form; and

• Cross-lingual transfer, providing a blueprint for morphological analyzers in related Bantu languages (e.g., Ndebele, Kalanga, Swahili).

By embedding indigenous linguistic structures within modern NLP frameworks, this work advances the broader goal of computational decolonization-creating AI systems that reflect and respect African linguistic diversity.

2 Literature Review

The Shona language, a member of the Bantu family (Guthrie Zone S10), exhibits a rich

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut