Two CFG Nahuatl for automatic corpora expansion
The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $π$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.
💡 Research Summary
The paper addresses the critical shortage of digital resources for Nahuatl, an indigenous Mexican language spoken by roughly 1.65 million people and classified as a “π‑language” (a language with extremely limited corpora). To mitigate this scarcity, the authors design two context‑free grammars (CFGs), denoted µG NAW⊕0 and µG NAW⊕1, that generate syntactically valid Nahuatl sentences without relying on large annotated datasets.
µG NAW⊕0 is a minimalist micro‑grammar that models only noun phrases (N) and verb phrases (V). It excludes recursion, limits the system to the first three grammatical persons, singular present‑tense verbs, and a small set of lexical items. This design enables rapid generation of a large number of sentences while keeping the rule set tractable.
µG NAW⊕1 expands on the first grammar by incorporating a richer set of markers that reflect Nahuatl’s agglutinative nature: person markers (MV), object markers (MO), possessive markers (POS), temporal markers (MT), quantity markers for nouns (MCS), intensity markers for verbs (MIV), and place markers (ML). It also adds terminals for negation (NEG), adjectives (ADJ), nouns (N), and verbs (V). Like the first grammar, µG NAW⊕1 avoids recursive productions, but it captures a broader range of syntactic patterns, including the dominant VSO order and less frequent SV, VO, and VOS constructions.
Both grammars are used in a generative pipeline to produce an artificial corpus called π‑YALL‑IA. The pipeline includes two crucial filtering stages: (1) a semantic filter that discards sentences that are grammatically correct but semantically implausible (e.g., “The big corn cob eats a rabbit”), and (2) a redundancy‑reduction mechanism that replaces repeated lexical variations with symbolic labels during generation, later substituting them stochastically with actual words. Additionally, a paragraph‑segmentation step inserts end‑of‑paragraph tags, turning a flat list of sentences into documents that more closely resemble authentic texts.
The artificial corpus is merged with the existing real‑world Nahuatl corpus π‑YALL‑I, yielding a combined dataset of roughly one hundred million tokens. The authors train non‑contextual word embeddings (Word2Vec, FastText) on both the original and the expanded corpora. Evaluation is performed on a sentence‑level semantic similarity task, where pairs of sentences are judged for meaning overlap. Results show that embeddings trained on the expanded dataset achieve a 3–5 % absolute improvement in accuracy and F1 score over embeddings trained on the original data alone. Notably, these lightweight embeddings outperform large language models such as GPT‑3 on the same task, highlighting the cost‑effectiveness of the approach for π‑languages.
The study acknowledges several limitations. The exclusion of recursive rules prevents the generation of complex, multi‑clausal sentences, and the rule‑based semantic filter cannot capture all nuances of Nahuatl dialectal variation. Consequently, the generated text, while syntactically valid, may still lack the full stylistic richness of natural discourse. The authors propose future work that integrates probabilistic CFGs or neural grammar induction to introduce recursion and richer lexical diversity, as well as human evaluation to assess naturalness and dialect coverage.
In summary, the paper demonstrates that carefully crafted CFGs can substantially augment scarce linguistic resources, enabling the creation of large, syntactically sound corpora for under‑resourced languages. The resulting artificial data improve the quality of word embeddings and can even surpass state‑of‑the‑art LLMs in specific semantic tasks, offering a pragmatic pathway for computational work on many other π‑languages worldwide.
Comments & Academic Discussion
Loading comments...
Leave a Comment