Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist
Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
💡 Research Summary
The paper tackles a pervasive problem in language documentation: lexical wordlists collected in the field often contain transcription errors and undocumented borrowings that compromise phonological analysis. The authors propose an unsupervised anomaly‑detection pipeline that flags phonotactic inconsistencies in a Kokborok wordlist, using both phoneme‑level and syllable‑level n‑gram language models.
Data come from a sociolinguistic survey (Kim et al., 2025) covering 306 basic concepts across 20 Kokborok varieties, three Garo varieties, and standard Bangla as the contact language. After converting the data to the Cross‑Linguistic Data Format (CLDF) and removing morphological affixes, the authors retain 3,055 unique wordforms. A hand‑annotated gold standard identifies 555 entries as borrowings from Bangla; transcription errors are not explicitly labeled because of the difficulty in distinguishing them from dialectal variation.
The methodology consists of two parallel modeling streams. At the phoneme level, bigram and trigram models are trained on the Kokborok corpus. Each word is padded with start/end markers, and diacritics are treated as separate symbols. The negative log‑likelihood (NLL) of a word under the model is computed and aggregated in four ways: arithmetic mean, harmonic mean, minimum, and maximum NLL. The intuition is that anomalous words will contain rare phoneme sequences, yielding high NLL scores.
At the syllable level, the authors first perform automatic syllabification using a sonority hierarchy and the maximum‑onset principle, inserting a period “.” as a syllable‑boundary marker. Three analysis types are then applied: (1) Within‑syllable n‑grams that stay inside a single syllable, (2) Cross‑boundary n‑grams that span syllable borders, and (3) Boundary‑as‑phoneme n‑grams that treat the “.” as a regular phoneme, thereby capturing positional information. The same NLL aggregation strategies are used.
Evaluation uses precision@K and recall@K for K = 100, 500, 1,000, comparing model rankings against the gold borrowing list. Results show that trigram models consistently outperform bigrams. The harmonic‑mean aggregation yields the best recall (0.52 at K = 1,000) and a respectable precision of 0.46 at K = 100. Syllable‑level “Within‑syllable” analysis achieves the highest precision (0.47 at K = 100) and recall (0.49 at K = 1,000) among the syllable approaches, while “Boundary‑as‑phoneme” performs well at low K values. In contrast, “Cross‑boundary” analysis lags behind, indicating that violations are more strongly signaled by internal syllable structure than by transitions across syllable edges. Random baselines (uniform and length‑stratified sampling) perform markedly worse, confirming that the n‑gram models capture genuine phonotactic signals rather than noise.
Qualitative inspection of top‑ranked anomalies reveals that the models flag words containing rare aspirated consonants, retroflex clusters, or phoneme sequences typical of Bangla borrowings (e.g., /dʒ/). Some flagged items are clear loans, while others are borderline cases that exhibit partial phonological integration, illustrating the system’s ability to prioritize both obvious and subtle inconsistencies.
The authors conclude that simple phoneme‑level n‑gram models already provide a practical, computationally cheap solution for early‑stage quality control in low‑resource documentation projects. Adding syllable‑aware modeling contributes interpretability by pinpointing specific phonotactic constraints that are violated. Limitations include the modest size of the wordlist, which precludes the use of data‑hungry neural language models, and the lack of fine‑grained annotation distinguishing transcription errors from borrowings. Future work is suggested to explore larger corpora, deep learning approaches, and multi‑label anomaly detection to further improve precision and to separate error types automatically. Overall, the study offers a concrete, reproducible toolkit that can help field linguists quickly identify entries requiring manual verification, thereby enhancing the reliability of downstream linguistic analyses.
Comments & Academic Discussion
Loading comments...
Leave a Comment