Mići Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect

Mići Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper documents our efforts in releasing the printed and audio book of the translation of the famous novel The Little Prince into the Chakavian dialect, as a computer-readable, AI-ready dataset, with the textual and the audio components of the two releases now aligned on the level of each written and spoken word. Our motivation for working on this release is multiple. The first one is our wish to preserve the highly valuable and specific content beyond the small editions of the printed and the audio book. With the dataset published in the CLARIN.SI repository, this content is from now on at the fingertips of any interested individual. The second motivation is to make the data available for various artificial-intelligence-related usage scenarios, such as the one we follow upon inside this paper already – adapting the Whisper-large-v3 open automatic speech recognition model, with decent performance on standard Croatian, to Chakavian dialectal speech. We can happily report that with adapting the model, the word error rate on the selected test data has being reduced to a half, while we managed to remove up to two thirds of the error on character level. We envision many more usages of this dataset beyond the set of experiments we have already performed, both on tasks of artificial intelligence research and application, as well as dialectal research. The third motivation for this release is our hope that this, now highly structured dataset, will be transformed into a digital online edition of this work, allowing individuals beyond the research and technology communities to enjoy the beauty of the message of the little boy in the desert, told through the spectacular prism of the Chakavian dialect.


💡 Research Summary

The paper presents the creation, processing, and release of a novel speech‑text dataset derived from a translation of Antoine de Saint‑Exupéry’s “The Little Prince” into the Chakavian dialect of Croatian, titled “Mići Princ”. The authors’ motivations are threefold: (1) to preserve a culturally valuable work that exists only in limited printed and audio editions, (2) to provide a rare open‑source resource for dialectal speech that can be used in artificial‑intelligence research, and (3) to enable broader public access through a future digital edition.

The dataset comprises a 113‑minute audio book (79 minutes of speech after music removal) and a corresponding text of 60,129 characters, 11,591 words, and 547 speaker turns. Sixteen translators and sixteen voice actors each contributed a distinct micro‑dialect, resulting in a highly diverse linguistic resource. The authors follow FAIR principles and deposit the data in the CLARIN.SI repository as well as on HuggingFace, making it findable, accessible, interoperable, and reusable.

Data processing involves several stages. First, the material is segmented by chapter. Voice Activity Detection (VAD) removes music and silence, after which a diarisation model identifies speaker segments. The resulting EXB files are manually inspected in the Exmaralda Partitur Editor, correcting mis‑diarised turns and assigning speaker labels (e.g., Mići Princ, Autor, Geograf, Dilavac). Text normalization replaces dialect‑specific characters (ˆı, ¨ı, etc.) with standard Croatian equivalents, expands numerals to words, and strips punctuation. Word‑level alignment is performed with Kaldi, yielding timestamps for every word. The aligned data are stored in JSON and EXB formats.

For automatic speech recognition (ASR) experiments, the authors create a “flavored” version of the dataset: audio segments are re‑segmented to a maximum of 30 seconds, and the text is further normalized (bullet points and newlines removed). Chapters 13 and 15 are reserved as a test set, containing both speakers seen during training (Autor, Mići Princ) and unseen speakers (Geograf, Dilavac), allowing evaluation of speaker generalisation.

The ASR component fine‑tunes OpenAI’s Whisper‑large‑v3 model, chosen for its solid baseline performance on standard Croatian. Hyper‑parameters are set to 80 epochs, learning rate 1e‑5, and batch size 16. Evaluation uses word error rate (WER) and character error rate (CER) computed with the “evaluate” library after lower‑casing and punctuation removal. Results show a dramatic improvement: overall WER is roughly halved, and CER is reduced by about two‑thirds compared with the vanilla Whisper model. Notably, the model improves on the two speakers that were not present in the fine‑tuning data, indicating successful dialect adaptation.

Error analysis reveals three primary sources of residual mistakes: (1) incorrect segmentation (e.g., “zvizdami” transcribed as “zvizda mi”), (2) regression to the standard language (e.g., “š njimi” instead of dialectal “š njimi”), and (3) legitimate orthographic variation in the dialect, where multiple transcriptions could be considered correct. The model also identified a mismatch between the printed and audio versions of the book (priti vs. arivat), prompting a correction in the dataset.

The authors acknowledge limitations: the dataset’s modest size (79 minutes of speech) may restrict scalability for larger neural models, and the lack of a standardized orthography for Chakavian introduces ambiguities during normalization. Future work is outlined as expanding the corpus to cover more Chakavian micro‑dialects, integrating multimodal alignment (text‑speech‑image), establishing benchmark baselines for dialectal ASR, and developing a publicly accessible digital edition of “Mići Princ”.

In conclusion, the paper demonstrates that even a relatively small, meticulously aligned dialectal corpus can substantially improve state‑of‑the‑art speech recognition systems for under‑resourced language varieties. By releasing the data under FAIR guidelines and providing a concrete ASR adaptation pipeline, the authors contribute a valuable resource for both linguistic preservation and AI research, bridging the gap between dialectology and modern speech technology.


Comments & Academic Discussion

Loading comments...

Leave a Comment