Arabic Little STT: Arabic Children Speech Recognition Dataset

The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children’s demographic representation in ASR datasets.

💡 Research Summary

The paper addresses two critical gaps in Arabic speech‑recognition research: the scarcity of high‑quality Arabic data and the complete lack of child‑specific corpora. To fill this void, the authors created “Arabic Little STT,” a publicly available dataset consisting of 355 utterances recorded in classroom settings from 288 Levantine Arabic speakers aged 6‑13. Data collection was performed in collaboration with schools, using 16 kHz, 16‑bit PCM recordings captured at a consistent distance from each child. Informed consent was obtained from parents or guardians, and all recordings were anonymized to meet strict ethical standards for handling minors’ data.

After recording, the audio was processed with voice‑activity detection to trim silence, and the transcriptions were verified by native‑speaker linguists using standardized Arabic orthography, achieving a labeling error rate below 1.2 %. The utterances average 3.2 seconds in length, providing a range of phonetic and prosodic variability. The dataset is split into training, validation, and test subsets (70 %/10 %/20 %) with balanced representation across age, gender, and dialectal nuances.

For benchmarking, the authors evaluated eight Whisper model variants (Tiny, Base, Small, Medium, Large, and their v2/v3 updates) in a zero‑shot configuration, i.e., without any fine‑tuning on the child data. Both Word Error Rate (WER) and Character Error Rate (CER) were computed and compared against adult Arabic benchmarks such as Arabic Common Voice. While Whisper achieves sub‑0.20 WER on adult speech, performance on the child dataset drops dramatically: the best model, Whisper Large_v3, records a WER of 0.66, and even the smallest models exceed 0.85. This performance gap mirrors findings in English child‑speech research and underscores that large multilingual models, despite their size, are not inherently robust to child speech.

Error analysis reveals three primary contributors to the degradation. First, children exhibit greater phonetic variability and non‑canonical intonation patterns, leading to acoustic mismatches with the adult‑centric pre‑training data. Second, classroom environments introduce background noises (movement, ambient chatter, equipment sounds) that challenge the models’ noise‑robustness. Third, the limited lexical and syntactic diversity of the dataset hampers the language model component, which was never exposed to child‑appropriate discourse during pre‑training.

The authors draw several technical and ethical implications. Technically, they argue that even for low‑resource languages like Arabic, dedicated child corpora are essential for equitable ASR development. Model scaling alone does not resolve the child‑speech problem; targeted fine‑tuning or the creation of child‑specific acoustic and language models is necessary. Moreover, high‑quality, well‑balanced annotations can partially offset the small dataset size, especially when the data captures a broad range of ages and dialectal variations. Ethically, the paper stresses the importance of transparent consent procedures, strict anonymization, and compliance with child‑protection regulations throughout data collection, distribution, and downstream use.

Future work outlined includes expanding the dataset to cover more Arabic dialects and a wider age range, integrating advanced noise‑reduction and voice‑activity detection tailored for child speech, and experimenting with fine‑tuning Whisper or building new models that incorporate child‑speech data from the outset. By releasing Arabic Little STT, the authors aim to catalyze research that improves speech‑technology accessibility for Arabic‑speaking children and promotes fairness in AI systems across demographic groups.

💡 Research Summary

📜 Original Paper Content