I can tell whether you are a Native Hawlêri Speaker! How ANN, CNN, and RNN perform in NLI-Native Language Identification

I can tell whether you are a Native Hawlêri Speaker! How ANN, CNN, and RNN perform in NLI-Native Language Identification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Native Language Identification (NLI) is a task in Natural Language Processing (NLP) that typically determines the native language of an author through their writing or a speaker through their speaking. It has various applications in different areas, such as forensic linguistics and general linguistics studies. Although considerable research has been conducted on NLI regarding two different languages, such as English and German, the literature indicates a significant gap regarding NLI for dialects and subdialects. The gap becomes wider in less-resourced languages such as Kurdish. This research focuses on NLI within the context of a subdialect of Sorani (Central) Kurdish. It aims to investigate the NLI for Hewlêri, a subdialect spoken in Hewlêr (Erbil), the Capital of the Kurdistan Region of Iraq. We collected about 24 hours of speech by recording interviews with 40 native or non-native Hewlêri speakers, 17 female and 23 male. We created three Neural Network-based models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), which were evaluated through 66 experiments, covering various time-frames from 1 to 60 seconds, undersampling, oversampling, and cross-validation. The RNN model showed the highest accuracy of 95.92% for 5-second audio segmentation, using an 80:10:10 data splitting scheme. The created dataset is the first speech dataset for NLI on the Hewlêri subdialect in the Sorani Kurdish dialect, which can be of benefit to various research areas.


💡 Research Summary

This paper addresses the under‑explored problem of native language identification (NLI) at the sub‑dialect level for a low‑resource language, focusing on the Hewlêri sub‑dialect of Central Kurdish (Sorani). The authors collected a novel speech corpus by conducting semi‑structured interviews with 40 participants (19 native Hewlêri speakers and 21 non‑native speakers who are fluent in the sub‑dialect). The recordings amount to 23 hours 27 minutes 22 seconds of audio, captured with a high‑quality condenser microphone at 192 kHz/24‑bit and later down‑sampled to 44.1 kHz/16‑bit WAV files. After manual cleaning (noise reduction, removal of interviewer’s voice, normalization), the audio was segmented into fixed‑length chunks of 1, 3, 5, 10, 20, 30, and 60 seconds using the Pydub library. For each chunk, 13‑dimensional Mel‑Frequency Cepstral Coefficients (MFCC) were extracted and used as input features.

Because the dataset is imbalanced (native vs. non‑native), the authors applied random oversampling to duplicate the minority class and random undersampling to reduce the majority class, achieving a balanced set of 39,083 samples per class. Three neural network architectures were built in TensorFlow/Keras: (1) a fully‑connected Artificial Neural Network (ANN) with ReLU activations and dropout; (2) a one‑dimensional Convolutional Neural Network (CNN) with multiple conv‑pool blocks; and (3) a Long Short‑Term Memory (LSTM) based Recurrent Neural Network (RNN). Training was performed on Google Colab GPUs and a local workstation (Intel i7‑10610U, 16 GB RAM, 8 GB GPU). For ANN the data split was 80 % training / 20 % testing; for CNN and RNN it was 80 % training / 10 % validation / 10 % testing. Early stopping with a patience of 10 epochs was used to avoid over‑fitting. Model performance was evaluated using classification accuracy and cross‑entropy loss, and 10 % of the data was always held out for final testing.

A total of 66 experiments were conducted, varying the segment length, sampling strategy (oversampled, undersampled, original), and data split. The key findings are:

  • Shorter audio segments consistently yielded higher accuracies, confirming that fine‑grained temporal cues are crucial for distinguishing native from non‑native speakers.
  • The RNN achieved the highest overall accuracy of 95.92 % on the oversampled dataset with 5‑second segments, outperforming the ANN (82.92 %) and CNN (94.47 % on 10‑second segments).
  • Accuracy dropped markedly for longer segments (60 seconds), with RNN at 70.54 % and CNN at 79.65 %, indicating that long windows dilute discriminative dialectal features.
  • K‑fold cross‑validation showed only a modest 0.8 % decrease for the RNN, suggesting good generalisation despite the modest sample size.
  • Training time was shortest for ANN, moderate for CNN, and longest for RNN, reflecting the trade‑off between computational cost and performance.

The authors argue that the RNN’s sequential modeling capability captures evolving prosodic and phonetic patterns that are especially informative in short time frames, whereas CNN’s convolutional filters are robust to class imbalance but less adept at modeling long‑range dependencies. The ANN, while computationally efficient, lacks the capacity to exploit temporal dynamics fully.

Beyond the experimental results, the paper contributes a publicly‑available “Hewlêri Speech Corpus,” the first NLI dataset for this Kurdish sub‑dialect. The authors discuss potential applications in forensic linguistics, dialectology, and low‑resource speech technology. They acknowledge limitations: the participant pool is relatively small (40 speakers), the non‑native group is heterogeneous in terms of first language and proficiency, and only MFCC features were used, omitting other acoustic cues such as pitch, intensity, or linguistic transcriptions.

Future work suggested includes expanding the corpus with more speakers across age, gender, and socioeconomic backgrounds; exploring Transformer‑based models (e.g., wav2vec 2.0, Conformer) for end‑to‑end speech representation; integrating textual features from automatic speech recognition to create multimodal NLI systems; and evaluating the models in real‑world forensic or law‑enforcement scenarios.

In summary, this study demonstrates that deep learning, particularly recurrent architectures, can effectively identify native versus non‑native speakers of a low‑resource Kurdish sub‑dialect when trained on carefully segmented MFCC features. The findings provide a solid baseline for subsequent research on dialect‑level NLI and contribute valuable resources to the broader community working on under‑represented languages.


Comments & Academic Discussion

Loading comments...

Leave a Comment