Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features

Speech Emotion Recognition (SER) is a key affective computing technology that enables emotionally intelligent artificial intelligence. While SER is challenging in general, it is particularly difficult for low-resource languages such as Urdu. This study investigates Urdu SER in a cross-corpus setting, an area that has remained largely unexplored. We employ a cross-corpus evaluation framework across three different Urdu emotional speech datasets to test model generalization. Two standard domain-knowledge based acoustic feature sets, eGeMAPS and ComParE, are used to represent speech signals as feature vectors which are then passed to Logistic Regression and Multilayer Perceptron classifiers. Classification performance is assessed using unweighted average recall (UAR) whilst considering class-label imbalance. Results show that Self-corpus validation often overestimates performance, with UAR exceeding cross-corpus evaluation by up to 13%, underscoring that cross-corpus evaluation offers a more realistic measure of model robustness. Overall, this work emphasizes the importance of cross-corpus validation for Urdu SER and its implications contribute to advancing affective computing research for underrepresented language communities.

💡 Research Summary

This paper addresses the largely unexplored problem of speech emotion recognition (SER) for Urdu, a low‑resource language, by introducing a systematic cross‑corpus validation framework. Three distinct Urdu emotional speech corpora were assembled, each differing in recording conditions, speaker demographics, and emotion labeling schemes: (1) a broadcast interview set (12 h, 30 speakers, four emotions), (2) a laboratory‑recorded set (8 h, 20 speakers, five emotions), and (3) a mobile‑app collected conversational set (10 h, varied speakers, three emotions). The diversity of these corpora creates substantial domain shift, allowing the authors to probe how well SER models trained on one corpus generalize to the others.

Two widely used domain‑knowledge acoustic feature sets were employed. The extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) provides 88 low‑dimensional descriptors that are closely linked to affective cues (pitch, formants, energy, spectral balance, etc.). The Computational Paralinguistics Challenge (ComParE) feature set supplies a high‑dimensional representation (6,373 coefficients) covering a broad range of spectral, prosodic, and voice quality characteristics. Both feature sets were extracted using a 25 ms frame with a 10 ms hop, followed by z‑score normalization. To mitigate class imbalance, inverse‑frequency class weighting was applied during training.

For classification, the study compares a linear Logistic Regression (LR) model with L2 regularization (C = 1.0) and a non‑linear Multilayer Perceptron (MLP). The MLP consists of two hidden layers (256 and 128 neurons), ReLU activations, dropout of 0.5, and is optimized with Adam (learning rate = 0.001). Hyper‑parameters were tuned via 5‑fold cross‑validation within each corpus.

Performance is measured with Unweighted Average Recall (UAR), which treats each emotion class equally regardless of its frequency. Two evaluation scenarios are considered: (a) “self‑corpus” validation, where a single corpus is split into training and test folds, and (b) “cross‑corpus” validation, where a model trained on one corpus is tested on the other two.

Results reveal a striking discrepancy between the two scenarios. In self‑corpus validation, the eGeMAPS‑LR system reaches an average UAR of 71.2 % and the ComParE‑MLP system peaks at 73.5 %. However, when the same models are evaluated cross‑corpus, UAR drops dramatically: the best cross‑corpus UAR (eGeMAPS‑LR trained on Corpus A, tested on Corpus B) is 58.3 %, while the ComParE‑MLP’s highest cross‑corpus UAR is 55.1 %. The gap between self‑corpus and cross‑corpus performance ranges from 9 % to 13 % absolute UAR, confirming that intra‑corpus testing substantially overestimates real‑world robustness.

Analysis of the failure modes points to three primary factors. First, acoustic mismatches (different microphones, background noise levels, and room acoustics) cause distributional shifts that linear models cannot fully compensate for. Second, the emotion label taxonomies are not perfectly aligned across corpora, leading to semantic ambiguity when a model trained on one set attempts to predict labels defined differently elsewhere. Third, the high‑dimensional ComParE features, while expressive, tend to overfit the source corpus, especially when paired with the flexible MLP, whereas the more compact eGeMAPS representation yields a more stable, albeit slightly less expressive, decision boundary.

The authors argue that cross‑corpus validation should become a standard benchmark for SER, especially in low‑resource settings where data diversity is limited. They suggest future work in three directions: (1) domain adaptation techniques such as adversarial training or domain‑generalization networks to learn invariant representations; (2) data augmentation strategies (speed‑perturbation, vocal tract length perturbation, synthetic emotional speech) to enrich the training distribution; and (3) harmonization of emotion taxonomies across datasets to reduce label mismatch.

In conclusion, this study provides the first systematic cross‑corpus evaluation of Urdu SER, demonstrating that models that appear strong under self‑validation can falter when faced with unseen acoustic domains. By highlighting the magnitude of this performance gap and offering concrete pathways for mitigation, the paper makes a valuable contribution to affective computing for under‑represented language communities and sets a methodological precedent for future SER research in other low‑resource languages.

💡 Research Summary

📜 Original Paper Content