Detection and Analysis of Emotion From Speech Signals

Recognizing emotion from speech has become one the active research themes in speech processing and in applications based on human-computer interaction. This paper conducts an experimental study on rec

Detection and Analysis of Emotion From Speech Signals

Recognizing emotion from speech has become one the active research themes in speech processing and in applications based on human-computer interaction. This paper conducts an experimental study on recognizing emotions from human speech. The emotions considered for the experiments include neutral, anger, joy and sadness. The distinuishability of emotional features in speech were studied first followed by emotion classification performed on a custom dataset. The classification was performed for different classifiers. One of the main feature attribute considered in the prepared dataset was the peak-to-peak distance obtained from the graphical representation of the speech signals. After performing the classification tests on a dataset formed from 30 different subjects, it was found that for getting better accuracy, one should consider the data collected from one person rather than considering the data from a group of people.


💡 Research Summary

The paper presents an experimental investigation into automatic emotion recognition from human speech, focusing on four basic affective states: neutral, anger, joy, and sadness. A custom corpus was collected from thirty volunteers, each producing ten utterances per emotion under controlled indoor conditions. Recordings were captured at 16 kHz, 16‑bit PCM, and subsequently pre‑processed to remove DC offset and normalize amplitude. The central acoustic feature introduced is the peak‑to‑peak distance (P2P distance), defined as the temporal interval between successive local maxima and minima in the waveform. This time‑domain metric is intended to capture variations in vocal fold tension and respiratory patterns that differ across emotional states. In addition to P2P distance, conventional descriptors such as pitch, short‑term energy, and zero‑crossing rate were extracted, forming a multi‑dimensional feature vector for each utterance.

Four supervised classifiers were evaluated: Support Vector Machine (SVM), k‑Nearest Neighbors (k‑NN), Decision Tree, and a Multi‑Layer Perceptron (MLP). Hyper‑parameters for each model were tuned via five‑fold cross‑validation, and performance was measured using accuracy, precision, recall, and F1‑score. Two experimental scenarios were considered. In the “speaker‑dependent” scenario, a separate model was trained and tested for each individual speaker using only that speaker’s data. In the “speaker‑independent” scenario, all speakers’ data were pooled to train a single global model.

Results demonstrate a clear advantage for speaker‑dependent modeling. The SVM achieved an average accuracy exceeding 92 % in the speaker‑dependent case, while k‑NN and MLP attained around 85 % accuracy. By contrast, the speaker‑independent models peaked at roughly 78 % accuracy, indicating that inter‑speaker variability in vocal production significantly hampers a universal classifier. Feature importance analysis revealed that the P2P distance contributed the most discriminative power, with pitch and energy providing supplementary information. Principal Component Analysis (PCA) for dimensionality reduction showed a steep performance drop when the feature space was compressed, suggesting that the raw P2P metric already encapsulates most of the relevant variance.

The authors acknowledge several limitations. The dataset is modest in size (30 speakers), which restricts the statistical robustness and generalizability of the findings. Emotion labels were self‑reported, introducing potential subjectivity. Moreover, only four discrete emotions were examined, leaving out more nuanced or mixed affective states. The reliance on hand‑crafted features also entails a non‑trivial engineering effort and may miss latent patterns present in the raw signal.

Future work is proposed along three main axes. First, expanding the corpus to include a larger, more diverse speaker pool and additional emotional categories would improve external validity. Second, integrating multimodal inputs—such as facial video, physiological signals, or lexical content—could enrich the representation of affect and boost recognition rates. Third, transitioning to end‑to‑end deep learning architectures (e.g., LSTM, Transformer, or convolutional networks) would enable automatic feature learning directly from waveforms, potentially mitigating speaker‑specific biases and achieving better cross‑speaker generalization. The paper concludes that while the peak‑to‑peak distance is a promising, computationally inexpensive cue for emotion detection, optimal performance currently relies on personalized models, and broader applicability will require more sophisticated modeling strategies and richer data.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...