Divide-and-Conquer based Ensemble to Spot Emotions in Speech using MFCC and Random Forest

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Besides spoken words, speech signals also carry information about speaker gender, age, and emotional state which can be used in a variety of speech analysis applications. In this paper, a divide and conquer strategy for ensemble classification has been proposed to recognize emotions in speech. Intrinsic hierarchy in emotions has been utilized to construct an emotions tree, which assisted in breaking down the emotion recognition task into smaller sub tasks. The proposed framework generates predictions in three phases. Firstly, emotions are detected in the input speech signal by classifying it as neutral or emotional. If the speech is classified as emotional, then in the second phase, it is further classified into positive and negative classes. Finally, individual positive or negative emotions are identified based on the outcomes of the previous stages. Several experiments have been performed on a widely used benchmark dataset. The proposed method was able to achieve improved recognition rates as compared to several other approaches.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Published as conference paper in The 2nd International Integrated Conference & Concert on Convergence (2016) Divide-and-Conquer based Ensemble to Spot Emotions in Speech using MFCC and Random Forest
Abdul Malik Badshah, Jamil Ahmad, Mi Young Lee, Sung Wook Baik*

College of Electronics and Information Engineering, Sejong University, Seoul, South Korea *Corresponding author email: sbaik@sejong.ac.kr Abstract Besides spoken words, speech signals also carry information about speaker gender, age, and emotional state which can be used in a variety of speech analysis applications. In this paper, a divide and conquer strategy for ensemble classification has been proposed to recognize emotions in speech. Intrinsic hierarchy in emotions has been utilized to construct an emotions tree, which assisted in breaking down the emotion recognition task into smaller sub tasks. The proposed framework generates predictions in three phases. Firstly, emotions are detected in the input speech signal by classifying it as neutral or emotional. If the speech is classified as emotional, then in the second phase, it is further classified into positive and negative classes. Finally, individual positive or negative emotions are identified based on the outcomes of the previous stages. Several experiments have been performed on a widely used benchmark dataset. The proposed method was able to achieve improved recognition rates as compared to several other approaches.
Keywords: speech emotions, divide-and-conquer, emotion classification, random forest

Introduction Body poses, facial expressions, and speech are the most common ways to express emotions [1]. However, certain emotions like disgust and boredom cannot be identified from gestures or facial expressions but could be effectively recognized from speech tone due to the difference in energy, speaking rate, speech linguistic and semantic information [2]. Speech emotion recognition (SER) has attracted increasing attention in the past few decades due to the increasing use of emotions in affective computing and human computer interaction. It has played a tremendous role in changing the way we interact with computers over the last few years [3]. Affective computing techniques use emotions to interact in more natural ways. Similarly, automatic voice response (AVR) systems can become more adaptive and user-friendly if affect states of users could be identified during interactions. The performance of such systems highly depend on the ability to accurately recognize emotions. SER is a challenging task due to the inherent complexity in emotional expression.
Information regarding the emotions of a person can help in several speech analysis applications such as human computer interaction, humanoid robots, mobile communication, and call centers [4]. Emergency call centers all around the world deal with fake calls on a daily basis. It has become necessary to avoid wastage of precious resources in responding to fake calls. Utilizing information like age, gender, emotional state, environmental sounds, and speech transcript, the situational seriousness could be assessed effectively. If a caller report an abnormal situation by calling an emergency call center, the SER can be used to assess whether the person is under stress, fear or not which can increase the truth rate of person and can help the call centers in making an effective decision. The emotion detection work presented here is a portion of a large project for lie detection from speech in emergency calls. Manuscript received: MM/DD/YYYY / revised: MM/DD/YYYY / Corresponding Author: sbaik@sejong.ac.kr Tel:+82-02-3408-3797 Sejong University

A.M. Badshah, J Ahmad, MY Lee, S.W. Baik This paper describes a divide-and-conquer (DC) approach to recognize emotions from speech using acoustic features. Three classifiers namely support vector machine (SVM) [5], Random Forest (RF) [6] and Decision Tree (DT) [7] were evaluated. Four models for each classifier were used at four stages to recognize the speech signals at each stage of the proposed method.
2. Related Work Speech emotion recognition has been studied vigorously in the past decade. Several different strategies involving a variety of feature extraction methods, and classification schemes have been proposed. Some of the recent works in this area are being presented here.
Vogt and Andre [8] proposed a gender-dependent emotion recognition model for emotion detection from speech. They used 20 different features consisting of 17 MFCC, 2 energy coefficients, and 1 pitch value for gender-independent and 30 features consisting of 22 MFCC, 3 energy and 5 pitch value for male and female emotions to train Naïve Bayes classifiers. They evaluated their framework on two different datasets including Berlin and SmartKom. They achieved an improvement of 2% in overall recognition rates for Berlin dataset and 4% for SmartKom datasets, respective

View Original ArXiv

This content is AI-processed based on ArXiv data.

Divide-and-Conquer based Ensemble to Spot Emotions in Speech using MFCC and Random Forest

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found