A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks
📝 Abstract
Speech emotion recognition (SER) is to study the formation and change of speaker’s emotional state from the speech signal perspective, so as to make the interaction between human and computer more intelligent. SER is a challenging task that has encountered the problem of less training data and low prediction accuracy. Here we propose a data augmentation algorithm based on the imaging principle of the retina and convex lens, to acquire the different sizes of spectrogram and increase the amount of training data by changing the distance between the spectrogram and the convex lens. Meanwhile, with the help of deep learning to get the high-level features, we propose the Deep Retinal Convolution Neural Networks (DRCNNs) for SER and achieve the average accuracy over 99%. The experimental results indicate that DRCNNs outperforms the previous studies in terms of both the number of emotions and the accuracy of recognition. Predictably, our results will dramatically improve human-computer interaction.
💡 Analysis
Speech emotion recognition (SER) is to study the formation and change of speaker’s emotional state from the speech signal perspective, so as to make the interaction between human and computer more intelligent. SER is a challenging task that has encountered the problem of less training data and low prediction accuracy. Here we propose a data augmentation algorithm based on the imaging principle of the retina and convex lens, to acquire the different sizes of spectrogram and increase the amount of training data by changing the distance between the spectrogram and the convex lens. Meanwhile, with the help of deep learning to get the high-level features, we propose the Deep Retinal Convolution Neural Networks (DRCNNs) for SER and achieve the average accuracy over 99%. The experimental results indicate that DRCNNs outperforms the previous studies in terms of both the number of emotions and the accuracy of recognition. Predictably, our results will dramatically improve human-computer interaction.
📄 Content
NIU et al.: A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks 1
- College of Computer Science, Chongqing University, Chongqing 400044, China. {dszou@cqu.edu.cn}
- School of Electronics Engineering and Computer science. Peking University, Beijing 100871,China Abstract—Speech emotion recognition (SER) is to study the formation and change of speaker’s emotional state from the speech signal perspective, so as to make the interaction between human and computer more intelligent. SER is a challenging task that has encountered the problem of less training data and low prediction accuracy. Here we propose a data augmentation algorithm based on the imaging principle of the retina and convex lens, to acquire the different sizes of spectrogram and increase the amount of training data by changing the distance between the spectrogram and the convex lens. Meanwhile, with the help of deep learning to get the high-level features, we propose the Deep Retinal Convolution Neural Networks (DRCNNs) for SER and achieve the average accuracy over 99%. The experimental results indicate that DRCNNs outperforms the previous studies in terms of both the number of emotions and the accuracy of recognition. Predictably, our results will dramatically improve human-computer interaction. Index Terms—speech emotion recognition; deep learning; speech spectrogram; deep retinal convolution neural networks; I. INTRODUCTION ER is using computer to analyze the speaker’s voice signal and its change process, to find their inner emotions and ideological activities, and finally to achieve a more intelligent and natural human-computer interaction (HCI), which is of great significance to develop new HCI system and to realize artificial intelligence[1]-[3]. Until now, the methods of SER can be divided into two cat- egories: the traditional machine learning method and the deep learning method. The key to the traditional machine learning method of SER is feature selection, which is directly related to the accuracy of recognition. By far the most common feature extraction methods include the pitch frequency feature, the energy-related feature, the formant feature, the spectral feature, etc. After features extracted, the machine learning method is used to train and predict Artificial Neural Network(ANN) [4]-[7], Bayesian network model [8], Hidden Markov Model (HMM)[9]-[12], Support Vector Machine (SVM)[13], [14], Gauss Mixed Model (GMM) [15], and multi-classifier fusion [16], [17]. The primary advantage of this method is that it could train model without very large data. While the disadvantage is that it is difficult to judge the quality of the feature and may lose some key features, which will decrease the accuracy of recognition. In the meantime, it is difficult to ensure the good results can be achieved in a variety of databases. Compared with the traditional machine learning method, the deep learning can extract the high-level features [18], [19], and it has been shown to exceed human performance in visual tasks [20], [21]. Currently, the deep learning has been applied to the SER by many researchers. Yelin Kim et al [22] proposed and evaluated a suite of Deep Belief Network(DBN) models, which can capture none linear features, and that models show improvement in emotion classification performance over baselines that do not employ deep learning. However, the accuracy is only 60% ~70%; W Zheng et al [23] proposed a DBN-HMM model, which improves the accuracy of emotion classification in comparison with the state-of-the-art methods; Q Mao et al [24] proposed learning affect-salient features for speech emotion recognition using CNN, which leads to stable and robust recognition performance in complex scenes; Z Huang et al [25] trained a semi-CNN model, which is stable and robust in complex scenes, and outperforms several well-established SER features. However, the accuracy is only 78% on SAVEE database, 84% on Emo-DB database; K Han et al [26] proposed a DNN-ELM model, which leads to 20% relative accuracy improvement compared to the HMM model; Sathit Prasomphan [27] detected the emotional by using information inside the spectrogram, then using the Neural Network to classify the emotion of EMO-DB database, and got the accuracy is up to 83.28% of five emotions; W Zheng [28] also used the spectrogram with DCNNs, and achieves about 40% accuracy on IEMOCAP database; H. M Fayek [29] provided a method to augment training data, but the accuracy is less than 61% on ENTERFACE database and SAVEE database; Jinkyu Lee [30] extracted high-level features and used recurrent neural network (RNN) to predict emotions on IEMOCAP database and got about 62% accuracy, which is higher than the DNN model; Q Jin [31] generated acoustic and lexical features to classify the emotions of the IEMOCAP database and achieved four-class emotion recognition A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks Yafeng
This content is AI-processed based on ArXiv data.