Research on several key technologies in practical speech emotion recognition

Reading time: 5 minute
...

📝 Abstract

In this dissertation the practical speech emotion recognition technology is studied, including several cognitive related emotion types, namely fidgetiness, confidence and tiredness. The high quality of naturalistic emotional speech data is the basis of this research. The following techniques are used for inducing practical emotional speech: cognitive task, computer game, noise stimulation, sleep deprivation and movie clips. A practical speech emotion recognition system is studied based on Gaussian mixture model. A two-class classifier set is adopted for performance improvement under the small sample case. Considering the context information in continuous emotional speech, a Gaussian mixture model embedded with Markov networks is proposed. A further study is carried out for system robustness analysis. First, noise reduction algorithm based on auditory masking properties is fist introduced to the practical speech emotion recognition. Second, to deal with the complicated unknown emotion types under real situation, an emotion recognition method with rejection ability is proposed, which enhanced the system compatibility against unknown emotion samples. Third, coping with the difficulties brought by a large number of unknown speakers, an emotional feature normalization method based on speaker-sensitive feature clustering is proposed. Fourth, by adding the electrocardiogram channel, a bi-modal emotion recognition system based on speech signals and electrocardiogram signals is first introduced. The speech emotion recognition methods studied in this dissertation may be extended into the cross-language speech emotion recognition and the whispered speech emotion recognition.

💡 Analysis

In this dissertation the practical speech emotion recognition technology is studied, including several cognitive related emotion types, namely fidgetiness, confidence and tiredness. The high quality of naturalistic emotional speech data is the basis of this research. The following techniques are used for inducing practical emotional speech: cognitive task, computer game, noise stimulation, sleep deprivation and movie clips. A practical speech emotion recognition system is studied based on Gaussian mixture model. A two-class classifier set is adopted for performance improvement under the small sample case. Considering the context information in continuous emotional speech, a Gaussian mixture model embedded with Markov networks is proposed. A further study is carried out for system robustness analysis. First, noise reduction algorithm based on auditory masking properties is fist introduced to the practical speech emotion recognition. Second, to deal with the complicated unknown emotion types under real situation, an emotion recognition method with rejection ability is proposed, which enhanced the system compatibility against unknown emotion samples. Third, coping with the difficulties brought by a large number of unknown speakers, an emotional feature normalization method based on speaker-sensitive feature clustering is proposed. Fourth, by adding the electrocardiogram channel, a bi-modal emotion recognition system based on speech signals and electrocardiogram signals is first introduced. The speech emotion recognition methods studied in this dissertation may be extended into the cross-language speech emotion recognition and the whispered speech emotion recognition.

📄 Content

本文获得国家自然科学基金“耳语音情感特征分析与识别方法研究”(No.60975017)与国家自然 科学基金“面向非特定说话人的实用语音情感特征分析与识别的关键技术及应用研究” (No.61273266)资助。 . 博士学位论文

实用语音情感识别若干关键技术研究

专 业 名 称:信息与通信工程 研究生姓名 :_____黄程韦 __ 导 师 姓 名:_赵 力

RESEARCH ON SEVERAL KEY TECHNOLOGIES IN PRACTICAL SPEECH EMOTION RECOGNITION

A Dissertation Submitted to Southeast University For the Academic Degree of Doctor of Engineering

School of Information Science and Engineering Southeast University 2013

I

东 南 大 学 学 位 论 文 独 创 性 声 明

本人声明所呈交的学位论文是我个人在导师指导下进行的研究工作及取得的研究成 果。尽我所知,除了文中特别加以标注和致谢的地方外,论文中不包含其他人已经发表 或撰写过的研究成果,也不包含为获得东南大学或其它教育机构的学位或证书而使用过 的材料。与我一同工作的同志对本研究所做的任何贡献均已在论文中作了明确的说明并 表示了谢意。

研究生签名:

日 期:

东 南 大 学 学 位 论 文 使 用 授 权 声 明

东南大学、中国科学技术信息研究所、国家图书馆有权保留本人所送交学位论文的 复印件和电子文档,可以采用影印、缩印或其他复制手段保存论文。本人电子文档的内 容和纸质论文的内容相一致。除在保密期内的保密论文外,允许论文被查阅和借阅,可 以公布(包括刊登)论文的全部或部分内容。论文的公布(包括刊登)授权东南大学研 究生院办理。

研究生签名 __________ 导师签名 __________

日 期 _________

II 摘要 语音情感识别技术能够从语音信号中识别说话人的心理情感状态,具有重要的理论意义和 应用价值。以往的研究主要围绕基本情感类别进行,不能满足实际应用的需要。本文研究了实 用语音情感识别技术,包括烦躁、自信和疲倦等三种与认知过程有关的情感。 高自然度的情感语料的获取是研究的基础。本文中通过以下途径诱发实用语音情感数据: 认知作业、计算机游戏、噪声刺激、睡眠剥夺和观看影视片断等。通过人工标注和筛选,建立 了实用语音情感数据库。通过增加心电情感信号,建立了语音和心电的双模态数据库。 语音情感特征分析是重要的一个环节。本文中提取了481 个静态统计特征,包括了音质特 征和韵律特征。对烦躁、自信和疲倦等实用语音情感进行了特征分析,对谐波噪声比特征在实 用语音情感上的分布特点进行了研究。通过特征降维与识别实验验证了本文中所提取的特征的 有效性。 在获取的高自然度的情感语料上,研究了基于高斯混合模型的实用语音情感识别系统。第 一、在充足的数据量下获得了理想的识别性能。第二、针对高斯混合模型的不足,在小样本条 件下结合两类分类器组进行了系统的性能改进。第三、考虑连续情感语音中的上下文关系,提 出了嵌入马尔科夫网络的高斯混合模型,对情感的连续性进行建模,提高了识别率。 在建立了性能较为完善的高斯混合模型语音情感识别系统后,进一步研究了系统鲁棒性的 问题。第一、研究了噪声对语音情感识别系统的影响,首次将基于听觉掩蔽效应的降噪算法应 用到实用语音情感识别中,获得了理想的效果。第二、针对实际条件下复杂的未知情感类型, 提出了一种可拒判的情感识别方法,增强了系统对未知情感样本的兼容性。第三、针对大量非 特定说话人带来的识别困难,提出了一种基于说话人特征聚类的情感特征规整化方法,提高了 系统对非特定说话人的鲁棒性。第四、通过增加心电信号通道,首次进行了语音信号与心电信 号融合的双模态情感识别,提高了系统的识别率和抗毁性。 本文中的语音情感识别系统能够扩展应用到跨语言和耳语音情感识别领域中。本文分析了 跨语言语音情感识别中面临的困难,研究了在汉语和德语上通用性较高的特征,进行了跨语言 的情感识别。实验结果显示,本文中的生气情感模型在两种语言中具有较强的共通性。本文还 分析了耳语音情感识别的特点和难点,提取了有效的耳语音情感特征,并且将本文中提出的嵌 入马尔科夫网络的高斯混合模型在耳语音情感识别中进行了验证,获得了理想的识别效果。

III 最后,本文对实用语音情感识别技术的应用进行了分析和讨论,对语音情感计算技术的发 展进行了展望。 关键词:情感识别;噪声;非特定说话人;高斯混合模型;马尔科夫网络

IV ABSTRACT Speech emotion recognition technology is designed for recognizing the speaker’s emotional states from speech signals, which is of both theoretical value and practical value. In the past researches the basic emotion types were the main concerns, which, however, might not satisfy the real world requirements. In this dissertation the practical speech emotion recognition technology is studied, including several cognitive related emotion types, namely fidgetiness, confidence and tiredness. The high quality of naturalistic emotional speech data is the basis of this research. In this dissertation the following techniques are used for inducing practical emotional speech: cognitive task, computer game, noise stimulation, sleep deprivation and movie clips. By human annotation and selection the practical speech emotion database is achieved. By adding the electrocardiogram signals the bi-modal emotional database is achieved. The emotional speech features analysis is an important step. In this dissertation 481 static features are extracted, including voice quality features and prosodic features. Feature analysis is carried out on practical speech emotions like fidgetiness, confidence and tiredness. Harmonic-to- noise ratio is studied for its distribution characters in the practical speech emotions. Through feature reduction and recognition experiments the effectiveness of the features proposed in this dissertation is justified. With the high-quality naturalistic emotional speech data achieved, the practical speech emotion recognition system is studied based on Gaussian mixture model. First, promising recognition results are achieved under sufficient training data. Second, considering the drawbacks of Gaussian mixture model, two-class classifier set is adopted for performance improvement under the small sample case. Third, considering the context information in continuous emotional speech, a Gaussian mixture model embedded with Markov networks is proposed. The recognition rate is improved by modeling the continuality of emotions. Upon establishing a satisfactory speech emotion recognition system based on Gaussian mixture model, a further study is carried out for system robustness analysis. First, noise influence on the speech emotion recognition system is studied. Noise reduction algorithm based on auditory masking properties is fist introduced to the practical speech emotion recognition, and achieved satisfactory results. Second, to deal with the complicated unknown emotion types under real situation, an emotion recognition method with rejection ability is proposed, which enhanced

V the system compatibility against unknown emotion samples. Third, coping with the difficulties brought by a large number of unknown speakers, an emotional feature normalizatio

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut