Speaker Identification in Shouted Talking Environments Based on Novel Third-Order Hidden Markov Models

Reading time: 5 minute
...

📝 Abstract

In this work we propose, implement, and evaluate novel models called Third-Order Hidden Markov Models (HMM3s) to enhance low performance of text-independent speaker identification in shouted talking environments. The proposed models have been tested on our collected speech database using Mel-Frequency Cepstral Coefficients (MFCCs). Our results demonstrate that HMM3s significantly improve speaker identification performance in such talking environments by 11.3% and 166.7% compared to second-order hidden Markov models (HMM2s) and first-order hidden Markov models (HMM1s), respectively. The achieved results based on the proposed models are close to those obtained in subjective assessment by human listeners.

💡 Analysis

In this work we propose, implement, and evaluate novel models called Third-Order Hidden Markov Models (HMM3s) to enhance low performance of text-independent speaker identification in shouted talking environments. The proposed models have been tested on our collected speech database using Mel-Frequency Cepstral Coefficients (MFCCs). Our results demonstrate that HMM3s significantly improve speaker identification performance in such talking environments by 11.3% and 166.7% compared to second-order hidden Markov models (HMM2s) and first-order hidden Markov models (HMM1s), respectively. The achieved results based on the proposed models are close to those obtained in subjective assessment by human listeners.

📄 Content

Speaker Identification in Shouted Talking Environments Based on Novel Third-Order Hidden Markov Models

Ismail Shahin Department of Electrical and Computer Engineering University of Sharjah Sharjah, United Arab Emirates E-mail: ismail@sharjah.ac.ae

Abstract - In this work we propose, implement, and evaluate novel models called Third-Order Hidden Markov Models (HMM3s) to enhance low performance of text-independent speaker identification in shouted talking environments. The proposed models have been tested on our collected speech database using Mel-Frequency Cepstral Coefficients (MFCCs). Our results demonstrate that HMM3s significantly improve speaker identification performance in such talking environments by 11.3% and 166.7% compared to second-order hidden Markov models (HMM2s) and first-order hidden Markov models (HMM1s), respectively. The achieved results based on the proposed models are close to those obtained in subjective assessment by human listeners.

Keywords - First-order hidden Markov models; second-order hidden Markov models; shouted talking environments; speaker identification; third-order hidden Markov models

I. INTRODUCTION Speaker recognition has two types: speaker identification and speaker verification (authentication). Speaker identification is the process of automatically deciding who is speaking from a set of known speakers. Speaker verification is the process of automatically accepting or rejecting the identity of the claimed speaker. Speaker identification can be used in criminal investigations to determine the suspected persons who uttered the voice captured at the scene of the crime. Speaker identification can also be used in civil cases or for the media. Speaker verification is widely used in security access to services via a telephone, including home shopping, home banking transactions using a telephone network, security control for private information areas, remote access to computers, and many telecommunication services [1]. Based on the text to be spoken, speaker recognition is categorized into text-dependent and text- independent cases. In the text-dependent case, speaker recognition requires the speaker to generate speech for the same text in both training and testing; on the other hand, in the text-independent case, speaker recognition does not depend on the text being spoken.

II. LITERATURE REVIEW Many studies in speech recognition area and speaker recognition area focus on speech uttered in neutral talking environments [1], [2], [3], [4] and on speech produced in stressful talking environments [5], [6], [7], [8]. In literature, many studies that focus on the two areas in stressful talking environments study the two areas in shouted talking environments [8], [9], [10], [11], [12], [13], [14]. Some talking environments are designed to simulate speech generated by different speakers under real stressful talking conditions. Hansen, Cummings, and Clements employed Speech Under Simulated and Actual Stress (SUSAS) database in which eight talking conditions are used to simulate speech uttered under real stressful talking conditions and three real talking conditions [5-7]. The eight talking conditions are neutral, loud, soft, angry, fast, slow, clear, and question. The three talking conditions are 50% task, 70% task, and Lombard. Chen used six talking environments to simulate speech under real stressful talking environments [8]. These environments are neutral, fast, loud, Lombard, soft, and shouted. Shouted talking environments are defined as when speakers shout, their intention is to produce a very loud acoustic signal, either to increase its range of transmission or its ratio to background noise. Chen [8] studied talker-stress-induced intraword variability and an algorithm that pays off for the systematic changes observed based on hidden Markov models (HMMs) trained by speech tokens under different talking conditions. Raja and Dandapat [9] studied speaker recognition under stressed conditions to improve the decreased performance under such conditions. They used four distinct stressed conditions of SUSAS database. These conditions are neutral, angry, Lombard, and question. They concluded that the least speaker identification performance happened when speakers talk in angry talking environments [9]. Angry talking environments are used as alternatives to shouted talking environments since they can not be totally separated from shouted talking environments in our genuine life [11], [12], [13], [14]. Zhang and Hansen [10] reported on the analysis of characteristics of the speech in five different vocal modes: whispered, soft, neutral, loud, and shouted; and to recognize discriminating features of speech modes. Shahin focused in four of his earlier studies [11], [12], [13], [14] on improving speaker identification performance in shouted talking envir

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut