Speaker Recognition for Childrens Speech

Reading time: 6 minute
...

📝 Abstract

This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids corpus and GMM-UBM and GMM-SVM SR systems. Regions of the spectrum containing important speaker information for children are identified by conducting SR experiments over 21 frequency bands. As for adults, the spectrum can be split into four regions, with the first (containing primary vocal tract resonance information) and third (corresponding to high frequency speech sounds) being most useful for SR. However, the frequencies at which these regions occur are from 11% to 38% higher for children. It is also noted that subband SR rates are lower for younger children. Finally results are presented of SR experiments to identify a child in a class (30 children, similar age) and school (288 children, varying ages). Class performance depends on age, with accuracy varying from 90% for young children to 99% for older children. The identification rate achieved for a child in a school is 81%.

💡 Analysis

This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids corpus and GMM-UBM and GMM-SVM SR systems. Regions of the spectrum containing important speaker information for children are identified by conducting SR experiments over 21 frequency bands. As for adults, the spectrum can be split into four regions, with the first (containing primary vocal tract resonance information) and third (corresponding to high frequency speech sounds) being most useful for SR. However, the frequencies at which these regions occur are from 11% to 38% higher for children. It is also noted that subband SR rates are lower for younger children. Finally results are presented of SR experiments to identify a child in a class (30 children, similar age) and school (288 children, varying ages). Class performance depends on age, with accuracy varying from 90% for young children to 99% for older children. The identification rate achieved for a child in a school is 81%.

📄 Content

Speaker Recognition for Children’s Speech Saeid Safavi, Maryam Najafian, Abualsoud Hanani, Martin Russell, Peter Jančovič, Michael Carey School of Electronic, Electrical and Computer Engineering, University of Birmingham, Birmingham, B15 2TT, England {sxs796, mxn978, m.j.russell, p.jancovic, m.carey}@bham.ac.uk, ahanani@birzeit.edu

Abstract This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids corpus and GMM-UBM and GMM-SVM SR systems. Regions of the spectrum containing important speaker information for children are identified by conducting SR experiments over 21 frequency bands. As for adults, the spectrum can be split into four regions, with the first (containing primary vocal tract resonance information) and third (corresponding to high- frequency speech sounds) being most useful for SR.
However, the frequencies at which these regions occur are from 11% to 38% higher for children. It is also noted that sub- band SR rates are lower for younger children. Finally results are presented of SR experiments to identify a child in a class (30 children, similar age) and school (288 children, varying ages). Class performance depends on age, with accuracy varying from 90% for young children to 99% for older children. The identification rate achieved for a child in a school is 81%.

Index Terms: speaker verification, speaker identification, child speech, gaussian mixture model, support vector machine, bandwidth 1. Introduction As human interaction with computers becomes more pervasive, and its applications become more private and sensitive, the value of automatic Speaker Recognition (SR) based on vocal characteristics increases. The employment of SR technology for children could be beneficial in several application areas, including, child security and protection, and education. For instance, social networking sites are most popular with teenagers and young adults, with almost half of children aged from 8 to 17 who use the internet having set up their own profile on a social networking site [1].
An SR system that identifies a child based on his or her voice, and confirms the identity of the individual with whom the child is communicating, could be a valuable safeguard for a child engaged in social networking. Other possible applications are in education. For example, an interactive educational tutor that could identify each child in a class could automatically continue a previous lesson, adapt its content to suit the child, and log the child’s responses appropriately without the child needing to go through a formal login process. Although automatic recognition of children’s speech has been the subject of considerable research effort, there is little published work on issues and algorithms related to automatic verification of a child’s identity from his or her speech. For example, we do not know how increases in inter- and intra- speaker variability for children’s speech [4] will affect SR performance. Variability is highest for young children, converging to adult values when children reach the age of 13.
Even for young children there is some evidence that the degree of variability varies significantly between individuals [6].
It has been shown that acoustic and linguistic characteristics of children’s speech are different from those of adult’s [3-5]. For example, children’s speech is characterized by higher pitch, and perceptually important features such as formants occur at higher frequencies [4]. Consequently, the impact of bandwidth reduction on speech recognition accuracy is greater for children’s speech than for adults [6, 7].
However, we do not know the significance of different frequency bands for SR for children, although the relevant studies for adult SR have been reported [2]. The success of Gaussian Mixture Model - Universal Background Model (GMM-UBM) and GMM-Support Vector Machine (GMM-SVM) approaches to adult SR motivated us to apply these techniques to our child SR task. The distribution of acoustic feature vectors for a population of speakers, is typically captured using a UBM (a speaker-independent GMM constructed using data from a variety of speakers and background conditions) [8, 9]. Speaker dependent GMMs are then built by MAP adaptation of the UBM [10]. Alternatively, discriminative approaches such as SVMs can be used, which have been shown to obtain comparable, and in some cases better, performance than GMM based systems. The combination of GMM supervectors, comprising the stacked parameters of the GMM components, with SVMs has also been successful [11]. SR systems usually employ score normalization to cope with score variability and to simplify decision threshold tuning.
This paper presents the results of experiments in SR for children’s speech and is organized as follows. Section 2 describes the OGI ‘Kid’s’ corpus of children’s speech, which

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut