Speaker Recognition for Childrens Speech
📝 Abstract
This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids corpus and GMM-UBM and GMM-SVM SR systems. Regions of the spectrum containing important speaker information for children are identified by conducting SR experiments over 21 frequency bands. As for adults, the spectrum can be split into four regions, with the first (containing primary vocal tract resonance information) and third (corresponding to high frequency speech sounds) being most useful for SR. However, the frequencies at which these regions occur are from 11% to 38% higher for children. It is also noted that subband SR rates are lower for younger children. Finally results are presented of SR experiments to identify a child in a class (30 children, similar age) and school (288 children, varying ages). Class performance depends on age, with accuracy varying from 90% for young children to 99% for older children. The identification rate achieved for a child in a school is 81%.
💡 Analysis
This paper presents results on Speaker Recognition (SR) for children’s speech, using the OGI Kids corpus and GMM-UBM and GMM-SVM SR systems. Regions of the spectrum containing important speaker information for children are identified by conducting SR experiments over 21 frequency bands. As for adults, the spectrum can be split into four regions, with the first (containing primary vocal tract resonance information) and third (corresponding to high frequency speech sounds) being most useful for SR. However, the frequencies at which these regions occur are from 11% to 38% higher for children. It is also noted that subband SR rates are lower for younger children. Finally results are presented of SR experiments to identify a child in a class (30 children, similar age) and school (288 children, varying ages). Class performance depends on age, with accuracy varying from 90% for young children to 99% for older children. The identification rate achieved for a child in a school is 81%.
📄 Content
Speaker Recognition for Children’s Speech Saeid Safavi, Maryam Najafian, Abualsoud Hanani, Martin Russell, Peter Jančovič, Michael Carey School of Electronic, Electrical and Computer Engineering, University of Birmingham, Birmingham, B15 2TT, England {sxs796, mxn978, m.j.russell, p.jancovic, m.carey}@bham.ac.uk, ahanani@birzeit.edu
Abstract
This paper presents results on Speaker Recognition (SR) for
children’s speech, using the OGI Kids corpus and GMM-UBM
and GMM-SVM SR systems. Regions of the spectrum
containing important speaker information for children are
identified by conducting SR experiments over 21 frequency
bands. As for adults, the spectrum can be split into four
regions, with the first (containing primary vocal tract
resonance information) and third (corresponding to high-
frequency speech sounds) being most useful for SR.
However, the frequencies at which these regions occur are
from 11% to 38% higher for children. It is also noted that sub-
band SR rates are lower for younger children. Finally results
are presented of SR experiments to identify a child in a class
(30 children, similar age) and school (288 children, varying
ages). Class performance depends on age, with accuracy
varying from 90% for young children to 99% for older
children. The identification rate achieved for a child in a
school is 81%.
Index Terms: speaker verification, speaker identification,
child speech, gaussian mixture model, support vector machine,
bandwidth
1.
Introduction
As human interaction with computers becomes more
pervasive, and its applications become more private and
sensitive, the value of automatic Speaker Recognition (SR)
based on vocal characteristics increases.
The employment of SR technology for children could be
beneficial in several application areas, including, child security
and protection, and education. For instance, social networking
sites are most popular with teenagers and young adults, with
almost half of children aged from 8 to 17 who use the internet
having set up their own profile on a social networking site [1].
An SR system that identifies a child based on his or her voice,
and confirms the identity of the individual with whom the
child is communicating, could be a valuable safeguard for a
child
engaged
in
social
networking.
Other
possible
applications are in education. For example, an interactive
educational tutor that could identify each child in a class could
automatically continue a previous lesson, adapt its content to
suit the child, and log the child’s responses appropriately
without the child needing to go through a formal login process.
Although automatic recognition of children’s speech has
been the subject of considerable research effort, there is little
published work on issues and algorithms related to automatic
verification of a child’s identity from his or her speech. For
example, we do not know how increases in inter- and intra-
speaker variability for children’s speech [4] will affect SR
performance. Variability is highest for young children,
converging to adult values when children reach the age of 13.
Even for young children there is some evidence that the degree
of variability varies significantly between individuals [6].
It has been shown that acoustic and linguistic
characteristics of children’s speech are different from those of
adult’s [3-5]. For example, children’s speech is characterized
by higher pitch, and perceptually important features such as
formants occur at higher frequencies [4]. Consequently, the
impact of bandwidth reduction on speech recognition accuracy
is greater for children’s speech than for adults [6, 7].
However, we do not know the significance of different
frequency bands for SR for children, although the relevant
studies for adult SR have been reported [2].
The success of Gaussian Mixture Model - Universal
Background Model (GMM-UBM) and GMM-Support Vector
Machine (GMM-SVM) approaches to adult SR motivated us
to apply these techniques to our child SR task. The distribution
of acoustic feature vectors for a population of speakers, is
typically captured using a UBM (a speaker-independent GMM
constructed using data from a variety of speakers and
background conditions) [8, 9]. Speaker dependent GMMs are
then built by MAP adaptation of the UBM [10]. Alternatively,
discriminative approaches such as SVMs can be used, which
have been shown to obtain comparable, and in some cases
better, performance than GMM based systems. The
combination of GMM supervectors, comprising the stacked
parameters of the GMM components, with SVMs has also
been successful [11]. SR systems usually employ score
normalization to cope with score variability and to simplify
decision threshold tuning.
This paper presents the results of experiments in SR for
children’s speech and is organized as follows. Section 2
describes the OGI ‘Kid’s’ corpus of children’s speech, which
This content is AI-processed based on ArXiv data.