KU-ISPL Speaker Recognition Systems under Language mismatch condition for NIST 2016 Speaker Recognition Evaluation
📝 Abstract
Korea University Intelligent Signal Processing Lab. (KU-ISPL) developed speaker recognition system for SRE16 fixed training condition. Data for evaluation trials are collected from outside North America, spoken in Tagalog and Cantonese while training data only is spoken English. Thus, main issue for SRE16 is compensating the discrepancy between different languages. As development dataset which is spoken in Cebuano and Mandarin, we could prepare the evaluation trials through preliminary experiments to compensate the language mismatched condition. Our team developed 4 different approaches to extract i-vectors and applied state-of-the-art techniques as backend. To compensate language mismatch, we investigated and endeavored unique method such as unsupervised language clustering, inter language variability compensation and gender/language dependent score normalization.
💡 Analysis
Korea University Intelligent Signal Processing Lab. (KU-ISPL) developed speaker recognition system for SRE16 fixed training condition. Data for evaluation trials are collected from outside North America, spoken in Tagalog and Cantonese while training data only is spoken English. Thus, main issue for SRE16 is compensating the discrepancy between different languages. As development dataset which is spoken in Cebuano and Mandarin, we could prepare the evaluation trials through preliminary experiments to compensate the language mismatched condition. Our team developed 4 different approaches to extract i-vectors and applied state-of-the-art techniques as backend. To compensate language mismatch, we investigated and endeavored unique method such as unsupervised language clustering, inter language variability compensation and gender/language dependent score normalization.
📄 Content
KU-ISPL SPEAKER RECOGNITION SYSTEMS
UNDER LANGUAGE MISMATCH CONDITION
FOR NIST 2016 SPEAKER RECOGNITION EVALUATION
Suwon Shon, Hanseok Ko School of Electrical Engineering, Korea University, South Korea swshon@ispl.korea.ac.kr, hsko@korea.ac.kr
ABSTRACT
Korea University – Intelligent Signal Processing Lab. (KU- ISPL) developed speaker recognition system for SRE16 fixed training condition. Data for evaluation trials are collected from outside North America, spoken in Tagalog and Cantonese while training data only is spoken English. Thus, main issue for SRE16 is compensating the discrepancy between different languages. As development dataset which is spoken in Cebuano and Mandarin, we could prepare the evaluation trials through preliminary experiments to compensate the language mismatched condition. Our team developed 4 different approaches to extract i-vectors and applied state-of-the-art techniques as backend. To compensate language mismatch, we investigated and endeavored unique method such as unsupervised language clustering, inter language variability compensation and gender/language dependent score normalization.
Index Terms— SRE16, i-vector, language mismatch
- INTRODUCTION
This document is description of the Korea University –
Intelligent Signal Processing Laboratory (KU-ISPL) speaker
recognition system for NIST 2016 speaker recognition
evaluation (SRE16).
Under
i-vector
framework,
new
approaches
are
introduced using Bottleneck Feature (BNF) and Deep
Neural Network (DNN) which were validated successfully
its performance improvement on ASR. In this study, we
developed the state-of-the-art i-vector systems for validating
the performances on language mismatch condition using
SRE16 dataset. Based on the prior studies about domain
adaptation and compensation, Inter Dataset Variability
Compensation (IDVC) and unsupervised domain adaptation
using interpolated PLDA are also applied.
After studying about prior works, we proposed
additional techniques for compensating the language
mismatch condition to obtain robust performance on SRE 16
dataset. For official evaluation, we submitted total 3 systems
including 1 primary system and 2 contrastive systems in
fixed training data condition. We carefully followed the
SRE16 rules and requirements during training and test
processes.
In the following, we introduce a dataset of SRE 16 at
section 2. At Section 3 and 4, system components for
development of state-of-the-art i-vector extraction are
described.
- DATASET PREPARATION FOR FIXED TRAINING CONDITION
For fixed training condition, we use Fisher English, SRE
0410 and SWB-2 (phase13, cellular 1~2) dataset for
training set. Language of all dataset in training set is English.
The dataset for SRE 16 evaluation trials are collected from
speakers who located outside North America and spoke
Tagalog and Cantonese (referred as major language). Before
evaluation dataset is available, development dataset which
mirrors the evaluation conditions to prepare the language
mismatch condition on evaluation set. The development
dataset is collected from speaker who located outside North
America and spoke Cebuano and Mandarin (referred as
minor language). Additionally, unlabeled minor and major
language dataset is also given to participants for
development set. The development set are free to use for any
purpose and detailed statistics about evaluation and
development dataset are shown in table 1.
Table 1. Statistics of development and evaluation dataset. * means information from the SRE16 plan documents.
Data
set
Category
Language
Labels
(metadata)
Numbers of
Utt.
Spk.
Calls
Dev.
Enrollment
Minor
Available
120
20
60
Test
Minor
Available
1207
20
140
Unlabeled
Minor
X
200
20*
200*
Unlabeled
Major
X
2272
X
X
Eval
Enrollment
Major
X
1202
802
602
Test
Major
X
9294
X
1408
3. SYSTEM COMPONENT DESCRIPTION
3.1. Acoustic features
For training speaker recognition system on this paper,
Mel-Frequency Cepstral Coefficients (MFCC) is used to
generate 60 dimensional acoustic features. It is consist of 20
cepstral coefficients including log-energy C0, then, it is
appended with its delta and acceleration. For training DNN
based acoustic model that is inspired by Automatic Speech
Recognition (ASR) area, different configuration was adopt
to generate 40 ceptral coefficient without energy component
for high resolution of acoustic features (ASR-MFCC). For
feature normalization, Cepstral Mean Normalization is
applied with 3 seconds-length sliding window.
After extracting acoustic features, Voice Activity
algorithm was adopted to remove silence and low energy
segments on the speech dataset. Simple energy based VAD
was used with log-mean scaled threshold. Using log-energy
(C0) component of MFCC, the mean log-energy of each
segment can be calculated and it is scale to half value and
t
This content is AI-processed based on ArXiv data.