KU-ISPL Speaker Recognition Systems under Language mismatch condition for NIST 2016 Speaker Recognition Evaluation

Reading time: 5 minute
...

📝 Abstract

Korea University Intelligent Signal Processing Lab. (KU-ISPL) developed speaker recognition system for SRE16 fixed training condition. Data for evaluation trials are collected from outside North America, spoken in Tagalog and Cantonese while training data only is spoken English. Thus, main issue for SRE16 is compensating the discrepancy between different languages. As development dataset which is spoken in Cebuano and Mandarin, we could prepare the evaluation trials through preliminary experiments to compensate the language mismatched condition. Our team developed 4 different approaches to extract i-vectors and applied state-of-the-art techniques as backend. To compensate language mismatch, we investigated and endeavored unique method such as unsupervised language clustering, inter language variability compensation and gender/language dependent score normalization.

💡 Analysis

Korea University Intelligent Signal Processing Lab. (KU-ISPL) developed speaker recognition system for SRE16 fixed training condition. Data for evaluation trials are collected from outside North America, spoken in Tagalog and Cantonese while training data only is spoken English. Thus, main issue for SRE16 is compensating the discrepancy between different languages. As development dataset which is spoken in Cebuano and Mandarin, we could prepare the evaluation trials through preliminary experiments to compensate the language mismatched condition. Our team developed 4 different approaches to extract i-vectors and applied state-of-the-art techniques as backend. To compensate language mismatch, we investigated and endeavored unique method such as unsupervised language clustering, inter language variability compensation and gender/language dependent score normalization.

📄 Content

KU-ISPL SPEAKER RECOGNITION SYSTEMS
UNDER LANGUAGE MISMATCH CONDITION FOR NIST 2016 SPEAKER RECOGNITION EVALUATION

Suwon Shon, Hanseok Ko School of Electrical Engineering, Korea University, South Korea swshon@ispl.korea.ac.kr, hsko@korea.ac.kr

ABSTRACT

Korea University – Intelligent Signal Processing Lab. (KU- ISPL) developed speaker recognition system for SRE16 fixed training condition. Data for evaluation trials are collected from outside North America, spoken in Tagalog and Cantonese while training data only is spoken English. Thus, main issue for SRE16 is compensating the discrepancy between different languages. As development dataset which is spoken in Cebuano and Mandarin, we could prepare the evaluation trials through preliminary experiments to compensate the language mismatched condition. Our team developed 4 different approaches to extract i-vectors and applied state-of-the-art techniques as backend. To compensate language mismatch, we investigated and endeavored unique method such as unsupervised language clustering, inter language variability compensation and gender/language dependent score normalization.

Index Terms— SRE16, i-vector, language mismatch

  1. INTRODUCTION

This document is description of the Korea University – Intelligent Signal Processing Laboratory (KU-ISPL) speaker recognition system for NIST 2016 speaker recognition evaluation (SRE16). Under i-vector framework, new approaches are introduced using Bottleneck Feature (BNF) and Deep Neural Network (DNN) which were validated successfully its performance improvement on ASR. In this study, we developed the state-of-the-art i-vector systems for validating the performances on language mismatch condition using SRE16 dataset. Based on the prior studies about domain adaptation and compensation, Inter Dataset Variability Compensation (IDVC) and unsupervised domain adaptation using interpolated PLDA are also applied.
After studying about prior works, we proposed additional techniques for compensating the language mismatch condition to obtain robust performance on SRE 16 dataset. For official evaluation, we submitted total 3 systems including 1 primary system and 2 contrastive systems in fixed training data condition. We carefully followed the SRE16 rules and requirements during training and test processes. In the following, we introduce a dataset of SRE 16 at section 2. At Section 3 and 4, system components for development of state-of-the-art i-vector extraction are described.

  1. DATASET PREPARATION FOR FIXED TRAINING CONDITION

For fixed training condition, we use Fisher English, SRE 0410 and SWB-2 (phase13, cellular 1~2) dataset for training set. Language of all dataset in training set is English. The dataset for SRE 16 evaluation trials are collected from speakers who located outside North America and spoke Tagalog and Cantonese (referred as major language). Before evaluation dataset is available, development dataset which mirrors the evaluation conditions to prepare the language mismatch condition on evaluation set. The development dataset is collected from speaker who located outside North America and spoke Cebuano and Mandarin (referred as minor language). Additionally, unlabeled minor and major language dataset is also given to participants for development set. The development set are free to use for any purpose and detailed statistics about evaluation and development dataset are shown in table 1.

Table 1. Statistics of development and evaluation dataset. * means information from the SRE16 plan documents.

Data set Category Language Labels (metadata) Numbers of Utt. Spk. Calls Dev. Enrollment Minor Available 120 20 60 Test Minor Available 1207 20 140 Unlabeled
Minor X 200 20* 200* Unlabeled
Major X 2272 X X Eval Enrollment Major X 1202 802 602 Test Major X 9294 X 1408 3. SYSTEM COMPONENT DESCRIPTION

3.1. Acoustic features For training speaker recognition system on this paper, Mel-Frequency Cepstral Coefficients (MFCC) is used to generate 60 dimensional acoustic features. It is consist of 20 cepstral coefficients including log-energy C0, then, it is appended with its delta and acceleration. For training DNN based acoustic model that is inspired by Automatic Speech Recognition (ASR) area, different configuration was adopt to generate 40 ceptral coefficient without energy component for high resolution of acoustic features (ASR-MFCC). For feature normalization, Cepstral Mean Normalization is applied with 3 seconds-length sliding window.
After extracting acoustic features, Voice Activity algorithm was adopted to remove silence and low energy segments on the speech dataset. Simple energy based VAD was used with log-mean scaled threshold. Using log-energy (C0) component of MFCC, the mean log-energy of each segment can be calculated and it is scale to half value and t

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut