Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System W eicheng Cai 2 , Jinkun Chen 2 and Ming Li 1 1 Data Science Research Center , Duke K unshan Univ ersity , K unshan, China 2 School of Electronics and Information T echnology , Sun Y at-sen Uni versity , Guangzhou, China ming.li369@dukekunshan.edu.cn Abstract In this paper , we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a uniﬁed and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggre- gating the variable-length input sequence into an utterance le vel representation. Besides the basic temporal average pooling, we introduce a self-attentiv e pooling layer and a learnable dictio- nary encoding layer to get the utterance lev el representation. In terms of loss function for open-set speaker veriﬁcation, to get more discriminative speaker embedding, center loss and angu- lar softmax loss is introduced in the end-to-end system. Exper- imental results on V oxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be signiﬁcantly improv ed by the proposed encoding layer and loss function. 1. Introduction Language recognition (LR) , text-independent speaker recogni- tion (SR) and many other paralinguistic speech attribute recog- nition tasks can be deﬁned as an utterance lev el “sequence-to- one” learning issue, compared with automatic speech recogni- tion, which is a “sequence-to-sequence” tagging task. They are problems in that we are trying to retrieve information about an entire utterance rather than speciﬁc w ord content [1]. Moreover , there is no constraint on the lexicon words thus the training ut- terances and testing segments may have completely dif ferent contents [2]. The goal, therefore, may boil down to ﬁnd a ro- bust and time-in variant utterance level vector representation de- scribing the distribution of the gi ven input data sequence with variable-length. In recent decades, the classical GMM i-vector approach and its v ariants hav e dominated multiple kinds of paralinguistic speech attribute recognition ﬁelds for its superior performance, simplicity and ef ﬁciency [3, 4]. As sho wn in Fig. 1, the conv en- tional processing pipeline contains four main steps as follows: • Local feature descriptors, which manifest as a variable- length feature sequence, include hand-crafted acous- tic lev el features, such as log mel-ﬁlterbank energies (Fbank), mel-frequency cepstral coefﬁcients (MFCC), perceptual linear prediction (PLP), shifted delta coefﬁ- cients (SDC) features [2, 5], and automatically learned phoneme discriminant features from deep neural net- works (DNN), such as bottleneck features [6, 7, 8], phoneme posterior probability (PPP) features [9], and tandem features [10, 11]. • Dictionary , which contains several temporal orderless center components (or units, words, clusters, etc.), in- cludes vector quantization (VQ) codebooks learned by K-means [12], a uni versal background model (UBM) learned by Gaussian Mixture Model (GMM) GMM [13, 14] or a supervised phonetically-aware acoustic model learned by DNN [10, 15]. • V ector encoding. This procedure aggre gates the variable-length feature sequence into an utterance level vector representation, based on the statistics learned on the dictionaries mentioned above. T ypical examples are the GMM Supervector/i-vector [1, 3] or the recently pro- posed DNN i-vector [15, 16]. • Decision generator , includes logistic regression (Lo- gReg), support vector machine (SVM), and neural net- work for closed-set identiﬁcation, cosine similarity or probabilistic linear discriminant analysis (PLDA) [17, 18] for open-set veriﬁcation. The GMM i-vector based approaches comprise a series of hand-crafted or ad-hoc algorithmic components, and they show strong generalization ability and rob ustness when data and computational resource are limited. In recent years, with the merit of large labeled datasets, enormous computation capabil- ity , and effectiv e network architectures, emerging progress to- wards end-to-end learning opens up a new area for exploration [19, 20, 21, 22, 23]. In our pre vious works [24, 25], we pro- posed a learnable dictionary encoding (LDE) layer , which con- nects the con ventional GMM Supervector procedure and state- of-the-art end-to-end neural network together . In the end-to-end learning scheme, a general encoding layer is employed on top of the front-end con volutional neural network (CNN), so that it can encode the v ariable-length input sequence into an utterance lev el representation automatically . W e hav e shown its success for closed-set LR task. Howe ver , when we mov e forward to SR task, the situation becomes much more complicated. T ypically , SR can be categorized as speaker identiﬁcation and speaker veriﬁcation. The former classiﬁes a speaker to a speciﬁc identity , while the latter determines whether a pair of utterances belongs to the same person. In terms of the testing protocol, SR can be ev aluated under closed-set or open-set set- tings, as illustrated in Fig. 2. For closed-set protocol, all testing identities are enrolled in the training set. It is natural to clas- sify a testing utterance to a giv en identity . Therefore, closed-set language or speaker identiﬁcation can be well addressed as a classiﬁcation problem. For the open-set protocol, speaker iden- tities in testing set are usually disjoint from the ones in training set, which makes the speaker veriﬁcation more challenging yet closer to practice. Since it is impossible to classify testing utter- ances to kno wn identities in training set, we need to map speak-                             Figure 1: Four main steps in the con ventional processing pipeline Language/Speaker Label Predictor ID i ID 1 ID 2 ID n … ID n  1 ID 3 T raining Set T esting Sample (a) Closed-set identiﬁcation (a) Closed-set identiﬁcation Feature Extractor T raining Set T esting Pair S i m i l a ri t y Feature 1 Feature 2 (b) Open-set veriﬁcation Figure 2: Comparison of closed-set identiﬁcation and open-set veriﬁcation problem. The closed-set identiﬁcation is equiv alent to classiﬁcation task, while the open-set veriﬁcation can be con- sidered as a metric learning task ers to a discriminative feature space. In this scenario, open- set speak er veriﬁcation is essentially a metric learning problem, where the key is to learn discriminati ve lar ge-margin features. Considering the aforementioned challenges, we generalize the learning scheme for closed-set LR in [24], and build a uni- ﬁed end-to-end system for both LR and SR. The whole pipeline contains ﬁve ke y modules: input data sequence, frame-le vel fea- ture e xtractor, encoding layer , loss function, and similarity met- ric. In this paper , W e focus on in vestigating ho w to enhance the system performance by exploring dif ferent kinds of encoding layers and loss functions. 2. End-to-End System Overview The speech signal is naturally with v ariable length, and we usu- ally don’t kno w exactly how long the testing speech segment will be. Therefore, a ﬂexible processing method should have the ability to accept speech segments with arbitrary duration. Motiv ated by [21, 22, 24], the whole end-to-end framework in this paper is shown in Fig. 3. It accepts variable-length input and produces an utterance lev el result. The additional similarity metric module is speciﬁcally designated for the open-set veriﬁ- cation task. Giv en input data feature sequence such as log mel- ﬁlterbank energies (Fbank), we employ a deep conv olutional neural network (CNN) as our frame-lev el feature extractor . It can learn high-le vel abstract local patterns from the raw input automatically . The frame-le vel representation after the front- end conv olutional layers is still in a temporal order . The re- maining issue is to aggregate them together over the entire se- quence. In this way , the encoding layer plays a role in extract- ing a ﬁxed-dimensional utterance level representation from a variable-length input sequence. The utterance lev el representa- tion is further processed through a fully-connected (FC) layer and ﬁnally connected with an output layer . Each unit in the out- put layer is represented as a target speaker/language label. All the components in the pipeline are jointly learned in an end-to- end manner with a uniﬁed loss function. 3. Encoding layer 3.1. T emporal average pooling lay er Recently , in both [21, 22], similar temporal av erage pooling (T AP) layer is adopted in their neural network architectures. As shown in Fig. 5, the T AP layer is inherently designated in the end-to-end network, and it equally pools the front-end learned features ov er time. 3.2. Self-attentiv e pooling layer The T AP layer equally pools the CNN extracted features over time. Ho wever , not all frame of features contribute equally to the utterance lev el representation, W e introduce a self-attentive pooling (SAP) layer to pay attention to such frames that are important to the classiﬁcation and aggregate those informative frames to form a utterance lev el representation. In [26], attention-based recurrent neural network (RNN) is introduced to get utterance lev el representation for closed-set LR task . Howe ver , the work in [26] relies on a non-trivial pre- training procedure to get the language cate gory embedding, and the authors only report results on 3s short duration task. Differ - ent from [26] , the attention mechanism in our network archi- tecture is self-contained, with no need for extra guiding source information. W e implement the SAP layer similar to [27, 28, 29]. That is, we ﬁrst feed the utterance level feature maps { x 1 , x 2 , · · · , x L } into a multi-layer perceptron (MLP) to get { h 1 , h 2 , · · · , h L } as a hidden representation. In this paer , we simply adopt a one-layer perceptron, h t = tanh( W x t + b ) (1) Then we measure the importance of each frame as the similarity of h t with a learnable context vector µ and get a normalized importance weight w t through a softmax function. w t = exp( h T t u ) P T t =1 exp( h T t u ) (2) Input data sequence Frame-level Feature Extractor Encoding Layer Loss Function … Utterance level representation Feature embedding Similarity Metric … … Figure 3: End-to-end framew ork for both LR and SR. It accepts input data sequence with variable length, and produces an utterance lev el result. The whole pipeline contains ﬁv e ke y modules: input data sequence, frame-level feature extractor , encoding layer , loss function, and similarity metric. The additional similarity metric module is speciﬁcally designated for the open-set veriﬁcation task. The context vector µ can be seen as a high lev el represen- tation of a ﬁxed query “what is the informative frame over the whole frames [27]. It is randomly initialized and jointly learned during the training process. After that, the utterance level representation e can be gen- erated as a weighted sum of the frame level CNN feature maps based on the learned weights. e = T X t =1 w t x t (3) 3.3. Lear nable dictionary encoding layer In con ventional speaker veriﬁcation system, we always rely on a dictionary learning procedure like K-means/GMM/DNN, to accumulate statistics. Inspired by this, we introduce a novel LDE Layer to accumulate statistics on more detailed units. It combines the dictionary learning and vector encoding steps into a single layer for end-to-end learning. As demonstrated in Fig. 5, given an input temporal ordered feature sequence with the size of D × L (where D denotes the feature coefﬁcients dimension, and L denotes the temporal duration length), LDE layer aggregates them over time. More speciﬁcally , it transforms them into an utterance lev el temporal orderless D × C vector representation, which is independent of length L . The LDE Layer imitates the mechanism of GMM Supervector , but learned directly from the loss function. The LDE layer is a directed acyclic graph and all the com- ponents are differentiable w .r .t the input X and the learn- able parameters. Therefore, the LDE layer can be trained in an end-to-end manner by standard stochastic gradient descent with backward propagation. Fig. 4 illustrates the forward di- agram of LDE layer . Here, we introduce two groups of learn- able parameters. One is the dictionary component center , noted as µ = { µ 1 , µ 2 · · · µ c } . The other one is assigned weights, noted as w . Consider assigning weights from the features to the dictio- nary components. Similar as soft-weight assignment in GMM, the features are independently assgined to each dictionary com- ponent and the non-negativ e assigning weight is given by a soft- max function, w tc = exp( − s c k r tc k 2 ) P C m =1 exp( − s m k r tm k 2 ) (4) where the smoothing factor s c for each dictionary center u c is learnable. V ariable-length Input Dictionary Components Residuals Aggregate Assign W eights Encoded V ector µ = { µ 1 , ··· µ c } { x 1 ,x 2 , ··· ,x L } r tc = x t  u c w tc E = { e 1 , ··· e C } Figure 4: The forward diagram within the LDE layer Giv en a set of L frames feature sequence { x 1 , x 2 , · · · , x L } and a learned dictionary center µ = { µ 1 , µ 2 · · · µ c } , each frame of feature x t can be assigned with a weight w tc to each component µ c and the cor- responding residual vector is denoted by r tc = x t − u c , where t = 1 , 2 · · · L and c = 1 , 2 · · · C . Given the assignments and the residual vector , similar to con ventional GMM Supervector , the residual encoding model applies an aggregation operation for ev ery dictionary component center µ c : e c = L X t =1 e tc = P L t =1 ( w tc · r tc ) P L t =1 w tc (5) In order to facilitate the deri vation we simpliﬁed it as e c = P L t =1 ( w tc · r tc ) L (6) The LDE layer concatenates the aggregated residual vectors with assigned weights. The resulted encoder outputs a ﬁxed dimensional representation E = { e 1 , e 2 · · · e C } . 4. Loss function 4.1. Loss function f or closed-set identiﬁcation In con ventional LR or SR problem, the processing stream is explicitly separated into front-end and back-end. The i-vector (c) LDE layer … LDE Layer (#Components = C) … D ⇥ C D ⇥ L MLP Transformation … (a) T AP layer T AP Layer … D ⇥ L D D ⇥ L D (b) SAP layer µ W eights … Figure 5: Comparison of dif ferent encoding procedures extracting front-end is comprised of multiple unsupervised gen- erativ e models. They are optimized through Expectation Max- imum (EM) algorithm under a negati ve complete-data log- likelihood loss. Since they are all generative models, we re- fer their loss functions as a kind of generative negati ve log- likelihood (GNLL) loss for simplicity . Once front-end model is trained and i-v ector is extracted, a back-end LogReg or SVM is commonly adopted to do the back-end classiﬁcation. Their loss function is softmax or hinge loss. As illustrated in Fig. 6, for an end-to-end closed-set iden- tiﬁcation system, the front-end feature extractor and back-end classiﬁer could be jointly learned. In this way , the whole iden- tiﬁcation system could be optimized within a uniﬁed softmax loss: ` s = − 1 M M X i =1 log e W T y i f ( x i )+ b y i P C j =1 e W T j f ( x j )+ b j (7) where M is the training batch size, x i is the i th input data se- quence in the batch, f ( x i ) is the corresponding output of the penultimate layer of the end-to-end neural network, y i is the corresponding target label, and W and b are the weights and bias for the last layer of the network which acts as a classiﬁer . 4.2. Loss function f or open-set veriﬁcation Once front-end model is trained and i-v ector is extracted, PLD A is commonly adopted in the state-of-the-art open-set speaker veriﬁcation system. PLD A is a Bayesian generativ e model. Thus its loss function is still GNLL. W e belie ve that PLDA is not necessary , and a completely end-to-end system should hav e ability to learn this kind of open- set problem with a uniﬁed loss function. Howev er , for open-set speaker veriﬁcation task, the learned feature embedding need to be not only separable but also discriminati ve. Since it is imprac- tical to pre-collect all the possible testing identities for training, the label prediction goal and corresponding basic softmax loss is not always applicable. Therefore, as illustrated in Fig. 6, a uniﬁed discriminative loss function is needed to hav e better generalization than closed-set identiﬁcation: In [21, 22], similar pairwise loss such as contrastive loss [30, 31] or triplet loss [32] is adopted for open-set speaker ver - Front-end Back-end + End-to-End G e ne ra t i ve N L L S oft m a x / H i nge U ni fi e d S oft m a x L os s (a) Closed-set identiﬁcation Front-end Back-end + End-to-End G e ne ra t i ve N L L G e ne ra t i ve N L L U ni fi e d D i s c ri m i na t i ve L os s (b) Open-set veriﬁcation Figure 6: Conv entional explicitly separated front-end and back- end loss are proceeded into a uniﬁed end-to-end loss iﬁcation. They all explicitly treat the open-set speaker veriﬁ- cation problem as metric learning problem. Howe ver , a neu- ral network trained with pairwise loss requires carefully de- signed pair/triplet mining procedure. This procedure is non- trivial, both time-consuming and performance-sensiti ve [33]. In this paper , we focus on the general classiﬁcation network. This means the units in the output layer are equal to the speaker num- bers in the training set. Here we introduce two discriminativ e loss which is ﬁrst proposed in computer vision community . 4.2.1. Center loss The basic softmax loss encourages the separability of features only . In [34], the authors propose a center loss simultaneously learning a center for deep features of each class and penalizing the distances between the deep features and their corresponding class centers. The learning goal is to minimize the within-class variations while keeping the features of different classes sepa- rable. The joint supervision of softmax loss and center loss is adopted for discriminativ e feature learning: ` = ` + λ` C = − 1 M M X i =1 log e W T y i f ( x i )+ b y i P C j =1 e W T j f ( x i )+ b j + λ 2 M X i =1   f ( x i ) − c y i   2 2 (8) The c y i ∈ R d denotes the y i th class center of deep features. The formulation effecti vely characterizes the intra-class varia- tions. A scalar λ is used for balancing the two loss functions. The con ventional softmax loss can be considered as a spe- cial case of this joint supervision, if λ is set to 0. With proper λ , the discriminati ve po wer of deep features can be signiﬁcantly enhanced [34]. 4.2.2. Angular Softmax loss In [33], the authors propose a natural way to learn angular mar- gin. The angular softmax (A-Softmax) loss is deﬁned as ` = 1 M M X i =1 − log( e k f ( x i ) k φ ( θ y i ,i ) e k f ( x i ) k φ ( θ y i ,i ) + P j 6 = y i e k f ( x i ) k cos ( θ j ,i ) ) (9) where φ ( θ y i , i ) = ( − 1) k cos ( mθ y i ,i ) − 2 k , θ ( y i , i ) ∈ h kπ m , ( k +1) π m i and k ∈ [0 , m − 1] . m ≥ 1 is an integer that controls the size of angular margin. When m = 1 , it becomes the modiﬁed softmax loss. A-Softmax loss has clear geometric interpretation. Su- pervised by A-Softmax loss, the learned features construct a discriminative angular distance metric that is equiv alent to geodesic distance on a hypersphere manifold, which intrinsi- cally matches the prior that speakers also lie on a manifold. A-Softmax loss has stronger requirements for a correct classiﬁ- cation when m ≥ 2 , which generates an angular classiﬁcation margin between learned features of dif ferent classes [33]. 5. Experiments 5.1. Data description 5.1.1. V oxceleb V oxceleb is a large scale text-independent SR dataset collected “in the wild”, which contains over 100,000 utterances from 1251 celebrities. It can be used for both speaker identiﬁcation and veriﬁcation [35]. W e pool the ofﬁcial split training and val- idation set together as our dev elopment dataset. For speak er veriﬁcation task, there are totally 1211 celebri- ties in the development dataset. The testing dataset contains 4715 utterances from the rest 40 celebrities. There are totally 37720 pairs of trials including 18860 pairs of true trials. T wo key performance metrics C det [36] and EER are used to ev alu- ate the system performance for the veriﬁcation task as sho wn in T able 2. For speaker identiﬁcation task, there are totally 1251 celebrities in the development dataset. The testing dataset con- tains 8251 utterances from these 1251 celebrities. W e report top-1 and top-5 accuracies as in T able 3. 5.1.2. NIST LRE07 The whole training corpus including Callfriend datasets, LRE 2003, LRE 2005, SRE 2008 datasets and dev elopment data for LRE07. The total training data is about 37000 utterances. The task of interest is the closed-set language detection. There are totally 14 target languages in testing corpus, which included 7530 utterances split among three nominal durations: 30, 10 and 3 seconds. T wo key performance metrics A verage Detection Cost C avg [37] and Equal Error Rate (EER) are used to evaluate system performance as shown in T able 4. 5.2. i-vector system For general usage, we focus on the comparison on those systems that do not require additional transcribed speech data and extra DNN acoustic model. T able 1: Our end-to-end baseline network conﬁguration Layer Output size Downsample Channels Blocks Con v1 64 × L in False 16 - Res1 64 × L in False 16 3 Res2 32 × L in 2 True 32 4 Res3 16 × L in 4 True 64 6 Res4 8 × L in 8 True 128 3 A vgpool 1 × L in 8 - 128 - Reshape 128 × L out , L out = L in 8 - - - As for the baseline i-vector system, raw audio is conv erted to 7-1-3-7 based 56 dimensional SDC feature for LR task. For SR task, 20 dimensional MFCC is augmented with their delta and double delta coef ﬁcients, making 60 dimensional MFCC feature vectors. A frame-lev el energy-based voice activity de- tection (V AD) selects features corresponding to speech frames. A 2048 components full covariance GMM UBM is trained, along with a 600 dimensional i-vector e xtractor . For closed-set speaker/language identiﬁcation, a multi- class LogReg is adopted as the back-end classiﬁer . For open- set veriﬁcation, cosine similarity or PLDA with full rank is adopted. 5.3. End-to-end system Audio is conv erted to 64-dimensional Fbank with a frame- length of 25 ms, mean-normalized over a sliding window of up to 3 seconds. The same V AD processing as in i-vector baseline system is used here. W e ﬁx the front-end deep CNN module based on the well known ResNet-34 architecture [38]. The de- tail architecture is described in T able 1. The total parameters of the front-end feature extractor is about 1.35 million. In CNN-T AP system, a simple average pooling layer is built on top of the front-end CNN. In CNN-LDE system, the T AP layer is replaced with a LDE layer . The number of dictionary components in CNN-LDE system is 64. The lose weight parameter λ of center loss is set to 0.001 in our experiments. For A-Softmax loss, we use the angular margin m = 4 . The model is trained with a mini-batch, whose size v aries from 96 to 256 considering different datasets and model pa- rameters. The network is trained using typical stochastic gra- dient descent with momentum 0.9 and weight decay 1e-4. The learning rate is set to 0.1, 0.01, 0.001 and is switched when the training loss plateaus. The training is ﬁnished at 40 epochs for V oxceleb dataset and 90 epochs for LRE 07 dataset. Since we hav e no separated validation set, the con verged model after the last optimization step is used for ev aluation. For each training step, an integer L within [300 , 800] interval is randomly gener- ated, and each data in the mini-batch is cropped or extended to L frames. For open-set speaker veriﬁcation, the 128-dimensional speaker embedding is extracted after the penultimate layer of neural network. Additional similarity metric like cosine simi- larity or PLD A is adopted to generate the ﬁnal pairwise score. In the testing stage, all the testing utterances with differ- ent duration are tested on the same model. Since the duration is arbitrary , we feed the testing speech utterance to the trained neural network one by one. T able 2: Results for veriﬁcation on V oxCeleb (lower is better) System ID System Description Encoding Procedure Loss Function Similarity Metric C det E E R (%) 1 i-vector + cosine Supervector GNLL cosine 0.829 20.63 2 i-vector + PLD A Supervector GNLL + GNLL PLD A 0.639 7.95 3 T AP-Softmax T AP softmax cosine 0.553 5.48 4 T AP-Softmax T AP softmax + GNLL PLD A 0.545 5.21 5 T AP-CenterLoss T AP center loss cosine 0.522 4.75 6 T AP-CenterLoss T AP center loss+ GNLL PLD A 0.5155 4.59 7 T AP-ASoftmax T AP A-Softmax cosine 0.439 5.27 8 T AP-ASoftmax T AP A-Softmax + GNLL PLD A 0.577 4.46 9 SAP-Softmax SAP softmax cosine 0.522 5.51 10 SAP-Softmax SAP softmax + GNLL PLD A 0.545 5.08 11 SAP-CenterLoss SAP center loss cosine 0.540 4.98 12 SAP-CenterLoss SAP center loss+ GNLL PLD A 0.571 4.89 13 SAP-ASoftmax SAP A-Softmax cosine 0.509 4.90 14 SAP-ASoftmax SAP A-Softmax + GNLL PLD A 0.622 4.40 15 LDE-Softmax LDE softmax cosine 0.516 5.21 16 LDE-Softmax LDE softmax + GNLL PLDA 0.519 5.07 17 LDE-CenterLoss LDE center loss cosine 0.496 4.98 18 LDE-CenterLoss LDE center loss + GNLL PLD A 0.632 4.87 19 LDE-ASoftmax LDE A-Softmax cosine 0.441 4.56 20 LDE-ASoftmax LDE A-Softmax + GNLL PLD A 0.576 4.48 T able 3: Results for identiﬁcation on V oxCeleb (higher is bet- ter) System ID System Description T op-1 (%) T op-5 (%) 1 i-v ector + LogReg 65.8 81.4 2 CNN-T AP 88.5 94.9 3 CNN-SAP 89.2 94.1 4 CNN-LDE 89.9 95.7 T able 4: Performance on the 2007 NIST LRE closed-set task (lower is better) System System Description C avg (%) /E E R (%) ID 3s 10s 30s 1 i-vector + LogReg 20.46/17.71 8.29/7.00 3.02/2.27 2 CNN-T AP 9.98/11.28 3.24/5.76 1.73/3.96 3 CNN-SAP 8.59/9.89 2.49 /4.27 1.09 /2.38 4 CNN-LDE 8.25/7.75 2.61/ 2.31 1.13/ 0.96 5.4. Evaluation As expected, the end-to-end learning systems outperform the con ventional i-vector approach signiﬁcantly for both SR and LR tasks (see T able 2-4). For encoding layer , as can be observed in T able 1-3, both SAP layer and LDE layer outperform the baseline T AP layer . Besides, the LDE layer system also show superior performance compared with SAP layer . Considering loss functions in T a- ble 1, in most cases, systems trained with discriminativ e loss function like center loss or A-Softmax loss achie ve better re- sults than softmax loss. In terms of similarity metric, we can ﬁnd that PLD A gets signiﬁcant error reduction in con ventional i-vector approach. Howe ver , when it turns into end-to-end sys- tem, especially for those system trained with discriminati ve loss funtion, PLD A achiev es little gain and sometimes makes the re- sult worse. Finally , CNN-LDE based end-to-end systems achiev e best result in speaker/language identiﬁcation task. Compared with CNN-T AP baseline system, the CNN-LDE system achieve 25%, 45%, 63% relativ e error reduction for corresponding NIST LRE 07 3s, 10s, 30s duration task. For V oxeceleb speaker identiﬁcation task, system trained with LDE layer get relati ve 12% error reduction compared with CNN-T AP system. In speaker veriﬁcation task, the speaker embeddings ex- tracted from neural network trained in LDE-ASoftmax system perform best. In the testing stage, a simple cosine similar- ity achieves the result of C det 0.441 and EER 4.56%, which achiev es relati ve 20% error reduction compared with T AP- Softmax baseline system. 6. Conclusions In this paper, a uniﬁed and interpretable end-to-end system is dev eloped for both SR and LR. It accepts v ariable-length in- put and produces an utterance level result. W e inv estigate how to enhance the system by exploring different kinds of encoding layers and loss function. Besides the basic T AP layer , we in- troduce a SAP layer and a LDE layer to get the utterance level representation. In terms of loss function for open-set speaker veriﬁcation, center loss and A-Softmax loss is introduced to get more discriminativ e speaker embedding. Experimental results show that the performance of end-to-end learning system could be signiﬁcantly improved by designing suitable encoding layer and loss function. 7. Acknowledgement The authors would like to acknowledge Y andong W en from Carnegie Mellon University . He gives insightful advice on the implementation of end-to-end discriminativ e loss. This research was funded in part by the National Natural Science Foundation of China (61401524,61773413), Natural Science Foundation of Guangzhou City (201707010363), Sci- ence and T echnology Dev elopment Foundation of Guangdong Province (2017B090901045), National Ke y Research and De- velopment Program (2016YFC0103905). 8. References [1] W .M. Campbell, D.E. Sturim, and DA Reynolds, “Sup- port vector machines using gmm supervectors for speak er veriﬁcation, ” IEEE Signal Pr ocessing Letters , vol. 13, no. 5, pp. 308–311, 2006. [2] T . Kinnunen and H. Li, “ An o verview of text-independent speaker recognition: From features to supervectors, ” Speech Communication , vol. 52, no. 1, pp. 12–40, 2010. [3] N. Dehak, P . Kenny , R. Dehak, P . Dumouchel, and P . Ouellet, “Front-end factor analysis for speaker veriﬁ- cation, ” IEEE T ransactions on Audio, Speech, and Lan- guage Pr ocessing , vol. 19, no. 4, pp. 788–798, 2011. [4] N. Dehak, P .A. T orres-Carrasquillo, D. Reynolds, and R. Dehak, “Language recognition via i-vectors and di- mensionality reduction, ” in INTERSPEECH 2016 , pp. 857–860. [5] H. Li, B. Ma, and K. Lee, “Spoken language recognition: From fundamentals to practice, ” Proceedings of the IEEE , vol. 101, no. 5, pp. 1136–1159, 2013. [6] P . Matejka, L. Zhang, T . Ng, H. Mallidi, O. Glembek, J. Ma, and B. Zhang, “Neural network bottleneck fea- tures for language identiﬁcation, ” Proc. IEEE Odysse y , pp. 299–304, 2014. [7] Y . Song, X. Hong, B. Jiang, R. Cui, I. Mcloughlin, and L. Dai, “Deep bottleneck network based i-vector repre- sentation for language identiﬁcation, ” 2015. [8] Fred Richardson, Douglas Reynolds, and Najim Dehak, “ A uniﬁed deep neural network for speaker and language recognition, ” arXiv preprint , 2015. [9] M. Li, L. Liu, W . Cai, and W . Liu, “Generalized i- vector representation with phonetic tokenizations and tan- dem features for both te xt independent and text dependent speaker veriﬁcation, ” Journal of Signal Pr ocessing Sys- tems , vol. 82, no. 2, pp. 207–215, 2016. [10] M. Li and W . Liu, “Speaker veriﬁcation and spoken lan- guage identiﬁcation using a generalized i-vector frame- work with phonetic tokenizations and tandem features, ” in INTERSPEECH 2014 . [11] F . Richardson, D. Reynolds, and N. Dehak, “Deep neural network approaches to speaker and language recognition, ” IEEE Signal Pr ocessing Letters , vol. 22, no. 10, pp. 1671– 1675, 2015. [12] F . Soong, A. E. Rosenberg, J. BlingHwang, and L. R. Ra- biner , “Report: A vector quantization approach to speak er recognition, ” At & T T echnical Journal , v ol. 66, no. 2, pp. 387–390, 1985. [13] D.A. Reynolds and R.C. Rose, “Robust text-independent speaker identiﬁcation using gaussian mixture speaker models, ” IEEE T ransactions on Speech & Audio Pr ocess- ing , vol. 3, no. 1, pp. 72–83, 1995. [14] D.A. Reynolds, T .F . Quatieri, and R.B. Dunn, “Speaker veriﬁcation using adapted gaussian mixture models, ” in Digital Signal Pr ocessing , 2000, p. 1941. [15] Y . Lei, N. Scheffer , L. Ferrer , and M. McLaren, “ A novel scheme for speaker recognition using a phonetically- aware deep neural netw ork, ” in ICASSP 2014 . [16] D. Sn yder , D. Garcia-Romero, and D. Po vey , “T ime delay deep neural network-based universal background models for speaker recognition, ” in ASRU 2016 , pp. 92–97. [17] S.J.D. Prince and J.H. Elder , “Probabilistic linear dis- criminant analysis for inferences about identity , ” in ICCV 2007 , pp. 1–8. [18] P . Kenn y , “Bayesian speak er v eriﬁcation with heavy tailed priors, ” OD YSSEY 2010 . [19] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J. Gonzalez-Rodriguez, and P . Moreno, “ Au- tomatic language identiﬁcation using deep neural net- works, ” in ICASSP 2014 . [20] J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-Rodriguez, and P . J Moreno, “ Automatic lan- guage identiﬁcation using long short-term memory recur- rent neural networks, ” in Pr oc. INTERSPEECH 2014 , 2014. [21] D. Snyder , P . Ghahremani, D. Povey , D. Garcia-Romero, Y . Carmiel, and S. Khudanpur , “Deep neural network- based speaker embeddings for end-to-end speaker veriﬁ- cation, ” in SL T 2017 , pp. 165–170. [22] L. Chao, M. Xiaokong, J. Bing, L. Xiangang, Z. Xuewei, L. Xiao, C. Y ing, K. Ajay , and Z. Zhenyao, “Deep speaker: an end-to-end neural speaker embedding sys- tem, ” 2017. [23] M. Jin, Y . Song, I. McLoughlin, and L. Dai, “Lid- senones and their statistics for language identiﬁcation, ” IEEE/A CM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 26, no. 1, pp. 171–183, 2018. [24] W . Cai, Z. Cai, W . Liu, X. W ang, and M. Li, “Insights into end-to-end learning scheme for language identiﬁcation, ” in ICASSP 2018 . [25] W . Cai, Z. Cai, X. Zhang, X. W ang, and M. Li, “A nov el learnable dictionary encoding layer for end-to-end language identiﬁcation, ” in ICASSP 2018 . [26] W . Geng, W . W ang, Y . Zhao, X. Cai, and B. Xu, “End- to-end language identiﬁcation using attention-based re- current neural networks, ” in INTERSPEECH , 2016, pp. 2944–2948. [27] Z. Y ang, D. Y ang, C. Dyer, X. He, A. Smola, and E. Hovy , “Hierarchical attention networks for document classiﬁca- tion, ” in Pr oceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies , 2016, pp. 1480–1489. [28] G. Bhattacharya, J. Alam, and P . K enny , “Deep speaker embeddings for short-duration speaker veriﬁcation, ” in Pr oc. Interspeech 2017 , pp. 1517–1521. 2017. [29] F A Chowdhury , Q. W ang, I. L. Moreno, and L. W an, “ Attention-based models for text-dependent speaker ver- iﬁcation, ” arXiv preprint , 2017. [30] R. Hadsell, S. Chopra, and Y . Lecun, “Dimensionality re- duction by learning an in variant mapping, ” in Computer V ision and P attern Recognition, 2006 IEEE Computer So- ciety Confer ence on , 2006, pp. 1735–1742. [31] Y . Chen, Y . Chen, X. W ang, and X. T ang, “Deep learning face representation by joint identiﬁcation-veriﬁcation, ” vol. 27, pp. 1988–1996, 2014. [32] F . Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed embedding for face recognition and clustering, ” in IEEE Conference on Computer V ision and P attern Recog- nition , 2015, pp. 815–823. [33] W . Liu, Y . W en, Z. Y u, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition, ” in The IEEE Conference on Computer V i- sion and P attern Recognition (CVPR) , 2017, v ol. 1. [34] Y . W en, K. Zhang, Z. Li, and Y . Qiao, “ A discrimina- tiv e feature learning approach for deep face recognition, ” in Eur opean Confer ence on Computer V ision . Springer , 2016, pp. 499–515. [35] A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large-scale speaker identiﬁcation dataset, ” in INTER- SPEECH , 2017. [36] C.S. Greenberg, “The NIST 2012 Speaker Recognition Evaluation Plan, ” NIST T echnical Report , 2012. [37] NIST , “The 2007 NIST Language Recognition Evaluation Plan, ” NIST T echnical Report , 2007. [38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn- ing for image recognition, ” in CVPR 2016 , 2016, pp. 770– 778.

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment