The CORAL+ Algorithm for Unsupervised Domain Adaptation of PLDA

State-of-the-art speaker recognition systems comprise an x-vector (or i-vector) speaker embedding front-end followed by a probabilistic linear discriminant analysis (PLDA) backend. The effectiveness of these components relies on the availability of a…

Authors: Kong Aik Lee, Qiongqiong Wang, Takafumi Koshinaka

The CORAL+ Algorithm for Unsupervised Domain Adaptation of PLDA
THE CORAL+ ALGORITHM FOR UNSUPER VISED DOMAIN AD APT A TION OF PLD A K ong Aik Lee, Qiongqiong W ang, T akafumi K oshinaka Biometrics Research Laboratories, NEC Corporation, Japan kongaik.lee, q-wang, koshinak@nec.com ABSTRA CT State-of-the-art speaker recognition systems comprise an x- vector (or i-v ector) speaker embedding front-end followed by a pr obabilistic linear discriminant analysis (PLDA) backend. The ef fectiv eness of these components relies on the a vailabil- ity of a large collection of labeled training data. In practice, it is common that the domains (e.g., language, demographic) in which the system is deployed differ from that we trained the system. T o close the gap due to the domain mismatch, we propose an unsupervised PLD A adaptation algorithm to learn from a small amount of unlabeled in-domain data. The proposed method was inspired by a prior work on feature- based domain adaptation technique kno wn as the corr elation alignment (CORAL). W e refer to the model-based adaptation technique proposed in this paper as CORAL+. The efficac y of the proposed technique is experimentally v alidated on the recent NIST 2016 and 2018 Speaker Recognition Ev aluation (SRE’16, SRE’18) datasets. Index T erms — Speaker recognition, domain adaptation, unsupervised, discriminant analysis 1. INTR ODUCTION Speaker recognition is the task of recognizing a person from his/her v oice giv en a small amount of speech utterance from the speaker [1]. Recent progresses have shown successful application of deep neural network to deriv e deep speaker embeddings from speech utterances [2, 3]. Analogous to word embeddings [4, 5], a speak er embedding is a fixed- length continuous-v alue vector that pro vides a succinct char - acterization of speakers voice rendered in a speech utterance. Similar to the classical i-vectors [6], deep speaker embed- dings live in a simpler Euclidean space where distance could be measured easily , compared to the much complex input patterns. T echniques like within-class co variance normal- ization (WCCN) [7], linear discriminant analysis (LDA) [8], probabilistic LD A (PLD A) [9, 10, 11] can be applied. Systems comprising x-vector speaker embedding (and i- vector) follo wed by PLD A hav e shown state-of-the-art per- formances on speaker verification task [12]. T raining an x- vector PLD A system typically requires ov er hundred hours of training data with speaker labels, and with the requirement that the training set must contain multiple recordings of a speaker under different settings (recording devices, transmis- sion channels, noise, reverberation etc.). These knowledge sources contribute to the rob ustness of the system against such nuisance factors. The challenging problem of domain mis- match arises when a speaker recognition system is used in a different domain (e.g., dif ferent languages, demographic etc.) than that of the training data. Its performance degrades con- siderably . It is impractical to re-train the system for each and ev- ery domain as the ef fort at collecting lar ge labelled data sets is expensi ve and time consuming. A more viable solution is to adapt the already trained model using a smaller , and pos- sibly unlabeled, set of in-domain data. Domain adaptation could be accomplished at dif ferent stages of the x-vector (or i-vector) PLD A pipeline. PLD A adaptation is preferable in practice since the same feature extraction and speaker em- bedding front-end could be used while domain adapted PLD A backbends are used to cater for the condition in each specific deployment. PLD A adaptation in volves the adaptation of its mean v ec- tor 1 and covariance matrices. In the case of unsupervised adaptation (i.e., no labels are giv en), the major challenge is how the adaptation could be performed on the within and between class covariance matrices given that only the total cov ariance matrix could be estimated directly from the in- domain data. In this paper , we show that this could be accom- plished by applying similar principle as in the feature-based correlation alignment (CORAL) [14] from which a pseudo- in-domain within and between class cov ariance matrices could be computed. W e further improve the robustness by introducing additional adaptation parameter and regulariza- tion to the adaptation equation. The proposed unsupervised adaptation method is referred to as CORAL+. 2. DOMAIN AD APT A TION OF PLDA This section presents a brief description of probabilistic lin- ear discriminant analysis (PLD A) widely used in state-of-the- art speaker recognition system. W e then draw attention to the 1 Mean shift due to domain mismatch could be solved by centralizing the datasets to a common origin [13]. domain mismatch issue and how the correlation alignment (CORAL) [14, 15] technique deals with it via feature trans- formation. 2.1. Probabilistic LDA Let the vector φ be a speaker embedding (e.g., x-vector , i- vector , etc.). W e assume that the vector φ is generated from a linear Gaussian model [8], as follows [9, 16] p ( φ | h , x ) = N ( φ | µ + Fh + Gx , Σ ) (1) The v ector µ represents the global mean, while F and G are the speaker and channel loading matrices, and the diagonal matrix Σ models the residual variances. The variables h and x are the latent speaker and channel variables, respectiv ely . A PLDA model is essentially a Gaussian distribution in the speaker embedding space. This could be seen more clearly in the form of the marginal density: p ( φ ) = N ( φ | µ, Φ b + Φ w ) (2) The main idea here is to account for the speaker and channel variability with a between-class and a within-class cov ariance matrices Φ b = FF T Φ w = GG T + Σ (3) respectiv ely . W e refer the readers to [9, 10, 16] for details on the model training procedure. In a speaker verification task, the PLDA model serves as a backend classifier . For a giv en pair of enrolment and test utterances, i.e, their speaker embeddings φ 1 and φ 2 , we com- pute the log-likelihood ratio score l ( φ 1 , φ 2 ) = p ( φ 1 , φ 2 ) p ( φ 1 ) p ( φ 2 ) (4) corresponding to the hypothesis test whether the two belong to the same or different speaker . The denominator is ev aluated by substituting φ 1 and φ 2 in turn to (2). The numerator is computed using p ( φ 1 , φ 2 ) = N   φ 1 φ 2       µ µ  ,  C Φ b Φ b C  (5) where C = Φ b + Φ w is the total cov ariance matrix. The as- sumption is that the unseen data follow the same distribution as giv en by the within and between classes cov ariance ma- trices deriv ed from the training set (i.e., the dataset we used to train the PLD A). Problem arises when the training set was drawn from a domain (out-of-domain) different from that of the enrollment and test utterances (in-domain). 2.2. Correlation Alignment Corr elation alignment (CORAL) [14] aims to align the second-order statistics, i.e., cov ariance matrices, of the out- of-domain (OOD) features to match the in-domain (InD) features. No class (i.e., speaker) label is used and therefore it belongs to the class of unsupervised adaptation techniques. The algorithm consists of two steps, namely , whitening fol- lowed by re-coloring. Let C o and C I be the cov ariance matrices of the OOD and InD data, respecti vely . Denote φ as a OOD vector , domain adaptation is performed by first whitening and then re-coloring, as follows φ 0 = C 1 2 I C − 1 2 o φ (6) where C − 1 2 o = Q o Λ − 1 2 o Q T o whitens the input vector , and C 1 2 I = Q I Λ 1 2 I Q T I does the re-coloring. Here, Q and Λ are the eigen vectors and eigenv alues pertaining to the cov ariance matrices 2 . Such simpler and “frustratingly easy” approach [15] has shown to outperform a more complicated non-linear transformation re- ported in [18]. In [15], CORAL is performed on the OOD x-vectors (or i-vectors) embeddings, and the transformed vec- tors (pseudo in-domain) are used to re-train the PLDA. Note that speak er labels of the OOD training data remain the same. 3. THE CORAL+ ALGORITM CORAL is a feature-based domain adaptation technique [14]. W e propose integrating CORAL to PLD A leading to a model- based domain adaptation. 3.1. Domain adaptation It is commonly known that a linear transformation on a nor- mally distrib uted vector leads to an equi valent transformation on the mean vector and cov ariance matrix of its density func- tion. Let A = C 1 / 2 I C − 1 / 2 o be the transformation matrix and φ 0 = A T φ the transformed vector . The cov ariance matrix of the pseudo in-domain vector φ 0 is giv en by C 0 o = A T C o A = A T Φ w , o A + A T Φ b , o A (7) Here, we hav e considered a PLDA trained on OOD data with a total co variance matrix C o = Φ w , o + Φ b , o giv en by the sum of within and between class covariance matrices, as noted in Section 2.1. The above equation shows that training a PLD A on the transformed vectors φ 0 , as proposed in [15], is equiv a- lent to transforming the within-class, between-class, and total cov ariance matrices of a PLD A trained on OOD data. 2 The whitening and re-coloring procedures are better known as the zero- phase component analysis (ZCA) transformation [17]. As opposed to prin- cipal component analysis (PCA) and Cholesky whitening (and re-coloring), ZCA preserves the maximal similarity of the transformed feature to the orig- inal space. Fig. 1 . The effects of regularization. Elements with negati ve variances are remo ved automatically . 3.2. Model-level adaptation Instead of replacing the cov ariance matrices in an OOD PLD A with pseudo in-domain matrices, model-level adapta- tion allows us to consider their interpolation Φ + b =(1 − β ) Φ b , o + β A T Φ b , o A Φ + w =(1 − γ ) Φ w , o + γ A T Φ w , o A where { β , γ } are the adaptation parameters constrained to lie between zero and one. Notice that the first term on the right- hand-side of the equations is the OOD between/within covari- ance matrix while the second term is the pseudo-in-domain cov ariance matrix. For clarity , we further simplify the adap- tation equations, as follows Φ + b = Φ b , o + β  A T Φ b , o A − Φ b , o  Φ + w = Φ w , o + γ  A T Φ w , o A − Φ w , o  (8) The second term on the right-hand-side of the equations rep- resents the new information seen in the in-domain data to be added to the PLD A model. 3.3. Regularized adaptation The central idea of domain adaptation is to propagate the un- certainty seen in the in-domain data to the PLD A model. The adaptation equations in (8), do not guarantee that the vari- ances, and therefore the uncertainty , increase. In this section, we achie ve this goal in the transform space where both the OOD and pseudo-in-domain matrices are simultaneously di- agonalized. Let B be an orthogonal matrix such that B T ΦB = I and B T  A T ΦA  B = E , where E is a diagonal matrix. This procedure is referred to as simultaneous diagonalization . The transformation matrix B is obtained by performing twice the eigen value decomposition (EVD) on the matrix Φ and then A T ΦA after the first transformation has been applied. The procedure is illustrated in Algorithm 1. By applying the simultaneous diagonalization on (8), the following adaptation could be obtained: Φ + b = Φ b , o + β B − T b ( E b − I ) B − 1 b Φ + w = Φ w , o + γ B − T w ( E w − I ) B − 1 w (9) Algorithm 1: The CORAL+ algorithm for unsuper- vised adaptation of PLD A. As before, the between and within class cov ariance matrices are adapt ed separately . Notice that the term ( E − I ) will ends up with negati ve variances if any diagonal elements of E is less than one. W e propose the following regularized adapta- tion: Φ + b = Φ b , o + β B − T b max ( E b − I ) B − 1 b Φ + w = Φ w , o + γ B − T w max ( E w − I ) B − 1 w (10) The max ( . ) operator ensures that the variance increases. W e refer to the regularized adaptation in (10) as the CORAL+ algorithm, while (9) corresponds to the CORAL+ algo- rithm without regularization. Algorithm 1 summarizes the CORAL+ algorithm. Figure 1 shows a plot of the diagonal elements of the term ( E b - I ) in (10). Those entries with neg- ativ e v ariances were removed automatically by the max ( . ) operator . It ensures that the uncertainty increases (or stays the same) in the adaptation process. It is worth noticing that, one could recover the subspace matrices { F , G } via EVD. Nev ertheless, this is not gener- ally required as scores could be computed by plugging in the adapted covariance matrices Φ + b , Φ + w and C + = Φ + b + Φ + w into (2) and (5). 4. EXPERIMENT Experiments were conducted on the the recent SRE’16 and SRE’18 datasets. The performance was measured in terms of equal err or rate (EER) and minimum detection cost (Min- Cost) [19, 20]. The latest SREs organized by NIST ha ve been focusing on domain mismatch as one of the technical chal- lenges. In both SRE’16 and SRE’18, the training set con- sists primarily English speech corpora collected over multi- ple years in the North America. This dataset encompasses Switchboard, Fisher, and the MIXER corpora used in SREs 04 – 06, 08, 10, and 12. The enrollment and test segments are in T agalog and Cantonese for SRE’16, and T unisian Ara- bic for SRE’18. Domain adaptation was performed using the unlabeled subsets provided for the e v aluation. The enrollment utterances have a nominal duration of 60 seconds, while the test duration ranges from 10 to 60 seconds. W e used x-vector speaker embedding, which has sho wn to be very effecti ve for speaker verification task over short utter- ances. (Recent results show that i-vector is more effecti ve for longer utterance of over 2 minutes). The x-vector extrac- tor follows the same configuration and was trained using the same setup as the Kaldi recipe 3 . A slight difference here is that we used an attention model in the pooling layer and ex- tended the data augmentation [21]. In our experiments, the dimension of the x-vector was 512. As commonly used in most state-of-the-art systems, LD A was used to reduce the dimensionality . W e inv estigated the cases of 150- and 200-dimensional x-vector after LDA projection. CORAL [14] transformation was applied on the raw x-vectors before LD A. The transformed, and then pro- jected x-vectors were used to train a PLD A for the CORAL PLDA baseline. It is worth noticing that the LD A projection matrix was computed from the raw x-vectors, from which the CORAL transformation was also deriv ed. W e find that this giv es the best performance compared to that reported in [15]. The proposed CORAL+ is a model-based adaptation technique. Domain adaptation is achieved by adapting the parameters (i.e., covariance matrices) pertaining to the OOD PLDA as in (10) and Algorithm 1 using the unlabeled in- domain dataset. The adaptation parameters were set empir- ically to 0 . 80 in the experiments. T ables 1 and 2 show the performance of the baseline PLDA model trained on the out- of-domain English dataset ( OOD PLDA ), the PLD A trained on the x-vectors which ha ve been adapted using CORAL ( CORAL PLDA ), and the OOD PLD A adapted to in-domain with CORAL+ algorithm ( CORAL+ PLDA ). Also shown in the tables is the CORAL+ adaptation without regularization ( w/o reg ). This correspond to the use of (9) replacing (10) in Algorithm 1. The results on both SRE’16 and SRE’18 sho w consis- tent improv ement of CORAL+ PLDA compared to the OOD PLDA baseline. The relativ e improvement amounts to 36 . 6% and 22 . 35% reduction in EER, and 32 . 0% and 23 . 0% reduc- tion in MinCost on SRE’16 and SRE’18, respectively , for LD A dimension of 200 . Also shown in the tables is an unsu- pervised adaptation method implemented in Kaldi 3 ( Kaldi PLDA ). The proposed CORAL+ PLDA consistently outper- forms this baseline on both SRE’16 and SRE’18 though the 3 https://github .com/kaldi-asr/kaldi/tree/master/egs/sre16/v2 T able 1 . P erformance comparison on SRE’16 (CMN). The dimension of x-vector after LD A is 150 and 200 . Boldface denotes the best performance for each column. LD A 150 LD A 200 EER (%) MinCost EER (%) MinCost OOD PLDA 9.69 0.783 9.94 0.813 Kaldi PLDA 6.82 0.552 6.57 0.558 CORAL PLDA 6.50 0.539 6.31 0.543 CORAL+ PLDA 6.62 0.540 6.30 0.553 w/o reg 6.93 0.544 6.51 0.547 T able 2 . P erformance comparison on SRE’18 (CMN2). The dimension of x-vector after LD A is 150 and 200 . Boldface denotes the best performance for each column. LD A 150 LD A 200 EER (%) MinCost EER (%) MinCost OOD PLDA 7.19 0.538 7.47 0.569 Kaldi PLDA 6.25 0.435 6.48 0.466 CORAL PLDA 6.22 0.449 6.42 0.482 CORAL+ PLDA 5.95 0.421 5.80 0.438 w/o reg 6.49 0.441 6.33 0.460 improv ement over this baseline is more apparent on SRE’18. At LD A dimension of 200, the relative improvement amounts to 10 . 5% reduction in EER, and 6 . 0% reduction in MinCost on SRE’18. Compared to the feature-based CORAL ( CORAL PLDA ), the benefit of CORAL+ ( CORAL+ PLDA ) is more apparent on SRE’18. W e obtained a relativ e reduction of 9 . 7% in EER and 9 . 1% in MinCost at LD A dimension of 200 . It is worth mentioning that SRE’16 has a unlabeled set with about the same size compared to that of SRE’18. Nevertheless, SRE’18 unlabeled set exhibits less v ariability (speaker and channel). This also explains the benefit of regularized adaptation on SRE’18 when a smaller and constrained unlabelled dataset is av ailable for domain adaptation. 5. CONCLUSION W e have presented the CORAL+ algorithm for unsuper- vised adaptation of PLD A backend to deal with the do- main mismatch issue in practical applications. Similar to the feature-based correlation alignment (CORAL) technique, the CORAL+ domain adaptation is accomplished by matching the out-of-domain statistics to that of the in-domain. W e show that statistics matching could be directly applied on PLD A model. W e further improv e the robustness by intro- ducing additional adaptation parameter and regularization to the adaptation equation. The proposed method sho ws signif- icant improvement compared to the PLD A baseline. Results also show the benefit of model-based adaptation especially when the data av ailable for adaptation is relativ ely small and constrained. 6. REFERENCES [1] J. H. L. Hansen and T . Hasan, “How humans and ma- chines recognize voices: a tutorial re view , ” IEEE Signal Pr ocessing Magazine , vol. 32, no. 6, pp. 74–99, 2015. [2] D. Snyder , D. Garcia-Romero, D. Pove y , and S. Khu- danpur , “Deep neural network embeddings for text- independent speaker verification, ” in Pr oc. Interspeech , 2017, pp. 999–1003. [3] E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification, ” in Pr oc. ICASSP , 2014, pp. 4052–4056. [4] Y . Bengio, R. Ducharme, and P . V incent, “ A neural probabilistic language model, ” in NIPS , 2000. [5] T . Mikolov , I. Sutske ver , K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality , ” in NIPS , 2013. [6] N. Dehak, P . K enny , R. Dehak, P . Dumouchel, and P . Ouellet, “Front end factor analysis for speaker verifi- cation, ” IEEE T ransactions on Audio, Speech and Lan- guage Pr ocessing , vol. 19, no. 4, pp. 788–798, 2010. [7] A. O. Hatch, S. Kajarekar , and A Stolcke, “W ithin- class cov ariance normalization for SVM-based speaker recognition, ” in Pr oc. Interspeec h , 2006, pp. 1471– 1474. [8] C. Bishop, P attern r ecognition and machine learning , Springer , Ne w Y ork, 2006. [9] S. J. D. Prince and J. H. Elder , “Probabilistic linear discriminant analysis for inferences about identity , ” in Pr oc. ICCV , 2007, pp. 1–8. [10] S Iof fe, “Probabilistic linear discriminant analysis, ” in Pr oc. ECCV , P art IV , LNCS 3954 , 2006, pp. 531–542. [11] P . Kenn y , “Bayesian speaker verification with heavy- tailed priors., ” in Pr oc. Odyssey: Speaker and Language Recognition W orkshop , 2010. [12] D. Snyder , D. Garcia-Romero, G. Sell, D. Pove y , and S. Khudanpur , “X-vectors: Robust DNN embeddings for speaker recognition, ” in ICASSP , 2018, pp. 5329– 5333. [13] K. A. Lee, V . Hautamaki, T . Kinnunen, et al., “The I4U mega fusion and collaboration for NIST speaker recog- nition ev aluation 2016, ” in Proc. Interspeech , 2017, pp. 1328–1332. [14] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation, ” in Proc. AAAI, vol. 6 , 2016, p. 8. [15] J. Alam, G. Bhattacharya, and P . Kenny , “Speaker verifi- cation in mismatched conditions with frustratingly easy domain adaptation, ” in Pr oc. Odyssey , 2018, pp. 176 – 180. [16] S. J. D. Prince, Computer vision: models, learning, and infer ence , Cambridge Univ ersity Press, 2012. [17] A. K essy , A. Lewin, and K. Strimmer, “Optimal whiten- ing and decorrelation, ” The American Statistician , vol. 2018, pp. 1–6, 2018. [18] W . Lin, M.-W . Mak, L Li, and J.-T . Chien, “Reducing domain mismatch by maximum mean discrepancy based autoencoders, ” in Pr oc. Odyssey , 2018, pp. 162–167. [19] National Institute of Standards and Technology , “NIST 2016 Speak er Recognition Evaluation Plan, ” NIST SRE , 2016. [20] National Institute of Standards and Technology , “NIST 2018 Speak er Recognition Evaluation Plan, ” NIST SRE , 2018. [21] K oji Okabe, T akafumi Koshinaka, and K oichi Shinoda, “ Attentive statistics pooling for deep speaker embed- ding, ” in Pr oc. Interspeech , 2018, pp. 2252–2256.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment