Speaker Recognition using Deep Belief Networks

Short time spectral features such as mel frequency cepstral coefficients(MFCCs) have been previously deployed in state of the art speaker recognition systems, however lesser heed has been paid to short term spectral features that can be learned by ge…

Authors: Adrish Banerjee, Akash Dubey, Abhishek Menon

Speaker Recognition using Deep Belief Networks
Speaker Recognition using Deep Belief Networks to CCIS Proceedings Adrish Banerjee, Akash Dubey, Abhishek Menon, Shubham Nanda, Gora Chand Nandi, Robotics and Artificial Intelligence Laborotary, Indian Institute of Information Technology, Allahabad, India Abstract.  Short time spectral features such as mel frequency cepstral coefficients(MFCCs) have been previously deployed in state of the art speaker recognition systems, however lesser heed has been paid to short term spectral features that can be learned by generative learning models from speech signals. Higher dimensional encoders such as deep belief networks (DBNs) could improve performance in speaker recognition tasks by better modelling the statistical structure of sound waves. In this paper, we use short term spectral features learnt from the DBN augmented with MFCC features to perform the task of speaker recognition. Using our features, we achieved a recognition accuracy of 0.95 as compared to 0.90 when using standalone MFCC features on the ELSDSR dataset. Keywords:  Speaker Recogntion, Deep Belief Network, Gaussian Mixture Models - Universal Background Model 1 Introduction Speaker recognition poses the major problem of understanding how to recognise and represent complex, high dimensional audio data. The solution used by most present day applications involve the use of mel frequency cepstral coefficients(MFCCs), highlighting the extensive dependence on domain specific feature engineering. We would therefore like to propose the use of unsupervised feature learning using a DBN[1] to represent the audio data in the form of spectral features. Work by Ng et al. [2], Schwarz et al.[3] and Lei et al. [4] have shown significant improvement in the speech and speaker recognition tasks whilst deploying a phoneme based approach to learning. DBNs have also been known to work very effectively in the case of learning features from music audio [5], again capturing features based on the temporal aspects of data. But, keeping in mind the text independent nature of the problem, we have decided to approach the problem by learning hierarchical short term spectral features instead of hierarchical temporal features. Smith et al. [6] and Olshausen et al. [7] showed that sparse representation of audio signals were very similar to those of mammalian auditory processing neurons. For instance, the learned representations of natural sounds greatly resembled the cochlear filters in the auditory cortex. However, these methods have been used to learn shallow (1 layer) representations. It is our hypothesis that learning more abstract representations of the speech signal would improve speaker recognition performance by producing better statistical models of the speech signal.These features, by encoding additional speaker dependent information, can hence be augmented with MFCCs to better model speaker characteristics. In this paper, we show that combining features learnt using DBNs with MFCCs produces better accuracy in speaker recognition systems. 2  Unsupervised Feature Learning using Deep Belief Networks Deep belief networks are generative models with numerous layers of latent variables, which are typically binary. There is no intra-layer connectivity whereas the inter layer connections are undirected. Learning in these network turns out to be very difficult due to the intractability of inferring the posterior distribution from the hidden (latent) layers. Sampling from the posterior may be accomplished with deployment of MCMC (Markov Chain Monte Carlo) methods [8], however, these methods are extensively time consuming. [9] Presents a training procedure (using complementary priors to eliminate the explaining away effects) for the deep belief networks, which is equivalent to training a stack or sequence of Restricted Boltzmann Machines (RBM) as depicted in Fig. 1. Fig. 1.  Structure of the DBN used for extraction of short term spectral features, with two hidden layers, can be visualised as a stack of 2 RBMs. 2.1  Restricted Boltzmann Machines (RBM) An RBM [10] (refer Fig. 2) is a type of Markov Random Field, with a layer of stochastic visible units and a layer of stochastic hidden units. There is no intra layer connectivity, but the all units of the visible layer are connected to all units of the hidden layer; the connections are weighted, undirected in nature. The weights of the connections along with the biases define a probability distribution over the state of the visible layer units. The energy function of a configuration of visible units  v and hidden units  h  is given by,  (1) Where E is the energy,  b and  c are the visible and hidden biases respectively and W is the weight matrix representing the strength of connections between visible and hidden units. The joint probability distribution and the conditional distribution of the j  th unit of the hidden layer given the visible layer can be written as:  (2)  (3) Where Z is the normalization or parition function and  :j  refers to all rows and the j  th column of  W  . Fig. 2.  Restricted Boltzmann Machine, where  v  1 .... v  m are visible units and  h  1 .... h  n are hidden (latent) units;  w  ij   is the corresponding weight of the undirected connection between  v  i   and  h  j  . 2.2  Generative Training of RBMs The RBMs can be trained in a greedy layer by layer fashion, training the layer below a layer first and then using the hidden units of the lower layer as visible units for the current layer. The gradient of the log likelihood function of the parameters (W,b,c) can be shown to be (4)  Where the first term is the expectation under the distribution of data and the second term is the expectation under the model. Computing the above expression can be computationally intensive, but as shown by [11] the expectation under the model can be replaced by a point estimate obtained after k Gibbs sampling. Usually, k equal to 1 is sufficient. 2.3  Discriminative fine tuning of DBN using back propagation Following generative pre training of the DBN, the weights were fine-tuned by constructing a neural network with the same architecture of the DBN. The weights of the neurons were initialized to the weights obtained after pre training. The weights were then fine-tuned using backpropagation on the labelled training examples. After fine tuning the outputs of the 1st and 2nd layer of the network corresponded to the learnt representations. These are referred to as L1 and L2 features in this paper, respectively. 3  Deep belief network for speaker recognition The objective of the DBN is learning abstract hierarchical representations of (unlabelled) input data. The deep belief network attempts to learn from frame wise spectral data of the speech signal, which is known to carry speaker dependent information (primarily about the vocal tract shape) [12]. Assuming the speaker to be a vibrating body which produces certain spectral features, we believe that deep generative models (like DBN) can be used to recognise and emphasise the patterns within the spectral data which correspond to the characteristic vocal tract shape of the speaker. The spectrogram (refer Fig. 2) is extracted from the speech data with 25 ms window size with a 10 ms step. To reduce the dimensionality of each frame, PCA whitening is applied on each frame, taking 128 components. We therefore have 128 visible units in the DBN followed by 200 units each in the first and the second hidden layers. Since the inputs to the DBN are real valued, ReLU (Rectified Linear Unit) activation is used for visible units. The activations of the 1st and 2nd hidden layers then form the L1 and L2 features. Fig. 2.  Spectrogram of voice sample, sampled frames of which are used for training the DBN. We append the features learnt by the DBN for each frame with MFCC features (keeping the 2-14 MFCC coefficients) for that frame and train a GMM-UBM [12] model for classification. This augmented feature vector can better represent the speaker dependent features carried by the spectral frame, than standalone MFCC features. Fig. 3. Creation of augmented feature vectors which accurately represent the statistical model of speech. 4  Speaker recognition using GMM-UBM framework For the task of speaker recognition using our learned representations we use the state of the art Gaussian Mixture Model - Universal Background Model framework developed by Reynolds et al. [12]. Gaussian Mixture models [13] are probabilistic models used to represent arbitrary probability distributions. For a GMM with M Gaussians, the probability of a vector x can be expressed as:  (5) Where refers to the collective parameters for the M gaussians. The { w , μ, Σ} λ = probability is the weighted linear sum of M individual Gaussians probabilities p  i  (  x  ),  (6) Where is the mean and is the covariance matrix of the gaussian  i.  Further μ i Σ i the sum of all weights  w  i  is 1. Reynolds et al.[12] formulated the basic task of speech recognition as a basic hypothesis test. Given a speech segment,  X  , and a hypothesized speaker,  Y  , the test to determine if X was spoken by Y reduces to the hypothesis test, H  0  :  X  was spoken by  Y  ; H  1  :  X  was not spoken by  Y They also formulated the optimal test to decide betw een the two hypotheses as the ratio If the ratio is greater than a threshold accept H  0 else reject  H  0  . The . P ( X | H ) 1 P ( X | H ) 0 θ task is then to model  p(X|H  0  ) and  p(X|H  1  )  . For each speaker a GMM is fit to augmented features (MFCC + DBN features) with maximum likelihood parameters estimated using the EM [15] algorithm. Each GMM then models the probability of a speech feature being said by a speaker.  p(X|H  1  ) is modelled using a single, speaker independent GMM known as the UBM. This GMM is fit with features extracted from speech of all speakers in the dataset. This models the probability of a expected alternative speech which can be encountered. After generating the models for the speaker  P(X|H  0  ) and the background model  P(X|H  1  )  , we can use the logarithm of the likelihood:  Log-likelihood = log(p(X|H  0  ) - log(p(X|H  1  ))    (7) The log likelihood can be use as the statistic to verify the speaker. The log likelihood can be normalized using a standard score normalization technique to obtain a confidence measure between 0 and 1. 5 Results We evaluated the performance of our approach on the ELSDSR[15] dataset. We vary the number of speakers and train a GMM-UBM model (with the number of Gaussians as 64 for each GMM). We use different feature vectors for the frames of each sample during training, the results reported are averaged after 15 trials, are as follows: Table 1  .  Performance of different feature vectors used to train a GMM-UBM model with increase in the number of speakers. No of Speakers MFCC MFCC + L1 MFCC + L2 MFCC + L1 + L2 2 0.98 0.99 0.98 0.99 5 0.94 0.97 0.95 0.98 12 0.92 0.96 0.91 0.96 15 0.90 0.95 0.91 0.96 22 0.90 0.94 0.90 0.95 The GMM-UBM model trained using only MFCC features for each frame is beaten by the GMM-UBM models trained by appending the L1 and L2 features obtained from the DBN for the frame. Fig. 4.  Plot of Accuracy v/s No. of speakers shows superior performance of MFCC + L1+ L2. We also evaluate our approach by varying the number of utterances per speaker whilst training the GMM. This highlights the dependence of our approach on the availability of labelled data (refer Table 2 and Fig. 5). Table 1  .  Performance of different feature vectors used to train a GMM-UBM model with increase in the number of utterances per speaker. No of Utterances MFCC MFCC + L1 MFCC + L2 MFCC + L1 + L2 1 0.76 0.80 0.78 0.85 2 0.78 0.82 0.84 0.86 3 0.85 0.85 0.84 0.92 This highlights the superiority of our approach in situations where there is lack of labelled data. Fig. 5. Plot of Accuracy v/s No. of utterances per speaker for 22 speakers shows the superiority of MFCC + L1 + L2 feature vectors. 7 Conclusion Our work therefore highlights the potency of deep belief networks in extracting short term spectral features which are capable of capturing speaker dependent features. We clearly observe that in dearth of labelled training examples, the presence of features learned using the DBN (which can be trained with unlabeled data) greatly enhances the accuracy of speaker recognition in such settings. Future work can be done with the attempt to find more accurate representation of short term spectral data, to completely remove the use of engineered features such as MFCC. References 1. Hinton, G. E. : Deep belief networks. Scholarpedia, 4(5), 5947 (2009). 2. Lee, H., Grosse, R., Ranganath, R., Ng, A. Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning , pp. 609--616. ACM, Montreal (2009) 3. Lei, Y., Ferrer, L., McLaren, M.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695--1699. IEEE, Florence (2014) 4. Schwarz, P., Matejka, P., Cernocky, J.: Hierarchical structures of neural networks for phoneme recognition. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, pp. I-325--I-328. IEEE, Toulouse (2006) 5. Hamel, P., Eck, D.: Learning Features from Music Audio with Deep Belief Networks. In: ISMIR, pp. 339--344. (2010) 6. Smith, E. C., Lewicki, M. S.: Efficient auditory coding. Nature, 439(7079), 978-982 (2006) 7. Olshausen, B. A.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607-609 (1996) 8.  Neal, R. M.: Connectionist learning of belief networks.  Artificial intelligence  ,  56  (1), 71-113 (1992) 9.  Hinton, G. E., Osindero, S., Teh, Y. W.: A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554 (2006) 10.Welling, M., Rosen-Zvi, M., Hinton, G. E.: Exponential family harmoniums with an application to information retrieval. In: Advances in neural information processing systems, pp. 1481--1488, Vancouver (2004) 11.Hinton, G. E.: Training products of experts by minimizing contrastive divergence. Neural computation, 14(8), 1771-1800 (2002) 12.Reynolds, D. A., Quatieri, T. F., Dunn, R. B.: Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1), 19-41 (2000) 13.Reynolds, D.: Gaussian mixture models. Encyclopedia of Biometrics, 827-832 (2015) 14.Dempster, A. P., Laird, N. M., Rubin, D. B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 1-38 (1977) 15.Feng, L., Hansen, L. K.: A new database for speaker recognition. (2005)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment