Learning Speaker Representations with Mutual Information

Learning Speak er Repr esentations with Mutual Inf ormation Mir co Ravanelli, Y oshua Bengio ∗ Mila, Uni versit ´ e de Montr ´ eal , ∗ CIF AR Fello w mirco.ravanelli@gmail.com Abstract Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of sta- tistical dependence are promising tools for learning these rep- resentations in an unsupervised way . Even though the mutual information between tw o random v ariables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achiev ed with an encoder-discriminator architecture similar to that of Genera- tiv e Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector . The discriminator is fed by either positiv e samples (of the joint distribution of encoded chunks) or negati ve samples (from the product of the marginals) and is trained to separate them. W e report experiments showing that this approach effec- tiv ely learns useful speak er representations, leading to promis- ing results on speaker identiﬁcation and veriﬁcation tasks. Our experiments consider both unsupervised and semi-supervised settings and compare the performance achiev ed with different objectiv e functions. Index T erms : Deep Learning, Speaker Recognition, Mutual Information, Unsupervised Learning, SincNet. 1. Introduction Deep learning has shown remarkable success in numerous speech tasks, including speech recognition [1–4] and speaker recognition [5, 6]. The deep learning paradigm aims to de- scribe data by means of a hierarchy of representations, that are progressiv ely combined to model higher le vel abstractions [7]. Most commonly , deep neural networks are trained in a super- vised way , while learning meaningful representations in an un- supervised fashion is more challenging but could be useful es- pecially in semi-supervised settings. Sev eral approaches have been proposed for deep unsuper - vised learning in the last decade. Notable examples are deep autoencoders [8], Restricted Boltzmann Machines (RBMs) [9], variational autoencoders [10] and, more recently , Generative Adversarial Networks (GANs) [11]. GANs are often used in the conte xt of generati ve modeling, where they attempt to mini- mize a measure of discrepancy between a distrib ution generated by a neural network and the data distribution. Beyond genera- tiv e modeling, some works have extended this framew ork to learn features that are in variant to dif ferent domains [12] or to noise conditions [13]. Moreover , we recently witnessed some remarkable attempts to learn unsupervised representations by minimizing or maximizing Mutual Information (MI) [14–17]. This measure is a fundamental quantity for estimating the sta- tistical dependence between random variables and is deﬁned as the Kullback-Leibler (KL) div ergence between the joint dis- tribution ov er these random variables and the product of their marginal distributions [18]. As opposed to other metrics, such as correlation, MI can capture comple x non-linear relationships between random variables [19]. MI, howe ver , is difﬁcult to compute directly , especially in high dimensional spaces [20]. The aforementioned works found that it is possible to maximize or minimize the MI within a framework that closely resembles that of GANs. Additionally , [15] has further proved that it is ev en possible to explicitly compute it by exploiting its Donsk er- V aradhan bound. Here we attempt to learn good speaker representations by maximizing the mutual information between tw o encoded ran- dom chunks of speech sampled from the same sentence. Our ar - chitecture employs both an encoder , that transforms raw speech samples into a compact feature vector , and a discriminator . The latter is alternativ ely fed by samples from the joint distribution (i.e. two local encoded vectors randomly drawn from the same speech sentence) and from the product of mar ginal distributions (i.e, two local encoder vectors coming different utterances). The discriminator is jointly trained with the encoder to maximize the separability of the two distributions. W e called our approach Local Info Max (LIM) to highlight the fact that it is simply based on randomly sampled local speech chunks. Our encoder is based on SincNet [21, 22], which ef ﬁciently processes the ra w input wa veforms with learnable band-pass ﬁlters based on sinc functions. The e xperimental results show that our approach learns use- ful speaker features, leading to promising results on speaker identiﬁcation and veriﬁcation tasks. Our experiments are con- ducted in both unsupervised and semi-supervised settings and compare different objecti ve functions for the discriminator . W e release the code of LIM within the PyT orch-Kaldi toolkit [23]. 2. Speaker Repr esentations based on MI The mutual information between two random variables z 1 and z 2 is deﬁned as follows: M I ( z 1 , z 2 ) = Z z 1 Z z 2 p ( z 1 , z 2 ) log  p ( z 1 , z 2 ) p ( z 1 ) p ( z 2 )  dz 1 dz 2 = D K L  p ( z 1 , z 2 ) || p ( z 1 ) p ( z 2 )  , (1) where D K L is the K ullback-Leibler (KL) di vergence between the joint distrib ution p ( z 1 , z 2 ) and the product of mar ginals p ( z 1 ) p ( z 2 ) . The MI is minimized when the random v ariables z 1 and z 2 are statistically independent (i.e., the joint distribution is equal to the product of marginals) and is maximized when the two variables contain the same information (in which case the MI is simply the entropy of any one of the v ariables). Our LIM system, depicted in Fig.1, aims to deri ve a com- pact representation z . The encoder f Θ , with f : R N → R M , is Enc ode r f q c 1 c 2 c rn d c 1 c 2 c rn d z 1 Enc ode r f q z 2 Enc ode r f q z rn d Di s cri m . g f Figure 1: Ar chitecture of the proposed system for unsupervised learning of speaker repr esentations. The speech chunks c 1 and c 2 ar e sampled fr om the same sentence, while c rand is sampled fr om a differ ent utterance. fed by N speech samples and outputs a v ector composed of M real values, while the discriminator g Φ , with g : R 2 M → R , is fed by two speaker representations and outputs a real scalar . W e learn the parameters Θ and Φ of the encoder and the discrimina- tor such that we maximize the mutual information M I ( z 1 , z 2 ) : ( ˆ Θ , ˆ Φ) = arg max Θ , Φ M I ( z 1 , z 2 ) , (2) where the tw o representations z 1 and z 2 are obtained by encod- ing the speech chunks c 1 and c 2 that are randomly sampled from the same sentence. Note that one reliable factor that is shared across chunks within each utterance is the speaker identity . The maximization of M I ( z 1 , z 2 ) should thus be able to properly disentangle this constant factor from the other variables (e.g., phonemes) that characterize the speech signal b ut are not shared across chunks of the same utterance. As sho wn in Alg. 1, the maximization of MI relies on a sampling strategy that draws positiv e and negati ve samples from the joint and the product of mar ginal distributions, respectiv ely . As discussed so far , the positive samples ( z 1 , z 2 ) are simply deriv ed by randomly sampling speech chunks from the same sentence. Negati ve samples ( z 1 , z rnd ) , instead, are obtained by randomly sampling from another utterance. The underly- ing assumptions considered here are the following: (1) two ran- dom utterances likely belong to dif ferent speakers, (2) each sen- tence contains a single speaker only . Under these assumptions, that naturally hold in most of the av ailable speech datasets, our method can be regarded as unsupervised (or, more precisely , self-supervised) because no speaker labels are explicitly used. A set of N samp positiv e and negati ve examples is sampled to form a minibatch X = { X p , X n } . The minibatch X feeds the discriminator g Φ , that is jointly trained with the encoder . Giv en z 1 , the discriminator g Φ has to decide whether its other input ( z 2 or z rnd ) comes from the same sentence or from a dif- ferent one (and generally a dif ferent speaker). Differently to the GAN framework, the encoder and the discriminator are not ad- versarial here but must cooperate to maximize the discrepancy between the joint and the product of mar ginal distributions. In Algorithm 1 Learning speaker representation with MI 1: while Not Con ver ged do 2: for i=1 to N samp do 3: Sample a chunk c 1 from a random utterance. 4: Sample another chunk c 2 from the same utterance. 5: Sample a chunk c rnd from another utterance. 6: Process the chunks with the encoder: 7: z 1 = f Θ ( c 1 ) , z 2 = f Θ ( c 2 ) , z rnd = f Θ ( c rnd ) . 8: Create positiv e and negati ve samples: 9: X p [ i ] =( z 1 , z 2 ), X n [ i ] =( z 1 , z rnd ). 10: Compute discriminator outputs: g ( X p ) , g ( X n ) . 11: Compute Loss L (Θ , Φ) . 12: Compute Gradients ∂ L ∂ Θ , ∂ L ∂ Φ . 13: Update Θ and Φ to maximize L. other words, we play a max-max game rather than a min-max one, making it easier to monitor the progress of training (com- pared to GAN training), simply as the average loss of the dis- criminator . Different objectives functions can be used for the discrim- inator . The simplest solution, adopted in [14], [17] and [24], consists in using the standard binary cross-entropy (BCE) loss 1 : L (Θ , Φ) = E X p [log( g ( z 1 , z 2 ))] + E X n [log(1 − g ( z 1 , z rnd ))] , (3) where E X p and E X n denote the expectation o ver positive and negati ve samples, respectively . Such a metric estimates the Jensen-Shannon div ergence between two distributions rather than the KL di vergence. Consequently , this loss does not op- timize the exact KL-based deﬁnition of MI, but a similar diver - gence between two distributions. Differently from standard MI, this metric is bounded (i.e., its maximum is zero), making the con ver gence of the architecture more numerically stable. As an alternativ e, it is possible to directly optimize the MI with the MINE objectiv e [15]: L (Θ , Φ) = E X p [ g ( z 1 , z 2 )] − l og  E X n [ e g ( z 1 ,z rnd ) ]  . (4) MINE explicitly computes MI by exploiting a lower -bound based on the Donsk er-V aradhan representation of the KL div er- gence. The third alternativ e explored in this work is the Noise Contrastiv e Estimation (NCE) loss proposed in [16], that is de- ﬁned as follows: L (Θ , Φ) = E x  g ( z 1 , z 2 ) − log  g ( z 1 , z 2 )+ X x n e g ( z 1 ,z rnd )  , (5) where the minibatch X is composed of a single positi ve sample and N − 1 neg ativ e samples. In [16] it is prov ed that maximizing this loss maximizes a lower bound on MI. All the aforementioned objectives are based on the idea of maximizing a discrepanc y between the joint and product of marginal distributions. Nevertheless, such losses might be more or less easy to optimize within the proposed framew ork. The unsupervised representations z are then used to train a speaker-id classiﬁer in a standard supervised way . Beyond unsupervised learning, this paper e xplores two semi-supervised variations for learning speaker representations. The ﬁrst one is 1 The output layer must be based on a sigmoid when using BCE. based on pre-training the encoder with the unsupervised param- eters and ﬁne-tuning it together with the speaker-id classiﬁer . As an alternativ e, we jointly train encoder, discriminator, and speaker -id networks from scratch. This way , the gradient com- puted within the encoder not only depends on the supervised loss b ut also on the unsupervised objecti ve. The latter approach turned out to be very effecti ve, since the unsupervised gradient acts as a powerful re gularizer . Similarly to [25–29], we propose to directly process raw wa veforms rather than using standard MFCC, or FB ANK fea- tures. The latter hand-crafted features are originally designed from perceptual evidence and there are no guarantees that such inputs are optimal for all speech-related tasks. Standard fea- tures, in fact, smooth the speech spectrum, possibly hinder - ing the extraction of crucial narro w-band speaker characteris- tics such as pitch and formants, that are important clues on the speaker identity . T o better process raw audio, the encoder is based on SincNet [21, 22], a nov el Conv olutional Neural Net- work (CNN) that encourages the ﬁrst layer to discover more meaningful ﬁlters. In contrast to standard CNNs, which learn all the elements of each ﬁlter , only lo w and high cutoff frequencies of band-pass sinc-based ﬁlters are directly learned from data, making SincNet suitable to process the high-dimensional audio. 3. Related W ork Similarly to this work, other attempts have recently been made to learn unsupervised representations with mutual information. In [14], a GAN that minimizes MI using positive and negativ e samples has been proposed for Independent Component Anal- ysis (ICA). A similar approach can be used to maximize MI. In [16] authors proposed a method called Contrastiv e Predicting Coding (CPC), that learns representations by predicting the fu- ture in a latent space. It uses an autoregressi ve model optimized with a probabilistic contrastiv e loss. In [17] authors introduced DeepInfoMax (DIM), an architecture that learns representations based on both local and high-lev el global information. The proposed LIM differs from the aforementioned works in the following way: DIM performs a maximization of MI be- tween local and global representations, CPC relies on future predictions, while our method is simply based on random lo- cal sampling. Note that training using local embeddings only is very efﬁcient since it does not require the expensi ve com- putation of a global representation as in GIM. LIM is also re- lated with the recently-proposed methods based on triplet loss [30, 31]. Most of the previous works on triplet loss (with the exception of [32]) rely on the speaker labels [31]. Moreover , they simply maximize the Euclidean or cosine distance between speaker embeddings. LIM, instead, is based on maximizing the mutual information, thus considering a more meaningful diver - gence that can also capture complex non-linear relationships between the variables. Maximum Mutual Information (MMI) is often used in HMM-DNN speech recognition as a loss func- tion [33]. This loss maximizes the MI between the acoustic probabilities and the targeted word sequence in a standard su- pervised framew ork, while LIM is used in a totally different un- supervised context that relies on local speech embeddings. Our work also uses SincNet [21, 22] (that is here used for the ﬁrst time in an unsupervised framew ork), and extends the previous works by also addressing semi-supervised learning where en- coder , discriminator , and speaker -id classiﬁer are jointly trained from scratch. Moreover , to the best of our knowledge, this pa- per is the ﬁrst that compares sev eral objective functions for MI optimization in a speech task. 4. Experimental Setup The proposed method has been evaluated using different cor- pora. In the following, an overvie w of the experimental setting is provided. 4.1. Corpora This paper considered the TIMIT (462 spks, tr ain chunk) [34], Librispeech (2484 spks), and V oxCeleb1 (1251 spks) [35] cor - pora. T o make TIMIT and Librispeech speaker recognition tasks more challenging, we only employed 12-15 seconds of randomly selected training material for each speaker . More- ov er , a set of TIMIT and Librispeech experiments have also been performed in distant-talking reverberant conditions. In this case, all the clean signals were con voluted with a different impulse response, that was sampled from the DIRHA dataset [36, 37]. The DIRHA corpus contains high-quality multi-room and multi-microphone impulse responses, that were measured in a domestic environment with a considerable rev erberation time of T 60 = 0 . 7 s . This way , we are able to provide exper - imental evidence in a much more challenging acoustic scenario and we can introduce a channel ef fect that is not nati vely present in the clean TIMIT and Librispeech corpora. T o study our ap- proach using a more standard speaker recognition dataset, we also employed the V oxCeleb1 corpus (using the provided lists). 4.2. DNN Setup The wa veform of each speech sentence was split into chunks of 200 ms (with 10 ms overlap), which were fed into the Sinc- Net encoder . The ﬁrst layer of the encoder performs sinc-based con volutions, using 80 ﬁlters of length L = 251 samples. The architecture then employs two standard conv olutional layers, both using 60 ﬁlters of length 5. Layer normalization [38] was used for both the input samples and for all con volutional layers. Next, two fully-connected leaky-ReLU layers [39] composed of 2048 and 1024 neurons (normalized with batch normaliza- tion [40, 41]) were applied. Both the discriminator and the speaker -id classiﬁer are fed by the encoder output and consist of MLPs based on a single ReLU layer . Frame-level speaker clas- siﬁcation was obtained from the speaker-id network by applying a softmax output layer , that pro vides a set of posterior probabil- ities over the targeted speakers. A sentence-lev el classiﬁcation was derived by averaging the frame predictions and voting for the speaker which maximizes the av erage posterior . T raining used the RMSprop optimizer , with a learning rate l r = 0 . 001 , α = 0 . 95 ,  = 10 − 7 , and minibatches of size 128. All the hyper-parameters of the architecture were tuned on TIMIT , then inherited for Librispeech and V oxCeleb as well. The speaker veriﬁcation system was derived from the speaker -id neural network using the d-vector technique. The d-vector [35, 42] was e xtracted from the last hidden layer of the speaker -id network. A speaker-dependent d-vector was com- puted and stored for each enrollment speaker by performing an L2 normalization and averaging all the d-vectors of the different speech chunks. The cosine distance between enrolment and test d-vectors w as then calculated, and a threshold was then applied on it to reject or accept the speaker . Note that to assess our ap- proach on a standard open-set speaker veriﬁcation task, all the enrolment and test utterances were taken from a speaker pool different from that used for training the speak er-id DNN. 5. Results This section summarizes our experimental activity on speaker identiﬁcation and veriﬁcation. 5.1. Speaker Identiﬁcation T ab. 1 reports the sentence-le vel classiﬁcation error rates achiev ed with binary cross-entropy (BCE), MINE, Noise Con- structiv e Estimation (NCE), and the triplet loss used in [31]. TIMIT Librispeech CNN SincNet CNN SincNet Unsupervised-T rip. Loss 2.84 2.22 1.46 1.33 Unsupervised-MINE 2.15 1.36 1.43 0.94 Unsupervised-NCE 2.05 1.29 1.14 0.82 Unsupervised-BCE 1.98 1.21 1.12 0.75 T able 1: Classiﬁcation Error Rate (CER%) obtained on TIMIT (462 spks) and Librispeec h (2484 spks) speaker -id tasks using LIM embeddings learned with various objective functions. The table highlights that our LIM embeddings contain in- formation on the speaker identity , leading to a CER(%) rang- ing from 2.84% to 1.21% in all the considered settings. It is worth noting that mutual information losses (i.e., MINE, NCE, BCE) outperform the triplet loss. This result suggests that bet- ter embeddings can be deriv ed with a di vergence measure more meaningful than the simple cosine distance. The best perfor- mance is achie ved with the standard binary cross-entropy . Sim- ilar to [17], we hav e observed that this bounded metric is more stable and more easy to optimize. Both MINE and NCE objec- tiv e are unbounded and their value can grow indeﬁnitely dur- ing training, eventually causing numerical issues. The perfor- mance achiev ed with Librispeech is better than that observed for TIMIT . Even though the former is based on more speak- ers, its utterances are on average longer than the TIMIT ones. The table also sho ws that SincNet outperforms a standard CNN. This conﬁrms the promising achievements obtained in [21, 22] in a standard supervised setting. SincNet, in fact, con verges faster and to a better solution, thanks to the compact sinc ﬁlters that make learning from high-dimensional raw samples easier . T ab. 2 extends previous speaker-id results to other training modalities, including supervised and semi-supervised learning in both clean and rev erberant acoustic conditions. TIMIT Librispeech Clean Rev Clean Re v Supervised 0.85 34.8 0.80 17.1 Unsupervised-BCE 1.21 28.2 0.75 15.2 Semi-supervised-pretr . 0.69 25.4 0.56 9.6 Semi-supervised-joint 0.65 24.6 0.52 9.3 T able 2: Classiﬁcation Err or Rate (CER%) obtained on speaker -id with supervised, unsupervised and semi-supervised modalities in clean and r everberat conditions. From the table, it emer ges that the results achie ved when feeding the classiﬁer with our speaker embeddings ( unsupervised-BCE ) are often better than those obtained with the standard supervised training ( supervised ). The gap be- comes more evident when we pass from unsupervised to semi- supervised learning. In particular , the joint semi-supervised framew ork (i.e., the approach that jointly trains encoder, dis- criminator , and speaker classiﬁcation for scratch) yields the best performance, surpassing the performance obtained when pre- training the encoder and then ﬁne-tuning it with the supervised task ( Semi-supervised-pretr . ). The internal representations dis- cov ered in this way are inﬂuenced by both the supervised and the unsupervised loss. The latter one acts as a powerful re gu- larizer , that allows the neural network to ﬁnd robust features. The results also show a signiﬁcant performance degradation in distant-talking acoustic conditions. The presence of consider- able reverberation and the introduction of channel/microphone variabilities, in f act, make speaker -id particularly challenging. 5.2. Speaker V eriﬁcation W e ﬁnally extend our validation to speaker veriﬁcation on the V oxCeleb corpus. T able 3 compares the Equal Error Rate (EER%) achiev ed using our best system ( Semi-supervised- pr etr . ) with some previous w orks on the same dataset. EER (%) GMM-UBM [35] 15.0 I-vectors + PLD A [35] 8.8 CNN [35] 7.8 CNN + intra-class + triplet loss [43] 7.9 SincNet [21] 7.2 SincNet+LIM (proposed) 5.8 T able 3: Equal Err or Rate (EER%) obtained on speaker veriﬁ- cation (using the V oxCeleb corpus). The proposed model reaches an EER(%) of 5.8% and out- performs other systems such as an I-v ector baseline [35, 44], a standard CNN [35], and a CNN based on combination of intra- class and triples loss [43]. Finally , LIM outperforms a standard SincNet model trained in a fully supervised way [21]. This re- sult conﬁrms the effecti veness of the proposed approach e ven in an open-set text-independent speaker v eriﬁcation setting. 6. Conclusion This paper proposed a method for learning speak er embed- dings by maximizing mutual information. The experiments hav e shown promising performance on speak er recognition and hav e highlighted better results when adopting the standard bi- nary cross-entropy loss, that turned out to be more stable and easier to optimize than other metrics. It also highlighted the importance of using SincNet, conﬁrming its ef fectiv eness when processing raw audio wav eforms. The best results are obtained with end-to-end semi-supervised learning, where an ecosystem of neural networks composed of an encoder , a discriminator, and a speak er-id must cooperate to deri ve good speaker embed- dings. Our achievement can be easily combined with other re- cent ﬁndings in speaker recognition. For instance, it is possible to use LIM to extract semi-supervised x-vectors. W e can also improv e it by emplo ying an attention mechanism that weights the contribution of each time frame, or by combing our semi- supervised costs with other losses, such as the center loss. 7. Acknowledgment W e would like to thank D. Hjelm, T . Parcollet, and M. Omol- ogo, for helpful comments. This research was enabled by sup- port provided by Calcul Qu ´ ebec and Compute Canada. 8. References [1] D. Y u and L. Deng, Automatic Speech Recognition - A Deep Learning Appr oach . Springer, 2015. [2] G. Dahl, D. Y u, L. Deng, and A. Acero, “Context-dependent pre- trained deep neural networks for large vocab ulary speech recog- nition, ” IEEE T ransactions on A udio, Speech, and Language Pr o- cessing , vol. 20, no. 1, pp. 30–42, 2012. [3] M. Ra vanelli, Deep learning for Distant Speech Recognition . PhD Thesis, Unitn, 2017. [4] M. Rav anelli, P . Brakel, M. Omologo, and Y . Bengio, “ A network of deep neural networks for distant speech recognition, ” in Proc. of ICASSP , 2017, pp. 4880–4884. [5] M. McLaren, Y . Lei, and L. Ferrer, “ Advances in deep neural network approaches to speaker recognition, ” in Proc. of ICASSP , 2015, pp. 4814–4818. [6] F . Richardson, D. Reynolds, and N. Dehak, “Deep neural netw ork approaches to speaker and language recognition, ” IEEE Signal Pr ocessing Letters , vol. 22, no. 10, pp. 1671–1675, 2015. [7] I. Goodfello w , Y . Bengio, and A. Courville, Deep Learning . MIT Press, 2016. [8] Y . Bengio, P . L., D. Popovici, and H. Larochelle, “Greedy layer- wise training of deep networks, ” in Pr oc. of NIPS , 2007, pp. 153– 160. [9] G. Hinton, S. Osindero, and Y . T eh, “ A fast learning algorithm for deep belief nets, ” vol. 18, 2006, pp. 1527–1554. [10] D. P . Kingma and M. W elling, “Auto-Encoding V ariational Bayes, ” CoRR , vol. abs/1312.6114, 2013. [11] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde- Farley , S. Ozair , A. Courville, and Y . Bengio, “Generative adver- sarial nets, ” in Pr oc. of NIPS , 2014, pp. 2672–2680. [12] Y . Ganin, E. Ustinova, H. Ajakan, P . Germain, H. Larochelle, F . Laviolette, M. Marchand, and V . Lempitsky , “Domain- adversarial training of neural networks, ” J. Mach. Learn. Res. , vol. 17, no. 1, pp. 2096–2030, Jan. 2016. [13] D. Serdyuk, P . Brakel, B. Ramabhadran, S. Thomas, Y . Bengio, and K. Audhkhasi, “Inv ariant representations for noisy speech recognition, ” arXiv e-prints , vol. abs/1612.01928, 2016. [14] P . Brakel and Y . Bengio, “Learning independent features with ad- versarial nets for non-linear ica, ” arXiv e-prints , vol. 1710.05050, 2017. [15] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair , Y . Bengio, A. Courville, and R. D. Hjelm, “Mutual information neural esti- mation, ” in Pr oc. of ICML , 2018, pp. 531–540. [16] A. van den Oord, Y . Li, and O. V inyals, “Representation learning with contrastive predictive coding, ” CoRR , vol. abs/1807.03748, 2018. [17] R. D. Hjelm, A. Fedorov , S. Lavoie-Marchildon, K. Grewal, A. T rischler , and Y . Bengio, “Learning deep representations by mutual information estimation and maximization, ” arXiv e-prints , vol. 1808.06670, 2018. [18] D. Applebaum, Probability and Information: An Inte grated Ap- pr oach , 2nd ed. Cambridge University Press, 2008. [19] J. B. Kinney and G. S. Atwal, “Equitability , mutual information, and the maximal information coefﬁcient, ” Proceedings of the Na- tional Academy of Sciences , v ol. 111, no. 9, pp. 3354–3359, 2014. [20] L. Paninski, “Estimation of entropy and mutual information, ” Neural Comput. , v ol. 15, no. 6, pp. 1191–1253, Jun. 2003. [21] M. Ravanelli and Y . Bengio, “Speaker Recognition from raw wav eform with SincNet, ” in Pr oc. of SLT , 2018. [22] M. Ra vanelli and Y .Bengio, “Interpretable Conv olutional Filters with SincNet, ” in Pr oc. of NIPS@IRASL , 2018. [23] M. Ravanelli, T . Parcollet, and Y . Bengio, “The PyT orch-Kaldi Speech Recognition T oolkit, ” in Submitted to ICASSP , 2019. [24] P . V elickovic, W . Fedus, W . L. Hamilton, P . Li ` o, Y . Bengio, and R. D. Hjelm, “Deep graph infomax, ” CoRR , vol. abs/1809.10341, 2018. [25] D. P alaz, M. Magimai-Doss, and R. Collobert, “Analysis of CNN- based speech recognition system using raw speech as input, ” in Pr oc. of Interspeech , 2015. [26] T . N. Sainath, R. J. W eiss, A. W . Senior , K. W . W ilson, and O. V inyals, “Learning the speech front-end with raw wa veform CLDNNs, ” in Pr oc. of Interspeech , 2015. [27] Z. T ¨ uske, P . Golik, R. Schl ¨ uter , and H. Ney , “Acoustic modeling with deep neural networks using raw time signal for L VCSR, ” in Pr oc. of Interspeech , 2014. [28] G. Trigeorgis, F . Ringeval, R. Brueckner , E. Marchi, M. A. Nico- laou, B. Schuller, and S. Zafeiriou, “ Adieu features? end-to-end speech emotion recognition using a deep conv olutional recurrent network, ” in Proc. of ICASSP , 2016, pp. 5200–5204. [29] H. Muckenhirn, M. Magimai-Doss, and S. Marcel, “T owards di- rectly modeling raw speech signal for speaker veriﬁcation using CNNs, ” in Pr oc. of ICASSP , 2018. [30] F . Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uni- ﬁed embedding for face recognition and clustering. ” CoRR , vol. abs/1503.03832, 2015. [31] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y . Cao, A. Kan- nan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system, ” CoRR , vol. abs/1705.02304, 2017. [32] A. Jati and P . G. Georgiou, “Neural predictive coding using conv o- lutional neural networks towards unsupervised learning of speaker characteristics, ” CoRR , vol. abs/1802.07860, 2018. [33] L. Bahl, P . Bro wn, P . de Souza, and R. Mercer, “Maximum mutual information estimation of hidden markov model parameters for speech recognition, ” in Pr oc. of ICASSP , vol. 11, 1986, pp. 49– 52. [34] J. S. Garofolo, L. F . Lamel, W . M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “D ARP A TIMIT Acoustic Phonetic Continuous Speech Corpus CDR OM, ” 1993. [35] A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identiﬁcation dataset, ” in Proc. of Interspec h , 2017. [36] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and M. Omologo, “The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic en vironments, ” in Pr oc. of ASR U 2015 , pp. 275–282. [37] M. Rav anelli, A. Sosi, P . Svaizer , and M. Omologo, “Impulse re- sponse estimation for rob ust speech recognition in a reverberant en vironment, ” in Pr oc. of EUSIPCO , 2012, pp. 1668–1672. [38] J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization, ” CoRR , vol. abs/1607.06450, 2016. [39] A. L. Maas, A. Y . Hannun, and A. Y . Ng, “Rectiﬁer nonlineari- ties improve neural network acoustic models, ” in Pr oc. of ICML , 2013. [40] S. Ioffe and C. Sze gedy , “Batch normalization: Accelerating deep network training by reducing internal covariate shift, ” in Proc. of ICML , 2015, pp. 448–456. [41] M. Rav anelli, P . Brakel, M. Omologo, and Y . Bengio, “Batch- normalized joint training for DNN-based distant speech recogni- tion, ” in Pr oc. of SLT , 2016. [42] E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker veriﬁcation, ” in Pr oc. of ICASSP , 2014, pp. 4052–4056. [43] N. Le and J. Odobez, “Robust and discriminative speak er embed- ding via intra-class distance variance regularization, ” in Proc. of Interspeech , 2018, pp. 2257–2261. [44] A. K. Sarkar , D. Matrouf, P . Bousquet, and J. Bonastre, “Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker veriﬁcation, ” in Pr oc. of Interspeech , 2012, pp. 2662–2665.

Learning Speaker Representations with Mutual Information

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment