Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition

The current trend in automatic speech recognition is to leverage large amounts of labeled data to train supervised neural network models. Unfortunately, obtaining data for a wide range of domains to train robust models can be costly. However, it is r…

Authors: Wei-Ning Hsu, Hao Tang, James Glass

Unsupervised Adaptation with Interpretable Disentangled Representations   for Distant Conversational Speech Recognition
Unsupervised Adaptation with Inter pr etable Disentangled Representations f or Distant Con versational Speech Recognition W ei-Ning Hsu, Hao T ang, J ames Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of T echnology Cambridge, MA 02139, USA { wnhsu,haotang,glass } @mit.edu Abstract The current trend in automatic speech recognition is to le verage large amounts of labeled data to train supervised neural net- work models. Unfortunately , obtaining data for a wide range of domains to train rob ust models can be costly . Ho we ver , it is relati vely ine xpensiv e to collect lar ge amounts of unlabeled data from domains that we want the models to generalize to. In this paper , we propose a nov el unsupervised adaptation method that learns to synthesize labeled data for the tar get domain from unlabeled in-domain data and labeled out-of-domain data. W e first learn without supervision an interpretable latent represen- tation of speech that encodes linguistic and nuisance factors (e.g., speaker and channel) using different latent v ariables. T o transform a labeled out-of-domain utterance without altering its transcript, we transform the latent nuisance variables while maintaining the linguistic variables. T o demonstrate our ap- proach, we focus on a channel mismatch setting, where the do- main of interest is distant con versational speech, and labels are only av ailable for close-talking speech. Our proposed method is ev aluated on the AMI dataset, outperforming all baselines and bridging the gap between unadapted and in-domain models by ov er 77% without using any parallel data. Index T erms : unsupervised adaptation, distant speech recogni- tion, unsupervised data augmentation, variational autoencoder 1. Introduction Distant speech recognition has greatly improved due to the re- cent advance in neural network-based acoustic models, which facilitates integration of automatic speech recognition (ASR) systems into hands-free human-machine interaction scenar- ios [1]. T o build a rob ust acoustic model, previous work pri- marily focused on collecting labeled in-domain data for fully supervised training [2, 3, 4]. Howe ver , in practice, it is expen- siv e and laborious to collect labeled data for all possible testing conditions. In contrast, collecting large amount of unlabeled in- domain data and labeled out-of-domain data can be fast and eco- nomical. Hence, an important question arises for this scenario: how can we do unsupervised adaptation for acoustic models by utilizing labeled out-of-domain data and unlabeled in-domain data, in or der to achieve good performance on in-domain data? Research on unsupervised adaptation for acoustic models can be roughly di vided into three categories: (1) constrained model adaptation [5, 6, 7], (2) domain-in variant feature ex- traction [8, 9, 10], and (3) labeled in-domain data augmenta- tion by synthesis [11, 12, 13]. Among these approaches, data augmentation-based adaptation is f av orable, because it does not require extra hyperparameter tuning for acoustic model training, and can utilize full model capacity by training a model with as much and as diverse a dataset as possible. Another benefit of this approach is that data in their original domain are more in- tuitiv e to humans. In other words, it is easier for us to inspect and manipulate the data. Furthermore, with the recent progress on domain translation [13, 14, 15], conditional synthesis of in- domain data without parallel data has become achiev able, which makes data augmentation-based adaptation a more promising direction to in vestig ate. V ariational autoencoder -based data augmentation (V AE- D A) is a domain adaptation method proposed in [13], which pools in-domain and out-domain to train a V AE that learns fac- torized latent representations of speech segments. T o disentan- gle linguistic factors from nuisance ones in the latent space, statistics of the latent representations for each utterance are computed. By altering the latent representations of the seg- ments from a labeled out-of-domain utterance properly accord- ing to the computed statistics, one can synthesize an in-domain utterance without changing the linguistic content using the trained V AE decoder . This approach shows promising results on synthesizing noisy read speech from clean speech. Ho wev er , it is non-tri vial to apply this approach to con versational speech, because utterances tend to be shorter, which makes estimating the statistics of a disentangled representation difficult. In this paper , we extend V AE-D A and address the issue by learning interpretable and disentangled representations using a variant of V AEs that is designed for sequential data, named fac- torized hierarchical variational autoencoders (FHV AEs) [15]. Instead of estimating the latent representation statistics on short utterances, we use a loss that considers the statistics across ut- terances in the entire corpus. Therefore, we can safely alter the latent part that models non-linguistic factors in order to synthesize in-domain data from out-of-domain data. Our pro- posed methods are e v aluated on the AMI [16] dataset, which contains close-talking and distant-talking recordings in a con- ference room meeting scenario. W e treat close-talking data as out-of-domain data and distant-talking data as in-domain data. In addition to outperforming all baseline methods, our pro- posed methods successfully close the gap between an unadapted model and a fully-supervised model by more than 77% in terms of word error rate without the presence of any parallel data. 2. Limitations of Pre vious W ork In this section, we briefly re vie w V AE-based data augmentation and its limitations. 2.1. V AE-Based Data A ugmentation Generation of speech data often inv olves many independent factors, such as linguistic content, speaker identity , and room acoustics, that are often unobserv ed, or only partially observ ed. One can describe such a generati ve process using a latent v ari- able model, where a v ector z ∈ Z describing generating factors is first drawn from a prior distribution, and a speech segment x ∈ X is then drawn from a distribution conditioned on z . V AEs [17, 18] are among the most successful latent variable models, which parameterize a conditional distrib ution, p ( x | z ) , with a decoder neural network, and introduce an encoder neural network, q ( z | x ) , to approximate the true posterior , p ( z | x ) . In [19], a V AE is proposed to model a generati ve process of speech segments. A latent vector in the latent space is assumed to be a linear combination of orthogonal vectors corresponding to the independent factors, such as phonetic content and speaker identity . In other words, we assume that z = z ` + z n where z ` encodes the linguistic/phonetic content and z n encodes the nuisance factors, and z ` ⊥ z n . T o augment the data set while reusing the labels, for an y pair of utterance and its correspond- ing label sequence ( X , y ) in the data set, we generate ( ˆ X , y ) by altering the nuisance part of X in the latent space. 2.2. Estimating Latent Nuisance V ectors A ke y observ ation made in [13] is that nuisance f actors, such as speaker identity and room acoustics, are generally constant o ver segments within an utterance, while linguistic content changes from se gment to segment. In other w ords, latent nuisance vec- tors z n are relativ ely consistent within an utterance, while the distribution of z ` conditioned on an utterance can be assumed to hav e the same distribution as the prior . Therefore, suppose the prior is a diagonal Gaussian with zero mean. Giv en an utterance X = { x ( n ) } N n =1 of N segments, we have: 1 N N X n =1 z ( n ) = 1 N N X n =1 z ( n ) n + 1 N N X n =1 z ( n ) ` (1) ≈ 1 N N X n =1 z n + E p ( z ) [ z ` ] = z n + 0 . (2) That is to say , the latent nuisance vector would stand out, and the rest would cancel out, when we take the av erage of latent vectors o ver se gments within an utterance. This approach sho ws great success in transforming clean read speech into noisy read speech. Howev er , in a con versa- tional scenario, the portion of short utterances are much larger than that in a reading scenario. For instance, in the W all Street Journal corpus [20], a read speech corpus, the average duration on the training set is 7.6s ( ± 2.9s), with no utterance shorter than 1s. On the other hand, in the AMI corpus [16], the dis- tant con versational speech meeting corpus, the average dura- tion on the training set is 2.6s ( ± 2.7s), with over 35% of the utterances being shorter than 1s. The small number of segments in a conv ersational scenario can lead to unreliable estimation of latent nuisance v ectors, because the sampled mean of latent linguistic v ectors would exhibit lar ge variance from the popula- tion mean. The estimation under such a condition can contain information about not only nuisance factors, but also linguistic factors. Indeed, we illustrate in Figure 1 that modifying the es- timated latent nuisance vector of a short utterance can result in undesirable changes to its linguistic content. 3. Methods In this section, we describe the formulation of FHV AEs and explain ho w it can ov ercome the limitations of v anilla V AEs. FBank bin index Frame Index T ransformed Utterances with VAE-DA Original Utterance T ransformed Utterances with FHVAE-DA Figure 1: Comparison between V AE-DA and pr oposed FHV AE- D A applied to a short utterance using nuisance factor r eplace- ment. The same sour ce and two target utterances ar e used for both methods. Our pr oposed FHV AE-D A can successfully transform only nuisance factors, while V AE-D A cannot. 3.1. Lear ning Interpr etable Disentangled Representations T o av oid estimating nuisance vectors on short segments, one can lev erage the statistics at the corpus le vel, instead of at the utter - ance lev el, to disentangle generating factors. An FHV AE [15] is a variant of V AEs that models a generative process of se- quential data with a hierarchical graphical model. Specifically , an FHV AE imposes sequence-independent priors and sequence- dependent priors to two sets of latent variables, z 1 and z 2 , re- spectiv ely . W e now formulate the process of generating a se- quence X = { x ( n ) } N n =1 composed of N sub-sequences: 1. an s-vector µ 2 is dra wn from p ( µ 2 ) = N ( µ 2 | 0 , σ 2 µ 2 I ) . 2. N i.i.d. latent se gment variables Z 1 = { z ( n ) 1 } N n =1 are drawn from a global prior p ( z 1 ) = N ( z 1 | 0 , σ 2 z 1 I ) . 3. N i.i.d. latent sequence variables Z 2 = { z ( n ) 2 } N n =1 are drawn from a sequence-dependent prior p ( z 2 | µ 2 ) = N ( z 2 | µ 2 , σ 2 z 2 I ) . 4. N i.i.d. sub-sequences X = { x ( n ) } N n =1 are dra wn from p ( x | z 1 , z 2 ) = N ( x | f µ x ( z 1 , z 2 ) , diag ( f σ 2 x ( z 1 , z 2 ))) , where f µ x ( · , · ) and f σ 2 x ( · , · ) are parameterized by a de- coder neural network. The joint probability for a sequence is formulated as follows: p ( µ 2 ) N Y n =1 p ( x ( n ) | z ( n ) 1 , z ( n ) 2 ) p ( z ( n ) 1 ) p ( z ( n ) 2 | µ 2 ) . (3) W ith such a formulation, z 2 is encouraged to capture gen- erating factors that are relatively consistent within a sequence, and z 1 will then capture the residual generating factors. There- fore, when we apply an FHV AE to model speech sequence gen- eration, it is clear that z 2 will capture the nuisance generating factors that are in general consistent within an utterance. Since the exact posterior inference is intractable, FHV AEs introduce an inference model q ( Z 1 , Z 2 , µ 2 | X ) to approximate the true posterior , which is factorized as follo ws: q ( µ 2 ) N Y n =1 q ( z ( n ) 1 | x ( n ) , z ( n ) 2 ) q ( z ( n ) 2 | x ( n ) ) , (4) where q ( µ 2 ) , q ( z 1 | x , z 2 ) , and q ( z 2 | x ) are all diagonal Gaus- sian distrib utions. T wo encoder networks are introduced in FH- V AEs to parameterize mean and v ariance values of q ( z 1 | x , z 2 ) and q ( z 2 | x ) respectiv ely . As for q ( µ 2 ) , for testing utterances we parameterize its mean with an approximated maximum a posterior (MAP) estimation P N n =1 ˆ z ( n ) 2 / ( N + σ 2 z 2 /σ 2 µ 2 ) , where ˆ z ( n ) 2 is the inferred posterior mean of q ( z ( n ) 2 | x ( n ) ) ; dur- ing training, we initialize a lookup table of posterior mean of µ 2 for each training utterance with the approximated MAP estima- tion, and treat the lookup table as trainable parameters. This can av oid computing the MAP estimation of each segment for each mini-batch, and utilize the discriminativ e loss proposed in [15] to encourage disentanglement. 3.2. FHV AE-Based Data A ugmentation W ith a trained FHV AE, we are able to infer disentangled latent representations that capture linguistic f actors z 1 and nuisance factors z 2 . T o transform nuisance factors of an utterance X without changing the corresponding transcript, one only needs to perturb Z 2 . Furthermore, since each z 2 within an utterance is generated conditioned on a Gaussian whose mean is µ 2 , we can re gard µ 2 as the representation of nuisance factors of an ut- terance. W e no w deriv e tw o data augmentation methods similar to those proposed in [13], named nuisance factor r eplacement and nuisance factor perturbation . 3.2.1. Nuisance F actor Replacement Giv en a labeled out-of-domain utterance ( X out , y out ) and an unlabeled in-domain utterance X in , we want to transform X out to ˆ X out such that it e xhibits the same nuisance factors as X in , while maintaining the original linguistic content. W e can then add the synthesized labeled in-domain data ( ˆ X out , y out ) to the ASR training set. From the generati ve modeling perspec- tiv e, this implies that z 2 of X in and ˆ X out are drawn from the same distribution. W e carry out the same modification for the latent sequence variable of each segment of X out as follows: ˆ z 2 ,out = z 2 ,out − µ 2 ,out + µ 2 ,in , where µ 2 ,out and µ 2 ,in are the approximate MAP estimations of µ 2 . 3.2.2. Nuisance F actor P erturbation Alternativ ely , we are also interested in synthesizing an utter- ance conditioned on unseen nuisance factors, for example, the interpolation of nuisance factors between two utterances. W e propose to draw a random perturbation vector p and compute ˆ z 2 ,out = z 2 ,out + p for each segment in an utterance, in or- der to synthesize an utterance with perturbed nuisance factors. Naiv ely , we may want to sample p from a centered isotropic Gaussian. Howe ver , in practice, V AE-type of models suffer from an o ver -pruning issue [21] in that some latent variables become inactive, which we do not want to perturb . Instead, we only want to perturb the linear subspace which models the variation of nuisance factors between utterances. Therefore, we adopt a similar soft perturbation scheme as in [13]. First, { µ 2 } M i =1 for all M utterances are estimated with the approxi- mated MAP . Principle component analysis is performed to ob- tain D pairs of eigenv alue σ d and eigen vectors e d , where D is the dimension of µ 2 . Lastly , one random perturbation vector p is drawn for each utterance to perturb as follo ws: p = γ D X d =1 ψ d σ d e d , ψ d ∼ N (0 , 1) , (5) where γ is used to control the perturbation scale. 4. Experimental Setup W e evaluate our proposed method on the AMI meeting cor- pus [16]. The AMI corpus consists of 100 hours of meeting recordings in English, recorded in three dif ferent meeting rooms with different acoustic properties, and with three to five par - ticipants for each meeting that are mostly non-native speakers. Multiple microphones are used for recording, including individ- ual headset microphones (IHM), and far-field microphone ar - rays. In this paper, we reg ard IHM recordings as out-of-domain data, whose transcripts are av ailable, and single distant micro- phone (SDM) recordings as in-domain data, whose transcripts are not av ailable, b ut on which we will e v aluate our model. The recommended partition of the corpus is used, which contains an 80 hours training set, and 9 hours for a dev elopment and a test set respecti vely . FHV AE and V AE models are trained using both IHM and SDM training sets, which do not require tran- scripts. ASR acoustic models are trained using augmented data and transcripts based on only the IHM training set. The perfor- mance of all ASR systems are evaluated on the SDM dev elop- ment set. The NIST asclite tool [22] is used for scoring. 4.1. V AE and FHV AE Configurations Speech segments of 20 frames, represented with 80 dimensional log Mel filterbank coefficients (FBank) are used as inputs. W e configure V AE and FHV AE models such that they have com- parable modeling capacity . The V AE latent variable dimension is 64, whereas the dimensions of z 1 and z 2 in FHV AEs are both 32. Both models hav e a two-layer LSTM decoder with 256 memory cells that predicts one frame of x at a time. Since a FHV AE model has two encoders, while a V AE model only has one, we use a two-layer LSTM encoder with 256 memory cells for the former , and with 512 memory cells for the latter . All the LSTM encoders take one frame as input at each step, and the output from the last step is passed to an affine trans- formation layer that predicts the mean and the log variance of latent variables. The V AE model is trained to maximize the variational lower bound, and the FHV AE model is trained to maximize the discriminati ve segment variational lower bound proposed in [15] with a discriminativ e weight α = 10 . In ad- dition, the original FHV AE training [15] is not scalable to hun- dreds of thousands of utterances; we therefore use the hierar - chical sampling-based training algorithm proposed in [23] with batches of 5,000 utterances. Adam [24] with β 1 = 0 . 95 and β 2 = 0 . 999 is used to optimize all models. T ensorflow [25] is used for implementation. 4.2. ASR Configuration Kaldi [26] is used for feature extraction, forced alignment, de- coding, and training of initial HMM-GMM models on the IHM training set. The Microsoft Cognitiv e T oolkit [27] is used for neural network acoustic model training. For all experiments, the same 3-layer LSTM acoustic model [28] with the architec- ture proposed in [2] is adopted, which has 1024 memory cells and a 512-node linear projection layer for each LSTM layer . Follo wing the setup in [29], LSTM acoustic models are trained with cross entropy loss, truncated back-propagation through time [30], and mini-batches of 40 parallel utterances and 20 frames. A momentum of 0.9 is used starting from the second epoch [2]. T en percent of training data is held out for validation, and the learning rate is halv ed if no improvement is observ ed on the validation set after an epoch. T able 1: Baseline WERs for the AMI IHM/SDM task. WER (%) ASR T raining Set SDM-dev IHM-dev IHM 70.8 27.0 SDM 46.8 (-24.0) 42.5 (+15.5) IHM, FHV AE-DI, ( z 1 ) [10] 64.8 (-6.0) 29.0 (+2.0) IHM, V AE-D A, (repl) [13] 62.2 (-8.0) 31.8 (+4.8) IHM, V AE-D A, (p, γ = 1 . 0 ) [13] 61.1 (-9.7) 30.0 (+3.0) IHM, V AE-D A, (p, γ = 1 . 5 ) [13] 61.9 (-8.9) 31.4 (+4.4) T able 2: WERs of the proposed and the alternative methods. WER (%) ASR T raining Set SDM-dev IHM-dev IHM 70.8 27.0 IHM, FHV AE-D A, (repl) 59.0 (-11.8) 31.3 (+4.3) IHM, FHV AE-D A, (p, γ = 1 . 0 ) 58.6 (-12.2) 30.1 (+3.1) IHM, FHV AE-D A, (p, γ = 1 . 5 ) 58.7 (-12.1) 31.4 (+4.4) IHM, FHV AE-D A, (rev-p, γ = 1 . 0 ) 70.9 (+0.1) 30.2 (+3.2) IHM, FHV AE-D A, (uni-p, γ = 1 . 0 ) 66.6 (-4.2) 30.9 (+3.9) 5. Results and Discussion W e first establish baseline results and report the SDM (in- domain) and IHM (out-of-domain) development set word error rates (WERs) in T able 1. T o avoid constantly querying the test set results, we only report WERs on the development set. If not otherwise mentioned, the data augmentation-based systems are ev aluated on reconstructed features, and trained on a trans- formed IHM set, where each utterance is only transformed once, without the original copy of data. The first two rows of results show that the WER gap be- tween the unadapted model and the model trained on in-domain data is 24%. The third row reports the results of training with domain inv ariant feature, z 1 , extracted with a FHV AE as is done in [10]. It improves over the baseline by 6% absolute. V AE- D A [13] results with nuisance factor replacement (repl) and la- tent nuisance perturbation (p) are shown in the last three ro ws. W e then examine the ef fecti veness of our proposed method and sho w the results in the second, third, and fourth ro ws in T a- ble 2. W e observe about 12% WER reduction on the in-domain dev elopment set for both nuisance factor perturbation (p) and nuisance factor replacement (repl), with little degradation on the out-of-domain development set. Both augmentation meth- ods outperform their V AE counterparts and the domain in v ariant feature baseline using the same FHV AE model. W e attribute the improv ement to the better quality of the transformed IHM data, which covers the nuisance factors of the SDM data, without al- tering the original linguistic content. T o verify the superiority of the proposed method of drawing random perturbation vectors, we compare two alternative sam- pling methods: r ev-p and uni-p , similar to [13], with the same expected squared Euclidean norm as the proposed method. The r ev-p replaces σ d in Eq. 5 with σ D − d , where [ σ 1 , · · · , σ D ] is sorted, while the uni-p replaces it with q P D d =1 σ 2 d /D . Results shown in the last two ro ws in T able 2 confirm that the proposed sampling method is more effecti ve under the same perturbation scale γ = 1 . 0 compared to the alternativ e methods as expected. Due to imperfect reconstruction using FHV AE models, some linguistic information may be lost in this process. Further- more, since V AE models tend to ha ve ov erly-smoothed outputs, one can easily tell an original utterance from a reconstructed T able 3: WERs on reconstructed data and original data. SDM-dev WER (%) IHM-dev WER (%) ASR T raining Set recon. ori. recon. ori. reconstruction 73.8 79.5 30.1 32.1 repl 59.0 71.4 31.3 34.4 +ori. IHM 59.4 61.4 30.5 26.2 p, γ = 1 . 0 58.6 71.8 30.1 31.5 +ori. IHM 58.0 66.2 29.0 25.9 T able 4: Models trained on disjoint partition of IHM/SDM data. WER (%) ASR T raining Set SDM-de v IHM-dev IHM-a 86.5 31.8 SDM-b 55.4 (-31.1) 51.0 (+19.2) IHM-a, FHV AE-D A, (pert, γ = 1 . 0 ) 62.4 (-24.1) 33.4 (+1.6) one. In other words, there is another layer of domain mismatch between original data and reconstructed data. In T able 3, we in- vestigate the performance of models trained with different data on both original data and reconstructed data. The first ro w , a model trained on the reconstructed IHM data serv es as the base- line, from which we observe a 3.0%/3.1% WER increase on SDM/IHM when tested on the reconstructed data, and a further 5.7%/2.0% WER increase when tested on the original data. Compared to the reconstruction baseline, the proposed per - turbation and replacement method both sho w about 15% im- prov ement on the reconstructed SDM data, and 8% on the orig- inal SDM data. Results on the reconstructed or original IHM data are comparable to the baseline. The performance differ - ence between the original and reconstructed SDM sho ws that FHV AEs are able to transform the IHM acoustic features closer to the reconstructed SDM data. W e then explore adding the original IHM training data to the two transformed sets ( +ori. IHM ). This significantly impro ves the performance on the orig- inal data for both SDM and IHM data sets. W e even see an improv ement from 27.0% to 25.9% on the IHM dev elopment set compared to the model trained on original IHM data. Finally , to demonstrate that FHV AEs are not exploiting the parallel connection between the IHM and SDM data sets, we create two disjoint sets of recordings of roughly the same size, such that IHM-a and SDM-b only contain one set of record- ings each. Results are shown in 4, where the FHV AE models is trained without any parallel utterances. In this setting, we ob- serve an ev en more significant 24.1% absolute WER improve- ment from the baseline IHM-a model, which bridges the gap by ov er 77% to the fully supervised model. 6. Conclusions and Future W ork In this paper , we marry the V AE-based data augmentation method with interpretable disentangled representations learned from FHV AE models for transforming data from one domain to another . The proposed method outperforms both baselines, and demonstrates the ability to reduce the gap between an un- adapted model and a fully supervised model by ov er 77% with- out the presence of any parallel data. For future work, we plan to in vestigate the unsupervised data augmentation techniques for a wider range of tasks. In addition, data augmentation is in- herently inefficient because the training time grows linearly in the amount of data we have. W e plan to explore model-space unsupervised adaptation to combat this limitation. 7. References [1] B. Li, T . Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin et al. , “ Acoustic modeling for Google Home, ” in Interspeech , 2017. [2] Y . Zhang, G. Chen, D. Y u, K. Y aco, S. Khudanpur , and J. Glass, “Highway long short-term memory RNNs for distant speech recognition, ” in International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2016. [3] W .-N. Hsu, Y . Zhang, and J. Glass, “ A prioritized grid long short- term memory rnn for speech recognition, ” in IEEE W orkshop on Spoken Language T echnolo gy (SLT), 2016 IEEE , 2016. [4] J. Kim, M. El-Khamy , and J. Lee, “Residual LSTM: Design of a deep recurrent architecture for distant speech recognition, ” arXiv:1701.03360 , 2017. [5] M. J. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition, ” Computer speech & language , vol. 12, no. 2, 1998. [6] P . Swietojanski and S. Renals, “Learning hidden unit contri- butions for unsupervised speaker adaptation of neural network acoustic models, ” in IEEE W orkshop on Spoken Language T ech- nology (SLT) . IEEE, 2014. [7] P . Swietojanski, J. Li, and S. Renals, “Learning hidden unit contri- butions for unsupervised acoustic model adaptation, ” IEEE T rans- actions on Audio, Speec h, and Language Pr ocessing , vol. 24, no. 8, 2016. [8] S. Sun, B. Zhang, L. Xie, and Y . Zhang, “ An unsupervised deep domain adaptation approach for robust speech recognition, ” Neu- r ocomputing , 2017. [9] Z. Meng, Z. Chen, V . Mazalov , J. Li, and Y . Gong, “Unsupervised adaptation with domain separation networks for robust speech recognition, ” , 2017. [10] W .-N. Hsu and J. Glass, “Extracting domain inv ariant features by unsupervised learning for robust automatic speech recognition, ” in International Confer ence on Acoustics, Speech and Signal Pro- cessing (ICASSP) , 2018. [11] N. Jaitly and G. E. Hinton, “V ocal tract length perturbation (VTLP) improves speech recognition, ” in ICML workshop on Deep Learning for Audio, Speec h, and Language Processing , 2013. [12] T . Ko, V . Peddinti, D. Povey , and S. Khudanpur, “ Audio augmen- tation for speech recognition, ” in Interspeech , 2015. [13] W .-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised domain adap- tation for robust speech recognition via variational autoencoder- based data augmentation, ” in IEEE W orkshop on Automatic Speech Recognition and Understanding (ASRU) , 2017. [14] J.-Y . Zhu, T . Park, P . Isola, and A. A. Efros, “Unpaired image- to-image translation using cycle-consistent adversarial networks, ” arXiv:1703.10593 , 2017. [15] W .-N. Hsu, Y . Zhang, and J. Glass, “Unsupervised learning of dis- entangled and interpretable representations from sequential data, ” in Advances in Neural Information Processing Systems (NIPS) , 2017. [16] J. Carletta, “Unleashing the killer corpus: experiences in creating the multi-everything AMI meeting corpus, ” Language Resources and Evaluation , vol. 41, no. 2, 2007. [17] D. P . Kingma and M. W elling, “ Auto-encoding variational bayes, ” arXiv:1312.6114 , 2013. [18] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic back- propagation and approximate inference in deep generativ e mod- els, ” , 2014. [19] W .-N. Hsu, Y . Zhang, and J. Glass, “Learning latent representa- tions for speech generation and transformation, ” in Interspeech , 2017. [20] J. Garofalo, D. Graf f, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete, ” Linguistic Data Consortium , 2007. [21] S. Y eung, A. Kannan, Y . Dauphin, and L. Fei-Fei, “T ackling over - pruning in variational autoencoders, ” , 2017. [22] J. G. Fiscus, J. Ajot, N. Radde, and C. Laprun, “Multiple dimen- sion Levenshtein edit distance calculations for evaluating auto- matic speech recognition systems during simultaneous speech, ” in International Confer ence on language Resources and Evaluation (LERC) , 2006. [23] W .-N. Hsu and J. Glass, “Scalable factorized hierarchical v ari- ational autoencoder training, ” arXiv preprint , 2018. [24] D. Kingma and J. Ba, “ Adam: A method for stochastic optimiza- tion, ” , 2014. [25] M. Abadi, P . Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghema wat, G. Irving, M. Isard et al. , “T ensorflow: A system for large-scale machine learning. ” in OSDI , vol. 16, 2016. [26] D. Pov ey , A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y . Qian, P . Schwarz et al. , “The Kaldi speech recognition toolkit, ” in IEEE W orkshop on Au- tomatic Speech Recognition and Understanding (ASR U) , 2011. [27] D. Y u, A. Eversole, M. Seltzer , K. Y ao, Z. Huang, B. Guenter, O. Kuchaie v , Y . Zhang, F . Seide, H. W ang et al. , “ An introduc- tion to computational networks and the computational network toolkit, ” Microsoft Research, T ech. Rep., 2014. [28] H. Sak, A. W . Senior , and F . Beaufays, “Long short-term mem- ory recurrent neural netw ork architectures for large scale acoustic modeling. ” in Interspeech , 2014. [29] W .-N. Hsu, Y . Zhang, A. Lee, and J. R. Glass, “Exploiting depth and highway connections in con volutional recurrent deep neural networks for speech recognition. ” in Interspeech , 2016. [30] R. J. W illiams and J. Peng, “ An efficient gradient-based algo- rithm for on-line training of recurrent network trajectories, ” Neu- ral computation , vol. 2, no. 4, 1990.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment