Extracting Domain Invariant Features by Unsupervised Learning for Robust Automatic Speech Recognition

EXTRA CTING DOMAIN INV ARIANT FEA TURES BY UNSUPER VI SED LEARNING FOR R OBUST A UT OMA T IC SPEECH RECOGNITION W ei-Ning Hsu, J ames Glass MIT Computer Science and Artiﬁcial Intelligence Laboratory Cambridge, MA 02139 , USA { wnhsu,gla ss } @mit.ed u ABSTRA CT The performance of automatic speech recognition (ASR) systems can b e signiﬁcantly compromised by p revious ly unseen conditions, which is typically due to a mismatch between training and testi ng distributions. In this paper , we address robustness by studying do- main i nv ari ant features, such that domain information becomes transparent to ASR systems, resolving the mismatch problem. Speciﬁcally , we i n vestigate a recent model, called the Factorized Hierarchical V ariational Autoencoder (FHV AE). FHV AEs learn to factorize sequence-le vel and segment-le vel attribu tes into dif ferent latent v ariables wit hout supervision . W e arg ue that the set of la- tent v ariables that contain segment-le vel information is our desired domain in v ariant feature for AS R. Experiments are condu cted on Aurora-4 and CHiME-4, which demonstrate 41% and 27% absolute word error ra te reductions respectiv ely on mismatched do mains. Index T erms — robust speech recog nition, f actorized hierarchi- cal v ari ational autoencoder , d omain in variant representations 1. INTRODUCTION Recently , neural network-based acoustic models [1, 2, 3] ha ve greatly improv ed the performance of automatic speech recognition (ASR) systems. Unfo rtunately , it is well known (e.g., [4]) that ASR performance can degrade signiﬁcantly when testing in a domain that is mismatched from training. A major reason is that speech data have complex distributions and contain information about not only linguistic con tent, but also speak er iden tity , background noise, room characteristics, etc. Among these sources of variability , only a subset are relev ant to ASR, while the rest can be considered as a nuisance and therefore hurt the performance if the distributions of these attributes are mismatched between training and testi ng. T o allev iate t his issue, some robust ASR research focuses on mapping th e out-of-domain data to in-domain data using enhancement- based methods [5, 6, 7], which g enerally requires parallel d ata from both domains. Another popular st r ategy is to train an ASR system with as large, and as div erse a dataset as possible [8 , 9]; ho we ver , this strategy is not feasible when the labeled data are not av ailable for all domains. Alternativ ely , robus tness can also be achie ved by training using features that are domain in variant [10, 11, 12, 13, 14]. In this case, we would not have domain mi smatch i ssues, beca use domain information is now transparent to the ASR system. In this paper , we consider the same highly adverse scenario as in [4], where both clean and noisy speech are av ailable, but the transcripts are only av ailable for clean speech. W e study the use of a r ecently proposed model, called Factorized Hierarchical V ari- ational Autoencoder (FHV AE) [15], for learning dom ain in v ariant ASR features without supervision. FHV AE models learn to f actorize sequence-le vel attributes and segment-le vel attributes into differen t latent v ariables. By training an AS R system o n the latent v ariables that encode segme nt-lev el att ributes, and testing the ASR in mis- matched domains, we demonstrate that these latent v ariables con- tain linguistic information an d are more domain in v ariant. Compre- hensi ve experiments study the effect of different F HV AE architec- tures, training strategies, and the use of derive d domain fea tures on the robu stness of ASR systems. Our proposed method is e v aluated on Aurora-4 [16] and CHiME-4 [17] datasets, which contain arti- ﬁcially corrupted noisy speech and real noisy speech respectiv ely . The proposed F HV AE-based feature reduces the absolute word error rate (WER) by 27% to 41% compared to ﬁlter bank features, and by 14% to 16% compared to variational autoencoder-b ased features. W e ha ve released the code of FHV AEs described in t he paper . 1 The rest of t he paper is organized as f ollo ws. In Section 2, we in- troduce the FHV AE mod el and a metho d to e xtract domain i n v ariant features. Section 3 describes the experimental setup, w hil e Section 4 presents results and discus sion. W e conclude our work in Section 5. 2. LEARNING DOM AIN INV ARIANT FEA TURES 2.1. Modelin g a Ge nerative Process of Speech Segments As mention ed above, generation of speech data often in volves many independe nt factors, which are howe ver unseen in the unsu pervised setting. It is therefore natural to describe such a generati ve process using a latent variable model, where a latent v ari able z is ﬁrst sam- pled from a prior distribution, and a speech se gment x is then sam- pled from a distribution conditioned on z . In [18 ], a conv olutional v ariational autoencoder (V AE) is proposed to model such process; by assuming the prior to be a diagonal Gaussian, it is shown that the V AE automatically learns to model independen t attribu tes regardin g generation, such as the speaker identity and t he linguistic content, using orthogo nal latent subspaces. This result pro vided a mecha- nism of potentially learning domain in variant features for ASR by discov ering latent v ariables that do not con tain domain information. 2.2. Extracting Domain Inv ariant Featur es from FHV AEs The generation of sequential data often in volv es multiple indepen- dent factors operating at differen t scales. For instance, the speaker identity af fects the fundamental frequency (F0) at the utterance lev el, while the phone tic conten t affects spectral characteristics at the se g- ment lev el. As a result, sequence-lev el attributes, such as F0 and 1 https://github .com/wnhsu/Fact orizedHierarchi calVAE volume , tends to have a small er amoun t of variation within an ut- terance, compared to between u tterances, while the other attributes, such as spectral contours, tend t o have similar amounts of v ari ation within and between utterances. Based on this observa tion, F HV AEs [15] formulate the gen era- tiv e proces s of sequen tial data with a factorized hierarch ical graph- ical model that imposes sequence-dependen t priors and sequence- independe nt priors to dif ferent sets of latent variables. Speciﬁcally , gi ven a dataset D = { X ( i ) } M i =1 consisting of M i.i.d. sequences, where X ( i ) = { x ( i,n ) } N ( i ) n =1 is a sequence of N ( i ) segmen ts (sub- sequence), a sequence X of N segments i s assumed to be gener - ated from a random process that i n volv es latent variables Z 1 = { z ( n ) 1 } N n =1 , Z 2 = { z ( n ) 2 } N n =1 , and µ 2 as follows: (1) an s-vector µ 2 is drawn from a prior distrib ution p θ ( µ 2 ) = N ( µ 2 | 0 , σ 2 µ 2 I ) ; (2) N i.i.d. latent se gment variables { z ( n ) 1 } N n =1 and latent sequence variables { z ( n ) 2 } N n =1 are drawn from a sequence-indep endent prior p θ ( z 1 ) = N ( z 1 | 0 , σ 2 z 1 I ) and a sequence-dependen t prior p θ ( z 2 | µ 2 ) = N ( z 2 | µ 2 , σ 2 z 2 I ) respectiv ely; (3) N i.i.d. speech segmen ts { x ( n ) } N n =1 are dra wn from a condition distribution p θ ( x | z 1 , z 2 ) = N ( x | f µ x ( z 1 , z 2 ) , diag ( f σ 2 x ( z 1 , z 2 ))) , whose mean and diagonal v ariance are parameterized by neural networks. The joint probability f or a sequence is formulated in Eq. 1 : p θ ( µ 2 ) N Y n =1 p θ ( x ( n ) | z ( n ) 1 , z ( n ) 2 ) p θ ( z ( n ) 1 ) p θ ( z ( n ) 2 | µ 2 ) . (1) Based on this formulation, µ 2 can be rega rded as a summarization of sequence-le vel attributes for a sequence, and z 2 is encouraged to encode sequence-le vel att ributes for a segmen t that are similar within an utterance. Consequently , z 1 encodes the residual segment- lev el attribu tes for a se gment, such that z 1 and z 2 together provid e suf ﬁcient information for generating a se gment. Since ex act posterior inference is intractable, FHV AEs i ntroduce an i nference model q φ ( Z ( i ) 1 , Z ( i ) 2 , µ ( i ) 2 | X ( i ) ) as formulated in Eq. 2 that app roximates the true posterior p θ ( Z ( i ) 1 , Z ( i ) 2 , µ ( i ) 2 | X ( i ) ) : q φ ( µ ( i ) 2 ) N ( i ) Y n =1 q φ ( z ( i,n ) 1 | x ( i,n ) , z ( i,n ) 2 ) q φ ( z ( i,n ) 2 | x ( i,n ) ) , (2) from which we observ e that inference of z ( i,n ) 1 and z ( i,n ) 2 only depends on t he corresponding segment x ( i,n ) ; in particular , the posteriors, q φ ( z 1 | x , z 2 ) = N ( z 1 | g µ z 1 ( x , z 2 ) , diag ( g σ 2 z 1 ( x , z 2 ))) and q φ ( z 2 | x ) = N ( z 2 | g µ z 2 ( x ) , diag ( g σ 2 z 2 ( x ))) , are approxi- mated wit h diagonal Gaussian distributions whose mean and di- agonal variance are also parameterized by neural networks. On the other hand, q φ ( µ ( i ) 2 ) is modeled as an i sotropic Gaussian, N ( µ ( i ) 2 | g µ µ 2 ( i ) , σ 2 ˜ µ 2 I ) , where g µ µ 2 ( i ) is a trainable l ookup table of the posterior mean of µ 2 for each training sequence. Estimation of µ 2 for testing sequen ces can be found in [15]. As pointed out in [4], nuisance attri butes regarding ASR, such as speaker identity , room geometry , and background noise, are gen- erally consistent within an utterance. If we treat each utt erance as a sequence, these attri butes then become sequence-le vel attributes, which would be encoded by z 2 and µ 2 . As a result, z 1 encodes the residual li nguistic information and is in variant to these nuisanc e attributes, which is our desired domain in variant ASR feature. 2.3. T raining FHV AE and Pre venting S-V ector Co llapsin g As in other g enerativ e mo dels, FHV AEs aim to ma ximize t he marginal li kelihoo d of the observ ed dataset; du e to the intractability of the exact posterior , F HV AEs optimize the se gment variational lower bound , Ł ( θ, ψ ; x ( i,n ) ) , which is formulated as follows: E q φ ( z ( i,n ) 1 , z ( i,n ) 2 | x ( i,n ) )  log p θ ( x ( i,n ) | z ( i,n ) 1 , z ( i,n ) 2 )  − E q φ ( z ( i,n ) 2 | x ( i,n ) )  D K L ( q φ ( z ( i,n ) 1 | x ( i,n ) , z ( i,n ) 2 ) || p θ ( z ( i,n ) 1 ))  − D K L ( q φ ( z ( i,n ) 2 | x ( i,n ) ) || p θ ( z ( n ) 2 | g µ µ 2 ( i ))) + 1 N log p θ ( g µ µ 2 ( i )) . Notice that if the µ 2 are the same for all utterances, an FHV AE would then degenerate to a v anilla V AE. T o p rev ent µ 2 from collapsing, we can add an additional discriminative objecti ve, log p ( i | z ( i,n ) 2 ) , t hat encourag es t he discriminability of z 2 regard- ing which utterance the seg ment is drawn from. Speciﬁcally , we deﬁne it as log p θ ( z ( i,n ) 2 | g µ µ 2 ( i )) − log P M j =1 p θ ( z ( i,n ) 2 | g µ µ 2 ( j )) . By combining the two objecti ves with a weighting parameter α , we obtain the discriminative se gment variational lower bound : L dis ( θ , φ ; x ( i,n ) ) = L ( θ , φ ; x ( i,n ) ) + α log p ( i | z ( i,n ) 2 ) . (3) 3. EXPERIM ENT SETUP T o e valu ate the ef fective ness of the proposed method on extracting domain i n v ariant features, we consider domain mi smatched AS R scenarios. S peciﬁcall y , we train an ASR system using a clean set, and test the system on both a clean and noisy set. The idea is that one would observe a smaller performanc e discrepancy between differen t domains if the feature representation is more domain in v ariant. W e next introduce the datasets, as well as the model architectures and training conﬁguration s for the experiments. 3.1. Dataset W e use Aurora-4 [16] as the primary dataset for our experimen ts. Aurora-4 i s a broadband corpu s designed for noisy speech recogni- tion tasks based on the W all Street Journal (WSJ0) corpus [19 ]. T wo microphone t ypes, clean/channel are included, and six noise t ypes are artiﬁ cial l y added to both microphone t ypes, which results in four conditions: cl ean(A), channel(B), noisy(C), and channe l+noisy(D). W e use the multi-condition de velopmen t set for training the V AE and FHV AE models, because the de velopment set contains both noise labels and spe aker labe ls for each utterance, which are used in Exp. Index 5 , whil e the training set only contains speak er labels. The ASR system is trained on the clean tr ain si84 clean set and e valuated on the multi-condition test eval92 set. T o v erify our proposed method on a non-artiﬁcial dataset, we repeat our experiments on the CHiME-4 [17] dataset, which con- tains real distant-talking recordings in noisy en vironments. W e use the original 7,138 clean utterances and the 1,600 single channel real noisy utterances in the training partiti on to train the V AE and F H- V AE models. The ASR system is trained on the original cl ean train- ing set and ev aluated on the CHiME-4 de velop ment set. 3.2. V AE/FHV AE Setup and T rain ing The V AE is trained with stocha stic gradient descent using a mini- batch size of 128 without clipping t o minimize the neg ativ e vari- ational lo wer bound plus an L 2 -regularization with weight 10 − 4 . Setting WER (%) WER (%) by Cond ition Exp. Index Feature #Layers #Units α Seq. Label A vg. A B C D 1 FBank - - - - 65.64 3.21 61.61 51.78 82.39 z 1/1 256/256 - - 44.79 4.22 38.16 36.11 59.63 z 1/1 512/256 - - 40.31 4.35 33.83 34.43 53.77 z 1 1/1 256/256 10 utt id 26.58 4.54 19.28 20.85 38.50 2 z 1 1/1 256/256 10 utt id 26.58 4.54 19.28 20.85 38.50 z 1 2/2 256/256 10 utt id 25.54 4.11 16.90 20.62 38.58 z 1 3/3 256/256 10 utt id 24.30 4.91 15.44 22.83 36.63 3 z 1 1/1 128/128 10 utt id 34.66 5.06 26.70 25.39 49.09 z 1 1/1 256/256 10 utt id 26.58 4.54 19.28 20.85 38.50 z 1 1/1 512/512 10 utt id 26.97 5.32 18.18 23.13 40.01 4 z 1 1/1 256/256 0 uttid 33.30 4.86 25.67 25.46 46.97 z 1 1/1 256/256 5 uttid 30.55 4.63 22.66 23.33 43.96 z 1 1/1 256/256 10 utt id 26.58 4.54 19.28 20.85 38.50 z 1 1/1 256/256 15 utt id 29.92 5.01 20.82 24.79 44.03 z 1 1/1 256/256 20 utt id 32.64 5.57 25.48 24.53 45.66 5 z 1 1/1 256/256 10 utt id 26.58 4.54 19.28 20.85 38.50 z 1 1/1 256/256 10 noise 32.27 4.33 23.89 28.96 45.86 z 1 1/1 256/256 10 speaker 34.95 4.39 27.27 32.22 48.20 6 z 1 1/1 256/256 10 utt id 26.58 4.54 19.28 20.85 38.50 z 1 - µ 2 1/1 256/256 10 utt id 43.61 5.08 42.47 27.55 53.85 T able 1 . Au rora-4 test e val92 set word error rate of acoustic models trained on differen t features. The Adam [20] optimizer i s used with β 1 = 0 . 95 , β 2 = 0 . 999 , ǫ = 10 − 8 , and initial learning rate of 10 − 3 . Train ing is terminated if the lower bo und on the de velop ment set does not impro ve for 50 epochs. The FHV AE i s trained with the same conﬁguration and o p- timization method, except that the loss function is rep laced with the negati ve discriminativ e segmen t variational lo wer bound. Seq2Seq-V AE [4] and Seq2Seq-FHV AE [15 ] architectures wit h LSTM units a re used for all exp eriments. W e let the latent space of the V AEs contain 64 dimensions. Since the FHV AE models ha ve t wo latent spaces, we let each of them be 32 d imensional. Other hyper- parameters are ex plored in ou r expe riments. Inputs to V AE/FHV AE, x , are chunk s of 20 consecuti ve speech frames randomly drawn from utterances, where each frame is represented as 80 dimensional ﬁl- ter bank (FBank) energies. T o extract features from the V AE and FHV AE for ASR training, for eac h utteranc e, we compute and con- catenate the posterior mean and variance of chunks shifted by one frame, which generates a sequence of new features that are 19 frames shorter t han the original sequence. W e pad t he ﬁ rst fr ame and the last frame at each end to match the original length. 3.3. ASR Setup and T raining Kaldi [21] is used for feature ex traction, decoding, forced align- ment, and training of an initial HMM-GMM model on the origi- nal clean utterances. The recipe provide d by the CHiME-4 chal- lenge ( run gmm.sh ) and the K al di Aurora-4 recipe are adapted by only changing the trai ning data being used. The Computational Net- work T oolkit (CNTK) [22] is used for neural network-based acous- tic model training. For all e xperiments, the same L STM acoustic model [ 23] with the architecture proposed in [24] is app lied, which has 1,024 memory cells and a 512-node projection layer for each LSTM layer , and 3 LSTM layers in total. Follo wing the training setup in [25], LST M acoustic mod- els are trained with a cross-entrop y criterion, using truncated backpropa gation-through-time (BPTT) [26] to optimize. Each BPTT se gment contains 20 frames, and each mi ni-batch contains 80 utterances, since we ﬁnd empirically t hat 80 utterances has similar performance to 40 u tterances. A momen tum of 0.9 is used starting from the second ep och [3]. T en percent of the training data is h eld out as a validation set to control the learning rate. The learning rate is halved when no gain is observed after an epoch. The same language model is used for decoding for all ex periments. 4. EXPERIM ENT AL RESUL TS AND DISCUSSION In this section , w e report the experimental results o n both datasets, and pro vide insights on t he ou tcome. T able 1 and 2 summarize the results on Aurora-4 and CHiME-4 respectiv ely . For both tables, dif- ferent exp eriments are separated by double horizontal lines and in- dex ed by the E xp. Index on the ﬁ rst column. The second column, F eatur e , refers to the frame represen tations used for training ASR models. The third to the sixth column explains the model conﬁg- uration and t he discriminati ve training weight for V AE or F HV AE models. W e separate the encod er and decoder parameters by “/” in the third and the fourth column. A veraged and by-condition word error rate (WER) are shown in t he rest of the columns. 4.1. Baseline W e start with establishing Aurora-4 baseline results tr ai ned on differ - ent type s of feature represen tations, including (1) FBank, (2) latent v ariable, z , e xtracted from t he V AE, and (3) latent se gment v ariable, z 1 , extracted from the FHV AE. Because each FHV AE model has two encoders, to ha ve a fair comparison between V AE and FHV AE models, we also consider a V AE model with 512 hidden units at each encoder layer . The results are shown in T able 1 Exp. Index 1 . As mentioned, condition A is the match ed d omain, while cond itions B, C, and D are all mismatched domains. FBank degrad es signiﬁcantly in the mismatched conditions, pro- ducing be tween 49% to 79% absolute WER increase. On the other Setting WER (%) WER (%) by Noise T ype Exp. Index ASR Feature #Layers #Units α Seq. Label Cl ean Noisy BUS CAF PED STR 1 FBank - - - - 19.37 87.69 95.56 92.05 78.77 84.37 z 1/1 512/256 - - 19.47 73.95 70.10 91.45 64.26 69.99 z 1 1/1 256/256 10 utt id 19.57 67.94 71.96 79.37 59.32 61.11 2 z 1 1/1 256/256 10 utt id 19.57 67.94 71.96 79.37 59.32 61.11 z 1 2/2 256/256 10 utt id 19.73 62.44 71.28 71.86 52.46 54.18 z 1 3/3 256/256 10 utt id 19.52 60.39 69.13 66.24 51.22 54.96 T able 2 . CHiME-4 dev elopment set word error rate of acoustic models trained on differe nt features. hand, both V AE and FH V AE models impro ve the performance in the mismatched domains by a large margin, with only a slight degrada- tion in the matched domain. In particular, the features learned by the FHV AE consistently outperform the V AE features in all mismatched conditions by 14 % absolute WER reduction. W e believ e that this experiment veriﬁes that FHV AEs can suc- cessfully retain domain in v ariant ling uistic features in z 1 , while en- code dom ain related information into z 2 . In con trast, as the resu lts suggests, V AEs encode all the i nformation into a single set o f l atent v ariables, z , which still contain doma in related i nformation that can hurt ASR performan ce on the mismatched domains. 4.2. Comparing Model Architectures W e next e xplore the optimal FHV AE architectures for extracting do- main in v ariant features. In particular, we study the effect of the num- ber of hidden units at each layer and the num ber of layers. Results of each v ariant are listed in T able 1 Exp. Index 2 and Exp. Index 3 respectiv ely . Regarding the av eraged WE R, the model with 256 hidden units at each layer and in total three layers achie ves the low- est WE R (24.30%). Interestingly , i f we break down the WER by condition, it can be ob served that increasin g the F HV AE model ca- pacity (i.e. increasing number of layers or hidden units) helps reduc- ing the WER in the noisy condition (B), b ut deteriorates channel- mismatching con dition (C) above 256 hidden units and 2 layers. 4.3. Effect of FHV AE Discriminativ e T raining Speaker veriﬁcation experiments in [15] suggest t hat discriminativ e training facilitates factorizing segment-le vel attributes and sequence- lev el attributes into two sets of latent v ariables. Here we study the effe ct of discriminativ e training on learning r obust ASR features, and show the results in T able 1 Exp. I ndex 4 . When α = 0 , the model is not trained with the discriminativ e object. While increas- ing the discriminative weight from 0 to 10, we observ e consistent improv ement in all 4 conditions due to better factorization of seg- ment and sequence information; howe ver , when further increasing the weight to 20, the performance starts to de grade. This is because the discriminativ e object can in versely affects the modeling capacity by constraining the expressibility of the latent sequ ence variables. 4.4. Choice of Sequence Label A core idea of FHV AE i s to learn sequence-speciﬁc priors to model the generation of sequence-le vel attributes, which have a smaller amount of v ariation within a sequence. Suppose we treat each utter- ance as on e sequence, t hen both speaker and noise information be- longs to sequence-le vel attri butes, because t hey are consistent within an utt erance. Alternativ ely , we consider two FHV AE models that learn speaker-sp eciﬁc priors and noise-speciﬁc priors respectiv ely . This can be easil y achiev ed by concatenating sequences of the same speak er label or noise label, and treating it as one sequence used for FHV AE training. W e report the results in T able 1 Exp. Inde x 5 . It may at ﬁrst seem surprising that utilizing supervised informa- tion in this fashion does not improv e performance. W e believe that concatenating utterances actually discards some useful information with r espect to learning doma in in v ariant features. FHV AEs use la- tent segmen t variables to encode attributes t hat are not consistent within a sequ ence. By concatenating speaker utterances, noise in- formation is no lon ger consistent within sequen ces, and w ould thus be encoded into latent segmen t variables; si mi l arly , latent segment v ariables would not be speaker inv ariant in the other case. 4.5. Use of S-V ector Lastly , we study the u se of s-v ectors, µ 2 , deri ved from the FHV AE model, which can be seen as a summarization of sequence-le vel at- tributes of an utterance. W e apply the same procedure as i-vector based speaker ad aptation [27]: For each utteran ce, we ﬁrst es timate its s-vector , and then concatenate s-vec tors with t he feature repre- sentation of each frame to generate the ne w feature sequence. Results are shown in T able 1 Exp. Inde x 6 , from w hich we ob- serve a signiﬁcant degradation of WER that is similar to those of the V AE models. This is reasonable because z 1 and µ 2 in combi- nation actually co ntains similar information as the latent va riable z in V AE models, and the deg radation is due to the mismatch between the distributions of µ 2 in the training and testing sets. 4.6. V erifying Results on CHiME4 In this section, we repeat the bas eline and the layer exp eriments on the CHiME-4 dataset, in order to verify the effecti veness of the FH- V AE and the optimality of the FHV AE architecture on a non-artiﬁcial dataset. The results are sho wn in T able 2 . From E xp. Index 1 , we see that the same trend applies to the CHiME-4 dataset, where the latent segment v ariables from t he F HV AE outperform those from the V AE, and both latent v ariable representations outperform FBank fea- tures. For the F HV AE architectures, a 7% abso lute WER dec rease is achie ved by increasing the number of encode r/decoder layers from 1 to 3, which is also consistent with the trends we saw on Aurora-4. 5. CONCLUSION AND FUTU RE WORK In this paper , we conduct comprehensi ve ex periments on studying the use of FHV AE models domain in v ariant ASR features extrac- tors. Our feature demonstrates superior robustne ss in mismatched domains comp ared t o FBank and V AE-based features by achie ving 41% and 27% absolute WER reduction on Aurora-4 and CHiME-4 respecti vely . In the future, we p lan to study FHV AE-based augmen- tation methods similar to [4]. 6. REFEREN CES [1] T ara N Sainath, Brian Kingsbury , Geor ge Saon, Hagen Soltau, Abdel-rahman Mohamed, George Dahl, and Bhuv ana Ramab- hadran, “Deep con volutional neural networks for large-scale speech tasks, ” Neural Networks , v ol. 64, pp. 39–48, 2015 . [2] Has ¸ i m Sak, F ´ eli x de Chaumon t Quitry , T ara Sainath, Kanishka Rao, et al., “ Acoustic mod elling with cd -ctc-smbr lstm rnns, ” in Automatic Speech Recognition an d Under standing (ASR U) , 2015 IEEE W orkshop on . IEEE , 2015, pp. 604–609. [3] W ei-Ning Hsu, Y u Zhang, and James Glass, “ A prioriti zed grid long short-term memo ry rnn for speech recognition, ” in Spo- ken Langu age T ech nology W orksho p (SL T), 2016 IEEE . IEEE, 2016, pp. 467–473. [4] W ei-Ning Hsu, Y u Z hang, and James Glass, “Unsupervised domain adaptation for robust speech recognition via v ariational autoencod er-based data augmentation, ” in Automatic Speech Recogn ition an d Understandin g (ASR U), 2017 IEE E W orkshop on . IEEE, 201 7. [5] Aru n Narayanan and DeLiang W ang, “Ideal rati o mask estima- tion using deep neural networks for robust speech recognition, ” in Acoustics, Speech and Signa l Pr ocessing (ICASSP), 201 3 IEEE International Confer ence on . IEE E , 2013, pp. 7092– 7096. [6] Y usuf Isik, Jonathan L e Roux, Zhuo Chen, Shinji W atanabe, and John R Hershey , “Single-channel multi-speak er separa- tion using deep clus tering, ” arXi v pr eprint arXiv:1607 .02173 , 2016. [7] Xu e Feng, Y aodong Zhang, and James Glass, “Speech feature denoising and derev erberation via dee p autoencoders for noisy rev erberant speec h recognition, ” in Acoustics, Sp eech and Sig- nal Pro cessing (ICA SSP), 2014 IEE E International Confer- ence on . IEEE, 2014, pp. 1759–1763 . [8] Jin yu L i, Dong Y u, Jui-Ting Huang, and Y ifan Gong, “Im- provin g wideband speech recognition using mixed-bandwidth training data in cd-dnn-hmm, ” in Spok en Lan guag e T echnolo gy W orkshop (SLT), 2012 IEEE . IEEE, 2012 , p p. 131–136. [9] M ichael L Seltzer, Dong Y u, and Y ongqiang W ang, “ An in ves- tigation of deep neural networks for noise rob ust speech recog- nition, ” in Acoustics, Speec h and Sig nal Pro cessing (ICASSP), 2013 IEEE Internationa l Con fer ence on . IEE E, 2013, pp. 7398–7 402. [10] Br i an ED Kingsb ury , Nelson Morgan, and S tev en Green- berg, “Robust speech recognition using the mo dulation spec- trogram, ” Speec h communication , v ol. 25, no. 1, pp. 117–132 , 1998. [11] Ri chard M Stern and Nelson Morgan, “Features base d on au- ditory physiology and perception, ” T ec hniques for N oise Ro- bustne ss in Automatic Speec h Recognition , p. 193227, 2012 . [12] Ori ol V inyals and Suman V R avuri, “Comparing multilayer perceptron to deep belief network tandem features for robust asr , ” in Acoustics, Speec h and Signal Pr ocessing (ICASSP), 2011 IEEE Internationa l Con fer ence on . IEE E, 2011, pp. 4596–4 599. [13] T ara N Sainath, Brian Kingsb ury , and Bhuv ana Ramabhadran, “ Auto-encod er bottleneck features using deep belief networks, ” in Acoustics, Speech and Signa l Pr ocessing (ICASSP), 201 2 IEEE International Confer ence on . IEE E , 2012, pp. 4153– 4156. [14] S ining S un, Binbin Zhang, Lei Xie, and Y anning Zhang, “ An unsu pervised deep domain adaptation approach for rob ust speech recognition, ” Neuro computing , 2017. [15] W ei- Ning Hsu, Y u Zhang, and James Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data, ” in Advances in Neural Information Pro cess- ing Systems , 2017. [16] David Pearce, Aur ora working gro up: DSR fr ont end L VCSR evalua tion A U/384/02 , Ph.D. thesis, Mississipp i S tate Univ er- sity , 2002. [17] E mmanuel V incent, Shinji W atanabe, Aditya Arie Nugraha, Jon B arker , and Ricard Marxer , “ An analysis of en vironment, microphone and data simulation mismatches in rob ust speech recognition, ” Compu ter Speech & Langua ge , 2016. [18] W ei- Ning Hsu, Y u Zhang, and James Glass, “Learning latent representations for speech gen eration and transformation, ” in Interspeec h , 2017, pp. 1273–1 277. [19] John Garofalo, David Gr aff, D oug Paul, and David Pal- lett, “Csr-i (wsj0) complete, ” Lingu istic Data Consortium, Philadelphia , 2007 . [20] Di ederik Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” arXiv pr eprint arXiv:1412.698 0 , 2014. [21] Daniel Pov ey , Arnab Ghoshal, Gilles Boulianne, Lukas Bur- get, Ondrej Glembek, Nagen dra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit, ” in IEEE 2011 workshop on automatic speech reco gnition and understandin g . IEEE Signal Pr ocess- ing Society , 2011, numb er EP FL-CONF-192584. [22] Dong Y u, Adam Eversole, Mike Seltzer , Kaisheng Y ao, Zhi- heng Huang, Brian Guenter , Oleksii Kuchaiev , Y u Zhang, Frank Seide, Huamin g W ang, et al., “ An introduction to com- putational networks and the computational network toolkit, ” T ech. Rep., T ech. Rep . MSR, Microso ft Research, 2014, http://codebox/cntk , 2 014. [23] Hasim Sak, Andre w W Senior, and Franc ¸ oise Beaufays, “Long short-term memory recu rrent neural network architectures for large scale acoustic modeling., ” in Interspeech , 2014, pp. 338– 342. [24] Y u Zhang, Guoguo Chen, Dong Y u, Kaisheng Y aco, Sanjeev Khudanpu r, and James Glass, “Highway long short-term mem- ory RNN s for distant speech recognition, ” i n 2016 IEEE Inter- national Con fer ence on Acoustics, Speec h and Signal Pro cess- ing (ICASSP) . IEEE, 201 6, pp. 5755–5759 . [25] W ei- Ning Hsu, Y u Zhang, Ann Lee, and James R Glass, “Ex- ploiting depth and highway connections in con volutiona l recur- rent deep neural network s for speech recognition., ” i n INT ER- SPEECH , 2016 , pp. 395–399. [26] Ronald J W il liams and Jing Peng, “ An ef ﬁcient gradient-based algorithm for on-line training of recurrent network tr aj ecto- ries, ” Neural computation , vol. 2, no. 4, pp. 490–5 01, 19 90. [27] George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny , “Speaker adaptation of neural netw ork acoustic mod- els using i-v ectors., ” in ASR U , 2013, pp. 55–59.

Extracting Domain Invariant Features by Unsupervised Learning for Robust Automatic Speech Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment