Automatic Grammar Augmentation for Robust Voice Command Recognition

A UTOMA TIC GRAMMAR A UGMENT A TION FOR R OBUST V OICE COMMAND RECOGNITION Y ang Y ang † , Anusha Lalitha ? , Jinwon Lee † , Chris Lott † † Qualcomm Research, San Diego ? Department of Electrical and Computer Engineering, Univ ersity of California San Diego ABSTRA CT This paper proposes a nov el pipeline for automatic grammar aug- mentation that pro vides a signiﬁcant impro vement in the voice com- mand recognition accuracy for systems with small footprint acoustic model (AM). The impro vement is achiev ed by augmenting the user - deﬁned voice command set, also called grammar set, with alternate grammar expressions. For a given grammar set, a set of potential grammar expressions (candidate set) for augmentation is constructed from an AM-speciﬁc statistical pronunciation dictionary that cap- tures the consistent patterns and errors in the decoding of AM in- duced by variations in pronunciation, pitch, tempo, accent, ambigu- ous spellings, and noise conditions. Using this candidate set, greedy optimization based and cross-entropy-method (CEM) based algo- rithms are considered to search for an augmented grammar set with improv ed recognition accuracy utilizing a command-speciﬁc dataset. Our experiments show that the proposed pipeline along with algo- rithms considered in this paper signiﬁcantly reduce the mis-detection and mis-classiﬁcation rate without increasing the false-alarm rate. Experiments also demonstrate the consistent superior performance of CEM method ov er greedy-based algorithms. Index T erms — voice command recognition, CTC, grammar augmentation, cross entropy method, statistical pronunciation dic- tionary 1. INTR ODUCTION V oice UI is becoming ubiquitous for all types of devices, from smart- phones to automobiles. Although we ha ve seen substantial improve- ment in speech recognition accuracy reported in the literature since the advent of deep neural network based solutions [1, 2, 3, 4], design- ing robust v oice UI system for low memory/po wer footprint embed- ded devices without a cloud-based back-end still remains a challeng- ing problem. Compared to its cloud-based counterpart, on-device in- ference, despite being limited by computation power , memory size, and power consumption, remains appealing for several reasons: (i) there are less priv acy concerns as user voice data need not be up- loaded to the cloud; (ii) it reduces the latency as it does not in volve network access delay; (iii) its usage is not restricted by internet av ail- ability , and can be applied in de vices with no built-in communication module. In this work, we focus on improving the recognition accuracy of on-device voice UI systems designed to respond to a limited set of pre-deﬁned voice commands. Such voice UI systems are commonly used in modern IoT/embedded devices such as bluetooth speaker, portable camcorder , hearables, home appliances, etc. Specially , we assume a ﬁxed audio front-end and only look at the pipeline of map- ping acoustic features to voice commands. As illustrated in Fig. 1, we focus on the voice command recog- nition system composed of an acoustic model (AM) encoder that Fig. 1 . V oice command recognition pipeline con verts the acoustic features into phoneme/grapheme-based proba- bilistic output, follo wed by a decoder (e.g., FST) that maps the prob- abilistic output from AM to one of the voice commands. State of the art acoustic model utilizes either CTC [5], RNN-transducer [4], or Attention Model [6] (see [7, 8] for a good summary). They gener- ate probabilistic outputs, which are fed to a decoder that generates the posterior probability of the corresponding phoneme or grapheme label. Even though these model architectures and training method- ologies lead to satisfactory and ev en super-human transcription ac- curacy , the best models obtained are often too large for their de- ployment in small portable devices, e.g., ev en the smallest model considered in [9] (T able 11 therein) has 18M parameters. In this work, we utilize a 211K parameter unidirectional-RNN- based acoustic model trained with CTC criterion using Librispeech and a few other datasets, which output probabilities on grapheme tar- gets. Due to the small model size, its transcription accuracy is low: the greedy decoding word-error-rate (WER) without any language model is 48 . 6% on Libri-speech test-clean dataset. Hence, one of the challenges addressed by our work is, given a small acoustic model trained with general speech dataset, how can one improve the com- mand recognition accurac y utilizing limited command-speciﬁc data. Such small footprint AMs have been considered for keyw ord detec- tion in [10] and [11]. Our work extends these by improving the com- mand command recognition accuracy with a small footprint AM. In T able 1, we list a few samples of the greedy decoding results from the 211K parameter acoustic model. It is worth noting that even though the word-error-rate is high, the error that it makes tends to be a phonetically plausible rendering of the correct word [1]. Running through a large dataset, we also observe that the error patterns tend to be consistent across dif ferent utterances. This leads to a useful in- sight: for the recognition of a limited set of voice commands (a.k.a. grammar of the decoder), one could improve recognition accuracy AM greedy decoding Ground truth the recter pawsd and den the rector paused and then shaking his classto hands shaking his clasped hands before him went on before him went on tax for wone o thease facts form one of these and itees he other and ideas the other T able 1 . Greedy decoding samples from the acoustic encoder. W ord errors are labeled in bold. by adding v ariations that capture common and consistent errors from the acoustic model to original command set. W e deﬁne grammar as a set of valid v oice commands (e.g., the grammar can be { play music , stop music , . . . } ) and we refer to this technique of adding v ariations to the original grammar as grammar augmentation . Effecti ve gram- mar augmentation is the focus of this work. The main contribution of this paper is the design of effecti ve grammar augmentation framework which provides signiﬁcant im- prov ement over the baseline system. Next, we highlight our main contributions in detail: (a) For any given set of original voice com- mands, we propose the design of a candidate set of all grammar variations which captures the consistent errors for a giv en AM (b) W e propose a technique for fast ev aluation of command recognition accuracy along with false-alarm and mis-detection rate for any aug- mented grammar set and ﬁnally (c) W e devise v arious algorithms to automatically identify an improv ed augmented grammar set by suit- ably adding v ariations from the candidate set to the original gram- mar . Our nov el pipeline using the above techniques is illustrated in Fig. 2. The rest of the paper is organized as the following: In Sec- tion 2, we giv e an overview of the proposed grammar augmenta- tion pipeline and div e into the generation of a candidate set and fast grammar e valuation techniques. In Section 3 algorithms via greedy optimization and CEM algorithm are utilized to automate the gram- mar augmentation process. The experiment results are presented in Section 4 and we discussion on future directions in Section 5. 2. PIPELINE FOR A UTOMA TIC GRAMMAR A UGMENT A TION Our AM is trained with CTC loss [5], and can thus assign a posterior probability P CTC ( g | u ) for each command g in a command set, for an input utterance u . For a giv en test utterance, our system picks the command with the highest probability , or rejects the utterance if the highest probability is below a pre-deﬁned conﬁdence threshold (see Section 2.3) [12][13]. Command decoding errors happen if the AM output deviates from the ground truth to the extent that it can no longer success- fully discriminate against other grammar sequences. The idea be- hind grammar augmentation is to restore the discriminati ve po wer of the acoustic model by including in the grammar the sequence vari- ations that capture pronunciation v ariations or consistent AM error patterns. T o do that, we begin with generation of a candidate set containing meaningful variations. 2.1. AM-speciﬁc statistical pronunciation dictionary The augmentation candidates should ideally capture consistent error patterns from the AM, induced by variations in pronunciation, pitch, tempo, accent, ambiguous spellings, or ev en inherent mistak es made by the AM. For example, if any command includes words that have homophones, then it is necessary to consider adding those homo- phones into the grammar . T o capture these word-level variations, we introduce a nov el concept named AM-speciﬁc statistical pr onun- ciation dictionary , obtained by the following steps: First, we run the AM through a large general speech dataset (e.g., the training set of AM). For each utterance, we obtain its greedy decoding se- quence by outputting the character with the maximum probability at each time frame, followed by the CTC squashing function [5] to collapse repeated output graphemes and remove blanks. Gi ven that most utterances from a general speech dataset correspond to a sen- tence rather than a single w ord, we use Lev enshtein algorithm to ﬁnd the minimum-edit-path of the ground-truth to the decoding, and by doing so obtain a mapping of each word to its corresponding max- imum probability decoding. For each word, we gather the statistics regarding the frequencies of its maximum-probability decoding out- puts. Here we sample a few entries from the dictionary obtained using our 211K-parameter AM: set pause two set 32.2% pause 15.7% to 53.3% said 16.6% pose 14.9% two 34.7% sat 11.4% pase 7.68% do 1.0% sait 8.15% porse 7.31% tu 0.7% sed 4.71% pas 7.31% too 0.3% 2.2. Candidate set f or grammar augmentation Utilizing this statistical dictionary , we build a candidate set contain- ing potential grammar variations by repeatedly replacing each word in the original grammar by its top- k likely max-decoding outputs. Consider a voice UI application for a small bluetooth player , one could hav e the following ﬁ ve commands forming the original gram- mar . command original candidate set for ( C ) grammar 1 grammar augmentation ( G ) play music play music pla music, ply music, play mesic, . . . stop music stop music stap music, stup music, stup mesic, . . . pause music pause music pose music, pase mesic, pause mesic, . . . previous song previous song previs song, pre vious son, . . . next song next song nex song, lext song, ne x son, . . . By looking up in the statistical dictionary the words contained in the original grammar, one can form an array of alternate expressions for the original commands as shown above. For each command, the set of candidates is the cartesian product of the top- k decoding list from the statistical pronunciation dictionary for each word in the command. The v alue of k can be different for dif ferent words, and is chosen to capture at least a certain fraction of all the variations. 2.3. Evaluation of command r ecognition accuracy Let us denote the set of commands as C , the set of all grammar candi- dates G , and the mapping function from G to C as f . A grammar G is a subset of G . For the purpose of e valuating the recognition accuracy of any grammar, we need a command-speciﬁc dataset containing au- dio waveforms and the corresponding target commands. W e denote such dataset as ( u, t ) ∈ D with u and t denoting an utterance and its corresponding target command. T o e valuate the false alarm rate, we also need an out-of-domain dataset u ∈ D ood that contains a set of utterances that do not correspond to any of the commands. As mentioned before, the acoustic decoder compares the poste- rior probabilities P CTC ( g | u ) of all the grammar candidates g included in grammar set G ⊂ G giv en the audio wa veform u , and output the command f ( g ∗ ) where g ∗ = argmax g ∈ G P CTC ( g | u ) . This calcu- lation is done by running a forward-only dynamic programming al- gorithm on the AM output. In order to avoid having to repeat the calculation of the probability scores for every choice of grammar 1 Here we assume that AM is trained with grapheme as target, and as a result the grammar is exact the same as the command. Note that the same grammar augmentation pipeline introduced here can be applied to AM trained with phoneme target as well, in which case the grammar is a set of phoneme sequences, and the statistical pronunciation dictionary contains variations of each word in phoneme representation. Fig. 2 . Grammar augmentation pipeline. set G ⊆ G , we pre-compute and store the probability scores for all the candidate grammar , and all the utterances in both command- speciﬁc dataset D and out-of-domain dataset D ood . Precisely , as a pre-processing steps of the grammar augmentation search algorithm, we obtain the following probability scores: P CTC ( g | u ) , ∀ g ∈ G , ∀ u ∈ D ∪ D ood . (1) T o achie ve a false alarm rate (F AR) target of α , the conﬁdence threshold for the probability score can be computed as below , τ ( G, α ) = min τ ( τ :    u ∈ D ood : max g ∈ G P CTC ( g | u ) > τ    |D ood | < α ) . The decoded command for an utterance u is d ( G, α, u ) = ( φ, if max g ∈ G P CTC ( g | u ) < τ ( G, α ) , argmax c ∈C max g ∈ G,f ( g )= c P CTC ( g | u ) , otherwise. , φ denotes decoding being out-of-domain. W ith a ﬁxed false-alarm rate, there are two types of error ev ent: mis-detection and mis-classiﬁcation. Mis-detection refers to the case where a voice command is issued but not detected (i.e., decoded as being out-of-domain), whereas mis-classiﬁcation happens where a voice command is issued and detected, but the wrong command is decoded. Precisely , the mis-detection-rate (MDR) and the mis- classiﬁcation-rate (MCR) are deﬁned as below MDR ( G, α ) = | { ( u, t ) ∈ D : d ( G, α , u ) = φ } | / |D | , MCR ( G, α ) = | { ( u, t ) ∈ D : d ( G, α , u ) 6∈ { φ, t }} | / |D | . 3. A UGMENT A TION SEARCH ALGORITHMS The grammar augmentation algorithms we consider search for the grammar set G among all subsets of a candidate set G that minimizes a weighted sum of the mis-detection-rate and mis-classiﬁcation-rate with a ﬁxed false-alarm tar get α , min G ⊆G MCR ( G, α ) + β MDR ( G, α ) . (2) Here the weight factor β controls the signiﬁcance of mis-detection versus mis-classiﬁcation. Since we pre-compute the probabilities as shown in Equation (1), for each grammar G ⊆ G the objectiv e func- tion can be ev aluated without inv oking the AM, which signiﬁcantly speeds up the search algorithms. It is important to note that adding candidate to the grammar does not always improve performance: (i) With a ﬁxed false-alarm tar- get, adding more candidates only increase the conﬁdence threshold τ ( G, α ) , which could potentially result in degraded mis-detection rate. (ii) distinguishability of the commands has a complex inter- dependency , hence adding grammar candidate for one command may reduce the recognition rate of other commands, as it may alter the classiﬁcation boundary amongst the set of commands. 3.1. A ugmentation via greedy optimization methods W e consider the follo wing three methods based on greedy optimiza- tion: Naive gr eedy searc h : Start with the original grammar , iterati vely go through all the candidates from G . In each iteration, add the candidate that best improves the objective function and update the conﬁdence threshold to maintain target F AR, until no candidate can improv e further . Gr eedy searc h with reﬁnement : This algorithm is similar to greedy search except for e very time a candidate is added to the grammar , we remove those candidates among the remaining ones which contain the added candidate as a subsequence. F or example, for pause music command, if candidate pose music is added to the grammar , then porse music is remov ed from subsequent iterations. T rimming the candidate set in this manner increases the diversity of variations in the grammar . Beam-sear ch: In each iteration a list of l best grammar sets is maintained. This degenerates to the naive greedy algorithm when l = 1 . 3.2. A ugmentation via cross entr opy method (CEM) Cross entropy method (CEM) is a widely used combinatorial opti- mization algorithm and has been successfully applied in some rein- forcement learning problems [14, 15]. The main idea is rooted from rare ev ent sampling, for which the algorithm tries to minimize the KL diver gence between a proposed sampling distribution and the optimal zero-variance importance sampling distribution [15]. Go- ing back to the grammar augmentation objecti ve function in Equa- tion (2), the search space is the power set of the candidate set G , which can be represented by { 0 , 1 } |G | , with each grammar choice represented by a |G | -dimensional binary vector . Applying the idea of CEM, we start with an initial probability distribution on { 0 , 1 } |G | , and iterativ ely tune its parameter so that it assigns most of the probability mass in the region towards the min- imization of the objection function. In our design, the distribution on this discrete space is induced by the sign of a |G | -dimensional in- dependent Gaussian distributions, parameterized by their mean and variance in each dimension. For each iteration, we start with a pop- ulation of s samples from the current distribution, each representing a feasible candidate choice. W e ev aluate the objectiv e function of MDR + β MCR for each of sample candidate choice, and keep the best γ fraction. W e then update the parameter of the distribution using the sample mean and variance of the top γ s candidates (also called elite set), and iterate the procedure by obtaining s samples from the updated distribution. 4. EXPERIMENTS In this section, we present some experiments which illustrate the im- prov ement that can obtained in recognition accuracy by applying our grammar augmentation algorithm. All the results are obtained with a dataset containing 5 commands: play music, pause music, stop mu- sic, next song and pr evious song . This dataset contains utterances with varying gender, pitch, volume, noise types and accents, and are split into training, validation, and testing datasets. The training dataset is used to train the augmentation search algorithms to min- imize the objective deﬁned in (2). The validation dataset is used to compare performances of grammar sets obtained and decide which one to tak e. Finally , we report the results of the ﬁnal grammar set on a test dataset. For the training objectiv e function in Equa- tion (2), we pick β = 1 , in which case minimizing the sum of MDR and MCR is equivalent to maximizing the command success rate 1 − MCR ( G, α ) − MDR ( G, α ) . A candidate set is obtained from run- ning the 211K parameter AM with a 2000-hour dataset using steps discussed in Section 2.1 and 2.2. W e consider 150 grammar candi- dates ( |G | = 150 ) using our statistical pronunciation dictionary . 4.1. Perf ormance Evaluation W e analyze the grammar augmentation algorithms described in Sec- tion 3 with a ﬁxed F AR target of α = 0 . 1% and compare the aug- mentation grammar output by each algorithm in terms. Fig. 3 sho ws the command success rate and the decomposition of the error in terms of mis-detection and mis-classiﬁcation. Note that CEM algo- rithm provides most improvement in command success rate unlike greedy-optimization based algorithms which may commit to sub- optimal grammar sets early on. As discussed previously , adding more variations to the grammar set makes it more susceptible to mis-detection errors. In fact, adding all 150 grammar expression reduces the command success rate to 80% and increases the MDR to 13 . 76% . Howe ver , Fig 3 shows that performing augmentation in a principled manner can greatly reduce the mis-classiﬁcation error without increasing the mis-detection errors. Original Greedy Greedy (with refinement) Beam-search (width=5) CEM 86 88 90 92 94 96 98 100 % (FAR=0.1%) 90.34 5.90 3.76 93.76 4.02 2.22 94.02 3.50 2.48 93.85 3.50 2.65 94.44 2.31 3.25 Command Success Rate MCR MDR Fig. 3 . Performance of grammar augmentation algorithms. 4.2. Complexity of Grammar A ugmentation Algorithms W e e valuate the complexity of the augmentation algorithms consid- ered in Section 3. The most computationally expensi ve step in im- plementing our augmentation algorithms is the evaluation of MCR and MDR for any candidate grammar set. Hence, we measure the complexity of our augmentation algorithms in terms of number of grammar ev aluations needed to output their best augmented gram- mar set. Fig. 4 illustrates the variation/improv ement in command success rate (1-MDR-MCR) as the number of grammar ev aluations increases. Note that CEM takes only marginally more ev aluations while providing the maximum reduction in the sum of MCR and MDR. While beamsearch explores more and requires more gram- mar e valuation , it provides only marginally better improvement o ver naiv e greedy . The greedy algorithm reﬁnement reaches its best per- formance in the least number of grammar e valuations. This suggests that incentivizing diversity over exploration may provide better im- prov ement in command success rate and in fewer ev aluations. 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 Nu m b e r o f g r a m m a r e v a l u a t i o n s 8 9 9 0 9 1 9 2 9 3 9 4 1 - M C R - M DR ( % ) C E M G r e e d y ( w i t h r e fi n e m e n t ) G r e e d y B e a m se a r c h ( w i d t h = 5 ) O r i g i n a l G r a m m a r Fig. 4 . T est dataset performance vs. number of grammar ev aluations 4.3. Effect of Candidate Set Size on Perf ormance So far we considered a candidate set size of 150 ( |G | = 150 ). Next, we in vestigate the effect of v arying the candidate set size on the per- formance of the augmentation algorithms. W e vary the candidate size by varying the number of words k we choose from the top- k likely max-decoding outputs for e very word in the statistical pronun- ciation dictionary . Hence, a larger candidate size captures a larger probability of max-decoding outputs. W e repeat our experiments by altering the candidate set size from 25 to 150. T able 2 shows the performance the augmentation algorithms for various candidate set sizes. In particular, it shows that CEM improves as we increase the candidate set and is consistently better than greedy based algorithms. Candidate Set Size |G | Greedy Greedy (reﬁnement) Beamsearch (width 5) CEM 25 92.31 91.79 92.31 93.25 50 93.16 92.74 93.50 93.59 75 93.08 93.16 92.99 93.68 100 92.82 92.65 92.05 94.02 150 93.76 94.02 93.85 94.44 T able 2 . 1-MDR - MCR ( % ) for different algorithms with different candidate set size |G | . 5. CONCLUSION AND FUTURE WORK In this work, we focus on a small-footprint v oice command recogni- tion system composed of a CTC-based small-capacity acoustic en- coder , and a corresponding maximum a posteriori decoder for the recognition of a limited set of ﬁxed commands. W ith a command speciﬁc dataset, we proposed a novel pipeline that automatically augments the command grammar for improved mis-detection and mis-classiﬁcation rate. W e achieved this by adapting the decoder to the consistent decoding variations of the acoustic model. An impor- tant direction of future work is to extend our grammar augmentation pipeline to provide personalization, i.e., to improve the recognition accuracy for a speciﬁc user by adapting the decoder to better ﬁt both the AM and the user’ s pronunciation pattern. 6. REFERENCES [1] A wni Y . Hannun, Carl Case, Jared Casper , Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y . Ng, “Deep Speech: Scaling up end-to-end speech recognition, ” 2014, vol. abs/1412.5567. [2] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anub- hai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper , Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzano wski, Adam Coates, Greg Diamos, and et al, “Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin, ” in Pr oceed- ings of The 33rd International Conference on Machine Learn- ing , Maria Florina Balcan and Kilian Q. W einberger , Eds., New Y ork, Ne w Y ork, USA, 20–22 Jun 2016, vol. 48 of Proceedings of Machine Learning Resear ch , pp. 173–182, PMLR. [3] Ronan Collobert, Christian Puhrsch, and Gabriel Synnaev e, “W av2Letter: an End-to-End ConvNet-based Speech Recog- nition System, ” 2016, vol. abs/1609.03193. [4] Alex Graves, “Sequence Transduction with Recurrent Neural Networks, ” 2012, vol. abs/1211.3711. [5] Alex Graves, Santiago Fern ´ andez, Faustino Gomez, and J ¨ urgen Schmidhuber , “Connectionist T emporal Classiﬁcation: La- belling Unsegmented Sequence Data with Recurrent Neural Networks, ” in Proceedings of the 23rd International Confer- ence on Mac hine Learning , Ne w Y ork, NY , USA, 2006, ICML ’06, pp. 369–376, A CM. [6] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, and Y oshua Bengio, “Attention-Based Mod- els for Speech Recognition, ” 2015, vol. abs/1506.07503. [7] Rohit Prabhav alkar, Kanishka Rao, T ara N. Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly , “A Comparison of Sequence- to-Sequence Models for Speech Recognition, ” in INTER- SPEECH , 2017. [8] E. Battenberg, J. Chen, R. Child, A. Coates, Y . G. Y . Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducers for end-to-end speech recognition, ” in 2017 IEEE Automatic Speech Recognition and Understanding W orkshop (ASR U) , Dec 2017, pp. 206–213. [9] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper , Bryan Catanzaro, Jingdong Chen, Mik e Chrzanowski, Adam Coates, Greg Diamos, and et al, “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, ” 2015, vol. abs/1512.02595. [10] T ara N. Sainath and Carolina Parada, “Con volutional neural networks for small-footprint keyword spotting, ” in INTER- SPEECH . 2015, pp. 1478–1482, ISCA. [11] G. Chen, C. P arada, and G. Heigold, “Small-footprint keyw ord spotting using deep neural networks, ” in 2014 IEEE Interna- tional Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , May 2014, pp. 4087–4091. [12] Y . Miao, M. Gowayyed, and F . Metze, “EESEN: End-to- end speech recognition using deep RNN models and WFST - based decoding, ” in 2015 IEEE W orkshop on Automatic Speech Recognition and Understanding (ASR U) , Dec 2015, pp. 167– 174. [13] Naoyuki Kanda, Xug ang Lu, and Hisashi Ka wai, “Maximum a posteriori based decoding for ctc acoustic models, ” in INTER- SPEECH , 2016. [14] Istv ´ an Szita and Andr ´ as L ¨ orincz, “Learning T etris Using the Noisy Cross-entropy Method, ” Cambridge, MA, USA, Dec. 2006, vol. 18, pp. 2936–2941, MIT Press. [15] Pieter-Tjerk de Boer, Dirk P . Kroese, Shie Mannor, and Reuven Y . Rubinstein, “A T utorial on the Cross-Entropy Method, ” Feb 2005, vol. 134, pp. 19–67.

Automatic Grammar Augmentation for Robust Voice Command Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment