Deep context: end-to-end contextual speech recognition

In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR sys- tem that util…

Authors: Golan Pundak, Tara N. Sainath, Rohit Prabhavalkar

Deep context: end-to-end contextual speech recognition
DEEP CONTEXT : END-TO-END CONTEXTU AL SPEECH RECOGNITION Golan Pundak, T ar a N. Sainath, Rohit Prabhavalkar , Anjuli Kannan, Ding Zhao Google Inc., USA ABSTRA CT In automatic speech recognition (ASR) what a user says depends on the particular context she is in. T ypically , this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR sys- tem that utilizes such context. Our approach, which we re- fer to as Contextual Listen, Attend and Spell (CLAS) jointly- optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain out-of- vocab ulary (OO V) terms not seen during training. W e com- pare our proposed system to a more traditional conte xtualiza- tion approach, which performs shallow-fusion between inde- pendently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the pro- posed CLAS system outperforms the baseline method by as much as 68% relati ve WER, indicating the adv antage of joint optimization ov er individually trained components. Index T erms : speech recognition, sequence-to-sequence models, listen attend and spell, LAS, attention, embedded speech recognition. 1. INTRODUCTION As speech technologies become increasingly pervasi ve, speech is emer ging as one of the main input modalities on mobile devices and in intelligent personal assistants [1]. In such applications, speech recognition performance can be improv ed significantly by incorporating information about the speaker’ s context into the recognition process [2]. Exam- ples of such context include the dialog state (e.g., we might want “stop” or “cancel” to be more likely when an alarm is ringing), the speaker’ s location (which might make nearby restaurants or locations more likely) [3], as well as personal- ized information about the user such as her contacts or song playlists [4]. There has been growing interest recently in b uilding sequence-to-sequence models for automatic speech recogni- tion (ASR), which directly output words, word-pieces [5], or graphemes gi ven an input speech utterance. Such mod- els implicitly subsume the components of a traditional ASR system - the acoustic model (AM), the pronunciation model (PM), and the language model (LM) - into a single neural network which is jointly trained to optimize log-likelihood or task-specific objectiv es such as the expected word error rate (WER) [6]. Representativ e examples of this approach include connectionist temporal classification (CTC) [7] with word output tar gets [8], the recurrent neural network trans- ducer (RNN-T) [9, 10], and the “Listen, Attend, and Spell” (LAS) encoder-decoder architecture [11, 12]. In recent work, we hav e shown that such approaches can outperform a state- of-the-art con ventional ASR system when trained on 12 , 500 hours of transcribed speech utterances [13]. In the present work, we consider techniques for incorpo- rating contextual information dynamically into the recogni- tion process. In traditional ASR systems, one of the domi- nant paradigms for incorporating such information in volves the use of an independently-trained on-the-fly (OTF) rescor- ing framew ork which dynamically adjusts the LM weights of a small number of n-grams relev ant to the particular recog- nition context [2]. Extending such techniques to sequence- to-sequence models is important for improving system per- formance, and is an activ e area of research. In this context, previous works hav e examined the inclusion of a separate LM component into the recognition process through either shallow fusion [14], or cold fusion [15] which can bias the recognition process tow ards a task-specific LM. A shallow fusion approach was also directly used to contextualize LAS in [16] where output probabilities were modified using a spe- cial weighted finite state transducer (WFST) constructed from the speaker’ s context, and was shown to be effecti ve in im- proving performance. The use of an external independently-trained LM for O TF rescoring, as in pre vious approaches, goes against the benefits deriv ed from the joint optimization of the components of a sequence-to-sequence model. Therefore, in this work, we propose Contextual-LAS (CLAS), a novel, all-neural mech- anism which can leverage contextual information – pro vided as a list of contextual phrases – to improve recognition per- formance. Our technique consists of first embedding each phrase, represented as a sequence of graphemes, into a fixed- dimensional representation, and then employing an attention mechanism [17] to summarize the available context at each step of the model’ s output predictions. Our approach can be considered to be a generalization of the technique proposed in [18] in the context of streaming keyw ord spotting, by al- lowing for a variable number of contextual phrases during inference. The proposed method does not require that the particular context information be a vailable at training time, and crucially , unlik e pre vious w orks [16, 2], the method does not require careful tuning of rescoring weights, while still being able to incorporate out-of-vocab ulary (OO V) terms. In experimental ev aluations, we find that CLAS – which trains the contextualization components jointly with the rest of the model – significantly outperforms online rescoring techniques when handling hundreds of context phrases, and is comparable to these techniques when handling thousands of phrases. The or ganization of the rest of this paper is as follows. In Section 2.1 we describe the standard LAS model, and the standard contextualization approach in Section 2.2. W e present the proposed modifications to the LAS model in or- der to obtain the CLAS model in Section 3. W e describe our experimental setup and discuss results in Sections 4 and 5, respectiv ely , before concluding in Section 6. 2. BA CKGROUND 2.1. The LAS model W e no w briefly describe the LAS model. For more details see [11, 13]. The LAS model outputs a probability distrib u- tion over sequences of output labels, y , (graphemes, in this work) conditioned on a sequence of input audio frames, x (log-mel features, in this work): P ( y | x ) . Fig. 1 . A schematic representation of the models used in this work. The model consists of three modules: an encoder , de- coder and attention network , which are trained jointly to pre- dict a sequence of graphemes from a sequence of acoustic feature frames (Figure 1a). The encoder is comprised of a stacked recurrent neu- ral network (RNN) [19, 20] (unidirectional, in this work) that reads acoustic features, x = ( x 1 , . . . , x K ) , and out- puts a sequence of high-lev el features (hidden states), h x = ( h x 1 , . . . , h x K ). The encoder is similar to the acoustic model in an ASR system. The decoder is a stacked unidirectional RNN that com- putes the probability of a sequence of output tokens (charac- ters, in this work) y = ( y 1 , . . . , y T ) as follows: P ( y | x ) = P ( y | h x ) = T Y t =1 P ( y t | h x , y 0 , y 1 , . . . , y t − 1 ) . (1) The conditional dependence on the encoder state vectors, h x , is modeled using a context v ector c t = c x t , which is com- puted using Multi-Head-attention [21, 13] as a function of the current decoder hidden state, d t , and the full encoder state sequence, h x . The hidden state of the decoder , d t , which captures the previous character conte xt y