Sequence-to-Sequence ASR Optimization via Reinforcement Learning

Despite the success of sequence-to-sequence approaches in automatic speech recognition (ASR) systems, the models still suffer from several problems, mainly due to the mismatch between the training and inference conditions. In the sequence-to-sequence…

Authors: Andros Tj, ra, Sakriani Sakti

Sequence-to-Sequence ASR Optimization via Reinforcement Learning
SEQUENCE-T O-SEQUENCE ASR OPTIMIZA TION VIA REINFORCEMENT LEARNING Andr os Tjandr a 1 , Sakriani Sakti 1,2 , Satoshi Nakamura 1,2 1 Graduate School of Information Science, Nara Institute of Science and T echnology , Japan 2 RIKEN, Center for Adv anced Intelligence Project AIP , Japan { andros.tjandra.ai6, ssakti, s-nakamura } @is.naist.jp ABSTRA CT Despite the success of sequence-to-sequence approaches in auto- matic speech recognition (ASR) systems, the models still suffer from sev eral problems, mainly due to the mismatch between the train- ing and inference conditions. In the sequence-to-sequence archi- tecture, the model is trained to predict the grapheme of the cur- rent time-step giv en the input of speech signal and the ground-truth grapheme history of the previous time-steps. Howe ver , it remains unclear how well the model approximates real-world speech dur- ing inference. Thus, generating the whole transcription from scratch based on previous predictions is complicated and errors can propa- gate over time. Furthermore, the model is optimized to maximize the likelihood of training data instead of error rate ev aluation met- rics that actually quantify recognition quality . This paper presents an alternativ e strategy for training sequence-to-sequence ASR mod- els by adopting the idea of reinforcement learning (RL). Unlike the standard training scheme with maximum likelihood estimation, our proposed approach utilizes the policy gradient algorithm. W e can (1) sample the whole transcription based on the model’ s prediction in the training process and (2) directly optimize the model with negati ve Lev enshtein distance as the reward. Experimental results demon- strate that we significantly improved the performance compared to a model trained only with maximum likelihood estimation. Index T erms — End-to-end speech recognition, reinforcement learning, policy gradient optimization 1. INTRODUCTION Sequence-to-sequence models have been recently shown to be very effecti ve for many tasks such as machine translation [1, 2], image captioning [3, 4], and speech recognition [5]. W ith these models, we are able to learn a direct mapping between the v ariable-length of the source and the target sequences that are often not known apriori using only a single neural network architecture. This way , many complicated hand-engineered models can also be simplified by letting DNNs find their way to map from input to output spaces [5, 6, 7]. Therefore, we can eliminate the need to construct separate components, i.e., a feature extractor , an acoustic model, a lexicon model, or a language model, as is commonly required in con ven- tional ASR systems such as hidden Mark ov model-Gaussian mixture model (HMM-GMM)-based or hybrid HMM-DNN. A generic sequence-to-sequence model commonly consists of three modules: (1) an encoder module for representing source data information, (2) a decoder module for generating transcription out- put and (3) an attention module for extracting related information from an encoder representation based on the current decoder state. A decoding scheme was done based on a left-to-right decoding pro- cedure. In the training stage, gi ven the current input of the speech signal, the decoder produces a grapheme in the current time-step with maximal probability conditioned on the ground-truth of the grapheme history in the previous time-steps. This training scheme is usually referred as a teacher-forcing method [8]. Howe ver , in the inference stage, since the ground-truth of the transcription is not known, the model must produce the grapheme in the current time- step based on an approximation of the correct grapheme in previous time-steps. Therefore, an incorrect decision in an earlier time-step may propagate through subsequent time-steps. Another drawback is the differences in the use of objective func- tions between training and ev aluation schemes. In the training stage, the model is mostly optimized by combining the teacher-forcing ap- proach with the maximum likelihood estimation (MLE) for each frame. On the other hand, the recognition accuracy is ev aluated by calculating the minimum string edit-distance (Lev enshtein distance) between the correct transcription and the recognition output. Such differences may result in suboptimal performance [9]. Optimizing the model parameter with the appropriate objecti ve function is cru- cial to achieve good model performance, or in other words, direct optimization with respect to the evaluation metrics might be neces- sary . In this paper , we propose an alternative strategy for training a sequence-to-sequence ASR by adopting an idea from RL. Specifi- cally , we utilize a policy gradient algorithm (REINFORCE) [10] to simultaneously alleviate both of the abo ve problems. By treating our decoder as a policy network or an agent, we are able to (1) sample the whole transcription based on model’ s prediction in the training pro- cess and (2) directly optimize the model with ne gativ e Levenshtein distance as the re ward. Our model thus integrates the po wer of the sequence-to-sequence approach to learn the mapping between the speech signal and the text transcription based on the strength of re- inforcement learning to optimize the model with ASR performance metric directly . 2. SEQUENCE-TO-SEQ UENCE ASR Sequence-to-sequence model is a type of neural network model that directly models conditional probability P ( y | x ) , where x = [ x 1 , ..., x S ] is the source sequence with length S , and y = [ y 1 , ..., y T ] is the target sequence with length T . Most common input x is a se- quence of feature vectors like Mel-spectral filterbank and/or MFCC. Therefore, x ∈ R S × F where F is the number of features and S is the total frame length for an utterance. Output y , which is a speech transcription sequence, can be either a phoneme or a grapheme (character) sequence. Figure 1 sho ws the ov erall structure of the attention-based encoder-decoder model that consists of encoder, decoder, and at- tention modules. The encoder task processes input sequence x and outputs representati ve information h E = [ h E 1 , ..., h E S ] for the Fig. 1 . Attention-based encoder -decoder architecture. decoder . The attention module is an extension scheme that helps the decoder find relev ant information on the encoder side based on current decoder hidden states [2]. An attention module produces context information c t at time t based on the encoder and decoder hidden states with following equation: c t = S X s =1 a t ( s ) ∗ h E s (1) a t ( s ) = Align ( h E s , h D t ) = exp( Score ( h E s , h D t )) P S s =1 exp( Score ( h E s , h D t )) . (2) There are sev eral variations for the score functions: Score ( h E s , h D t ) =      h h E s , h D t i , dot product h E | s W s h D t , bilinear V | s tanh( W s [ h E s , h D t ]) , MLP (3) where Score : ( R M × R N ) → R , M is the number of hidden units for the encoder and N is the number of hidden units for the decoder . Finally , the decoder task, which predicts the target sequence proba- bility at time t based on the pre vious output and conte xt information c t can be formulated: log P ( y | x ; θ ) = T X t =1 log P ( y t | h D t , c t ; θ ) (4) where h D t is the last decoder layer that contains summarized infor- mation from all previous input y

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment