A comparable study of modeling units for end-to-end Mandarin speech recognition

A COMP ARABLE STUD Y OF MODELING UNITS FOR END-T O-END MAND ARIN SPEECH RECOGNITION W ei Zou, Dongwei Jiang, Shuaijiang Zhao, Xiangang Li AI Labs, Didi Chuxing, Beijing, China { zouwei,jiangdongwei,zhaoshuaijiang,lixiangang } @didichuxing.com Abstract End-T o-End speech recognition hav e become increasingly pop- ular in mandarin speech recognition and achiev ed delightful performance. Mandarin is a tonal language which is differ - ent from English and requires special treatment for the acoustic modeling units. There ha ve been several different kinds of mod- eling units for mandarin such as phoneme, syllable and Chi- nese character . In this work, we explore two major end-to-end models: connectionist temporal classiﬁcation (CTC) model and attention based encoder-decoder model for mandarin speech recognition. W e compare the performance of three different scaled modeling units: context dependent phoneme(CDP), syl- lable with tone and Chinese character . W e ﬁnd that all types of modeling units can achieve approximate character error rate (CER) in CTC model and the performance of Chinese character attention model is better than syllable attention model. Further- more, we ﬁnd that Chinese character is a reasonable unit for mandarin speech recognition. On DidiCallcenter task, Chinese character attention model achieves a CER of 5.68% and CTC model gets a CER of 7.29%, on the other DidiReading task, CER are 4.89% and 5.79%, respectiv ely . Moreov er , attention model achieves a better performance than CTC model on both datasets. Index T erms : automatic speech recognition, connectionist tem- poral classiﬁcation, attention model, modeling units, mandarin speech recognition 1. Introduction T raditional speech recognition includes separate modeling com- ponents, including acoustic, phonetic and language models. These components of the system are trained separately , thus each components errors would extend during the process. Be- sides, b uilding the components requires expert kno wledge, for example, building a language model requires linguistic knowledge. The acoustic model is used to recognize context- dependent (CD) states or phonemes [1, 2], by bootstrapping from an existing model which is used for alignment. The pro- nunciation model maps the phonemes sequences into word se- quences, then the language model scores the word sequences. A weighted ﬁnite state transducer (WFST) [3] integrates these models and do the decoding for the ﬁnal result. Recently , end-to-end speech recognition systems hav e be- come increasingly popular and achieve promising performance in mandarin [4]. End-to-end speech recognition methods pre- dict graphemes directly from the acoustic data without linguis- tic knowledge, thus reducing the effort of building ASR systems greatly and making it easier for new language. The end-to-end ASR simpliﬁes the system into a single network architecture, and it is likely to be more robust than a multi-module archi- tecture. There are two major types of end-to-end architectures for ASR: The connectionist temporal classiﬁcation (CTC) cri- terion [5, 6, 7, 8], which has been used to train end-to-end sys- tems that can directly predict grapheme sequences. The other is attention-based encoder-decoder model [9, 10, 11, 12] which applies an attention mechanism to perform alignment between acoustic frames and recognized symbols. Attention-based encoder-decoder models have become in- creasingly popular [13, 7, 14, 15]. These models consist of an encoder network, which maps the input acoustic sequence into a higher -lev el representation, and an attention-based de- coder that predicts the next output symbol conditioned on the full sequence of previous predictions. A recent comparison of sequence-to-sequence models for speech recognition [9] has shown that Listen, Attend and Spell (LAS) [16], a typical attention-based approach, offered improv ements over other sequence-to-sequence models, and attention-based encoder-decoder model performs considerably well in mandarin speech recognition [17]. For Mandarin speech recognition, modeling units of acous- tic model af fect the performance signiﬁcantly [18]. As we all know , CDP is most commonly used as the acoustic modeling units for speech recognition in mandarin [4]. In fact, there hav e been several different kinds of modeling units for Mandarin [19] such as phoneme, syllable and Chinese character . Com- pared with CDP , it will be easier to use syllable or character which does not need other prior model for alignment. Under current end-to-end speech recognition framework, we can get target output syllable sequence and character sequence directly from training transcripts and lexicon. Especially , in the case of using Chinese character models, we can get the desired results directly without lexicon and language model. In order to ﬁnd a more suitable end-to-end system and modeling unit in Mandarin speech recognition, we explore two major end-to-end models: CTC model and attention based encoder-decoder model. Meanwhile, W e compare the perfor- mance of three different scaled modeling units: context depen- dent phoneme (CDP) , syllable with tone and Chinese character . The rest of this paper is organized as follo ws. Section 2 in- troduces the details of end-to-end speech recognition. V arious model units for end-to-end speech recognition in mandarin are studied in Section 3. Section 4 describes the detail of the ex- periments. Section 5 draws some conclusions and outlines our future work. 2. End-to-End Speech Recognition Recently , end-to-end speech recognition systems have become increasingly popular and achiev e encouraging performance in mandarin. 2.1. Connectionist T emporal Classiﬁcation(CTC) The CTC criterion was proposed by Grav es et al. [5] as a way of training end-to-end models without requiring a frame- lev el alignment of the tar get labels for a training utterance. T o achiev e this, an extra blank label denoted h b i is introduced to map frames and labels to the same length, which can be inter- preted as no target label. CTC computes the conditional prob- ability by marginalizing all possible alignments and assuming conditional independence between output predictions at dif fer- ent time steps giv en aligned inputs. Giv en a label sequence y corresponding to the utterance x , where y is typically much shorter than the x in speech recognition. Let β ( y , x ) be the set of all sequences consist- ing of the labels in Y ∪ h b i , which are of length | x | = T , and which are identical to y after ﬁrst collapsing consecutive repeated targets and then removing any blank symbols (e.g., A h b i AA h b i B → AAB ). CTC model deﬁnes the probability of the label sequence conditioned on the acoustics as Equation 1. P C T C ( y | x ) = X ˆ y = β ( y , x ) P ( ˆ y | x ) = X ˆ y = β ( y , x ) T Y t =1 P ( ˆ y t | x ) (1) W ith the conditional independent assumption, P C T C ( ˆ y | x ) can be decomposed into a product of posterior P ( ˆ y t | x ) in each frame t . The conditional proability of the labels at each frame, P C T C ( ˆ y t | x ) , can be estimated using BLSTM, which we refer to as the encoder . The model can be trained to maximize Equa- tion 1 by using gradient descent, where the required gradients can be computed using the forward-backward algorithm [5]. CTC models ha ve a conditional independence assumption on its outputs, wherein it will become difﬁcult to model the in- terdependencies between words. During the beam search pro- cess, language model and word count are introduced. The beam search process of CTC [20] is to ﬁnd arg max y (log( P C T C ( y | x )) + α log( P LM ( y )) + β wor dcount ( y )) (2) where a language model and word count are included, and α and β are the weights of them respectively . 2.2. Attention based models Chan et al. [16] proposed Listen, Attend and Spell (LAS), a kind of neural network that learns to transcribe speech utter - ances to characters. As an attention-based encoder-decoder net- work, LAS is often used to deal with variable length input and output sequences. Using the attention mechanism, the attention model can align the input and output sequence. As section 2.1 mentioned, the CTC assumes monotonic alignment, and it explicitly marginalizes over alignments. And because of the conditional independence assumption, the CTC model can not explicitly learn co-articulation patterns, which exist in speech commonly . Attention based models remove the conditional independence assumption in the label sequence that CTC requires, then the p ( y | x ) deﬁnes as Equation 3 P Attention ( y | x ) = P ( y | h ) = T Y t =1 P ( y t | c t , y

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment