Speech recognition with quaternion neural networks

Speech Recognition with Quater nion Neural Networks Titouan P arcollet 1 , 4 Mirco Ra v anelli 2 Mohamed Morchid 1 Georges Linarès 1 Renato De Mori 1 , 3 1 LIA, Univ ersité d’A vignon, France 2 MILA, Univ ersité de Montréal, Québec, Canada 3 McGill Univ ersity , Québec, Canada 4 Orkis, Aix-en-prov ence, France titouan.parcollet@alumni.univ-avignon.fr mirco.ravanelli@gmail.com firstname.lastname@univ-avignon.fr rdemori@cs.mcgill.ca Abstract Neural network architectures are at the core of po werful automatic speech recog- nition systems (ASR). Howe v er , while recent researches focus on nov el model architectures, the acoustic input features remain almost unchanged. T raditional ASR systems rely on multidimensional acoustic features such as the Mel ﬁlter bank energies alongside with the ﬁrst, and second order deriv ati v es to characterize time-frames that compose the signal sequence. Considering that these compo- nents describe three different vie ws of the same element, neural networks hav e to learn both the internal relations that exist within these features, and external or global dependencies that exist between the time-frames. Quaternion-v alued neural networks (QNN), recently recei v ed an important interest from researchers to process and learn such relations in multidimensional spaces. Indeed, quaternion numbers and QNNs ha ve sho wn their ef ﬁciency to process multidimensional inputs as entities, to encode internal dependencies, and to solve man y tasks with up to four times less learning parameters than real-valued models. W e propose to inv estigate modern quaternion-valued models such as con volutional and recurrent quaternion neural networks in the context of speech recognition with the TIMIT dataset. The experiments show that QNNs always outperform real-valued equiv alent models with way less free parameters, leading to a more efﬁcient, compact, and expressi ve representation of the relev ant information. 1 Introduction During the last decade, deep neural networks (DNN) ha ve encountered a wide success in automatic speech recognition. Many architectures such as recurrent (RNN) [ 34 , 15 , 1 , 31 , 13 ], time-delay (TDNN) [ 39 , 28 ], or con volutional neural networks (CNN) [ 42 ] hav e been proposed and achiev ed better performances than traditional hidden Mark ov models (HMM) combined with gaussian mixtures models (GMM) in different speech recognition tasks. Howe ver , despite such ev olution of models and paradigms, the acoustic feature representation remains almost the same. The acoustic signal is commonly split into time-frames, for which Mel-ﬁlter banks energies, or Mel frequency scaled cepstral coefﬁcients (MFCC) [ 8 ] are extracted, alongside with the ﬁrst and second order deri vati ves. In fact, time-frames are characterized by 3 -dimensional features that are related by representing three different vie ws of the same basic element. Consequently an efﬁcient neural networks-based 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. model has to learn both external dependencies between time-frames, and internal relations within the features. T raditional real-valued architectures deal with both dependencies at the same le vel, due to the lack of a dedicated mechanism to learn the internal and external relations separately . Quaternions are hypercomplex numbers that contain a real and three separate imaginary components, ﬁtting perfectly to three and four dimensional feature vectors, such as for image processing and robot kinematics [ 35 , 29 , 4 ]. The idea of b undling groups of numbers into separate entities is also exploited by the recent capsule network [ 33 ]. Contrary to traditional homogeneous representations, capsule and quaternion neural networks bundle sets of features together . Thereby , quaternion numbers allow neural models to code latent inter-dependencies between groups of input features during the learning process with up to four times less parameters than real-valued neural netw orks, by taking advantage of the Hamilton pr oduct as the equiv alent of the dot product between quaternions. Early applications of quaternion-valued backpropagation algorithms [ 3 , 2 ] hav e efﬁciently solved quaternion functions approximation tasks. More recently , neural networks of comple x and hypercomplex numbers ha ve receiv ed an increasing attention [ 16 , 38 , 7 , 40 ], and some efforts hav e shown promising results in different applications. In particular , a deep quaternion network [ 24 , 25 , 37 ], and a deep quaternion con volutional network [ 5 , 27 ] hav e been successfully employed for challenging tasks such as images and language processing. Contributions: This paper proposes to ev aluate previously in vestigated quaternion-v alued models in two dif ferent realistic conditions of speech recognition, to see whether the quaternion encoding of the signal, alongside with the quaternion algebra and the important parameter reduction help to better capture the acoustic signal nature, leading to a more expressi ve representation of the information. Based on the TIMIT [ 10 ] phoneme recognition task, a quaternion con volutional neural network (QCNN) is compared to a real-v alued CNN in a end-to-end framew ork, and a quaternion recurrent neural network (QRNN) is compared to an RNN within a more traditional HMM-based system. In the end-to-end approach, the e xperiments show that the QCNN outperforms the CNN with a phoneme error rate (PER) of 19 . 5% against the 20 . 6% achiev ed for CNNs. Moreov er, the QRNN outperforms the RNN with a PER of 18 . 5% against 19 . 0% for the RNN. Furthermore, such results are observed with a maximum reduction factor of the number of neural network parameters of 3 . 96 times. 2 Motivations A major challenge of current machine learning models is to obtain ef ﬁcient representations of rele vant information for solving a speciﬁc task. Consequently , a good model has to efﬁciently code both the relations that occur at the feature lev el, such as between the Mel ﬁlter energies, the ﬁrst, and second order deri v ati ves v alues of a single time-frame, and at a global lev el, such as a phonemes or words described by a group of time-frames. Moreover , to av oid ov erﬁtting, better generalize, and to be more efﬁcient, such models also have to be as small as possible. Nonetheless, real-v alued neural networks usually require a huge set of parameters to well-perform on speech recognition tasks, and hardly code internal dependencies within the features, since they are considered at the same lev el as global dependencies during the learning. In the following, we detail the motiv ations to employ quaternion-v alued neural networks instead of real-valued ones to code inter and intra features dependencies with less parameters. First, a better representation of multidimensional data has to be explored to naturally capture internal relations within the input features. For example, an efﬁcient way to represent the information composing an acoustic signal sequence is to consider each time-frame as being a whole entity of three strongly related elements, instead of a group of unidimensional elements that could be related to each others, as in traditional real-valued neural networks. Indeed, with a real-valued NN, the latent relations between the Mel ﬁlter banks energies, and the ﬁrst and second order deri vati ves of a giv en time-frame are hardly coded in the latent space since the weight has to ﬁnd out these relations among all the time-frames composing the sequence. Quaternions are fourth dimensional entities and allow one to b uild and process elements made of up to four elements, mitigating the abov e described problem. Indeed, the quaternion algebra and more precisely the Hamilton pr oduct allo ws quaternion neural network to capture these internal latent relations within the features of a quaternion. It has been shown that QNNs are able to restore the spatial relations within 3 D coordinates [ 20 ], and within color pixels [ 17 ], while real-valued NNs failed. In fact, the quaternion-weight components are shared through multiple quaternion input parts during the Hamilton pr oduct , creating relations within the elements. Indeed, Figure 1 shows that the multiple weights required to code latent relations within a 2 Figure 1: Illustration of the input features ( Q in ) latent relations learning ability of a quaternion-valued layer (right) due to the quaternion weight sharing of the Hamilton pr oduct (Eq. 5), compared to a standard real-valued layer (left). feature are considered at the same le vel as for learning global relations between dif ferent features, while the quaternion weight w codes these internal relations within a unique quaternion Q out during the Hamilton pr oduct (right). Second, quaternion neural networks make it possible to deal with the same signal dimension than real-valued NN, but with four times less neural parameters. Indeed, a 4 -number quaternion weight linking two 4-number quaternion units only has 4 degrees of freedom, whereas a standard neural net parametrization hav e 4 × 4 = 16 , i.e., a 4-fold saving in memory . Therefore, the natural multidimensional representation of quaternions alongside with their ability to drastically reduce the number of parameters indicate that hyper -complex numbers are a better ﬁt than real numbers to create more efﬁcient models in multidimensional spaces such as speech recognition. Indeed, modern automatic speech recognition systems usually employ input sequences composed of multidimensional acoustic features, such as log Mel features, that are often enriched with their ﬁrst, second and third time deri vati ves [ 8 , 9 ], to integrate conte xtual information. In standard NNs, static features are simply concatenated with their deri vati ves to form a lar ge input v ector, without ef fectively considering that signal deriv atives represent different views of the same input. Nonetheless, it is crucial to consider that these three descriptors represent a special state of a time-frame, and are thus correlated. Follo wing the above moti vations and the results observ ed on previous works about quaternion neural networks, we hypothesize that for acoustic data, quaternion NNs naturally provide a more suitable representation of the input sequence, since these multiple views can be directly embedded in the multiple dimensions space of the quaternion, leading to smaller and more accurate models. 3 Quaternion Neural Netw orks Real-valued neural networks architectures are e xtended to the quaternion domain to beneﬁt from its capacities. Therefore, this section proposes to introduce the quaternion algebra (Section 3.1), the quaternion internal representation (section 3.2), a quaternion con volutional neural networks (QCNN, Section 3.3) and a quaternion recurrent neural network (QRNN, Section 3.4). 3.1 Quaternion Algebra The quaternion algebra H deﬁnes operations between quaternion numbers. A quaternion Q is an extension of a comple x number deﬁned in a four dimensional space as: Q = r 1 + x i + y j + z k , (1) 3 where r , x , y , and z are real numbers, and 1 , i , j , and k are the quaternion unit basis. In a quaternion, r is the real part, while x i + y j + z k with i 2 = j 2 = k 2 = ijk = − 1 is the imaginary part, or the vector part. Such a deﬁnition can be used to describe spatial rotations. The information embedded in the quaterion Q can be summarized into the follo wing matrix of real numbers, that turns out to be more suitable for computations: Q mat =    r − x − y − z x r − z y y z r − x z − y x r    . (2) The conjugate Q ∗ of Q is deﬁned as: Q ∗ = r 1 − x i − y j − z k . (3) Then, a normalized or unit quaternion Q / is expressed as: Q / = Q p r 2 + x 2 + y 2 + z 2 . (4) Finally , the Hamilton product ⊗ between two quaternions Q 1 and Q 2 is computed as follows: Q 1 ⊗ Q 2 =( r 1 r 2 − x 1 x 2 − y 1 y 2 − z 1 z 2 )+ ( r 1 x 2 + x 1 r 2 + y 1 z 2 − z 1 y 2 ) i + ( r 1 y 2 − x 1 z 2 + y 1 r 2 + z 1 x 2 ) j + ( r 1 z 2 + x 1 y 2 − y 1 x 2 + z 1 r 2 ) k . (5) The Hamilton product is used in QRNNs to perform transformations of vectors representing quater - nions, as well as scaling and interpolation between two rotations follo wing a geodesic over a sphere in the R 3 space as shown in [21]. 3.2 Quaternion inter nal representation In a quaternion layer , all parameters are quaternions, including inputs, outputs, weights, and biases. The quaternion algebra is ensured by manipulating matrices of real numbers [ 27 ]. Consequently , for each input vector of size N , output vector of size M , dimensions are split into four parts: the ﬁrst one equals to r , the second is x i , the third one equals to y j , and the last one to z k to compose a quaternion Q = r 1 + x i + y j + z k . The inference process is based in the real-valued space on the dot product between input features and weights matrices. In any quaternion-valued NN, this operation is replaced with the Hamilton pr oduct (eq. 5) with quaternion-valued matrices (i.e. each entry in the weight matrix is a quaternion). 3.3 Quaternion con volutional neural networks Con volutional neural networks (CNN) [ 19 ] hav e been proposed to capture the high-le vel relations that occur between neighbours features, such as shape and edges on an image. Howe ver , internal dependencies within the features are considered at the same lev el than these high-level relations by real-valued CNNs, and it is thus not guaranteed that they are well-captured. In this extend, a quaternion con volutional neural network (QCNN) ha ve been proposed by [ 5 , 27 ] 1 . Let γ l ab and S l ab , be the quaternion output and the pre-acti vation quaternion output at layer l and at the index es ( a, b ) of the new feature map, and w the quaternion-valued weight ﬁlter map of size K × K . A formal deﬁnition of the con volution process is: γ l ab = α ( S l ab ) , (6) with S l ab = K − 1 X c =0 K − 1 X d =0 w l ⊗ γ l − 1 ( a + c )( b + d ) , (7) 1 https://github.com/Orkis- Research/Pytorch- Quaternion- Neural- Networks 4 where α is a quaternion split activation function [41] deﬁned as: α ( Q ) = f ( r ) + f ( x ) i + f ( y ) j + f ( z ) k , (8) with f corresponding to any standard activ ation function. The output layer of a quaternion neural network is commonly either quaternion-valued such as for quaternion approximation [ 2 ], or real- valued to obtains a posterior distribution based on a softmax function following the split approach of Eq. 8. Indeed, target classes are often expressed as real numbers. Finally , the full deriv ation of the backpropagation algorithm for quaternion v alued neural networks can be found in [23]. 3.4 Quaternion r ecurrent neural networks Despite the fact that CNNs are efﬁcient to detect and learn patterns in an input volume, recurrent neural networks (RNN) are more adapted to represent sequential data. Indeed, recurrent neural networks obtained state-of-the-art results on many tasks related to speech recognition [ 32 , 12 ]. Therefore, a quaternary version of the RNN called QRNN have been proposed by [ 26 ] 2 . Let us deﬁne a QRNN with an hidden state composed of H neurons. Then, let w hh , w hγ , and w γ h be the hidden to hidden, input to hidden, and hidden to output weight matrices respecti vely , and b l n be the bias at neuron n and layer l . Therefore, and with the same parameters as for the QCNN, the hidden state h t,l n of the neuron n at timestep t and layer l can be computed as: h t,l n = α ( H X m =0 w t,l nm,hh ⊗ h t − 1 ,l m + N l − 1 X m =0 w t,l nm,hγ ⊗ γ t,l − 1 m + b l n ) , (9) with α any split activation function. Finally , the output of the neuron n is computed following: γ t,l n = β ( N l − 1 X m =0 w t,l nm,γ h ⊗ h t,l − 1 n + b l n ) , (10) with β any split activation function. The full deriv ation of the backpropagation trough time of the QRNN can be found in [26]. 4 Experiments on TIMIT Quaternion-v alued models are compared to their real-v alued equiv alents on two dif ferent benchmarks with the TIMIT phoneme recognition task [ 10 ]. First, an end-to-end approach is in vestigated based on QCNNs compared to CNNs in Section 4.2. Then, a more traditional and powerful method based on QRNNs compared to RNNs alongside with HMM decoding is e xplored in Section 4.3. During both experiments, the training process is performed on the standard 3 , 696 sentences uttered by 462 speakers, while testing is conducted on 192 sentences uttered by 24 speakers of the TIMIT dataset. A validation set composed of 400 sentences uttered by 50 speakers is used for hyper -parameter tuning. All the results are from an average of three runs (3-folds) to alleviate an y variation due to the random initialization. 4.1 Acoustic quaternions End-to-end and HMM based experiments share the same quaternion input vector extracted from the acoustic signal. The raw audio is ﬁrst transformed into 40 -dimensional log Mel-ﬁlter-bank coef ﬁcients using the pytorc h-kaldi 3 toolkit and the Kaldi s5 recipes [ 30 ]. Then, the ﬁrst, second, and third order deriv ativ es are extracted. Consequently , an acoustic quaternion Q ( f , t ) associated with a frequency f and a time-frame t is formed as: Q ( f , t ) = e ( f , t ) + ∂ e ( f , t ) ∂ t i + ∂ 2 e ( f , t ) ∂ 2 t j + ∂ 3 e ( f , t ) ∂ 3 t k . (11) 2 https://github.com/Orkis- Research/Pytorch- Quaternion- Neural- Networks 3 https://github.com/mravanelli/pytorch- kaldi 5 Q ( f , t ) represents multiple vie ws of a frequency f at time frame t , consisting of the ener gy e ( f , t ) in the ﬁlter band at frequency f , its ﬁrst time deriv ative describing a slope view , its second time deriv ativ e describing a concavity vie w , and the third deriv ativ e describing the rate of change of the second deri vati ve. Quaternions are used to learn the spatial relations that exist between the different views that characterize a same frequenc y . Thus, the quaternion input vector length is 160 / 4 = 40 . 4.2 T oward end-to-end phonemes r ecognition End-to-end systems are at the heart of modern researches in the speech recognition domain [ 42 ]. The task is particularly dif ﬁcult due to the differences that e xists between the raw or pre-processed acoustic signal used as input features, and the word or phonemes expected at the output. Indeed, both features are not deﬁned at the same time-scale, and an automatic alignment method has to be deﬁned. This section proposes to e valuate the QCNN compared to traditional CNN in an end-to-end model based on the connectionist temporal classiﬁcation (CTC) method, to see whether the quaternion encoding of the signal, alongside with the quaternion algebra, help to better capture the acoustic signal nature and therefore better generalize. 4.2.1 Connectionist T emporal Classiﬁcation In the acoustic modeling part of ASR systems, the task of sequence-to-sequence mapping from an input acoustic signal X = [ x 1 , ..., x n ] to a sequence of symbols T = [ t 1 , ..., t m ] is complex due to: • X and T could be in arbitrary length. • The alignment between X and T is unknown in most cases. Specially , T is usually shorter than X in terms of phoneme symbols. T o alleviate these problems, connectionist temporal classiﬁcation (CTC) has been proposed [ 11 ]. First, a softmax is applied at each timestep, or frame, providing a probability of emitting each symbol X at that timestep. This probability results in a symbol sequences representation P ( O | X ) , with O = [ o 1 , ..., o n ] in the latent space O . A blank symbol 0 − 0 is introduced as an e xtra label to allow the classiﬁer to deal with the unkno wn alignment. Then, O is transformed to the ﬁnal output sequence with a man y-to-one function g ( O ) deﬁned as follows: g ( z 1 , z 2 , − , z 3 , − ) g ( z 1 , z 2 , z 3 , z 3 , − ) g ( z 1 , − , z 2 , z 3 , z 3 ) ) = ( z 1 , z 2 , z 3 ) . (12) Consequently , the output sequence is a summation over the probability of all possible alignments between X and T after applying the function g ( O ) . Accordingly to [ 11 ] the parameters of the models are learned based on the cross entropy loss function: X X,T ∈ tr ain − log( P ( O | X )) . (13) During the inference, a best path decoding algorithm is performed. Therefore, the latent sequence with the highest probability is obtained by performing argmax of the softmax output at each timestep. The ﬁnal sequence is obtained by applying the function g ( . ) to the latent sequence. 4.2.2 Model Architectur es A ﬁrst 2 D con volutional layer is followed by a maxpooling layer along the frequenc y axis to reduce the internal dimension. Then, n = 10 2 D con volutional layers are included, together with three dense layers of sizes 1024 and 256 respectively for real- and quaternion-valued models. Indeed, the output of a dense quaternion-v alued layer has 256 × 4 = 1024 nodes and is four times lar ger than the number of units. The ﬁlter size is rectangular (3 , 5) , and a padding is applied to keep the sequence and signal sizes unaltered. The number of feature maps varies from 32 to 256 for the real-valued models and from 8 to 64 for quaternion-v alued models. Indeed, the number of output feature maps is four times larger in the QCNN due to the quaternion con volution, meaning 32 quaternion-valued feature maps (FM) correspond to 128 real-valued ones. Therefore, for a fair comparison, the number of feature maps is represented in the real-valued space (e.g., a number of real-valued FM of 256 corresponds to 256 / 4 = 64 quaternion-v alued neurons). The PReLU activ ation function is employed for both models [ 14 ]. A dropout of 0 . 2 and a L 2 regularization of 1 e − 5 are used across all the layers, 6 except the input and output ones. CNNs and QCNNs are trained with the RMSPR OP learning rate optimizer and v anilla hyperparameters [ 18 ] during 100 epochs. The learning rate starts at 8 · 10 − 4 , and is decayed by a factor of 0 . 5 ev ery time the results observed on the validation set do not improv e. Quaternion parameters including weights and biases are initialized following the adapted quaternion initialization scheme provided in [ 26 ]. Finally , the standard CTC loss function deﬁned in [ 11 ] and implemented in [6] is applied. 4.2.3 Results and discussions End-to-end results of QCNN and CNN are reported in T able 1. In agreement with our hypothesis, one may notice an important dif ference in the amount of learning parameters between real and quaternion valued CNNs. An e xplanation comes from the quaternion algebra. A dense layer with 1 , 024 input values and 1 , 024 hidden units contains 1 , 024 2 ≈ 1 M parameters, while the quaternion equi valent needs 256 2 × 4 ≈ 0 . 26 M parameters to deal with the same signal dimension. Such reduction in the number of parameters hav e multiple positiv e impact on the model. First, a smaller memory footprint for embedded and limited de vices. Second, and as demonstrated in T able 1, a better generalization ability leading to better performances. Indeed, the best PER observed in realistic conditions (w .r .t to the dev elopment PER) is 19 . 5% for the QCNN compared to 20 . 6% for the CNN, giving an absolute improv ement of 0 . 5% with QCNN. Such results are obtained with 32 . 1 M parameters for CNN, and only 8 . 1 M for QCNN, representing a reduction factor of 3 . 96 x of the number of parameters. T able 1: Phoneme error rate (PER%) of CNN and QCNN models on the dev elopment and test sets of the TIMIT dataset. “Params" stands for the total number of trainable parameters, and "FM" for the feature maps size. Models FM Dev . T est Params CNN 32 22.0 23.1 3.4M 64 19.6 20.7 5.4M 128 19.6 20.8 11.5M 256 19.0 20.6 32.1M QCNN 32 22.3 23.3 0.9M 64 19.9 20.5 1.4M 128 18.9 19.9 2.9M 256 18.2 19.5 8.1M It is worth noticing that with much fewer learning parameters for a given architecture, the QCNN always performs better than the real-v alued one. Consequently , the quaternion-valued con volutional approach offers an alternati ve to traditional real-valued end-to-end models, that is more efﬁcient, and more accurate. Howe ver , due to an higher number of computations in volv ed during the Hamilton pr oduct and to the lack of proper engineered implementations, the QCNN is one time slower than the CNN to train. Nonetheless, such behavior can be alle viated with a dedicated implementation in CUD A of the Hamilton pr oduct . Indeed, this operation is a matrix product and can thus beneﬁt from the parallel computation of GPUs. In fact, a proper implementation of the Hamilton pr oduct will leads to a higher and more efﬁcient usage of GPUs. 4.3 HMM-based phonemes recognition A con ventional ASR pipeline based on a HMM decoding process alongside with recurrent neural networks is also in vestigated to reach state-of-the-art results on the TIMIT task. While input features remain the same as for end-to-end experiments, RNNs and QRNNs are trained to predict the HMM states that are then decoded in the standard Kaldi recipes [ 30 ]. As hypothetized for the end-to-end solution, QRNN models are expected to better generalize than RNNs due to their speciﬁc algebra. 4.3.1 Model Architectur es RNN and QRNN models are compared on a ﬁxed number of layers M = 4 and by varying the number of neurons N from 256 to 2 , 048 , and 64 to 512 for the RNN and QRNN respecti vely . Indeed, as demonstrated on the pre vious experiments the number of hidden neurons in the quaternion and real spaces do not handle the same amount of real-number values. T anh activ ations are used across all the 7 layers except for the output layer that is based on a softmax function. Models are optimized with RMSPR OP [ 18 ] with vanilla h yper-parameters and an initial learning rate of 8 · 10 − 4 . The learning rate is progressively annealed using an halving factor of 0 . 5 that is applied when no performance improv ement on the validation set is observed. The models are trained during 25 epochs. A dropout rate of 0 . 2 is applied ov er all the hidden layers [ 36 ] except the output. The negati ve log-likelihood loss function is used as an objecti ve function. As for QCNNs, quaternion parameters are initialized based on [ 26 ]. Finally , decoding is based on Kaldi [ 30 ] and weighted ﬁnite state transducers (WFST) [ 22 ] that integrate acoustic, lexicon and language model probabilities into a single HMM-based search graph. 4.3.2 Results and discussions The results of both QRNNs and RNNs alongside with an HMM decoding phase are presented in T able 2. A best testing PER of 18 . 5% is reported for QRNN compared to 19 . 0% for RNN, with respect to the best dev elopment PER. Such results are obtained with 3 . 8 M, and 9 . 4 M parameters for the QRNN and RNN respectiv ely , with an equal hidden dimension of 1 , 024 , leading to a reduction of the number of parameters by a factor of 2 . 47 times. As for previous e xperiments the QRNN always outperform equi valents architectures in term of PER, with signiﬁcantly less learning parameters. It is also important to notice that both models tend to o verﬁt with larger architectures. Howe ver , such phenomenom is lowered by the small number of free parameters of the QRNN. Indeed, a QRNN whose hidden dimension is 2 , 048 only has 11 . 2 M parameters compared to 33 . 4 M for an equi valently sized RNN, leading to less degrees of freedom, and therefore less o verﬁtting. T able 2: Phoneme error rate (PER%) of both QRNN, RNN models on the development and test sets of the TIMIT dataset. “Params" stands for the total number of trainable parameters, and "Hidden dim" represents the dimension of the hidden layers. As an example, a QRNN layer of "Hidden dim." = 512 is equiv alent to 128 hidden quaternion neurons. Models Hidden dim. Dev . T est Params RNN 256 22.4 23.4 1M 512 19.6 20.4 2.8M 1,024 17.9 19.0 9.4M 2,048 20.0 20.7 33.4M QRNN 256 23.6 23.9 0.6M 512 19.2 20.1 1.4M 1,024 17.4 18.5 3.8M 2,048 17.5 18.7 11.2M The reported results sho w that the QRNN is a better framework for ASR systems than real-valued RNN when dealing with con ventional multidimensional acoustic features. Indeed, the QRNN performed better and with less parameters, leading to a more efﬁcient representation of the information. 5 Conclusion Summary . This paper proposes to in vestigate nov el quaternion-v alued architectures in two dif ferent conditions of speech recognition on the TIMIT phoneme recognition tasks. The experiments sho w that quaternion approaches alw ays outperform real-valued equi valents in both benchmarks, with a maximum reduction factor of the number of learning parameters of 3 . 96 times. It has been shown that the appropriate multidimensional quaternion representation of acoustic features, alongside with the Hamilton pr oduct , help QCNN and QRNN to well-learn both internal and external relations that exists within the features, leading to a better generalization capability , and to a more efﬁcient representation of the rele vant information through signiﬁcantly less free parameters than traditional real-valued neural netw orks. Future W ork. Future in vestigation will be to develop multi-vie w features that contribute to decrease ambiguities in representing phonemes in the quaternion space. In this extend, a recent approach based on a quaternion Fourrier transform to create quaternion-v alued signal has to be inv estigated. 8 References [1] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. Applying con volutional neural networks concepts to hybrid nn-hmm model for speech recognition. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2012 IEEE International Conference on , pages 4277–4280. IEEE, 2012. [2] Paolo Arena, Luigi Fortuna, Giovanni Muscato, and Maria Gabriella Xibilia. Multilayer perceptrons to approximate quaternion valued functions. Neural Networks , 10(2):335–342, 1997. [3] Paolo Arena, Luigi Fortuna, Luigi Occhipinti, and Maria Gabriella Xibilia. Neural networks for quaternion-valued function approximation. In Cir cuits and Systems, ISCAS’94., IEEE International Symposium on , volume 6, pages 307–310. IEEE, 1994. [4] Nicholas A Aspragathos and John K Dimitros. A comparative study of three methods for robot kinematics. Systems, Man, and Cybernetics, P art B: Cybernetics, IEEE T ransactions on , 28(2):135–145, 1998. [5] Anthony Maida Chase Gaudet. Deep quaternion networks. arXiv pr eprint arXiv:1712.04604v2 , 2017. [6] François Chollet et al. Keras. https://github.com/keras- team/keras , 2015. [7] Ivo Danihelka, Greg W ayne, Benigno Uria, Nal Kalchbrenner , and Alex Gra ves. Associative long short-term memory . arXiv preprint , 2016. [8] Stev en B Davis and Paul Mermelstein. Comparison of parametric representations for monosyl- labic word recognition in continuously spoken sentences. In Readings in speech recognition , pages 65–74. Elsevier , 1990. [9] Sadaoki Furui. Speaker-independent isolated w ord recognition based on emphasized spectral dynamics. In Acoustics, Speech, and Signal Processing , IEEE International Confer ence on ICASSP’86. , volume 11, pages 1991–1994. IEEE, 1986. [10] John S Garofolo, Lori F Lamel, W illiam M Fisher , Jonathan G Fiscus, and David S Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical r eport n , 93, 1993. [11] Alex Gra ves, Santiago Fernández, F austino Gomez, and Jür gen Schmidhuber . Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural networks. In Pr oceedings of the 23rd international confer ence on Machine learning , pages 369–376. A CM, 2006. [12] Alex Gra ves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal pr ocessing (icassp), 2013 ieee international confer ence on , pages 6645–6649. IEEE, 2013. [13] Klaus Gref f, Rupesh K Sri vasta v a, Jan K outník, Bas R Steunebrink, and Jürgen Schmidhuber . Lstm: A search space odyssey . IEEE transactions on neural networks and learning systems , 28(10):2222–2232, 2017. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-le vel performance on imagenet classiﬁcation. In Pr oceedings of the IEEE international confer ence on computer vision , pages 1026–1034, 2015. [15] Geoffre y Hinton, Li Deng, Dong Y u, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly , Andrew Senior , V incent V anhoucke, P atrick Nguyen, T ara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared vie ws of four research groups. Signal Pr ocessing Magazine, IEEE , 29(6):82–97, 2012. [16] Akira Hirose and Shotaro Y oshida. Generalization characteristics of complex-v alued feedfor- ward neural networks in relation to signal coherence. IEEE T ransactions on Neural Networks and learning systems , 23(4):541–551, 2012. [17] T eijiro Isokawa, T omoaki K usakabe, Nobuyuki Matsui, and Ferdinand Peper . Quaternion neural network and its application. In International Confer ence on Knowledge-Based and Intelligent Information and Engineering Systems , pages 318–324. Springer , 2003. [18] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. 9 [19] Y ann LeCun, Patrick Haffner , Léon Bottou, and Y oshua Bengio. Object recognition with gradient-based learning. In Shape, contour and gr ouping in computer vision , pages 319–345. Springer , 1999. [20] Nobuyuki Matsui, T eijiro Isoka wa, Hiromi Kusamichi, Ferdinand Peper , and Haruhiko Nishimura. Quaternion neural network with geometrical operators. J ournal of Intelligent & Fuzzy Systems , 15(3, 4):149–164, 2004. [21] T oshifumi Minemoto, T eijiro Isokawa, Haruhiko Nishimura, and Nobuyuki Matsui. Feed forward neural network with random quaternionic neurons. Signal Processing , 136:59–68, 2017. [22] Mehryar Mohri, Fernando Pereira, and Michael Riley . W eighted ﬁnite-state transducers in speech recognition. Computer Speech and Language , 16(1):69 – 88, 2002. [23] T ohru Nitta. A quaternary v ersion of the back-propagation algorithm. In Neur al Networks, 1995. Pr oceedings., IEEE International Confer ence on , volume 5, pages 2753–2756. IEEE, 1995. [24] T itouan Parcollet, Mohamed Morchid, Pierre-Michel Bousquet, Richard Dufour , Georges Linarès, and Renato De Mori. Quaternion neural networks for spoken language understanding. In Spoken Language T echnology W orkshop (SLT), 2016 IEEE , pages 362–368. IEEE, 2016. [25] T itouan Parcollet, Mohamed Morchid, and Georges Linares. Deep quaternion neural networks for spoken language understanding. In Automatic Speech Recognition and Understanding W orkshop (ASRU), 2017 IEEE , pages 504–511. IEEE, 2017. [26] T itouan Parcollet, Mirco Ra vanelli, Mohamed Morchid, Georges Linarès, Chiheb Trabelsi, Renato De Mori, and Y oshua Bengio. Quaternion recurrent neural networks, 2018. [27] T itouan Parcollet, Y ing Zhang, Mohamed Morchid, Chiheb Trabelsi, Geor ges Linarès, Renato De Mori, and Y oshua Bengio. Quaternion conv olutional neural networks for end-to-end automatic speech recognition. arXiv preprint , 2018. [28] V ijayaditya Peddinti, Daniel Povey , and Sanjee v Khudanpur . A time delay neural network architecture for efﬁcient modeling of long temporal contexts. In Sixteenth Annual Confer ence of the International Speech Communication Association , 2015. [29] Soo-Chang Pei and Ching-Min Cheng. Color image processing by using binary quaternion- moment-preserving thresholding technique. IEEE T ransactions on Image Pr ocessing , 8(5):614– 628, 1999. [30] Daniel Povey , Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, Jan Silovsk y , Georg Stemmer , and Karel V esely . The kaldi speech recognition toolkit. In IEEE 2011 W orkshop on Automatic Speech Recognition and Understanding . IEEE Signal Processing Society , December 2011. IEEE Catalog No.: CFP11SR W -USB, 2011. [31] Mirco Rav anelli, Philemon Brakel, Maurizio Omologo, and Y oshua Bengio. Improving speech recognition by revising g ated recurrent units. Pr oc. Interspeech 2017 , 2017. [32] Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Y oshua Bengio. Light gated recurrent units for speech recognition. IEEE T ransactions on Emer ging T opics in Computational Intelligence , 2(2):92–102, 2018. [33] Sara Sabour , Nicholas Frosst, and Geof frey E Hinton. Dynamic routing between capsules. arXiv pr eprint arXiv:1710.09829v2 , 2017. [34] Ha ¸ sim Sak, Andrew Senior , and Françoise Beaufays. Long short-term memory recurrent neural network architectures for lar ge scale acoustic modeling. In Fifteenth annual confer ence of the international speech communication association , 2014. [35] Stephen John Sangwine. Fourier transforms of colour images using quaternion or hypercomplex, numbers. Electronics letters , 32(21):1979–1980, 1996. [36] Nitish Sriv astav a, Geoffrey Hinton, Ale x Krizhevsk y , Ilya Sutskev er, and Ruslan Salakhutdinov . Dropout: A simple way to pre vent neural networks from ov erﬁtting. The Journal of Machine Learning Resear ch , 15(1):1929–1958, 2014. [37] Parcollet T itouan, Mohamed Morchid, and Georges Linares. Quaternion denoising encoder- decoder for theme identiﬁcation of telephone con versations. Pr oc. Interspeech 2017 , pages 3325–3328, 2017. 10 [38] Mark T ygert, Joan Bruna, Soumith Chintala, Y ann LeCun, Serkan Piantino, and Arthur Szlam. A mathematical moti vation for complex-v alued con volutional networks. Neural computation , 28(5):815–825, 2016. [39] Alexander W aibel, T oshiyuki Hanazawa, Geof frey Hinton, Kiyohiro Shikano, and K evin J Lang. Phoneme recognition using time-delay neural networks. In Readings in speech r ecognition , pages 393–404. Elsevier , 1990. [40] Scott W isdom, Thomas Powers, John Hershe y , Jonathan Le Roux, and Les Atlas. Full-capacity unitary recurrent neural networks. In Advances in Neural Information Pr ocessing Systems , pages 4880–4888, 2016. [41] D Xu, L Zhang, and H Zhang. Learning alogrithms in quaternion neural networks using ghr calculus. Neural Network W orld , 27(3):271, 2017. [42] Y ing Zhang, Mohammad Pezeshki, Philémon Brakel, Saizheng Zhang, Cesar Laurent Y oshua Bengio, and Aaron Courville. T owards end-to-end speech recognition with deep con volutional neural networks. arXiv pr eprint arXiv:1701.02720 , 2017. 11

Speech recognition with quaternion neural networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment