Augmenting Bottleneck Features of Deep Neural Network Employing Motor State for Speech Recognition at Humanoid Robots

As for the humanoid robots, the internal noise, which is generated by motors, fans and mechanical components when the robot is moving or shaking its body, severely degrades the performance of the speech recognition accuracy. In this paper, a novel sp…

Authors: Moa Lee, Joon Hyuk Chang

Augmenting Bottleneck Features of Deep Neural Network Employing Motor   State for Speech Recognition at Humanoid Robots
1 Augmenting Bottleneck Features of Deep Neural Network Emplo ying Motor State for Speech Recognition at Humanoid Robots Moa Lee and Joon-Hyuk Chang, Senior Member , IEEE Abstract As for the humanoid robots, the internal noise, which is generated by motors, fans and mechanical components when the robot is moving or shaking its body , sev erely degrades the performance of the speech recognition accuracy . In this paper , a novel speech recognition system rob ust to ego-noise for humanoid robots is proposed, in which on/of f state of the motor is employed as auxiliary information for finding the relev ant input features. For this, we consider the bottleneck features, which hav e been successfully applied to deep neural network (DNN) based automatic speech recognition (ASR) system. When learning the bottleneck features to catch, we first exploit the motor on/of f state data as supplemen- tary information in addition to the acoustic features as the input of the first deep neural network (DNN) for preliminary acoustic modeling. Then, the second DNN for primary acoustic modeling employs both the bottleneck features tossed from the first DNN and the acoustics features. When the proposed method is ev aluated in terms of phoneme error rate (PER) on TIMIT database, the experimental results show that achieve obvious improvement (11% relative) is achieved by our algorithm ov er the con ventional systems. Index T erms Human-robot interaction, bottleneck feature, automatic speech recognition, ego-noise, Humanoid robot I . I N T RO D U C T I O N The automatic speech recognition (ASR) technology , the most natural and intuiti ve means of communi- cation for human-robot interaction, becomes more essential because humanoid robots perform actions or responds according to human commands. Many humanoid robots or similar robots, in reality , including the Softbank's robot Pepper [1], the MIT's home robot JIBO [2], and Intel's Jimmy [3] hav e been dev eloped August 28, 2018 DRAFT 2 based on the ASR technology . Recently , many researches on this ASR technology , as an indispensable part for the humanoid robots, hav e been activ ely carried out, b ut still remains a challenging problem The humanoid robots, especially , generate strong internal noise, which results in a significant factor deteriorating the recognition performance. Indeed, because of the close distance between the microphone and the motor or joint than the human voice source, the internal noise incurred from motors, fans, and mechanical components noise is loudly recorded into the microphone installed on the robot especially while the robot is acti vely moving This self-created noise in humanoid robot is referred to as ego-noise [6], which has been not been fully treated to be addressed while the rob ust speech recognition in the presence of background noise or external interference has been extensi vely studied thus far [4]-[8]. As for ego-noise suppression, spectral subtraction [9] is one of the common methods. For instance, Ito et al. [4] dev eloped an framewise prediction approach based on a neural network (NN), where the noise spectra are predicted by using angular v elocities of the joints of the robot. Then, the estimated noise spectra are subtracted from the target signal spectra. One of the problem in this approach is that the ASR performance quite becomes poor when the noise power is not well-estimated especially when the ego-noise is non-stationary . Se veral researchers hav e tackled this problem by predicting and subtracting ego-noise using templates. Nishimura et al. [5] proposed a method to predict the ego-noise using motion like gesture and walking pattern template obtained from a pre-recorded motor noise corresponding to the motion pattern. W ith the labeled motion command, the appropriate ego-noise template matched to latest motion is selected from the template database and used for subtraction. Ince et al. [6] extended the small set of noise template database to larger ego-noise space in which the template database was enhanced by incorporating more information related to the joints such as angular positions, velocities and accelerations. Schmidt et al. [7] employed the motor data to predict the intrinsic harmonic structure of ego-noise and incorporate the ego-noise harmonics into a multichannel dictionary-based ego-noise reduction approach. These studies on ego-noise reduction show that, unlike the con ventional background noise suppression method, the instantaneous motor data of the humanoid robot can be used as a secondary information source for dealing with the ego-noise problem. Recently , we originally devised an idea in [8] to use the motor on/off state as the auxiliary information when designing the acoustic model of the deep neural network (DNN)-based speech recognition system for humanoid robots. Howe ver , the auxiliary information is simply designed as a one-hot vector , so the performance gain is limited. In this paper , we propose a new approach based on the bottleneck features to further impro ve the speech recognition performance when using the on/off motor state as auxiliary information A first DNN is carefully designed to create motor state dependent bottleneck features for which the first DNN input is August 28, 2018 DRAFT 3 determined by concatenating the motor on/off state data in addition to the acoustic features contaminated by the ego-noise. Then, preliminary training is accomplished to yield the bottleneck features, which are fed into the second DNN input, designed for primary acoustic modeling under ego-noise en vironments. Finally , the second DNN is trained with the input including both the bottled features and acoustic features to fully represent the complex relationship between the audio signal and phoneme. In order to verify the performance of our new approach against the existing methods, experiments are extensi vely conducted on TIMIT corpus. The experimental results sho wed better performance in terms of phoneme error reduction (PER) reduction than the baseline models including [8] and the method that used only acoustic features. The rest of this paper is or ganized as follo ws. The related w orks and the proposed methods are described in Section II and Section III, respecti vely . Section IV presents the experimental setting and sho ws results. Then, Section V concludes the work. I I . B OT T L E N E C K F E A T U R E S Fig. 1: Structure of autoencoder model. In the past sev eral years, bottleneck features have been widely used in many tasks, such as speech recognition [10]-[12], audio classification [13], [14], speech synthesis [15] and speaker recognition [16]. The bottleneck features are generated from a multi-layer perceptron (MLP) or deep neural network (DNN) with a middle bottleneck layer ha ving small number of hidden units compared to the other hidden layers. This special hidden layer creates a constriction in the network to compress the task-related (classification or regression) information into a low dimensional representation. Therefore, the bottleneck features can be considered as nonlinear transformation and dimensionality reduction of the input features. August 28, 2018 DRAFT 4 The bottleneck features can be deri ved using both unsupervised and supervised method. In unsupervised approach, classically , an autoencoder with one hidden layer trained to predicts input features themselves. The network consists of an encoder and a decoder as sho wn in Fig. 1. The autoencoder has three layers (input, output and hidden layer). The input vector of autoencoder x is encoded to hidden vector h by a nonlinear acti vation function σ , using learned weight matrix W (1) and bias vector b (1) as follo ws: h = σ ( W (1) x + b (1) ) . (1) Then, the input v ector is decoded from the hidden vector to produce a reconstructed vector e x using learned weight matrix W (2) and bias vector b (2) as follo ws: e x = σ ( W (2) h + b (2) ) . (2) The autoencoder parameter θ = ( W (1) , b (1) ) , ( W (2) , b (2) ) is learned using back-propagation algorithm by minimizing the mean square error (MSE) loss as defined: L MSE ( θ ) = 1 d X x ∈ D l MSE ( x, e x ) = 1 d X x ∈ D k x − e x k 2 . (3) Fig. 2: Extracting bottleneck features using unsupervised (left) and supervised (right) methods. Further , a stacked autoencoder can be used to extract bottleneck features, which are progressively encoded using successive hidden layers. Firstly , each layer is pre-trained as a shallow autoencoder and the learned hidden layer vector h l is used to learn the next hidden layer h l +1 . Then, fine-tuning on the entire stack of hidden layers is performed using back-propagation algorithm. This allo ws each hidden August 28, 2018 DRAFT 5 layer to provide different levels of representation for the input feature. In stacked autoencoder , the hidden vectors are computed as in (1), for l = 1 , 2 , . . . , L : h l = σ ( W ( l ) h l − 1 + b ( l ) ) , (4) where h 0 is the input vector x and L denotes the number of hidden layers of stacked autoencoder . In the supervised approach, bottleneck features are created by an MLP trained to predict the class label (e.g. phoneme states) as shown in Fig. 2. MLP is feed-forward neural network made of an input layer , output layer , and at least one hidden layer . Usually , for a classification task, the softmax function is adopted to con vert the v alues of arbitrary ranges into a probabilistic representation as defined by σ ( y ) = 1 P K k =0 exp( y k ) [exp( y 1 ) · · · exp( y k )] T , (5) where K is the number of elements in y . The learning process attempts to minimize the prediction error L ( x, e x ) with respect to the parameter θ = ( W (1) , b (1) ) , ( W (2) , b (2) ) , · · · , ( W ( L ) , b ( L ) ) . T ypically the loss function in MLP is the cross entropy error function [17]. The supervised method can create a v aluable information for classification task. These bottleneck features provide more effecti ve information while preserving enough information of the original input features. I I I . P RO P O S E D M E T H O D S A. Motor on/off state data Fig. 3: Extracting auxiliary features from the motor on/off state information (“Motor of f ” state with fan noise only and “Motor on” state with additional movement noise). Since instantaneous motor state information of the robot is intuiti vely useful for handling the ego-noise problem, in this paper , we fuse acoustic information obtained from spoken utterances and the motor state August 28, 2018 DRAFT 6 information brought by the instantaneous motor on/of f state data into a single frame work. T o do this, we propose a method to use the motor state data as auxiliary features, trains the bottleneck features. The motor data deriv ed from the robot can be classified into a basic operation state in which only the fan and the motor are turned on (“motor off ”), and a motion state in which the robot shakes its head or body according to the human command (“motor on”). Our robot transmits auxiliary information with the basic operation state as “state off ” and the other state as “state on”. This auxiliary feature can be observed at each frame and contains instantaneous internal state information of the humanoid robot. The concatenated input with conv entional spectral features x 0 = [ x ; o ] is used for bottleneck feature learning. B. Extracting motor state dependent bottleneck featur es Fig. 4: Framework of the proposed ASR system employing bottleneck features The elementary question is to how we fuse the motor data into the con ventional acoustic features. Herein, we propose to extend the bottleneck feature-based ASR method, motiv ated in [5], [11]. The ke y novelty is to learn the motor state dependent bottleneck features based on additional instantaneous motor data. In this work, 4 hidden layers of structure including bottleneck layer was selected on both unsupervised and supervised methods as shown in left-hand of Fig. 4. The MFCC and auxiliary features August 28, 2018 DRAFT 7 T able I. Hardware specifications of JIBO [2]. Hardwar e Specifications Sensors 360 degrees sound localization Mov ement 3 full-revolute axes Sound 2 premium speakers Processor High-end ARM-based mobile are concatenated at each frames and consecutive frames are used to train the bottleneck features. From this, the motor on/off state data can be encoded to more effecti ve representation. C. Acoustic model training Fig. 4 illustrates the ov erall architecture of the proposed ASR system that employs motor state dependent bottleneck features. The left-hand network is the bottleneck network to extract ego-noise adapti ve bottleneck features first. Then, the bottleneck features are stacked alongside the spectral features as input to the right-hand network in order to train the acoustic model. The concatenated features contain the motor state information that is needed to b uild ego-noise robust speech recognition system. W e will in vestigate the performance of v arious such system configurations in the next section. I V . E X P E R I M E N T S A N D R E S U LT S A. Corpus Description In order to ev aluate the proposed approach, we conducted experiments with a JIBO humanoid robot which has 3 full-rev olute axes [2]. A brief specification of the robot is introduced in T able I. W e consider a scenario in which humans interact with a robot while the robot shakes his head. T o simulate noisy en vironments in the humanoid robot, we recorded ego-noises using the single microphone located at the front side of the head. These noise signals in volv e two types: fan noise and mov ement noise. The mixing is conducted at v arious signal-to-noise ratio (SNR) levels including 5 dB, 10 dB, 15 dB and 20 dB, depending on the distance between the speaker and robot. These mixtures were then used to train and ev aluate the ego-noise robust ASR algorithms described above. Our experiments were conducted on the TIMIT database [18] divided into three subsets: 3969 utterances as training set, 400 utterances as de velopment set, and 192 utterances as testing set. The wav eform sampling rate of the corpus and the recorded noises was 16 kHz. W e then measured the proposed algorithm in terms of phoneme error rate (PER) under the aformention en vironments. August 28, 2018 DRAFT 8 B. Experimental Setup In our experiments, the Kaldi toolkit [19] was utilized to train the bottleneck network and the acoustic model. The systems implemented and used for comparison in our experiments are as follows: 1) DNN ( MFCC ): A baseline system using no motor data but conv entional spectral features, obtained from the spoken utterances, as the input features for training acoustic model. 2) DNN ( MFCC + motor data ): A second baseline system using auxiliary features in addition to the con ventional spectral features as the input features. The auxiliary features were gi ven by the one-hot representation of the motor on/off state information. 3) BN-DNN-PHN ( MFCC + BN-PHN ): The proposed system using motor state dependent bottleneck features as auxiliary features. The ego-noise adaptiv e features, rather than the simple one-hot representa- tion of motor data, were combined with conv entional spectral features and used to train acoustic model. In order to extract ego-noise adaptiv e features, the one-hot encoded motor state data and the spectral features were utilized for the input features and phoneme (PHN) states were employed for the output features. 4) BN-DNN-MS ( MFCC + BN-MS ): Same as BN-DNN-PHN except the output features of the bottle- neck network were one-hot representation of motor state (MS) data. It was trained to classify the motor on/of f state of the robot. 5) BN-DNN-MFCC ( MFCC + BN-MFCC ): Same as BN-DNN-PHN except the output features of the bottleneck network were original MFCC features. It was considered as an autoencoder . All the systems described abo ve employed the same structure of acoustic model having 5 hidden layers each of which has 512 hidden units. The rectified linear unit (ReLU) activ ation functions were used in the lower layers, and a softmax function at the output layer . For the con ventional spectral features, 13 dimensional MFCC features were extracted using 25ms analysis window with 10 ms frame shift. As for the input of acoustic model, in the baseline systems, the MFCC features stacked with 11 adjacent frames were used and the additional one-hot representation of the motor data was used for the second baseline system. ( 13 × 11 = 143 -dim. for the first baseline system and 15 × 11 = 165 -dim. for the second respecti vely .) In the proposed systems, the additional bottleneck networks with 4 hidden layers were trained sepa- rately . T o compare the effect of the various bottleneck features, experiments were performed with dif ferent output features, bottleneck layer dimensions and bottleneck layer positions. Firstly , the PHN, MS label and original MFCC features were compared as output features. The sigmoid and tanh activ ation functions were used for classification and regression task, respectiv ely . Also, the bottleneck sizes of 40-dim and 80-dim were compared. Finally , we varied the placement of the bottleneck layer from the bottom hidden August 28, 2018 DRAFT 9 T able II. Performance (PER in %) comparison when only fan noise exists (”motor off state”). The PHN, MS and MFCC indicate output of the bottleneck network. (PHN: phoneme, MS: motor state, and MFCC: mel-frequency cepstral coef ficients). The BN and BN2 indicate 40 and 80 dimensional bottleneck features, respecti vely . PER (%) 5 dB 10 dB 15 dB 20 dB A vg. Baseline MFCC 31.1 29.1 27.9 26.9 28.8 MFCC + MS 31.2 28.7 26.5 25.7 28.0 Proposed MFCC + BN-PHN 28 25.6 24.2 24.2 25.5 MFCC + BN-MS 30.9 27.6 25.9 25.5 27.5 MFCC + BN-MFCC 30 27.6 26.7 26.1 27.6 MFCC + BN2-PHN 28.4 25.6 24.6 24.1 25.7 MFCC + BN2-MS 29.7 27.2 25.9 25.8 27.1 MFCC + BN2-MFCC 30.5 27.9 26.5 25.6 27.6 layer (position 1) to the top hidden layer (position 4). For all the bottleneck networks, stacked MFCC and auxiliary features ( (13 + 2) × 11 = 165 -dim.) were used for input. Then, the extracted motor state dependent bottleneck features were combined with the spectral features again and used to train the acoustic model as shown in Fig. 4. C. Experimental Results and Analysis T able I and II present the PER results on the “motor on” and “motor of f” state, respectiv ely . It is worth noting first that the motor state data shows a better recognition performance on the both states. In particular , the auxiliary features, generated by using the bottleneck network, yielded superior performance when compared to one-hot encoded vectors. It indicates that the bottleneck netw ork can create more v aluable representation of the motor state data by fusing along with the spectral features and being compressed. For comparison of the proposed algorithms, the PER is reported for each output features. The results sho w that the phoneme states are appropriate as target features. The model with bottleneck features predicting phonemes (BN-PHN) achieved relative PER reduction of 11.5% and 10.4% ov er the baseline model using no motor data on the “motor of f” and “motor on” states respecti vely . Furthermore, we examined the effect of the bottleneck feature size and it did not show any significant performance August 28, 2018 DRAFT 10 T able III. Performance (PER in %) comparison when there is an additional movement noise in addition to the fan noise (“motor on state”). PER (%) 5 dB 10 dB 15 dB 20 dB A vg. Baseline MFCC 31.6 29.1 27.7 26.8 28.8 MFCC + MS 31.4 28.5 26.8 26.1 28.2 Proposed MFCC + BN-PHN 28.7 26.1 24.7 23.5 25.8 MFCC + BN-MS 31 27.6 26 25.5 27.5 MFCC + BN-MFCC 30.8 27.9 26.7 26.2 27.9 MFCC + BN2-PHN 29.2 26 24.7 23.8 25.9 MFCC + BN2-MS 29.7 27.2 26.1 25.6 27.2 MFCC + BN2-MFCC 30.3 27.8 26.7 25.4 27.6 dif ferences. Therefore, considering the computational complexity , 40 dimensional bottleneck feature is suitable for extracting motor state dependent bottleneck features. In addition, the effect of bottleneck layer position is presented in Fig. 5 on both (a) 40 and (b) 80 dimensional bottleneck hidden layer experiments. From the results, It is evident that the middle (second or third) layer is reasonable for the phoneme class and the first hidden layer is moderate for the others. (a) (b) Fig. 5: PER as a function of the position of the bottleneck layer (The 40-dimensional bottleneck layer in (a) and the 80-dimensional bottleneck layer in (b)). August 28, 2018 DRAFT 11 V . C O N C L U S I O N In this paper, we proposed a nov el method to incorporate the instantaneous motor on/off state informa- tion into a ego-noise robust ASR system, which results in a better performance than exploiting no motor data. For this, we employed a bottleneck network to create motor state dependent bottleneck features to effecti vely integrate the motor data along with con ventional speech signals. These ego-noise adaptive bottleneck features pro vide a significant impro vement than one-hot encoded motor state features. W e in vestigated the effect of output features of the bottleneck network and shown that the phoneme states classification output is most effecti ve to e xtract ego-noise adapti ve bottleneck features. Additionally , we compared the effect of the bottleneck layer position and concluded that the middle (second or third) layer is reasonable for the phoneme class and the first hidden layer is moderate for the others. From the experimental results, we concluded that the robot’ s instantaneous motor state information is adv antageous for human-robot communication. In particular , the bottleneck network can generate more v aluable representation of motor data than one-hot encoding method. In a future work, more varied states of the robot will be considered as motor data, e.g. walking state, right/left arm rotating state, head shaking state, or multiple state. R E F E R E N C E S [1] B. W ang, “IBM putting W atson into Softbank Pepper robot, ” Next Big Future , 2016. [2] P . Rane, V . Mhatre, and L. Kurup, “Study of a home robot: Jibo, ” International J ournal of Engineering Researc h and T echnolo gy , vol. 3. No. 10, pp. 490-493. 2014. [3] 21st Century Robot[W ebsite]. (2018, Feb . 26). https://www . 21stcenturyrobot.com. [4] A. Ito, T . Kanayama, M. Suzuki, and S. Makino, “Internal noise suppression for speech recognition by small robots, ” in Pr oc. Eur opean Conference on Speech Communication and T echnology (Eur ospeech) , 2005. [5] Y . Nishimura, M. Nakano, K. Nakadai, H. Tsujino and M. Ishizuka, “Speech Recognition for a Robot under its Motor Noises by Selectiv e Application of Missing Feature Theory and MLLR, ” ISCA T utorial and Researc h W orkshop on Statistical And P er ceptual Audition , 2006. [6] G. Ince et al. , “Ego noise suppression of a robot using template subtraction, ” in Pr oc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IR OS) , 2009. [7] A. Schmidt et al. , “ A novel ego-noise suppression algorithm for acoustic signal enhancement in autonomous systems, ” in IEEE International Confer ence on Acoustics, Speech, and Signal Pr ocessing (ICASSP) , 2018. [8] M. Lee, J-H. Chang, “DNN-based Speech Recognition System dealing with Motor State as Auxiliary Information of DNN for Head Shaking Robot, ” to appear in IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) , 2018. [9] S. F . Boll, “Supression of acoustic noise in speech using spectral subtraction, ” in IEEE T ransactions on Acoustics, Speech, and Signal Pr ocessing , vol.27, pp. 113-120, 1979. [10] F . Grzl, M. Karafit, S. Kontr , and J. Cernocky , “Probabilistic and bottleneck features for L VCSR of mettings, ” in IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing (ICASSP) , 2007. [11] D. Y u, M. L. Seltzer , “Improv ed bottleneck features using pretrained deep neural networks, ” in Pr oc. Interspeech , 2011. August 28, 2018 DRAFT 12 [12] T . N. Sainath, B. Kingsbury , and B. Ramabhadran, “ Auto-encoder bottleneck features using deep belief netw orks, ” in IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing (ICASSP) , 2012. [13] B. Zhang, L. Xie, Y . Ming, H. Huang, and M. Song, “Deep neural network deriv ed bottleneck features for accurate audio classification, ” in Proc. IEEE International Conference on Multimedia and Expo W orkshops (ICMEW) , 2016. [14] S. Mun, S. Shon, W . Kim, and H. K o, “Deep Neural Network Bottleneck Features for Acoustic Event Recognition, ” in Pr oc. Interspeec h , 2016. [15] Z. W u and S. King, “Impro ving trajectory modelling for DNNbased speech synthesis by using stack ed bottleneck features and minimum generation error training, ” in IEEE/ACM T ransactions on Audio, Speech, and Language Processing , v ol. 24, no. 7, pp. 1255-1265, 2016. [16] S. Y aman, J. Pelecanos and R. Sarikaya, “Bottleneck features for speaker recognition, ” Odyssey 2012-The Speaker and Language Recognition W orkshop , 2012. [17] G. E. Nasr , E. Badr , and C. Joun, “Cross entropy error function in neural networks: Forecasting gasoline demand, ” in Pr oc. FLAIRS Conference , 2002. [18] V . Zue, S. Seneff, and J. R. Glass, “Speech database dev elopment at MIT : T imit and be yond, ” Speec h communication , v ol. 9, iss. 4, pp. 351-356, 1990. [19] D. Povey et al. , “The Kaldi speech recognition toolkit, ” in Pr oc. IEEE Automatic Speech Recognition and Understanding W orkshop (ASR U) , 2011. August 28, 2018 DRAFT

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment